So how many words do you think it is?

Last Update 14 Feb 2015: Analysis based on SDL Trados Studio 2014 SP2 CU8 (11.2.4364.8)

It’s not unusual for people to see the word-count in a translation tool, compare it to the word-count in their authoring tool (usually MSWord..!) and then wonder why there are so many differences.  Quite often I get sent documents and am asked to explain the differences which I often put down to a simple explanation; ultimately I think you need to be fair with this and Word is a simple word count, whereas a translation tool is designed to try and reflect the effort of the Translator.  Studio even separates out the placeables and numbers so that the real effort becomes apparent and a fair way of measuring and charging for the work is achieved.  How successful this is often depends on who’s asking and exactly what the source material is..!

To get an idea where some of the differences lie and to see exactly how Studio does its count I have  consolidated a few things I’ve shared with others in the past (thank you for asking, Kevin Lossner), stolen a few things from the work of others (thank you for your great article Tuomas Kostiainen) and have cobbled a few things together myself.  So I created a word document that looks like this:

#1.1

I then analysed this in Studio, Trados 2007 and MSWord.  The results were as follows:

#2.1

There’s not a lot of difference here is there?  The truth is that more often than not a quick check will reveal something similar and no-one’s worried.  But the text above contains things that do make a difference and if the text you are translating contains a lot of something that affects the count then this is when the conversations about who’s right begin.

But before we investigate the differences let’s take a look at the Studio count itself.  Studio counts tags and placeables, compared to Trados 2007 that counted some placeables and tags but didn’t distinguish between them… and Trados 2007 sometimes incorrectly counts them.  So in this file if you take all the separate segments and add them up the total placeables was seven and not the four shown here… an important contributory factor when comparing the analysis between these tools but not something worth investigating because it will never be fixed..!  The basic explanation for what Studio is counting can be generally explained using the first four segments of this example:

Taking it segment by segment the analysis is like this:

#1 : 5 words
#2 : 7 words, with one placeable that is counted in the wordcount
#3 : 7 words, with two tags. The tags are also placeables and are not included in the wordcount
#4 : 9 words, with four placeables each of which are counted as a single word

Segment #4 is the active segment showing the blue underlined placeables that are recognised by Studio.  So the analysis of this part alone is reported as follows:

The conclusion being:

  • all placeables are counted as words unless they are also tags
  • tags themselves can be identified for any manual adjustment to the overall rate for tag handling
  • numbers themselves cannot be identified from the analysis (subtract the tags and you are left with 5 placeables… one number, two dates and two variables)

A complication worth noting is that if the placeables are not recognised because they are formatted in a way that Studio does not identify with then this also affects the analysis.  So for example the same file analysed without these being recognised could generate the following:

This is based on the long date format no longer being recognised and therefore the wordcount has increased by three:

This is actually the same as we see in MSWord now because Word doesn’t see “placeables” and counts this segment as 12 words.  So Studio sees this one as 12 words with 5 placeables.  You can see how Studio expects to see the formatting using the National Language Support (NLS) API Reference from MicroSoft.  You change the operating system to suit the one you are using and the language code details are changed accordingly.  Find the language you’re interested in and then click on the link like this:

So in addition to understanding the importance of the source formatting when analysing a file we can also see the first difference between the tools.

Numbers and Dates

Today is Monday, August 8, 2011, aka 08/08/2011 in US and UK.
  • (9 words, 4 placeables) Studio counts numbers and full dates when they are recognised
  • (9 words, 1 placeable) Trados 2007 does not count numbers and only recognises some date formats (the long date above is not recognised)
  • (12 words) MSWord counts numbers and recognises some dates when counting (the long date above is not recognised)

Hyphenated Words

He has a devil-may-care attitude when it comes to hyphenated words

Studio 2014 SP2 introduced some changes here and this is quite a tricky area with many differences depending on the type of hyphens being used.  I tested with ten different types, but even this is not exhaustive.  In general I think Studio is mirroring the MSWord approach somewhat, although not for the same kinds of hyphens.

Studio

  • (11 words) Studio counts hyphenated words as one word for the following
    • U+002D – hypen minus
    • U+00AD – soft hyphen
    • U+2010 – hyphen
    • U+2012 – figure dash
    • U+2013 – en dash
    • U+2014 – em dash
    • U+2015 – horizontal bar
  • (13 words) Studio splits the words for the following
    • optional hypen (unique to MSWord as far as I know)
    • U+2011 – non breaking hyphen
    • U+2212 – minus sign

Trados

  • (11 words) Trados counts hyphenated words as one word for the following
    • optional hyphen
    • U+2011 – non breaking hyphen
  •  (12 words) Trados does something weird here…!!
    • U+00AD – soft hyphen
  •  (13 words) Trados splits the words for the following
    • U+002D – hypen minus
    • U+2010 – hyphen
    • U+2012 – figure dash
    • U+2013 – en dash
    • U+2014 – em dash
    • U+2015 – horizontal bar
    • U+2212 – minus sign

MSWord

  • (11 words) MSWord counts hyphenated words as one word for the following
    • U+002D – hypen minus
    • U+00AD – soft hyphen
    • U+2010 – hyphen
    • U+2011 – non breaking hyphen
    • U+2012 – figure dash
    • U+2015 – horizontal bar
    • U+2212 – minus sign
    • optional hypen (unique to MSWord as far as I know)
  • (13 words) MSWord splits the words for the following
    • U+2013 – en dash
    • U+2014 – em dash

 

Forward slash and back slash

Using the solidus for he/she/it is often discouraged, except in this
case.
c:\Users\pfilkin\Documents\Studio 2011\Projects\87 - Joomla ini files
\gd-GB\en-AU.tpl_atomic.sys.ini.sdlxliff is handled similarly.
  • (30 words, 3 placeables) Studio counts words separated by either “slash” as separate words
  • (26 words) Trados counts words separated by either “slash” (so \ or /) as separate words but doesn’t count the numbers (it’s also inconsistent in its approach so justifying the 26 is an interesting challenge… feel free to post your answers below..!)
  • (21 words) MSWord treats words separated by either “slash” as a single word

Dotted lines (dashed really…) and underscores

Dotted lines count --------- as a word per line but solid underscore 
lines are ignored ________.
  • (15 words) Studio counts the underscore line as a word but ignores the dotted line (changed from previouos versions and now more like the old Trados count)
  • (15 words) Trados counts the underscore line as a word but ignores the dotted line (so the opposite of the text above… my initial assumption)
  • (16 words) MSWord treats both the dotted line and the underscore line as a single word

You do need to pay attention to this though because if the lines are on their own, and not with a sentence as shown here then Trados won’t count them at all.  Studio won’t count dashes at all, but will count individual underscores as separate words.  MSWord will treat them consistently, also counting individual dashes or underscores separated by spaces as separate words.

 

Hyperlinks

You can find anything with { HYPERLINK "http://www.google.co.uk" }.
  • (7 words, 3 placeables, 2 tags) Studio treats a hyperlink as a single placeable but also separates the hyperlink from the link text so both are counted
  • (8 words, 2 placeables) Trados also treats the hyperlink as a word, but 2 words (in a docx)
  • (6 words) MSWord treats the entire hyperlink as a single word

One of the differences with hyperlinks is that Studio will separate out the link from the link text making it obvious and also allowing the translation of the link itself if required.  The two tags and the link in #10 accounting for the three placeables:

Trados does the same thing but splits the hyperlink itself into two words because of the colon and because the entire link is not handled as a placeable as in Studio.  This can vary with the file type as well because DOC and DOCX are treated differently in Trados.  The same text as a DOC returns this analysis because of the inconsistent way Trados 2007 treats hyperlinks between file types:

  • (7 words, 3 placeables, 2 tags) Studio counts as before
  • (6 words, 1 placeables) Trados doesn’t count the hyperlink at all
  • (6 words) MSWord counts as before

Numbered lists

1. If you put
2. things into a numbered list
3. the numbers are excluded
4. unless you do them manually
5. like these last two
  • (23 words, 2 placeables) Studio moves the correctly formatted numbers (automatic numbering in MSWord) outside the segment and only counts the 4. and 5. because these are placeables in the segment.
  • (21 words) Trados does a similar thing but then ignores the numbers that are in the text as part of the count.
  • (26 words) MSWord treats the manual and automatic numbering in the same way and they are all counted.

In Studio these actually look like this:

So I think Studio reflects the effort required more accurately than Trados or MSWord.

Numbers and units

50% is written correctly and 50 % is not.
  • (8 words, 2 placeables) Studio counts to number and the percentage as a single placeable
  • (8 words) Trados ignores the number on its own and counts 50% as one word
  • (9 words) MSWord treats numbers and the separated percentage as separate words

In this example 50% and 50 % are both considered correct ways to write percentages with the language pair I have used.  This may not always be the case and sometimes the analysis in Studio will vary if the formatting of the numbers do not match that which is expected.

Hidden text… using MSWord hidden text feature

If I hide some of the text the count will be funny.
  • (9 words, 1 placeable, 1 tag) Studio ignores the hidden text but does display it as a tag which doesn’t contribute to the word count total
  • (9 words, 1 placeable) Trados behaves in a similar way but doesn’t distinguish between a tag and another form of placeable
  • (9 words) MSWord doesn’t count hidden text at all

The red text is hidden using the “Hidden text” font feature in MSWord rather than the special non-translatable styles that can be applied to text in Word.

Chemical names

Chemical names are tricky, like (2R)-2-methylsulfanyl-3-
hydroxybutanedioate
  • (8 words, 3 placeables) Studio counts 2R, -2 and -3 as placeables and so separates the chemical name itself into three words for the count
  • (10 words) Trados counts the hyphenated words separately and because of the number recognition breaks up the chemical name around the 3… I think
  • (6 words) MSWord treats the chemical name as one word consistently handling hyphenated words as a single word

This is actually a good example of where none of the applications do a good job here because the effort involved in writing these things is not recognised at all.  If they are added to the variable list, or a termbase then the effort will be much reduced but for new text this is definitely a good example of a problem area for all applications.

Writing styles

Using an ellipsis … causes different counts…depending on the 
style . . . that you use.
  • (12 words) Studio ignores the ellipsis unless there are no spaces, in which case it treats this as a hyphenated word and treats the word…word as one
  • (13 words) Trados counts all the word separately ignoring the ellipsis.
  • (16 words) MSWord treats the first ellipsis as a word and the last three dots of the final ellipsis as separate words… it treated the middle ellipsis as a hyphenated word as Studio.

There is no single rule for using an ellipsis… as far as I know.  So some texts ask for a space before and after the dots, others say there should be none and I found one guide that asked for a space before and after each dot.  I use a space after the dots only (looks like my bad!)… but all these differences are handled in different ways and also lead to analysis inconsistencies.

The other interesting thing about this is segmentation… in Studio I see this:

So the number of segments may vary too depending on the style used and as a result the effort required to merge the segments as you work.  They are not paragraph segments so they can be merged, and you could also create a segmentation rule to handle this automatically, but it’s still a good example of how important the authoring styles can be to productivity.

Alphanumerics

#11

  • (6 words, 1 placeable) Studio recognises the product code and treats this as one placeable
  • (8 words) Trados separates the Product code into three words around the hypens
  • (6 words) Word treats the product code as one word

Generally Studio will recognise alphanumeric patterns and treat them as a placeable, but there are a few rules around this to explain any exceptions:

  • must not start or end with underscores, hyphens or full stops
  • must not contain both dashes and full stops
  • must contain at least one number and one letter
  • must not contain lowercase characters and dashes

I ran a few tests in Studio with a few examples to make these a little more clear (I hope).  You can see the results here:

 

Dotted lines

……………………………………………………………………
  • (0 words) Studio only counts the characters and not the words.  However it still displays this as a segment that you have to handle.
  • (0 words) Trados handles the count the same way but doesn’t make this a translatable segment.
  • (1 word) MSWord treats this as a single word

Apostrophes

Since SDL Trados Studio 2014 SP2 words containing an apostrophe are counted as one word.  For example “Relax, it’s Friday” would be three words.  However, there are exceptions to this rule and it adds more possibilities for wordcount discrepancies as you extend past the typographical “curly” apostrophe (U+2019) and the straight apostrophe (U+0027).

One important exception worth mentioning is the use of the prime apostrophe (U+2032-34), a good example might be measurements in feet and inches:

prime

  • (5 words, 2 placeables) Studio splits on the prime apostrophe and counts them as separate words.
  • (5 words) Trados splits on the prime apostrophe and counts them as separate words.
  • (4 words) MSWord counts all words containing the apostrophe as a single word.

I think both Studio and Trados get it right here.  But generally it is probably debatable whether it’s always right, or always wrong, or somewhere between.  Certainly the treatment of the apostrophe is an emotive area.  What’s worse is that if the author doesn’t know how to insert a prime apostrophe and incorrectly uses a “curly or straight” apostrophe then this exception will no longer apply and you lose a word for every measurement.

But let’s a take a longer example for the rest.  I took this sentence and analysed it as I did for other examples, and then considering we also see different types of apostrophe (correctly or incorrectly used) I added a more detailed analysis as well:

It’s Friday night, so if we were in the ‘80s it’d be disco fever.

Based on the simple and more common apostrophes, “curly and straight”, we see this:

  •  (14 words) Studio counts all words containing the apostrophe as a single word.
  • (16 words) Trados counts all words containing the apostrophe as two words.
  • (14 word) MSWord counts all words containing the apostrophe as a single word.

So Studio is mirroring MSWord more for the more common usecases, but looking at a wider range of possibilities and also considering placeables it paints a very different picture and could easily be the cause of dicussion over the analysis if these were used.

Before I start I’ll just add that the use of these “exotic apostrophes” in this example is not really appropriate and I’m only using this to illustrate the point.  For example the “U+02BC – modifier letter apostrophe” is really just a space modifier and shoud not be used at all in an example like this; similarly the use of the Okina (U+02BB – modifier letter turned comma) which does look as though it is an apostrophe shouldn’t really be used here; but it’s interesting to see how they are handled… and this is not an exhaustive list either.

Studio

  • (14 words, 0 placeable) Studio counts all words containing the apostrophe as a single word.  The placeable for 80s is ignored altogether
    • U+02BC – modifier letter apostrophe
    • U+02EE – modifier letter double apostrophe
  • 14 words, 1 placeable) Studio counts all words containing the apostrophe as a single word.  The placeable is for 80s.
    • U+0027 – apostrophe
    • U+2019 – right single quotation mark
    • U+2034 – triple prime
  • (14 words, 7 placeables, 6 tags) Studio counts all words containing the apostrophe as a single word, each apostrophe being controlled by a tag pair.  The placeable is for 80s.
    • U+055A – armenian apostrophe
    • U+FF07 – full width apostrophe
  • (19 words, 7 placeables, 6 tags) Studio splits the words and counts each apostrophe too, each apostrophe being controlled by a tag pair. The additional placeable is for 80s.
    • U+02BB – modifier letter turned comma
    • U+A78B – Latin Capital Letter
    • U+A78C – Latin Small Letter

Trados

  • (16 words, 0 placeable) Trados counts all words containing the apostrophe as two words, the exception being where the apostrophe is preceded by a space.
    • U+0027 – apostrophe
    • U+02BC – modifier letter apostrophe
    • U+2019 – right single quotation mark
    • U+2034 – triple prime
  • 16 words, 6 placeables) Trados counts all words containing the apostrophe as two words, each apostrophe being controlled by a tag pair.
    • U+055A – armenian apostrophe
    • U+FF07 – full width apostrophe
  • (19 words, 0 placeables) Trados splits the words and counts each apostrophe too
    • U+02EE – modifier letter double apostrophe
  • (19 words, 6 placeables) Trados splits the words and counts each apostrophe too, each apostrophe being controlled by a tag pair.
    • U+02BB – modifier letter turned comma
    • U+A78B – Latin Capital Letter
    • U+A78C – Latin Small Letter

MSWord

  • (14 words) MSWord counts all words containing the apostrophe as one words.
    • U+0027 – apostrophe
    • U+02BC – modifier letter apostrophe
    • U+2019 – right single quotation mark
    • U+2034 – triple prime
    • U+055A – armenian apostrophe
    • U+02EE – modifier letter double apostrophe
    • U+02BB – modifier letter turned comma
    • U+A78B – Latin Capital Letter
    • U+A78C – Latin Small Letter
  • (19 words) MSWord counts all words containing the apostrophe as two words, including the apostrophe itself.
    • U+FF07 – full width apostrophe

 Conclusion

After writing this, and after looking at various examples of texts I have received to investigate since the launch of Studio it is clear there is no single simple answer to why the counts differ.  Each text needs to be considered on it’s own merit and often the reasons for the differences are clear.  These differences are not only there when considering the differences between Studio, Trados 2007 and MSWord… they will also be there when comparing the counts between other translation tools as well.

All in all I think I agree with the conclusion drawn by Tuomas Kostiainen that Studio makes a good attempt to be fair and to reflect the effort made by the translator in having to handle the work.  Even when placeables are not recognised the analysis is reflective of the effort.  Studio is also consistent between file types and manually verifying the counts on a segment by segment basis is simple compared to Trados.

Hopefully this explanation of the Studio analysis and consolidation of the differences in counts between tools will help to explain why things are different the next time you are asked why.  It may not help to resolve how you are actually paid and this is where the difficulties come from… and is another, less technical reason, why vendors often prefer translators to use the same tools throughout the supply chain.  There isn’t an easy answer to a dispute over counts, only fair and sensible compromise… although you could consider an independent tool for counting so that irrespective of the tools used for translation the counts are always consistent.  I don’t have an opinion on the merits of these tools but I do know some translators who use them… tools like PractiCount for example… but even here these tools are probably not as comprehensive in terms of file support as your translation tool.

35 comments
  1. Agenor said:

    Simply brilliant! Even better than the regex tutorial.

    Like

    • Glad you think so Agenor… although sometimes I think regex is easier to understand..!

      Like

  2. Thank you, Paul, for an excellent explanation of this frequently relevant topic. I hope some day Studio will not only count tags and placeables but will also include weighting functions for costing. Or that tools integrating with Studio will consider this.

    Like

    • Thanks Kevin. That would be nice indeed, although I think we have a couple of applications on the OpenExchange, both from Kaleidoscope, that can do this already. GoAnalyze is designed just for this – http://goo.gl/3Qtc1 – and Connecting Content – http://goo.gl/4R1R4
      The latter would be part of a bigger solution but GoAnalyze is great.

      Like

  3. Hi Paul,
    I find that the biggest difference in word counts between Studio and Trados 2007 is in files with number-only segments (Studio counts them and Trados 2007 doesn’t). I know of one agency that uses the Trados 2007 log report to calculate job price and then sends me a Studio package. In one case this meant a difference of 15,000 words in a 90,000 word project. The excuse used to be that Studio couldn’t export its analysis. With the OpenExchange Export Analysis Report and later the same feature within Studio the problem’s been solved, but people should be aware of this potentially big difference between Trados 2007 and Studio word counts.
    Translating number-only segments needs time and care, even with auto-propagation enabled. It’s good that Studio takes this into account.
    Emma

    Like

    • Hi Emma, I think I’d agree. Most of the differences I come across are easily reconciled because of numbers. It would be nice if Studio separated (was that right ;-)) the numbers specifically but this is probably only useful for number only segments as you still need to do some work for the rest… as well as QA the numbers properly at the end to make sure nothing was incorrectly transposed.

      Like

  4. Paul Harrap said:

    I/am/waiting/to/hear/about/a/customer/authoring/content/like/this/so/Word/insists/the/translator/only/gets/paid/for/one/word.

    Like

  5. Patrick Hartnett said:

    Excellent article Paul; very interesting the differences between the applications with regards ‘what is a word’; this article will help me a lot with finalizing some functionality with ‘Post-Edit Compare’.

    Patrick.

    Like

    • Excellent… and I’m looking forward to seeing the first release of Post-Edit Compare… so far it seems an excellent application!

      Like

  6. Tisha said:

    Hi Paul, thanks for your fantastically interesting and useful blog!
    My question today is whether you can get Trados to ignore hyperlinks in the word count. I am analyzing a file for a client who has a TM with matches. Word gives a plain word count of 540 words. Trados Studio 2011 takes the matches into account, but also the many hyperlinks, and gives a total word count of 703. I have already gone into Tools-Options-File Types – Word-Common and selected both “never process hyperlinks” and “extract only hyperlink text” but the word count remains the same (running the project again from scratch).
    I do not want to be unfair to the client (or to myself!) so it feels stupid to go after the Word count and ignore the matches, but also unfair to count in a lot of hyperlinks that are not to be translated at all.
    Thanks for any suggestions! Much appreciated.
    Regards!

    Like

    • Hi Tisha, I’m surprised the wordcount would remain the same and wonder if you are perhaps not changing the things you believe. So maybe confusing Project Settings with Global Settings, or using a Project Template that still has the extract hyperlinks set?
      If I take this sentence “This has a hyperlink” and then analyse this after extracting the hyperlinks I get:
      – 5 words
      – 2 segments
      – 3 tokens
      – 2 tags
      The same thing without extracting the hyperlinks is:
      – 4 words
      – 1 segment
      – 2 tokens
      – 2 tags

      So this behaves correctly and will reduce the count because the hyperlink itself is not added as an additional segment for translation, and lower a word, and as it is a recognised token I also lower the token count.

      I think you may not be using the settings you changed.

      If you were using Studio 2014 I would suggest the SDLXLIFF toolkit as another way to handle this because you could lock the tagged segments (all hyperlinks) and then exclude locked segments from the analysis. But with Studio 2011 you need a different approach. So maybe use the SDLXLIFF to Legacy Converter to create a TTX without the hyperlinks and then analyse that? Just an idea as described here : Identifying numbers in your analysis. The article was written for a different purpose but you could use the same idea and find the hyperlinks with the display filter and then lock them all.

      But I think I’d double check your mechanism for obtaining the analysis first because I am pretty sure you are not getting what you think.

      Regards

      Paul

      Like

      • Tisha said:

        Thanks for such a quick reply, Paul! I double-checked everything (it was correctly set), restarted Trados, and double-checked and re-analyzed, and now the word count is ok. Pfew! 🙂
        Thank you!
        (ps. it was Studio 2011)

        Like

  7. Micaela Novas said:

    Paul, I’m finding discrepancies between the analysis of packages I create in Studio 2014 Professional, and what my colleagues see on their end, when they open the package either in Studio 2011 Professional or Studio 2009. (Total word count is the same). Where should I start looking for the cause of these discrepancies? I would have thought that the settings I use to create a package overwrite any other settings on my colleagues’ machines.

    Like

    • Hi Micaela, I think first of all I’d be asking whether this is the same report viewed in different versions of the tool or if the analysis is being rerun? Then I think I’d look closely at the analysis to see if I could see which bands, or even which files, the discrepancies were appearing in. From this something might stick out that would cause me to look a little closer at it. So perhaps taking one of the files, splitting it down a little so you have a smaller set of data to look at would help, and then you may be able to see where the problem is coming from.
      I wouldn’t expect to see a big difference even though 2014 can report things differently based on criteria that is available in 2014 but not in 2009/2011.
      If you can share the package with me I’d be glad to take a look and perhaps I can explain it… or find something that should be fixed… both are possible! I’ll drop you an email.

      Like

  8. CJIoHuKu said:

    Hi, guys. Cna anyone tell me why sometimes Trados does not show terms from termbase in the term recognition window? I know for sure that I did enter that term but it does not apperar in recognition and (more of that) it is not used in automation. This was the 1st question. The second one is : Why trados does not EVER recognise terms liks g/l, km/h and other terms with slash? Thanks a lot.

    Like

    • Hi CJIoHuKu, seems a little off topic for this post which is about analysis as opposed to term recognition!
      When you say Trados what exactly are you talking about? Do you mean Studio 2014 or Trados 6 for example… helps to be able to try and answer the question.
      Assuming Studio then the term recognition can be affected by long sentences, so I’d need a little more to go on if you want a specific answer to your question. On the slashes, I think it’s inconsistent. So recognise/terms and km/h are recognised for me. But g/l is not. The term recognition in Multiterm is in need of a refresh.

      Like

  9. Hi Paul,

    Thanks for your clear and detailed explanation of this controversial topic. The best I’ve read in years. 🙂

    I have been working with Trados since version 1.16, and noticed differences in wordcounts between almost all versions, but also between Trados and other CAT tools. And sometimes the discrepancies in wordcounts can be quite surprising, sometimes even alarming.

    Paul, it would be great if you could also run a detailed wordcount comparison with your above test-case file between the latest version of Trados and that of other non-SDL CAT tools such as DejaVu, Idiom WorldServer, MemoQ, XTM, … and share the results with us.

    I am really really really interested in finally finding out what is causing the differences in wordcounts between all those CAT tools, and what exactly is counted as a word in one tool and not counted in another. I am pretty sure there are many people in the translation business who would like to have this cleared up once and for all.

    Maybe this exercise has already been done by you or another CAT tool provider. In that case, please tell me where I can find that information. If not, it would be great to see SDL take up the gauntlet and lead this initiative.

    Thanks in advance.

    John

    Like

  10. Hi Paul,

    Excellent post. It inspired me to create a “Word Count Analyzer” tool that allows translators to easily see the potential word count gray areas in their text and also allows them to customize how those gray areas are treated in the count. The tool is available here: https://www.tm-town.com/word-count-analyzer and I also wrote up a short blog post about it here: https://www.tm-town.com/blog/deconstructing-word-count.

    Thanks again for the great post!
    Kevin

    Like

    • Hi Kevin, I think that’s a great idea to develop a tool like this. Glad you got some inspiration here!

      Like

  11. Hi Paul,

    I have a question regarding wordcounts in SDL Trados Studio. When analyzing a file with number-only segments, Studio shows them as 100% matches. Is there a way to report these number-only segments as no matches/new words in the analysis report?

    Thanks a lot!

    Cecilia

    Like

    • Hi Cecelia, these are auto-localised placeables which is why they are 100%. As you noted, you can’t separate numbers in this way, so you’d have to workaround. If you lock all segments with numbers first then you can run an analysis and have the locked segments counted separately. This gives you the number, so now you add these to the no-match instead in your own copy of the analysis. I don’t think you can have them automatically handled this way.

      Like

  12. Ciao Paul,

    Thank you again for letting me make reference to this wonderfully researched article with my API documentation for Segment Comparison at http://posso.codificare.org/

    I updated the documentation this afternoon; hopefully it should be a little easier on the eyes now 🙂

    ciao,
    Patrick.

    Like

    • It’s an absolute pleasure Patrick… thank you for all the great work you’ve done on this and all the other cool things you work on!!

      Like

  13. I’ve recently been analysing word counts after a difference with a customer’s word count.

    I couldn’t account for some minor differences, even after re-reading your excellent article, Paul. Finally I realised that in automatic bullet point lists, MS Word doesn’t count the bullet symbol used (dot, arrow, etc.) EXCEPT when the symbol is a long (–) or short (-) dash.

    Go figure.

    Like

    • That’s interesting Emma, one to test after Xmas 😉

      Like

  14. Katja said:

    Dear all,

    I wonder if anyone among you can help me with the following issue:

    Today I created a report in Studio 2011 for the translation of an XML file. Afterwards I sent it to our language service provider and asked for a quote. The LSP created another report in Studio 2011 at their end. They used the same XML file and the same memory, however their word counts for the different match categories differed slightly from the ones I got, e.g. their report showed more No Matches and less 100% Matches compared to mine.
    Has anyone an idea how this happens? As far as I can see one hasn’t that many options in Trados changing the word count or the assignment to different match categories. But maybe I’m missing sth? Could you please give me a hint which setting I need to align with our LSP so that we get the same report results? (we always get the same total word count, that’s not an issue)

    Many thanks in advance for your help!

    Cheers, Katja

    Like

    • Hi Katja,
      Most likely, with XML, the wordcount differences will be down to file preparation. Are you using the same filetype? XML is normally prepared with a custom filetype to only extract the translatable text. Review this article.

      Like

  15. There appears to be significant divergence in word count between Studio 2014 and Studio 2015, perhaps it would be worth updating your article with figures for these versions (and especially the latter).

    Like

    • Hi Spiros, I think the main difference is additional options… I will take a look at this at some point. Do you know where your differences are coming from?

      Like

  16. Paul, I am not sure, but in a big “Help” project I tested recently the discrepancy was quite significant. I could forward it to you if it helps.

    Like

    • Sure… send it over and give me the stats you expected.

      Like

  17. Dear all

    I found that Studio reports lower word counts (about 5% less) than Microsoft Word in most cases. These texts concern civil engineering. They contain many numbers accompanied by SI units in the running text, in tables, and in formulae. The main reason seems to be that placeables include the unit and count 1 word in Studio, whereas Word counts them as 2 words.
    It also looks as if Studio would not count the equal sign (=) used in the formulae and sometimes elsewhere. Am I right?

    Best regards, Johannes

    Like

    • Most probably… but I think this is 5% for your texts and not in general. The word count is entirely dependent on the context, and Studio tries to be as fair as possible to reflect the effort.

      Like

  18. Malcolm said:

    Hi. In Studio 2015 SR2 12.2.5087.4, I analysed a single file for translation into 4 languages (F.I.G.S.). For some reason the Spanish analysis came out with a total word count 8 words higher than the others. Does anyone have any ideas why this might be so?

    The file was a Word file, prepped as a two-column table with column 1 hidden, so that it would end up as a bilingual table.

    Like

    • Maybe differences in the TMs somewhere, in how they have been set up…

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: