DOGS and CATS… Regular Expressions Part 4!

DOGS and CATS...When I first started adding articles about how to use regular expressions I thought I’d only write three… but I had an interesting question from one of our resellers, Agenor (actually Agenor always asks me the hardest questions!), about how to use the display filter to find segments that contain one word, but not another.  It was tricky, but once you have it you can use the expression all the time.  I have a collection of such things from when people ask me, so I thought I’d share how this problem was solved and also post a list of some of the useful regular expressions I have saved for  the display filter in Studio 2011.

The task from Agenor was how to display segments that contained one word, but exclude segments that contained another, even if they were both in the same segment.  So DOGS but not CATS, and not DOGS and CATS because CATS is included… so basically make this happen:

Narrowing the results with the display filter

The solution was to use an expression that used some things called lookarounds and word boundaries… probably things I haven’t mentioned in any real detail yet.

^(?=.*\bDOGS\b)(?:(?!\bCATS\b).)+$

That looks complicated doesn’t it?  But help is at hand if you have Regex Buddy (I know I plug this a lot but it’s very handy) because this will explain the expression for you and help you to write more, and create variances on this.  Basically this is what it does:

^
Start at the beginning of the segment

(?=)
This is a Positive lookahead and it means take a look in front of the current position and match the regular expression inside these brackets

.*
We’ve seen this before.  The dot means find any character and the * means find between zero and unlimited times… so keep looking for anything until you are told otherwise

\b
This is called a word boundary, and by placing the word you are looking for inbetween this syntax you can ensure you find complete words.  So I was looking for DOGS and CATS like this \bDOGS\b and \bCATS\b

?:
This one is not really needed as the expression will work without it.  When you place things inside brackets (apart from the lookaheads themselves) this creates a backreference… we used these before for search and replace… and adding the ?: tells the expression I don’t want to store the backreference as I don’t need it, so it might help the expression to perform faster for very large files.  For most files you may never notice the difference.

(?!)+
This part, (?!), is called a negative lookahead and it means take a look in front of the current position and make sure you cannot match the regular expression inside these brackets… so in this case don’t match CAT.  The + at the end makes it “greedy” similar to the star, but this time between one and unlimited times.

$
This one you also know by now, this just means carry on with whatever you were doing (being “greedy”) until you get to the end and stop there.

So when you look at it like this it’s not so hard to follow… I think, and it might be a useful one to write down and remember for the future?

Here’s a few other useful expressions I stored in a spreadsheet for safe keeping, I hope you find them useful?  Maybe you can add the useful ones you have into the comments for others to share too:

Display filter usecases

For copying and pasting!

^\d+\,\d+$
^([^<>])+
[^0-9.,]
^\D+$
^[\d|,|.]+$
^(?=.*SDL)(?=.*Trados).*$
^(\S+\s+){0,3}\S+$
(\b\w+\b)\s\1\b
[\.,]\s*[\.,]+
\. \s+$
\s+a\s+[aeiou]
\s+an\s+[^aeiou]
\s\s+
\.[A-Z]
\bWORD\b
^(?:(?!\bWORD\b).)*$
^.*\bWORD\b.*$
^(.*\bONE\b.*\bTWO\b)|(.*\bTWO\b.*\ONE\b)
^(?=.*\bONE\b)(?:(?!\bTWO\b).)+$

There are a couple of other articles that are worth reviewing;

Regex….. what regex..! (Regular Expressions – Part 1)

Regex… and “economy of accuracy” (Regular Expressions – Part 2)

Search and replace with Regex in Studio – Regular Expressions Part 3

36 comments
  1. Michal Skoczynski said:

    Hi Paul,

    Glad Agenor brought this up with you, this was originally a problem posted in a PL Facebook group and the problem was to filter out “Information Sheet” and “Additional Information Sheet”. This seems to work like a charm now. Also a lot of additional interesting RegEx stuff here and generally a very, very useful Studio blog – appreciate it, keep up the good work!

    Like

    • Thank you Michal, good to see it worked in practice so fast!

      Like

  2. Agenor said:

    Thanks Paul, it’s now time to create a regular expression that will do the translation for us 🙂

    Like

  3. Indeed very insightful; regular expressions can be so useful, but also lead to quite some headache if not well constructed.

    By the way, another very nice way to better understand what a regex does is to visualise it. There is a rather well made site by Jeff Avallone which offers exactly that: http://www.regexper.com/

    Like

    • I like that… I hadn’t seen it before but it is a nice way to look at it. Thanks for the link!

      Like

  4. I love your cheatsheet, Paul! Any chance of uploading it in an editable format rather than an image, so we can copy&paste the entries?

    Like

  5. Marco said:

    Thank you Paul, very useful! (Especially the copying and pasting version).

    Like

    • I believe it… I have the same problem. I think if you don’t use them often enough then you have to stop and think every time. Thanks for the link… that site is maintained by the author of regex buddy. I’d also recommend his book, the regex cookbook… a readable and very useful resource if you want to have some reference material to hand.

      Like

  6. Nicolas said:

    Excellent resources Paul, as usual. I am really a regex and RegexBuddy addict myself and lookaheads are really powerful stuff!

    Like

  7. Thank you, Paul. You have saved my day. I have a huge file with a lot of numbers and some localizations problems. Now I’m confident that all numbers are in good order in the translation.

    Like

  8. While honing my QA rules I found that in the .NET flavor of regex, lookbehinds can be of variable width, but not so the lookaheads, which is a bit of a game spoiler.

    Like

    • Drop me an example Piotr. Sometimes we do come across “funnies” and if validated we can get them fixed.

      Like

  9. For example this is a variable width negative lookbehind which checks if a comma was forgotten before certain words:

    (?<!(\,\s)(na\s|w\s)?)(\b(któr|że\b|ale\b|czy\b)).+

    and it works very nicely.

    Now if I want to use a negative lookahead which includes and expression indicating a variable number of words after a specific word starting a compound sentence which should be separated by a comma at some point (say from 1 to 10) before the non existent comma, it does not match.

    I used a simpler regex instead which checks if all characters are ‘non-commas’ until the end of sentence, but it is less precise.

    This reference http://regexhero.net/reference/ says that variable width lookbehinds are possible, so I assume that variable width lookaheads are not possible.

    Like

    • Hello Piotr… can you mail me some text to play with that you’d use this with? Some that should pass and some that should fail.

      Like

  10. Andrew said:

    HI, Paul.

    Your regex, ^(?=.*\bDOGS\b)(?:(?!\bCATS\b).)+$ matches segments not containing CATS but DOGS. But that case if I would like to allow segments with both DOGS and CATS, are there any methods?

    So a regex I want should match

    ….DOGS… (o)

    ….CATS… (x)

    DOGS …. CATS (o)

    Like

    • Hi Andrew, maybe this will do the trick:

      DOGS|CATS

      Nice and simple… just means match DOGS or CATS which will do all your cases.

      Like

  11. Andrew said:

    Thanks, but what you are referring to matches

    …CATS.. (o)

    Therefore, this is not what I want. I guess a much more complicated regex is required.

    Like

    • I’m not entirely sure what you want now? First I thought you just wanted DOGS, or CATS, or DOGS and CATS. Now I see the little (x) or (o) at the end probably meant no match or match. Try this:

      DOGS

      So this will match a segment with DOGS, and it will match a segment with DOGS and CATS. Or am I getting confused by what you mean with the (o) as you now put CATS (o) when last time it was CATS (x).
      The main point being, you do not have to try and match the whole segment exactly to achieve what you want in the display filter. This makes it somewhat easier.

      Like

      • Andrew said:

        Thanks for your response and sorry for having confused you with the little. I needed the regex as I am in charge of linguistic quality at a translation company. If I were a freelancer, using display filter with incomplete regex would be enough. But in-house reviewers use regexes that I make. Anyways by chance I just found out this regex “\bDOGS\b(?<!\bCATS\b)" excatly matches my intention.

        I have one more question. I saw your another article about "using backreference in Trados". Is there any way to apply \b, "word boundary" to $(number)? If I enter \b$1\b in target, Trados does not recognize that bacause of the escaping sign. A regex that I input was like the below.

        Source : ([A-Z]{2,})
        Target : \b$1\b

        "Grouped regex – source and target"

        In case I apply the word boundary to only source like below, it is recognized only in source not target.

        Source : \b([A-Z]{2,})\b
        Target : $1

        Should I accept the limitations of Trados?

        Like

      • Hi Andrew, I think your corrected expression is a complicated way of achieving exactly the same as just this:

        DOGS

        It produces exactly the same result. But as long as it works and you’re happy, that’s the main thing!

        On your other question. Can you explain what you are trying to achieve. In your first example the syntax is incorrect for a replace, unless you wanted the literal text \b to be included. But then you seem to be referring to verification now which is a different usecase. Please explain what you wish to find rather than where you think the problem is and this might make it easier to see what’s needed.

        Like

      • Or maybe… if I understand you correctly you want this:

        Source : (\b[A-Z]{2,}\b)
        Target : $1

        If you have an example of what you would like to catch it would be easier.

        Like

    • Andrew said:

      Hello, I have tested your regex but it does not match any. Why don’t you test it on any regex site or Trados?

      Like

    • Hi Raphaël, always fun to play with these things. I think your regex will match DOGS inside a string, but not DOGS alone or DOGS at the end of a string. It does match DOGS followed by CATS specifically as well.

      Like

      • Andrew said:

        Paul is right. but it matches only “DOGS followed by CATS”.

        Like

      • Will try it tomorrow at work.

        Have to admit that I am a bit confused about the regex Andrew provided “\bDOGS\b(?<!\bCATS\b)". If I decompose it correctly, it should match the following:
        whole word "DOGS" and then a negative lookbehind without anything, i.e. normally the part "(?<!\bCATS\b)" should be followed by something to which it refers. Gonna test that too…

        Like

      • Indeed… which is why I think DOGS alone achieves what he needed. It will basically allow the display filter to find anything with DOGS in it as a separate word and the lookbehind is not really doing anything at all.

        Like

      • Andrew said:

        I wanted to catch out words consisting of only upper cased alphabets such as SIM, PIN, UN.

        However, certain acronyms that I did not need to catch out, began to appear, and fortunately I have found out a regex below.

        (\b(?:[A-Z]{2,})\b(?<!\bSIM\b|\bSD\b|\bFDDT\b))

        Now I can add terms which I do not need such as above – SIM, SD, FDDT…, catching out every possible acronym.

        Trados technicians should resolve remaining issues below.

        1) Any of escaped regexes is not available on the backreference – any word boundary on the source does not affect the target. it does only on the source. Please see my final regex far below.

        2) The search option detecting "matched but different counts on the source and target" should be available in the "grouped search" in order to prevent

        a possible error below – wrong spelling of ADT from being ignored.

        Source segment : ….. SIM ….. PIM ….. ADT

        Target segment : ….. SIM ….. PIM …… ADTT

        Therefore, my final destination is the following.

        Source : (\b(?:[A-Z]{2,})\b(?<!\bSIM\b|\bSD\b|\bFDDT\b))

        Target : \1\b\1

        Grouped search regex – source matched but not target

        – source and target matched but with different count

        Like

      • Hi Andrew, I don’t think there is a bug here. As I said before, you cannot use regex in the target of the Grouped Search for anything other than referring to the backreference. Anything else will be looking for literal text. In your target you only require this as far as I can see:

        Target: $1

        Nice regex though, I’m sure this example will be useful for anyone looking to do a similar thing where they have a general expression for everything, but with a few exceptions to the rule.

        Your second point… I agree. It would be a nice option as an enhancement to the checker if the grouped search also counted matches.

        Like

  12. Jiri Proniuk said:

    Hello, could you hello me?
    I need lock (exclude) the text in Excel file between [ ] for example [% return maindetail.label() %]
    What is correct regular expression fro SDL Studio 2014?

    Like

    • Hi Jiri, for any version of Studio (regex is not a Studio thing!) you could use something like this (based on exactly what you said, so not including the opening and closing square brackets):

      (?<=\[)[^]]+

      So the expression does a positive look behind to make sure it was able to find the opening square bracket, and then it keeps finding everything until it comes across a closing square bracket.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: