The elusive regex with GPT-4

A DALL E generated image of running digitsWhilst the solving of regular expressions with ChatGPT seems like a great way to give yourself superpowers I have stayed away from writing about this usecase till now.  Yes, ChatGPT is great for those simple things that anyone with some basic knowledge could probably write themselves in the time it took to explain what was wanted.  But I like regular expressions… I’m definitely not a real expert, but I do like to play around with them and would consider myself above an average user.  So when I decided to test ChatGPT with a regular expression I asked it to solve something I have never been able to achieve on my own.  In fact I have never seen anyone else do this either… although I’m certain there are many people out there who would be very capable of doing it.  But when I’ve asked I have never had a satisfactory solution without using code, or without using multiple search & replace operations.

So why am I writing about this now?  Well, this evening ChatGPT was upgraded to GPT-4.  What does this mean?  In a nutshell it means it has become a lot smarter and is now multimodal which means it can handle various types of data, such as text, images, audio, video, or other sensor-based inputs.  How is it smarter?  The OpenAI website reported that ChatGPT scored in the 10th percentile on a uniform bar exam ( a standardized bar exam designed to test the knowledge and skills that are necessary to practice law in a wide range of jurisdictions in the United States.), whilst GPT-4 scored in the 90th percentile.  In what’s known as the Biology Olympiad (a series of tests, practical exams, and theoretical problems designed to challenge students’ knowledge, problem-solving skills, and understanding of biological concepts), the vision-powered GPT-4 scored in the 99th percentile, while ChatGPT finished in the 31st percentile.

There are other improvements of course, one significant one is that it is now capable of handling over 25,000 words of text which will support the creation of much longer content, extended conversations, and improved document search and analysis.  There’s more too, but the bit I’m interested in was my simple little test!

Not so simple test

This was my test.

Create a regular expression using .NET syntax suitable for use in EditPad Pro that can add thousand separators, using a comma, to any number that may or may not have a decimal separator, using a period, already.

Example data would be:

Input: 123456789.5678
Expected Output: 123,456,789.5678

Input: 3456789.567843356
Expected Output: 3,456,789.567843356

Input: 6789
Expected Output: 6,789

How hard can that be?  I think very hard given the efforts to date by anyone I ask, and by my own failures to solve it.  I often see users in the forums asking for this seemingly simple expression and always getting a response that never really provides the one expression to solve it all.

In previous iterations of ChatGPT I have never been able to solve this either, and it often made syntax errors as well.  So this exercise became my acid test for regex success.  In GPT-4 it still failed to do this correctly for the first four attempts, but on the fifth it seems to have got it right!  I’ll provide the full response (as it corrected itself) as it is a great learning experience:

Here’s the corrected regex pattern for use in EditPad Pro that won’t add separators after the decimal place:

Find what: (?<=\d)(?=(\d{3})+(?!\d)(?<!\.\d*)) Replace with: ,$0

This regex pattern breaks down as follows:

  • (?<=\d): a positive lookbehind assertion that checks if there’s a digit immediately before the current position
  • (?=...): a positive lookahead assertion that checks if the pattern inside the parentheses is present, but doesn’t consume any characters
  • (\d{3})+: matches a sequence of one or more groups of exactly three digits
  • (?!\d): a negative lookahead assertion that checks that there isn’t another digit after the matched group of three digits
  • (?<!\.\d*): a negative lookbehind assertion that checks that there isn’t a decimal point followed by zero or more digits before the matched group of three digits

Using this regex pattern in the Find and Replace dialog in EditPad Pro will add thousand separators to the given numbers without adding separators after the decimal point.

I used EditPad Pro as the context because it seemed easier to use software it probably knows more about and because it’s my preferred editor.  I could also be sure the same expression should work in Trados Studio as I could test the exact same syntax flavour.

Conclusion

I know this is really a trivial test of a tool like this, but the recent upgrade of ChatGPT to GPT-4 has led to significant improvements in its capabilities, including its performance in solving complex regular expressions.  While previous versions struggled to provide a satisfactory solution for the given regex problem, GPT-4 was ultimately able to generate the correct regex pattern for adding thousand separators without affecting the decimal point.  The successful solution of this problem may not really showcase the impressive advancements in GPT-4’s capability and its potential for providing valuable assistance in various fields, including text and data manipulation, but it does demonstrate it’s usefulness in supporting yet another task that we are regularly challenged with in localization projects.  As the AI continues to evolve, I think we can all expect even more refined and powerful results, making it an increasingly useful tool for a wide range of applications.

I think there is no doubt that AI is impacting the industry we work in and we are all going to experience a transformation in the nature of work, with a focus on higher-value tasks, post-editing, and collaboration with AI tools.  So I think that to stay competitive in this evolving landscape we do need to consider embracing AI technology and upskilling to handle more complex, specialized, or context-sensitive assignments.  Quite frankly, I have not witnessed anything so disruptive in my lifetime and the pace of change seems quite incredible in our digital world.

So if you’re not already using these type of AI resources and thinking about how you can use them to help you with your work, and also how you work and are less affected, then there’s no time like the present to get started.  Things will change with you or without you… better with you!

6 thoughts on “The elusive regex with GPT-4

  1. When will these new GPT AI developments be impacting RWS machine translation and more specifically the Language Weaver resource in Trados Studio?

    1. Hi Ray, I’d be interested to hear what you think we might do with these resources? Language Weaver is a secure machine translation resource where we do not use any of the material companies push through for anything other than to return a translation to you. GPT is not secure or private and they are quite clear about the risks of using it for anything private. I think that whilst the technology itself is very attractive and offers many advantages that we will surely look at, allowing it to be used in an unsecure way for content being returned for translation will never happen. I can tell you we have looked at building apps for the appstore that will take advantage of this promethean technology but to use them you’ll need to set up your own account and take on board the risks of sharing content yourself. These apps we develop will not have anything to do with Language Weaver. Keep an eye on the appstore and you’ll start to see more and more references to it and not just from us. I know that Custom.MT who offer a plugin in the appstore have already integrated with GPT-3.5. They have said this with regard to Data Privacy in this blog article:

      Your data flowing to OpenAI is opted out of the optional sharing program. While OpenAI is still your data subprocessor, we believe that your translations and prompts will not be available to any other users. Custom.MT never stores your data and deletes it as soon as the translation is complete.

      So I think you’ll need to validate the data privacy yourself, as you will have to for anything we do provide through the appstore. But I can’t see anything we do with Language Weaver will change with regard to the very important measures we place on privacy and security for our customers which are the bedrock of our machine translation solutions.

    2. Paul, I’m a little cranky about this one and I’m rather reluctant to post the correction. Hallucinations from ChatGPT are contagious it seems, and you are good enough with regex and tools for checking it that you can see the errors. Your question as it turns out, is an interesting one for which I see a few applications in our translation workflows, so thank you for that. But please look carefully at this. If it took four or five rounds to get the right answer from ChatGPT, then I think you copied your notes from one of those unsuccessful solutions.
      And why EditPad Pro? Try this in Trados Studio and you’ll see the problem. I looked at it in RegexStorm on the tester page and in memoQ, and it can be fixed without much ado, but the two passes for a find & replace aren’t elegant. There are, however, elegant solutions for the fixed regex which can offer placeable hits of the format-corrected numbers in the working grid of memoQ, and the solution can be adapted easily to a number of other common formats. Just for fun I may teach it for the EU English styleguide the DGT puts out. A good way to wake some listeners up in my next class perhaps.
      You never asked me this question, alas. It would have been fun. But I’m glad I found it, however, because it led me to a solution I had not realized that we need, so I owe you a beer next time we cross paths. Beware of this AI nonsense. You know much better than what was (accidentally?) published here.

  2. Hi Paul,
    Interesting article, but there is something wrong with the Regex synatx you mentioned. It does not work at all at my end and I have advanced text editors such as Ultraedit.
    I also do not understand what is the point of having a decimal point followed by a lot of digits as in the examples you have mentioned. I think your Regex
    (?<=\d)(?=(\d{3})+(?!\d)(?<!.\d*))

    contains an error in this part:

    A quantifier inside a lookbehind makes it non-fixed width

    I think we can first get rid of the decimal point here because it is of no significance if we want to have that number read properly in English language, then use

    (\d)(?=(\d{3})+(?!\d))(?!.\d*)

    and replace with

    $1,

    let me know what you think. In all cases, this would certainly need a two stage Regex run, and not just one that does all! That is what I think, so correct me if I am wrong.

    Sameh

    1. Hi Kevin and Sameh,
      Nice to see you mulling over my articles and I’m really happy you found a use for the usecase. I don’t agree with your conclusions though 🙂 I thought the best approach would be to just show you how it works, for the very specific examples I discussed here. So here’s a short video explaining:

    2. Hi Sameh,
      On your comments specifically…

      I also do not understand what is the point of having a decimal point followed by a lot of digits as in the examples you have mentioned

      Ever heard of PI? I don’t really understand your question since the point of the exercise here was to simply test an expression that would handle the task I gave it. That’s it, nothing more complicated than that.

      I think we can first get rid of the decimal point here because it is of no significance if we want to have that number read properly in English language, then use

      (\d)(?=(\d{3})+(?!\d))(?!.\d*)

      and replace with

      $1,

      let me know what you think. In all cases, this would certainly need a two stage Regex run, and not just one that does all!

      Well, I hope you can see now that this can be done in one pass. But on your suggestion… removing the decimal point may not always be desirable, especially if the numbers are not meant to be integers or if the decimal portion is significant for the context in which the numbers are used.
      The regular expression (\d)(?=(\d{3})+(?!\d))(?!.\d) has a small error in the negative lookahead at the end: (?!.\d) should probably be written as (?!.\d+) or (?!.\d) should be (?!.\d). As it stands, . would match any character, not specifically a decimal point, and ?!\d will match zero digits, which doesn’t really prevent anything. Therefore, it may not work as expected.
      The main point to keep in mind is whether you want to preserve the decimals or not. If you don’t, then your approach of a two-stage regex run might work well for your needs. However, if you want to keep the decimal points intact, then it becomes more complex and might still require a one-step approach that takes care of both the integer and decimal parts.

Leave a Reply