Tuesday, April 10, 2007

How to optimize a regular expression

Regular expressions are in the toolbox of most developers. They provide an easy way to match patterns that are just a bit too complex for IndexOf. A friend of mine once said: "regular expressions must have been invented by the devil. They first lure you in with their beauty and terseness, allowing you to do complex things with a seemingly simple code. But later, when you encounter problems with your expressions, you find yourself in debugging hell."

And after having fought regular expressions last week, I tend to agree with him.

In a project I'm involved with we have to rewrite the URLs in HTML pages. The process is pretty simple, especially in pseudo code:

for each URL in the HTML
determine the new destination
replace the URL with the new destination
The most tricky part is of course in finding the URLs in the HTML in the first place. Parsing HTML is not exactly exact science, something the browser vendors at least seem to agree upon. But for this project, we couldn't afford writing our own HTML parsing logic and apparently there aren't many good HTML parsers available for .NET. So one of our developers decided to use regular expressions to find the URLs.

And as my friend had already predicted, it started out very well. Recognizing href and src attributes in HTML is not that difficult:
(?<=\ssrc|\shref)\s*=\s*([^>\s]*)([^>]*)(>|/>)+
It of course becomes a bit more tricky when you also need to handle quotes around the URLs:
(?<=\ssrc|\shref)\s*=\s*([""'])([^\1]*?)\1
I was very surprised that I couldn't find a single construct that matches both quoted and unquoted values, but apparently it really can't be done. So we decided to use separate expressions for them, doubling the required processing time but at least allowing us to rewrite all URLs.

Things got even more tricky once you we found out that we needed to treat base tags different from other tags and want to exclude them from the result:
(?!<\s*base [^>]*)(?<=\ssrc|\shref)\s*=\s*([""'])([^\1]*?)\1
This last one uses a construct called "zero-width negative lookbehind assertion", which I had never heard of before. But it seemed to do exactly what we needed: only match when the URL is not inside a base tag.

Observing readers might have noticed the fact that this expression matches hrefs anywhere, so also in things like HTML comments. We decided to not call that a bug, but label them as unusual HTML constructs. Anything to keep the project going forward. :-)

Until of course... we ran into performance problems. As it turns out the last regular expression takes over 600ms to find 20 matches in a 21Kb HTML document. And if you run a sequence of similarly complex regular expressions to handle different constructs against that HTML document, the time to rewrite the HTML quickly becomes unacceptable (over 3 seconds in our case). And since the original developer who wrote these regular expressions had since left the project, I was tasked with searching for a way to optimize them. Oh lucky me!

If you don't know regular expressions very well, the process of optimizing them can turn into a very mind numbing activity. Remove a character here or there, re-run, check whether the results are the same as before and whether it is faster. By now I know this process pretty well, because I tried it for a day and a half. And the results have not been too promising: about 25% improvement of the performance. The trick was to replace the "Zero-width positive lookahead assertion" with a "zero-width positive lookahead assertion". To you and me that means removing the < (less then) after the first question mark:
(?!<\s*base [^>]*)(?<=\ssrc|\shref)\s*=\s*([""'])([^\1]*?)\1
A 25% performance improvement might sound good, but the response times are still unacceptable and I have the feeling that there is quite some room for more improvement in there.

I recall having to optimize a pretty complex XSLT a few years ago. It quickly turned into a similar exercise of "change and test", without any idea of the likeliness of success. That is... until I discovered the option to compile the XSLT into code (Java code in this case). The idea behind allowing you to do that is that Java code would execute faster and would take less memory while executed. But an additional benefit I found was... the fact that Java code can be profiled. I still recall the excitement of the first few runs of my XSLT in a profiler. The analyzing of the call stacks, the finding out that named templates become methods named temlate[name] and match templates are translated into methods named template[index]. It was great and it allowed me to tune the XSLT within days.

Now if only something similar exists for regular expressions. I know .NET has the ability to compile them into an assembly. But I'm using the pro version of Visual Studio, so I don't have access to the profiler. And I'm not aware of any good, free or cheap profiler for .NET.

Meanwhile, does anyone have an idea of how to further optimize this regular expression? If not, I'll probably see whether I can reduce the number of expressions we run by merging some of them together. It will probably hurt readability even further, but at this stage I'll sacrifice readability to get closer to shipping. God, I know I will hate myself for that in a few months time.

16 comments:

Alan Moore said...

I know it's been a while, but if you're still waiting for an answer, try this (remove the line breaks):

(<(?!base\b)\w+\s+(?>(?:[^>sh]+|s(?!rc\b)
|h(?!ref))*))(src|href)\s*=\s*(?:(["'])
((?>(?:(?!\2).)*))\2|((?>[^>\s]*)))

Readability is a joke, of course, but it should be a lot faster than what you posted. I'll try to explain how it works if you're interested.

Frank van Puffelen said...

Hi Alan,

An explanation would be most appreciated, as I find it hard to understand the differences and especially why this version would result in better performance.

I'll see if I can get the project working again on my work machine early next week, so I can run some tests with your suggested regex. We were able to ship the product with the performance we got, but I wouldn't mind increasing performance without having to redo the entire architecture in the next upgrade.

Frank

Alan Moore said...

In general, it's much easier to match something than to not match something, and it's easier and faster to match according to positive rather than negative criteria. That's why I rewrote the regex to match everything from the beginning of the tag to the src/href attribute. If you know which exact elements you want to change, you can speed things up some more by listing them, instead of matching "anything but a BASE tag" like I did.

The most effective thing you can do to speed up regex matching is to make sure that, if a match attempt is going to fail, it does so as quickly as possible. And the best way to do that is to eliminate unnecessary backtracking. That's what atomic groups are for (they're the ones that start with "(?>" ), but you have to make sure the subexpression inside the group doesn't match the part that's supposed to come after it, which can be fairly tricky.

For instance, after matching the tag name and the following whitespace, I wanted to skip over anything else preceding the src/href attribute. I could have used [^>]*?, but non-greedy quantifiers are inherently slower than greedy ones, and I'm going for maximum speed. If I use a greedy quantifier, I have to make sure the quantified subexpression doesn't match the attribute names I'm looking for. That way, if the next thing the regex sees after the quantifier quits is not "src" or "href", I know there's no point backtracking into what it just matched, so it's safe to wrap that subexpression in an atomic group.

The trickiest part (aside from understanding what this backtracking/atomic group stuff is all about) is coming up with a subexpression that excludes certain multi-character sequences while running as quickly as possible. The first-character discrimination gimmick I used here is the best way I've come up with so far. Another option is the lookahead-and-dot approach I used later on, but that means doing a lookahead at every position. This way only does them after certain characters are seen.

Speaking of not-matching things, there's an error in the final part of your regex, where you match the URL. A backreference like \1 can match any number of characters, but a character class matches exactly one character at a time. In Java, a backreference in a character class is treated as a syntax error; I don't know what .NET thinks it means, but it isn't working the way you obviously expect it to (if it were, you wouldn't need the non-greedy quantifier there). My regex shows a more correct way to match double-quoted and single-quoted values with one construct, but for the sake of speed and maintainability, I recommend using a separate alternative for each.

I just realized there's an error in that part of my regex, too; I used \2 when I should have used \3. Here's the corrected regex:

(<(?!base\b)\w+\s+(?>(?:[^>sh]+|s(?!rc\b)|h(?!ref\b))*))
(src|href)\s*=\s*(?:(["'])((?>(?:(?!\3).)*))\3|((?>[^>\s]*)))

The part before the line break matches everything preceding the src/href attribute and saves it in group #1. The attribute name is saved in group #2, and the URL will be in group #4 or #5, depending on whether it was quoted or not. Below is the separate-alternatives version; in it, the URL is saved in group #3, #4 or #5.

(<(?!base\b)\w+\s+(?>(?:[^>sh]+|s(?!rc\b)|h(?!ref\b))*))
(src|href)\s*=\s*(?:"((?>[^"]*))"|'((?>[^']))'|((?>[^>\s]*)))

On a side note, in Java I would use possessive quantifiers instead of atomic groups; they're so much neater! I never use atomic groups in Java. I could have used them more effectively here, but the readability tradeoff got to be too much even for me. ^_^

Frank van Puffelen said...

Thanks for the incredibly detailed explanation. While writing this post I already knew which aspects of the regex were hurting performance, but wasn't sure how to solve them. It seems like you have way more experience with it than I have.

Separating capture groups per quote type indeed improves readability. I recently saw someone use it in one of our other projects and decided that I'll also introduce that in the next version. Aside readability it also means that you can match quoted and non-quoted hrefs with a single regex. Of course it does mean that we'll need to modify the code slighty to handle the three capture groups, but that is a small price to pay for this improvement.

Thanks for the advice and the explanation. Once I run some performance tests (early next week), I'll let you know the results.

Frank van Puffelen said...

Hi Alan,

Your expression indeed seems to run quite a lot faster. That's great!

But unfortunately we now hit way too many patterns in our code. It seems that changing this one expression, caused a lot of our other regexes to hit the same URLs. So in those cases we're now rewriting the same URL multiple times. Which unfortunately is not very good. :-(

Is there any way to keep the speed increase, but make sure that the start of the tag (everything up to the href) is not included in the match?

Alan Moore said...

I don't see how the change I suggested could lead to the result you describe, but that's probably because I don't know how the regexes are being applied. However, there is another efficiency concern I thought of after my last comment.

The full regex only gets applied when the angle bracket marking the beginning of a tag is seen. If it fails to match, the regex engine backs up to the angle bracket and starts looking again at the very next position. What you would like it to do is skip over the whole tag before starting to search again. You might be able to use this regex to accomplish that:

(<(?!base\b)\w+\s+(?>(?:[^>sh]+|s(?!rc\b)|h(?!ref\b))*))
(?>(?:(src|href)\s*=\s*(?:"((?>[^"]*))"|'((?>[^']))'|((?>[^>\s]*))))?)[^>]*>

Th regex now matches an entire start tag (except BASE tags), whether it has a src/href attribute or not. If there is one, it will be captured as before, but if you check group #2 and find it empty (or maybe null), you know there's no URL to rewrite in the current tag. Whether you can use that given the existing structure of your app, I don't know. I hope it helps.

Frank van Puffelen said...

Hi Alan,

Thanks for all the help. I'm slowly picking up some things about where your pattern would speed things up. There are still some problems with it, that I can't quite figure out.

Unfortunately it now doesn't seem to match the href in a string like:
<li><a href="/subscription/subscription.jsp">Subscribe</a></li>

It does match href in the following:
<link rel='stylesheet' href='/multimedia/main.css' />

But the single quote is somehow part of the match group.
{1}=<link rel='stylesheet'
{2}=href
{3}=
{4}=
{5}='/multimedia/main.css'

For double quoted href's, the URL is in group 3 and the quote is not part of the match group.

And do you have any idea where group 4 is coming from? It is always empty in the cases I have seen.

Thanks in advance,
Frank

Alan Moore said...

The problem is that I left out an asterisk. This:

'((?>[^']))'

will only match a single-quoted string consisting of exactly one non-apostrophe character. I meant to write this:

'((?>[^']*))'

When that alternative fails, the final alternative matches the attribute value instead, and it treats the apostrophe as just another character.

This is a good illustration of why, when you work with well-known formats like HTML, you should prefer dedicated tools over regexes. You're less likely to make mistakes like this, and when you do, you tend to get more useful feedback. (I know you didn't have that option, Frank; this comment is directed at anyone who's still in the "lured in" phase.) ^_^

Frank van Puffelen said...

Hi Alan,

No offense taken. I would rather have used another solution as well. :-) Although I am interested if anyone knows of a good HTML parser that is available for both Java and .NET. The latter is especially problematic as Tidy versions for .NET are few and far between. :-(

But back to the regular expression. The * indeed fixes the problems with single quoted string. I now finally realized that the 5th match group is used if there is an unquoted href. I've removed that group for the moment, since I don't want to disturb the rest of the code too much. Once the regex works reliable with quoted refs, I'll add it back in for some further speed gains.

But it seems that the expression currently doesn't match the href in the following:

<div id="logo"><a href="/"><img border="0" alt="" src="server/show.ar?ticks=633114792620000000&url=/multimedia/logo_tridion.gif&base_url=http%3A//www.salesdemo.com/&profile=profile1" width="192" height="32"/></a></div>

The src attribute of the image was matched and rewritten, but somehow it missed the href. I don't really understand why this is, since all match groups use a * quantifier.

Maybe I should just send you my test file, instead of just providing one problem area at a time?

Again, thanks for all your explanations. I hope others find this thread as educational and entertaining as I do.

Alan Moore said...

It works for me. If you're writing these regexes in the form of C# string literals, you may need to escape the quotation marks by doubling them:

""((?>[^""]*))""

Although, if that were the problem, I would have expected it to throw an exception instead of (sometimes) failing to match. But I don't do C#, so I could be wrong.

If you want to prevent the final alternative from matching quoted attributes, you can disallow quoting characters as well as the other stuff:

((?>[^>\s""']*))

I probably should have done that from the beginning.

Frank van Puffelen said...

I've correctly escaped the double quotes. As a Java developer I always type the wrong sequence the first time, but luckily the compiler is pretty good at catching that type of mistake.

So the fragment above results in 2 matches when you try it? It is really weird then, that it doesn't do so for me. I'll try the fragment in a more confined test to see if I can get further.

Frank van Puffelen said...

Some more experimenting shows that (in my test cases) the

<a href="/">

isn't being matched because there is no space between the closing double quote and the >.

Frank van Puffelen said...

Ah.... I seem to have missed the * right at the end of your regex, during previous copy actions. With that it matches the &a href="/"> correctly.

sothim said...

I am quite sure that this optimization discussion could be totally avoided. Regex for optimization should be converted into an FSA on which the minimization should be applied - this is prooven to yield the minimal automaton possible that accepts this language and thus give the best performance and is even canonical! Thus the notation of the regex should not matter.

Any decent regex impl should do this either automatically or offer some optimize function to generate such an optimized matcher.

See:
http://en.wikipedia.org/wiki/Finite_state_machine#Optimization

A complete algorithm should be found eg in Hopcroft's Intro to automata theory.

Frank said...

Thanks for the link Sothim. I can only wish thath the Microsoft regex implementers read it and apply the suggested optimization to their regex engine. I sure would've liked to spend the time elsewhere. Although I have to admit, optimizing this kind of stuff is also fun to do.

Anonymous said...

Hi found good contents on the blog,
i am new to regex and have one query about it.
Suppose if i have an expression (hello1) | (hello2) |(hello3)
Now on regex match i want to know which subexpression got matched, how can i get this information.