Regular expressions are in the toolbox of most developers. They provide an easy way to match patterns that are just a bit too complex for IndexOf. A friend of mine once said: "regular expressions must have been invented by the devil. They first lure you in with their beauty and terseness, allowing you to do complex things with a seemingly simple code. But later, when you encounter problems with your expressions, you find yourself in debugging hell."
And after having fought regular expressions last week, I tend to agree with him.
In a project I'm involved with we have to rewrite the URLs in HTML pages. The process is pretty simple, especially in pseudo code:
for each URL in the HTML
determine the new destination
replace the URL with the new destination
The most tricky part is of course in finding the URLs in the HTML in the first place. Parsing HTML is not exactly exact science, something the browser vendors at least seem to agree upon. But for this project, we couldn't afford writing our own HTML parsing logic and apparently there aren't many good HTML parsers available for .NET. So one of our developers decided to use regular expressions to find the URLs.
And as my friend had already predicted, it started out very well. Recognizing href and src attributes in HTML is not that difficult:
(?<=\ssrc|\shref)\s*=\s*([^>\s]*)([^>]*)(>|/>)+
It of course becomes a bit more tricky when you also need to handle quotes around the URLs:
(?<=\ssrc|\shref)\s*=\s*([""'])([^\1]*?)\1
I was very surprised that I couldn't find a single construct that matches both quoted and unquoted values, but apparently it really can't be done. So we decided to use separate expressions for them, doubling the required processing time but at least allowing us to rewrite all URLs.
Things got even more tricky once you we found out that we needed to treat base tags different from other tags and want to exclude them from the result:
(?!<\s*base [^>]*)(?<=\ssrc|\shref)\s*=\s*([""'])([^\1]*?)\1
This last one uses a construct called "zero-width negative lookbehind assertion", which I had never heard of before. But it seemed to do exactly what we needed: only match when the URL is not inside a base tag.
Observing readers might have noticed the fact that this expression matches hrefs anywhere, so also in things like HTML comments. We decided to not call that a bug, but label them as unusual HTML constructs. Anything to keep the project going forward. :-)
Until of course... we ran into performance problems. As it turns out the last regular expression takes over 600ms to find 20 matches in a 21Kb HTML document. And if you run a sequence of similarly complex regular expressions to handle different constructs against that HTML document, the time to rewrite the HTML quickly becomes unacceptable (over 3 seconds in our case). And since the original developer who wrote these regular expressions had since left the project, I was tasked with searching for a way to optimize them. Oh lucky me!
If you don't know regular expressions very well, the process of optimizing them can turn into a very mind numbing activity. Remove a character here or there, re-run, check whether the results are the same as before and whether it is faster. By now I know this process pretty well, because I tried it for a day and a half. And the results have not been too promising: about 25% improvement of the performance. The trick was to replace the "Zero-width positive lookahead assertion" with a "zero-width positive lookahead assertion". To you and me that means removing the < (less then) after the first question mark:
(?!<\s*base [^>]*)(?<=\ssrc|\shref)\s*=\s*([""'])([^\1]*?)\1
A 25% performance improvement might sound good, but the response times are still unacceptable and I have the feeling that there is quite some room for more improvement in there.
I recall having to optimize a pretty complex XSLT a few years ago. It quickly turned into a similar exercise of "change and test", without any idea of the likeliness of success. That is... until I discovered the option to compile the XSLT into code (Java code in this case). The idea behind allowing you to do that is that Java code would execute faster and would take less memory while executed. But an additional benefit I found was... the fact that Java code can be profiled. I still recall the excitement of the first few runs of my XSLT in a profiler. The analyzing of the call stacks, the finding out that named templates become methods named temlate[name] and match templates are translated into methods named template[index]
. It was great and it allowed me to tune the XSLT within days.
Now if only something similar exists for regular expressions. I know .NET has the ability to compile them into an assembly. But I'm using the pro version of Visual Studio, so I don't have access to the profiler. And I'm not aware of any good, free or cheap profiler for .NET.
Meanwhile, does anyone have an idea of how to further optimize this regular expression? If not, I'll probably see whether I can reduce the number of expressions we run by merging some of them together. It will probably hurt readability even further, but at this stage I'll sacrifice readability to get closer to shipping. God, I know I will hate myself for that in a few months time.