really really long regular expressions – don’t do it!

Posted by chad on November 08, 2007

For one of my projects, we have a list of around 100 domains in a ‘blacklist’. Because of the way the blacklist works, we need to allow non-programmers to enter sites using simple wildcards. Then I convert that into a regular expression:

Regexp.new("^#{regexp.gsub(".","\\.").gsub("*",".*")}$",Regexp::IGNORECASE)

When spidering sites, if a site we find is in this list, we exclude it. I was building a regular expression of all these sites and ‘unioning’ them all together into one mega-regular expression. Then, to see if a domain is on the ‘blacklist’, i just do:

domain =~ my_mega_regex

However, what I found is that a regular expression longer than about 159 separate predicates causes ruby to segfault. This happens on ruby 1.8.4 and 1.8.5. Here’s the simplist code I can repro this with:

r=Regexp.new(/^$/);1.upto(1000) { |i| r= Regexp.union(r,Regexp.new("^#{i}$"));puts i unless "foo" =~ r }

Trackbacks

Use this link to trackback from your own site.

Comments

Leave a response

Comments