Regex reuse: Difference between revisions

From EggeWiki
No edit summary
mNo edit summary
Line 55: Line 55:


Adding a group another group to the regex means all these group number are going to change.
Adding a group another group to the regex means all these group number are going to change.
Back to [[Parsers Presentation]]

Revision as of 20:43, 4 August 2009

In a project I worked on, thousands of emails were parsed each day, with hundreds of different formats. When a new email would come along which would fail to parse, another regular expression would be added to the Perl program which did the parsing. This soon turned into the most unreadable program one could imagine, and someone needing to tweak it would have no idea where to start. The end solution was to rewrite the app in Python, and use one of Python's lex/yacc tools. Since the described system is encumbered, I'll use the example of needing to parse mailing addresses and stick them into a database. Let's say your an eBay merchant, and people just email you their addresses in any format they choose.

When you start your project, you say, this will only a simple regular expression. Your first address is:

312 Cox Rd
Cheyenne, WY 82009

So your write your first expression:

(\d+) +(\w+) +(\w+)\n(\w+), (\w+) (\d+)

This works, but fails on your next address due to an apartment and a two word city.

270 Marin Blvd Apt 20T
Jersey City, NJ 07302

Since Java doesn't support named groups, adding another group is going to break your code. You can use a regex library which does support named groups, you can change your code, or have multiple regular expressions. Let's say you modify the regular expression.

^(\d+) +(\w+) +(\w+)( (\w+) (\w+))?\n([\w ]+), (\w+) (\d+)$

Again, things go well until you get this address, which has a fraction, a zip+4, and a period of the street type.

93 1/2 E 7th St.
New York, NY 10009-5731

Just extend the regex a little more...

^(\d+( \d/\d)?) +([NSEW] )?(\w+) +(\w+)(.)?( (\w+) (\w+))?\n([\w ]+), (\w+) (\d{5})(-\d{4})?$

An suppose you want to only accept valid US States and territories....

^(\d+( \d/\d)?) +([NSEW] )?(\w+) +(\w+)(.)?( (\w+) (\w+))?\n([\w ]+), (AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VT|VA|WA|WV|WI|WY|AS|DC|FM|GU|MH|MP|PW|PR|VI|AE|AA|AE|AE|AE|AP) (\d{5})(-\d{4})?$

Continuing to add more rules to the same regular expression, or adding additional regular expressions to handle all the different cases can soon lead to completely unreadable code. See the [RFC822 regex for an example of a 6000 character long regex expression.

Lastly, you must get the data out of the various capturing groups. Identifying where to find the data in the examples 13 groups can be rather tricky. The built in Java regex package doesn't supported named group, but other languages and libraries do. Using numeric groups, your code will look something like:

address.setStreetNumber(join(matcher.group(1), matcher.group(2)));
address.setStreetName(join(matcher.group(3), matcher.group(4), matcher.group(5)));
address.setApartment(matcher.group(9));
address.setCity(matcher.group(10));
address.setState(matcher.group(11));
address.setPostcode(matcher.group(12) + matcher.group(13) == null ? "" : ("-" + matcher.group(13)));

Adding a group another group to the regex means all these group number are going to change.

Back to Parsers Presentation