Regex reuse: Difference between revisions

Revision as of 01:01, 5 August 2009

In a project I worked on, thousands of emails were parsed each day, with hundreds of different formats. When a new email would come along which would fail to parse, another regular expression would be added to the Perl program which did the parsing. This soon turned into the most unreadable program one could imagine, and someone needing to tweak it would have no idea where to start. The end solution was to rewrite the app in Python, and use one of Python's lex/yacc tools. Since the described system is encumbered, I'll use the example of needing to parse mailing addresses and stick them into a database. Let's say your an eBay merchant, and people just email you their addresses in any format they choose.

When you start your project, you say, this will only a simple regular expression. Your first address is:

<geshi> 312 Cox Rd Cheyenne, WY 82009 </geshi >

So your write your first expression: <geshi> (\d+) +(\w+) +(\w+)\n(\w+), (\w+) (\d+) </geshi >

This works, but fails on your next address due to an apartment and a two word city.

<geshi> 270 Marin Blvd Apt 20T Jersey City, NJ 07302 </geshi>

Since Java doesn't support named groups, adding another group is going to break your code. You can use a regex library which does support named groups, you can change your code, or have multiple regular expressions. Let's say you modify the regular expression.

Again, things go well until you get this address, which has a fraction, a zip+4, and a period of the street type. <geshi> 93 1/2 E 7th St. New York, NY 10009-5731 </geshi>

Just extend the regex a little more... <geshi> ^(\d+( \d/\d)?) +([NSEW] )?(\w+) +(\w+)(.)?( (\w+) (\w+))?\n([\w ]+), (\w+) (\d{5})(-\d{4})?$ </geshi>

An suppose you want to only accept valid US States and territories.... <geshi> ^(\d+( \d/\d)?) +([NSEW] )?(\w+) +(\w+)(.)?( (\w+) (\w+))?\n([\w ]+), (AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VT|VA|WA|WV|WI|WY|AS|DC|FM|GU|MH|MP|PW|PR|VI|AE|AA|AE|AE|AE|AP) (\d{5})(-\d{4})?$ </geshi>

Continuing to add more rules to the same regular expression, or adding additional regular expressions to handle all the different cases can soon lead to completely unreadable code. See the [RFC822 regex for an example of a 6000 character long regex expression.

Lastly, you must get the data out of the various capturing groups. Identifying where to find the data in the examples 13 groups can be rather tricky. The built in Java regex package doesn't supported named group, but other languages and libraries do. Using numeric groups, your code will look something like: <geshi> address.setStreetNumber(join(matcher.group(1), matcher.group(2))); address.setStreetName(join(matcher.group(3), matcher.group(4), matcher.group(5))); address.setApartment(matcher.group(9)); address.setCity(matcher.group(10)); address.setState(matcher.group(11)); address.setPostcode(matcher.group(12) + matcher.group(13) == null ? "" : ("-" + matcher.group(13))); </geshi>

Adding a group another group to the regex means all these group number are going to change.

Lexer Solution

Instead of stringing together more and more regex rules, one can use a lexer to make reusable regex components. Here's the token definitions:

<geshi lang="java"> TOKEN : {

 < FRACTION : ["1"-"9"] (["0"-"9"])* "/" ["1"-"9"] (["0"-"9"])* >

| < CARDINAL : "N" | "S" | "E" | "W" | "NORTH" | "SOUTH" | "EAST" | "WEST" > | < USSTATE : "AL" | "AK" | "AZ" | "AR" | "CA" | "CO" | "CT" | "DE" | "FL" | "GA" | "HI" | "ID" | "IL" | "IN" | "IA" | "KS" | "KY" | "LA" | "ME" | "MD" | "MA" | "MI" | "MN" | "MS" | "MO" | "MT" | "NE" | "NV" | "NH" | "NJ" | "NM" | "NY" | "NC" | "ND" | "OH" | "OK" | "OR" | "PA" | "RI" | "SC" | "SD" | "TN" | "TX" | "UT" | "VT" | "VA" | "WA" | "WV" | "WI" | "WY" | "AS" | "DC" | "FM" | "GU" | "MH" | "MP" | "PW" | "PR" | "VI" | "AE" | "AA" | "AE" | "AE" | "AE" | "AP" > | < CANSTATE : "AB" | "BC" | "MB" | "NB" | "NL" | "NT" | "NS" | "NU" | "ON" | "PE" | "QC" | "SK" | "YT" > | < APT : "APT" | "UNIT" > | < USZIP : (["0"-"9"]){5} ("-" (["0"-"9"]){4} )? > | < CANPOST : ["A"-"Z"] ["0"-"9"] ["A"-"Z"] " " ["0"-"9"] ["A"-"Z"] ["0"-"9"] > | < NUMBER : ["1"-"9"] (["0"-"9"])* > | < WORD : (["0"-"9", "A"-"Z"])+ > | < EOL : "\n" | "\r" | "\r\n" > } </geshi>

One can then define the productions:

<geshi lang="java"> Address parse() : {

 Address a = new Address();

} {

LOOKAHEAD(256)
   usAddress(a) { return a; }
 | canadaAddress(a) { return a; }

} void usAddress(Address a) : {

 a.setCountry("USA");

} {

 streetNumber(a)
 streetName(a)
 streetType(a)
 ( apartment(a) )?
 <EOL>
 city(a) (",")?
 state(a)
 zip(a)
 <EOF>

} void zip(Address a) : {

 Token z;

} {

 z = <USZIP>
 { a.setPostcode(z.image); }

} </geshi>

When does it makes sense to use a lexer?

Here's some rough guidelines on when to switch from a few regular expressions to a lexer/parser:

More than a dozen regular expressions
More than three capturing groups
Many similar rules
You start building multiple case/switch statements to keep track of application state. A parser is really just a DSL to create finite state machines.

Back to Parsers Presentation

@@ Line 126: / Line 126: @@
 Back to [[Parsers Presentation]]
+[[Category:Java]]
+[[Category:Parsers]]
+[[Category:Regex]]