Copybook parser: Difference between revisions

Latest revision as of 10:09, 22 September 2009

Copybooks

Parsers can be used to parse all sorts of data, not just computer languages, but many kinds of data if there is some structure to it. When I joined the Cards project, I found that we connect to a mainframe running COBOL. Communication with the mainframe is done by sending [copybook] messages to and from the server. The copybooks define the physical layout of a fixed length record. Here's an example of the kind copybook that we needed to use.

000100*N NAG0106                                                        PRCITXDA 
000200******************************************************************PRCITXDA 
000300*    YADA YADA YADA                                              *PRCITXDA 
000400*    SOME COPYBOOK RECORD                                        *PRCITXDA 
000500******************************************************************PRCITXDA 
000600 01  AU-CR-INT-INTERFACE-REC.                                     PRCITXDA 
000700     05  AU-REC-TYPE-CD                  PIC  X(01).              PRCITXDA 
000800         88  AU-TAX-DETAIL                    VALUE 'D'.          PRCITXDA 
000900         88  AU-TAX-TRAILER                   VALUE 'Z'.          PRCITXDA 
001000     05  FILLER                          PIC  X(98).              PRCITXDA 
001100*                                                                 PRCITXDA 
001200 01  AU-CR-INT-DETAIL-REC  REDEFINES                              PRCITXDA 
001300                           AU-CR-INT-INTERFACE-REC.               PRCITXDA 
001400     05  AU-TAX-DETAIL-REC-CD            PIC  X(01).              PRCITXDA 
001500         88  AU-TAX-DETAIL-REC                VALUE 'D'.          PRCITXDA 
001600     05  AU-ACCOUNT-NO                   PIC S9(17)     COMP-3.   PRCITXDA 
001700     05  AU-DETAIL-PROD-ID               PIC  X(04).              PRCITXDA 
001800         88  AU-DETAIL-PROD                   VALUE 'CP  '.       NAG0106 
001900     05  AU-NAB-COMP-EXT-CD              PIC  9(02).              PRCITXDA 
002000     05  AU-TYPE-IND                     PIC  X(02).              PRCITXDA 
002100     05  AU-FROM-EFF-DATE-CYMD           PIC S9(09)     COMP-3.   PRCITXDA 
002200     05  AU-CUST-RESIDENT-CD             PIC  X(01).              PRCITXDA 
002300     05  AU-REASON-CD                    PIC  X(01).              PRCITXDA 
002400     05  AU-CR-INT-EARNED-AMT            PIC S9(15)V99  COMP-3.   PRCITXDA 
002500     05  AU-CR-INT-TAX-AMT               PIC S9(11)V99  COMP-3.   PRCITXDA 
002600     05  AU-CARDHOLDER-NBR               PIC S9(17)     COMP-3.   PRCITXDA 
002700     05  AU-INPUT-SOURCE-ID              PIC  X(04).              PRCITXDA 
002800     05  AU-PGM-ACTION-IND               PIC  X(01).              PRCITXDA 
002900     05  AU-PRINC-AMT                    PIC S9(13)V99  COMP-3.   PRCITXDA 
003000     05  AU-WH-PRIN-TAX-EXMPT-AMT        PIC S9(13)V99  COMP-3.   PRCITXDA 
003100     05  AU-TAX-EXMPT-CERT-NO            PIC  X(07).              PRCITXDA 
003200     05  AU-WH-PRINC-TAX-AMT             PIC S9(11)V99  COMP-3.   PRCITXDA 
003300     05  AU-DETAIL-FILLER                PIC  X(14).              PRCITXDA 
003400*                                                                 PRCITXDA 
003500 01  AU-CR-INT-TRAILER-REC REDEFINES                              PRCITXDA 
003600                           AU-CR-INT-INTERFACE-REC.               PRCITXDA 
003700     05  AU-TAX-TRAILER-REC-CD           PIC  X(01).              PRCITXDA 
003800         88  AU-TAX-TRAILER-REC               VALUE 'Z'.          PRCITXDA 
003900     05  AU-TRAILER-PROD-ID              PIC  X(04).              PRCITXDA 
004000         88  AU-TRAILER-PROD                  VALUE 'CP  '.       NAG0106 
004100     05  AU-TRAILER-DATE-CYMD            PIC  9(08).              PRCITXDA 
004200     05  AU-TRAILER-TIME                 PIC  9(08).              PRCITXDA 
004300     05  AU-TRAILER-REC-CNT              PIC S9(13)     COMP-3.   PRCITXDA 
004400     05  AU-TRAILER-FILLER               PIC  X(71).              PRCITXDA 
004500*                                                                 PRCITXDA 
004600**** END OF PRCITXDA ******************************************** PRCITXDA

Initial approach

The initial approach to digesting and composing the messages was to use a type of iterator which would consume or write fields. For each message, a Java class would be created which would include the fields, their sizes, and their ordering. The iterator would take the list of fields and then either read or write the fields values into a hash map. This method worked, but meant that each additional copybook one wanted to use would require the same amount of effort to translate the field mapping. Additionally, this method allowed for errors to occur, as it was a manual process, and not every developer understood the nuances of the copybook format.

Parser approach

When looking at the copybook data, I figured it wouldn't be too difficult to parse it and generate Java code. Building a parser could normalize the format into a usable structure. This structure could then be used to parse the copybook data itself. The next choice is if you want to parse the copybook at compile time or at runtime. The advantage of parsing them at runtime is the flexibility to handle new and changing formats without needing to recompile your program. The compile time approach gives you the benefit of code completion, and compile time checks. I.e., for a runtime parse you might have code which looks like:

void setCard(String card) {
   if (card.length() != fields.get("AU-CARDHOLDER-NBR").fieldSize()) {
       return;
   }
   fields.get("AU-CARDHOLDER-NBR").setValue(card);
}

The generated code / compile time use might look like this:

void setCard(String card) {
   if (card.length() != AU_CARDHOLDER_NBR_LENGTH) {
       return;
   }
   setAU_CARDHOLDER_NBR(card);
}

Parser advantages

I decided to go with the code generation approach for these reasons:

copybook formats don't change very often. If we do get a new format, and it is not completely compatible with the previous version, we will get compile time errors instead of runtime errors. Since we were using a statically compiled language (Java), we could use the compiler to help find any problems with the new spec.
Code completion. Using an IDE which offers code completion means that the copybook field names can pop up when we press the dot key. This is quicker than thumbing through the copybook spec to find the fields that your interested it.
Speed. This wasn't really a concern on our project, but the compiled generated code will almost always be faster than a runtime equivalent.

Once the parser was in place, and reliably generating code, it was an easy replacement for the hand coded classes. Using a parser to parse the copybooks and generate Java code provided these benefits:

Kept the code [DRY]. The definition of the copybook format was kept in one place - the file which the vendor gave us.
Made the common case easy, the difficult case possible. One of the record formats contained over 2,000 fields. Calculating field lengths and offsets by hand may have been near impossible in this case.
Kept the knowledge in the code. Copybooks have some quirks, such as redefines. Another is the rules for calculating field lengths. For example, one developer might think a PIC 9(11)V99 is 14 bytes long, but in fact it's only 13 bytes. The rules for calculating these lengths were in code, and in one place. If there was a mistake, it could be fixed in one place as well.
Was fun. Writing code is generally more stimulating than performing a repetitive task. Computers are quite happy to do repetitive tasks, so keeping the programmer from doing such keeps him/her happy.

Download

http://www.theeggeadventure.com/2007/copybookParser.zip

@@ Line 91: / Line 91: @@
 * Kept the knowledge in the code.  Copybooks have some quirks, such as redefines.  Another is the rules for calculating field lengths.  For example, one developer might think a PIC 9(11)V99 is 14 bytes long, but in fact it's only 13 bytes.  The rules for calculating these lengths were in code, and in one place.  If there was a mistake, it could be fixed in one place as well.
 * Was fun.  Writing code is generally more stimulating than performing a repetitive task.  Computers are quite happy to do repetitive tasks, so keeping the programmer from doing such keeps him/her happy.
+== Download ==
+* http://www.theeggeadventure.com/2007/copybookParser.zip