Reading badly structured input data



  • What do you think of the flow control in this code? 

    The input data consists of a header record, then a bunch of sets of records; each set consists of an "A" record (that starts with "GMA"), then an optional "B" record, an optional "1" record, an optional "2" record, and an optional "3" record.  Then another "A" record, and so on.  Then a trailer record that starts with "EOF".

    The B and 1, 2, and 3 records logically go with the A record -- the subroutine (whose guts are omitted here) processes the whole group, and uses defaults if the optional records aren't present.

    I want to know if this is the best flow control, or if there's a better way.  This is VB.NET 2002 (or 2005, same syntax).  Variable declarations and the InputReader initialization are omitted, but the vars are all strings.  (Don't worry about no error checking on the CDate; it should always work, since the file has earlier been verified to be a correctly formatted input file.)  Thanks.

    <FONT color=#008000 size=2>' To detect the first set of records</FONT>

    LineA = "First"

    <FONT size=2></FONT><FONT color=#008000 size=2>' Read and display header line, and get file date

    </FONT><FONT size=2>

    HeaderLine = InputReader.ReadLine

    Console.WriteLine("File Header: " & Left(HeaderLine, 120))

    </FONT><FONT color=#0000ff size=2>Dim</FONT><FONT size=2> FileDate </FONT><FONT color=#0000ff size=2>As</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>Date</FONT><FONT size=2> = </FONT><FONT color=#0000ff size=2>CDate</FONT><FONT size=2>(Mid(HeaderLine, 47, 10))

    </FONT><FONT color=#008000 size=2>' The NextLine var starts as the header, and is updated in the loop

    </FONT><FONT size=2>

    NextLine = HeaderLine

    </FONT><FONT color=#0000ff size=2>Do</FONT><FONT size=2> </FONT><FONT color=#008000 size=2>' Until we process the last set of records and then see an EOF in NextLine.

    </FONT><FONT size=2>

    </FONT><FONT color=#008000 size=2>    ' On an A record or the EOF record, process the previous set of records.

    </FONT><FONT size=2>

    </FONT><FONT color=#0000ff size=2>    If</FONT><FONT size=2> Left(NextLine, 3) = "GMA" </FONT><FONT color=#0000ff size=2>Or</FONT><FONT size=2> Left(NextLine, 3) = "EOF" </FONT><FONT color=#0000ff size=2>Then

    </FONT><FONT size=2>

    </FONT><FONT color=#008000 size=2>        ' See if it's the very first A record

    </FONT><FONT size=2>

    </FONT><FONT color=#0000ff size=2>        If</FONT><FONT size=2> LineA <> "First" </FONT><FONT color=#0000ff size=2>Then

    </FONT><FONT size=2>

    </FONT><FONT color=#008000 size=2>            ' It's not the first A record, so we have a set of records to process.

    </FONT><FONT size=2>

    </FONT><FONT color=#0000ff size=2>            If</FONT><FONT size=2> 0 <> ProcessTheData(LineA, LineB, Line1, Line2, Line3, FileDate) </FONT><FONT color=#0000ff size=2>Then</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>GoTo</FONT><FONT size=2> EndSub

    </FONT><FONT color=#008000 size=2>        </FONT><FONT size=2></FONT><FONT color=#0000ff size=2>End</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>If</FONT><FONT size=2> </FONT><FONT color=#008000 size=2>' Not first time through the loop

    </FONT><FONT size=2>

    </FONT><FONT color=#008000 size=2>        ' Since NextLine starts with GMA, assign it to LineA, and clear the rest of the variables (whether first time through or not).

    </FONT><FONT size=2>

            LineA = NextLine

            LineB = ""

            Line1 = ""

            Line2 = ""

            Line3 = ""

    </FONT><FONT color=#0000ff size=2>        End</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>If</FONT><FONT size=2> </FONT><FONT color=#008000 size=2>' NextLine starts with GMA or EOF

    </FONT><FONT size=2>

    </FONT><FONT color=#008000 size=2>    ' Not an A record -- assign the line to the correct variable. When we hit the next A record, or the EOF record, we'll call the subroutine.   (The author could have used a Select Case here; that's not critical.)

    </FONT><FONT size=2>

    </FONT><FONT color=#0000ff size=2>    If</FONT><FONT size=2> Left(NextLine, 3) = "GMB" </FONT><FONT color=#0000ff size=2>Then</FONT><FONT size=2> LineB = NextLine

    </FONT><FONT color=#0000ff size=2>    If</FONT><FONT size=2> Left(NextLine, 3) = "GM1" </FONT><FONT color=#0000ff size=2>Then</FONT><FONT size=2> Line1 = NextLine

    </FONT><FONT color=#0000ff size=2>    If</FONT><FONT size=2> Left(NextLine, 3) = "GM2" </FONT><FONT color=#0000ff size=2>Then</FONT><FONT size=2> Line2 = NextLine

    </FONT><FONT color=#0000ff size=2>    If</FONT><FONT size=2> Left(NextLine, 3) = "GM3" </FONT><FONT color=#0000ff size=2>Then</FONT><FONT size=2> Line3 = NextLine

    </FONT><FONT color=#008000 size=2>    ' See if we're done

    </FONT><FONT size=2>

    </FONT><FONT color=#0000ff size=2>    If</FONT><FONT size=2> Left(NextLine, 3) = "EOF" </FONT><FONT color=#0000ff size=2>Then</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>Exit</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>Do

    </FONT><FONT size=2>

    </FONT><FONT color=#008000 size=2>    ' Read the next line from the file</FONT>

    <FONT size=2>    NextLine = InputReader.ReadLine

    </FONT><FONT color=#0000ff size=2>    Loop</FONT><FONT size=2> </FONT><FONT color=#008000 size=2>' Until we exit at the EOF</FONT>

    <FONT color=#008000><FONT color=#008000>

    <FONT size=2>    NextLine = InputReader.ReadLine</FONT>

    <FONT size=2>' The rest of the program...

    </FONT></FONT></FONT>


  • @DWalker59 said:

    Don't worry about no error checking ... it should always work ...

    The last words of many a programmer.



  • This looks like a legacy app.  It appears to be designed for the IBM 407 Accounting Machine.  (Maybe you can find the original plugboard and take the logic from it.)

    I used to have to write programs like this in COBOL.  (They were designed by promoted 407 programmers who never quite made the transition from accounting machines to computers.)

    You don't have to worry about cards in the wrong order.  The operator runs them through the card sorter first and then puts on the header and EOF cards.

    You also don't have to worry about an empty data set.  If there are no cards to process, the operator doesn't run the job.

    I'd use "Else If" (or whatever it is in VB) for some of those If's, because once you know it's "GMB" you don't have to check if it's "GM1".  Also you could initialize LineA to "" and test it against "", or use a separate flag to detect the first time through.

    If the header line is not one of the data lines, you don't have to send it through the loop, so instead of "NextLine = HeaderLine" I'd say "NextLine = InputReader.ReadLine".  Better yet, take it out and move "NextLine = InputReaderReadLine" from the bottom to the top of the loop.  Then, after the End If just after ProcessTheData(), test for "EOF" and exit the loop at that point.  (Just above the comment "Since next line starts with GMA".)

    Other than that, I'd leave it like this.  It will work.  You'll probably get submissions that run faster, like looking at only the third character of the record which has to be "A", "B", "1", "2", "3" or "F" (until someone invents a "GMF" record and breaks it), but I say if the design looks like a COBOL program from 1960, then the implementation should look like a COBOL program from 1960.  That way, whoever designed the data will be able to read your code.

    And by the way, if it were 1960, it would look something like this: 

    01 HEADER-LINE.
       03 FILLER PICTURE X(46).
       03 FILE-DATE PICTURE X(10).
       03 FILLER PICTURE X(64).

    01 CONSOLE-LINE.
       03 FILLER PICTURE X(13) VALUE "FILE HEADER: ".
       03 LINE-DATA PICTURE X(120).

    01 INPUT-LINE.
       03  CARD-CODE PICTURE XXX.
           88 END-OF-FILE VALUE IS "EOF".
           88 CARD-A VALUE IS "GMA".
           88 CARD-B VALUE IS "GMB".
           88 CARD-1 VALUE IS "GM1".
           88 CARD-2 VALUE IS "GM2".
           88 CARD-3 VALUE IS "GM3".
       03  FILLER PICTURE X(117).


    77 LINE-A PICTURE X(120).
    77 LINE-B PICTURE X(120).
    77 LINE-1 PICTURE X(120).
    77 LINE-2 PICTURE X(120).
    77 LINE-3 PICTURE X(120).

    77 FLAG PICTURE X(5) VALUE "FIRST".
        88 FIRST-TIME VALUE IS "FIRST".


    START-UP.
        OPEN INPUT FILE INFILE.
        READ INFILE INTO HEADER-LINE.
        MOVE HEADER-LINE TO LINE-DATA.
        DISPLAY CONSOLE-LINE UPON CONSOLE.
     

     

    READ-LINE.
        READ INFILE INTO INPUT-LINE.

    TIME-TO-PROCESS.
        IF NOT END-OF-FILE AND NOT CARD-A GO TO MOVE-DATA.
        IF FIRST-TIME
            MOVE SPACES TO FLAG
        ELSE
            PERFORM PROCESS-THE-DATA USING LINE-A LINE-B LINE-1 LINE-2 LINE-3.
        IF END-OF-FILE GO TO ALL-DONE.
        MOVE SPACES TO LINE-A.
        MOVE SPACES TO LINE-B.
        MOVE SPACES TO LINE-1.
        MOVE SPACES TO LINE-2.
        MOVE SPACES TO LINE-3.

    MOVE-DATA.
        IF CARD-A
            MOVE INPUT-LINE TO LINE-A
            GO TO READ-LINE.
        IF CARD-B
            MOVE INPUT-LINE TO LINE-B
            GO TO READ-LINE.
        IF CARD-1
            MOVE INPUT-LINE TO LINE-1
            GO TO READ-LINE.
        IF CARD-2
            MOVE INPUT-LINE TO LINE-2
            GO TO READ-LINE.
        IF CARD-3
            MOVE INPUT-LINE TO LINE-3.
        GO TO READ-LINE.

    ALL-DONE.
        CLOSE INFILE.
        STOP RUN.

       
               

     



  • I'm gunna agree with newfweiler here. It does look like a legacy app...

    The flow control is jumping out of blocks... Does ProcessTheData() return 0 if it fails and 1 if it doesn't? the if statement can be simplified to exclude "0 <>"

    The do loop can be write as:

    While (NOT Left(NextLine, 3) = "EOF") ... End While

    And if that's the case you won't need the "<FONT color=#0000ff>If</FONT><FONT size=2> Left(NextLine, 3) = "EOF" </FONT><FONT color=#0000ff size=2>Then</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>Exit</FONT><FONT size=2> </FONT><FONT color=#0000ff size=2>Do"</FONT>

    let's see here... I have a really hard time with app logic that has a different "start" state than "run" state...

    I usually would form applications that do that like this: Psuedo-do:

    //do header logic

    //lineA = "first" , etc.

    // Perform all processing on the first line

    while("EOF" <> left(nextline,3))

         //do other record logic

         nextline = inputreader.readline

    End While

    //whatever else.

     

    Usually how i do it, especially if there's some special case first... it cuts down on the branching and conditionals inside the loop, which should speed things up a tiny bit.

    I'm not sure why you'd need to exit in the middle if you just switch around the order of reads and calls. I dunno, maybe i didn't help at all :-)

    As a quick aside, the whole start/run state thing is the fault of my C++ instructor, that said that if you're going to loop, all information needs to be in the loop, with none outside. that's fine for simple data loops, but when you have a special start and end case, sometimes it's better to drop those out of the loop for clarity, especially if the data is handled differently (which it looks like it is for the start case here).

     

     

     



  • @newfweiler said:

    I'd use "Else If" (or whatever it is in VB) for some of those If's, because once you know it's "GMB" you don't have to check if it's "GM1".  Also you could initialize LineA to "" and test it against "", or use a separate flag to detect the first time through.

    In fact, you could opt for a Case Select, since all you're doing is checking for a value x or y or z.



  • As for the comment about "it will always work", if the CDate fails, I would just as soon see the native error message as opposed to a custom error message.  A custom error message won't give me any more information than the native error message here; that's why it is not tested and trapped.  If it fails, it's because the whole record, or the whole data file, is trashed, and there's nothing for the code to do except give the native error message, quit, and I'll make a phone call to ask for the data file to be re-sent.  This has happened once in 7 years.

    Yes, the data comes from a legacy system.  Yes, I could have (and probably will) change the sequential IFs to a Select Case statement.  But overall, is there any high-level, better way to structure this?

    Thanks.

    David Walker



  • GeneWitch:  With the WHILE statement you suggest, it would not process the last batch of records -- it would jump out of the loop as soon as NextLine started with an EOF... and the last set of records to process are stranded in the variables.

    I understand your "usually" code, but in this screwy set of data, I can't process LineA by itself -- if there's a LineB, LineC, etc, I need to gather those values and process the whole bunch at once.  If there's not a LineB, LineC, THEN I can process LineA and default the values that would have been in LineB and LineC.   But I don't know that there's not a LineC for this set of records until I see another LineA.

    Thanks.



  • Thanks, newfweiler.  You probably hit it about right.  I just wanted to make sure I wasn't missing something obvious. 



  • @DWalker59 said:

    Thanks, newfweiler.  You probably hit it about right.  I just wanted to make sure I wasn't missing something obvious. 

    Yup.  Since the data was designed for a particular (although obsolete) coding style, you might as well handle it in that style.

     


Log in to reply