QA for Fortran written in C++

seebs

So, there was this place. And I worked there. And I was in the QA department, although I did get to read the code.

It was a bit exciting.

The place did jet engine test cell software; that is to say, software for accumulating real-time sensor data and applying the insanely complex formulae provided by a jet engine vendor to determine whether a jet engine was properly repaired or not. (It is important to know that the cost of a test was huge -- possibly five-figure numbers -- and thus it is fairly catastrophic for this software to crash or otherwise abort a test.)

Well, the software had all originally been done in Fortran; when I showed up, everything was being redone in C++, and mostly had been.

I lost count of WTFs while I was there; indeed, it wasn't until years later that I was quite sure of whether they were crazy or I was.

Let us start with the example. This is my own personal favorite IsTrue macro:

#define IS_TRUE(x) ((x) & 0x100)

This macro was used to determine whether a value coming out of a Fortran function (much of the C++ code still called to old Fortran) was the fortran value .TRUE. Needless to say, it broke during a compiler upgrade; I still don't think they understood how it was that I was able to figure out that removing the IS_TRUE macro from a conditional test would fix the no-longer-functioning program.

Well, let's move on. Maybe it was just problems with their compatibility shims.

Calculations were handled through an elaborate language which let you write variable names and simple expressions, in pairs. You could have non-nested if statements. Nested conditionals were impossible because it would take six weeks of work to do this. (See, the value of the most recent conditional was being stored through a pointer. Someone would have to figure out a way to allocate an array of objects, then increment and decrement the pointer. This was, of course, impossible.) Now, the thing about these calculations is that there were literally thousands of them happening every update, and the goal was to have multiple updates a second for real-time monitoring. The language very thoughtfully provided case-insensitive value names. Well, sort of. Parameters were compared to the list using a case-insensitive compare.

Of course, like every large program, they wrote their own case-insensitive string compare. Rather than relying on some weirdness like tolower(), they went with the simpler expedient of finding the length of each string, allocating space for a copy of it, copying letters into it (smashing them to lowercase along the way), and then comparing those resulting strings, then freeing them. This was in the early 90s, and I think trying to do this several thousand times a second may have been a bit ambitious.

One thing you have to grant is, every effort was made to follow good object-oriented methodology. Members were hidden, exposed only through methods. For instance, if you had a class "Param", with a member "value" of type double, you would have two methods; double GetParamValue(void) and void PutParamValue(double). Well, more than two; rather than rely on C's type promotions, they had a bevy of PutParamValue functions, each of which cast the input to double. There was only one exception. A parameter could in some rare cases hold a generic pointer (void *) in a field called aux. So, in case you didn't like PutParamAux(), as a special favor, PutParamValue(void *) filled that field too.

The List class was inevitable. Of course, there was no STL; this was right around the time when templates were first being experimented with. Still, I do question the wisdom of a virtual void Draw() routine which was defined only for one List in the entire program, and could only be called under very specific circumstances. There were no other virtual functions, so every instance of a List in the whole program got the virtual table pointer overhead.

But that's all technical details. The key question was the corporate environment. Remember what I said about the expense of crashes? Well, the thing is, if a crash didn't occur while following the book precisely, it didn't count. For instance, buttons had callbacks. The interface spawned multiple threads; each callback would run independently. This meant that you had to be very careful to make sure your callbacks were thread safe. Here's how you make a callback thread safe:

void
callback(mumble mumble) {
	static int already_running = 0;
	if (already_running)
		return;
	already_running = 1;
	/* do stuff */
	already_running = 0;
}

The astute reader may suspect that this did not always work. Worse, in the development environment used (not sure what it was, but it came with HP-UX), buttons didn't depress until the UI thread noticed them, and then the events would show up. So, if you clicked a button, and nothing happened, and you clicked again, both events would execute pretty close to simultaneously. One bug report on this got posted prominently in a developer's cubicle, with the notation "QA hard at work!!" -- needless to say, that one wasn't fixed.

The key defense was to simply respond "NAP" -- "not a problem". So, for instance, when we submitted a report saying "Configuring a display with this gauge type causes crashes", the response was "NAP". When QA resubmitted, saying "$LARGE_CUSTOMER, who pays us untold amounts of cash, is complaining", the response was "The gauge type in question has been removed from the 6.0 release."

Finally, we get to error checking. "." Now, the null set is technically a set... But the problem is, the program that developed configuration files for runtime probably should have had some error checking. With huge log files, and the tiny disks available at the time, what tended to happen is this: The log files filled the disk, and the next time you ran the program that generated the configuration and layout files, it systematically tried to save all of them to disk. It re-opened existing files, truncating them, failed to write any data, and repeated this until it had run out of files. Sometimes a file would get a few bytes written. (The log files could exceed the non-root-user capacity of the disk, see...)

All in all, a wonderful environment. I sorta wish they'd given me a copy of the code as a going-away present so I could share more details, but those above are the ones that have stayed with me lo these many years.

deathkrush

@seebs said:

The key defense was to simply respond "NAP" -- "not a problem". So, for instance, when we submitted a report saying "Configuring a display with this gauge type causes crashes", the response was "NAP".

In other words, it works as coded!