Testing code - bit by bit

OzPeter

Last week a colleague of mine was doing some validation work on a industrial system where I had re-factored some code.

The gist of the system is that various fault conditions set a bit in a large bit field. The code scans the bit field (using a standard built in function), and extracts the address of the first bit that has been set - which is designated as the "current fault". The code uses the address of the current fault as an index into two arrays. The indexed element of the first array gets incremented each time a new current fault is detected, while the indexed element of the second array gets incremented once per second. Thus the first array logs the number of instances of each fault, while the second array logs the total time spent in each fault condition.

At first my colleague was confused by the code as he couldn't find the explicit increments for each possible fault condition, where upon I had to point out the use of indirect addressing - for at least the 100th time.

Later on I looked over his shoulder as he was proving that each fault condition acted on its associated locations in each array. Rather than looking at the code and understanding the use of indirect addressing in setting the values in the arrays, he was manually setting each bit in turn and noting whether or not the array values were changed. Of course the setting of any bit in the bit field would result in the associated array values being modified, but he was all set in manually testing the 250+ possible conditions .

When I saw what he was doing I kept my mouth shut and quietly walked away as I could see that my colleague was happy with his testing solution.

The joke will be on you when your colleague disproves binary arithmetic

tgape

Some of us like to make sure our testing code doesn't use the same algorithm as the code being validated - otherwise, the test only shows that the algorithm behaves consistently. Admittedly, I suspect there may have been a better way to do it - and if there wasn't better code to do the validation, at *least* one could write a program to write the test code, and then simply review the generated test code to ensure correctness...

Mole

I don't get the problem. Isn't it better to prove that the code works the way you expected it to, rather than assuming it does what its supposed to? Sure, what he did could have been done quicker, but after he's finished he has proven that there are no quirks or gotchas whatsoever. Compilers can generate some strange code, specially when you change the various optimisation settings. For this reason, after our internal "code review" we then have to prove it does what we think it does. I'd say the job time ends up being about 25% coding and 75% testing. As a bonus however, we have never had any released code do something we didn't expect it to do. It can get a little tedious at times retesting the same functions over and over just because the module itself has been recompiled, but it's a necessary evil.

OzPeter

The issue is that this is not code running on some generic x86 from code compiled from an arbitrary high level language where you have a lot of choice. This is ladder logic code running in a PLC. First of all there is typically no "test" machine where you can build a special version of the program, you effectively work on the production machine using the production code.

Secondly,ladder logic was designed to keep electricians happy by mimicking the physical layout of electrical relays. The programming environment is more akin to point and click connecting predefined blocks a la labview than it is writing high level code. So once the correct algorithm is defined and been verified, then it is going to work the same for all inputs - and the only possible errors will be at edge cases. The only thing that repeated testing can do is prove over and over again that the built in library functions work for all cases.

In this case that amounts to testing over and over again that a[i] + 1 = a[i] +1

tgape

@OzPeter said:

In this case that amounts to testing over and over again that a[i] + 1 = a[i] +1

If he had written his test code using the same indirect addressing that you used, this would be true. He didn't do that.

Given what you've said in this latest response, it's possible that the environment didn't give him any option to write the test in a quicker manner, without using the indirect addressing you used. If that is the case, TRWTF is your code environment.

OzPeter

@tgape said:

Given what you've said in this latest response, it's possible that the environment didn't give him any option to write the test in a quicker manner, without using the indirect addressing you used. If that is the case, TRWTF is your code environment.

There was no test code. The test method consisted of manually setting the bits in the PLC which was running the production code and then noting whether the array's were affected. This is SOP for a PLC type system. That may be a WTF to you, but in my mind stepping through all 200+ bits is a bit retarded.

tgape

@OzPeter said:

@tgape said:
Given what you've said in this latest response, it's possible that the environment didn't give him any option to write the test in a quicker manner, without using the indirect addressing you used. If that is the case, TRWTF is your code environment.

There was no test code. The test method consisted of manually setting the bits in the PLC which was running the production code and then noting whether the array's were affected. This is SOP for a PLC type system. That may be a WTF to you, but in my mind stepping through all 200+ bits is a bit retarded.

Not nearly as retarded as having a system where that's the only testing mechanism available.

The few people I know who do embedded systems that don't have test capabilities available due to system size have emulators, which allow them to provide programmatic input/output, so they can at least test them in an emulated environment. I don't know how feasible that would be, but it would at least be something.

OzPeter

@tgape said:

Not nearly as retarded as having a system where that's the only testing mechanism available.

Its kinda expensive to replicate a manufacturing process just for the sake of testing code changes every so often. In general a lot of industrial processes are tested on the fly, on live manufacturing systems. The systems are designed for online code changes and are pretty well crash proof - however that doesn't protect you from logic errors!

If the changes are major enough, then you can justify taking down the manufacturing process for testing (although the production managers won't like you upsetting their quotas), but you are still working on the production systems.

dtech

@OzPeter said:

Its kinda expensive to replicate a manufacturing process just for the sake of testing code changes every so often. In general a lot of industrial processes are tested on the fly, on live manufacturing systems. The systems are designed for online code changes and are pretty well crash proof - however that doesn't protect you from logic errors!
If the changes are major enough, then you can justify taking down the manufacturing process for testing (although the production managers won't like you upsetting their quotas), but you are still working on the production systems.

That sounds stupid to me. Why don't use an emulator first and only deploy changes on the live systems once they work and are tested in the emulator?

OzPeter

@dtech said:

That sounds stupid to me. Why don't use an emulator first and only deploy changes on the live systems once they work and are tested in the emulator?

First of all there is no such animal as an emulator for these PLC systems. Think MS lockin that includes hardware and software (oops .. maybe thats Apple) plus you can't control the firmware on which the runtime libraries are based - so you have to use the manufacturers dedicated systems to run your programs. This does have the advantage that PLC systems tend to be rock solid in terms of uptime. There is nothing a user can accidently do to screw around with the equivalent of the OS.

Second, at the plant I was at last week there are 11 machines producing the same part. But all 11 machines differ in some way (mechanical, electrical or software), and that can change from time to time depending on maintenance and other projects that are going on. So creating an accurate emulator for a machine would be near impossible - a major project in itself.

Third, assuming an emulator was desired, it would easily consume between $50,000 to $100,000 in pure hardware costs. Not something that any plant manager wants sitting on his books. Repeat that for each type of processing line in the plant - and this plant had at least 4 separate sections using different classes of hardware.

So working on the production machines is a given when it comes to maintenance work on PLCs. In a greenfield project you can have the luxury of basic simulation of the hardware environment - you bench test the actual hardware that will be installed in the plant, so there is minimal extra hardware cost. However the simulations are only ever 1st or 2nd order approximations at best, so the real debugging will always occur in the field during commissioning.

I know that this seems like the antithesis of modern commercial software development - but there you can encapsulate the complete software environment, and the code can be deployed on generic hardware for testing. Embedded design is slightly different, but there the deployment will occur in large numbers of systems in which it is hard to make a change after the fact - so investment in emulation technology makes sense. PLC coding for industrial processes is bespoke, typically only deployed to a small number of systems and fairly easy to change after the fact (although I can also cite exceptions)

Just to scare you some more, the concept of PLC program version control is only just starting the enter the field of play, and it is actually not an easy sell. And one of the products I work with is actually layered over the top of VSS!

tgape

@OzPeter said:

First of all there is no such animal as an emulator for these PLC systems.

<snip>
@OzPeter said:

Second, at the plant I was at last week there are 11 machines producing the same part. But all 11 machines differ in some way (mechanical, electrical or software), and that can change from time to time depending on maintenance and other projects that are going on. So creating an accurate emulator for a machine would be near impossible - a major project in itself.

Third, assuming an emulator was desired, it would easily consume between $50,000 to $100,000 in pure hardware costs. Not something that any plant manager wants sitting on his books. Repeat that for each type of processing line in the plant - and this plant had at least 4 separate sections using different classes of hardware.

Wow - what a flash from the past. Back when I was in college, I worked at a shop that had something similar to this over a couple summer breaks.

Now, there, there were three classes of machines. Each machine could be configured like any of the other machines in its class as far as the software was concerned, but they were sufficiently different physically that they couldn't necessarily do each others actual jobs. For example, one of them might be twice the width of another - allowing it to make product twice as wide, with the disadvantage of having control problems making narrower product (so they couldn't all be wide machines.)

What my boss did (long before I got there) was to salvage the control circuits from one of the bigger (and vastly more complicated) machines that had managed to reach effective end of life (aka totaled in a mishap). He then built a small proof-of-concept test machine out of cheap materials. The end construction was probably over 50% wood, and at least half of the rest of the parts were refurbished parts deemed irreparably damaged by the machine vendor.

What it lacked in visual appeal, it made up for in savings. The fact that they could actually test new programming with reject material and find problems with the software repaid the cost of the machine enough times over that his management allowed him to actually *buy* a tiny machine of the same class as the other complex machines. I came on board a few months before they were able to buy a real machine replacement for the original test machine (it happened while I was at college, however.)

When downtime costs the plant tens of thousands of dollars per hour, it doesn't take a lot of hours of downtime to justify spending $50k on a test machine.

Of course, the trick in doing this is to figure out how to present to management the savings actually realized by the test equipment. Since it's downtime that doesn't happen, most management have a difficult time understanding it - and how one needs to convey it varies between managers.

In this case, part of what convinced management to buy the second real test machine (and, in fact, build an addition onto the building to house it, as they were out of space) was that the first tiny machine did, on rare occasion, actually run some production material. Since it was production-quality hardware, it really *could* do the job; it just wasn't big enough to be a normal production machine. However, every so often, some of their larger customers would want a custom, small order of something - below their normal minimum order size, but since it was a customer that represented in excess of 10% of their business, they didn't want to turn them away. Having that equipment allowed them to not turn them away without as much expense. Having production-quality hardware also allowed them to investigate some process issues on the test machine also - probably a bigger bucket of savings, because they were much more common, but it didn't get management attention quite as much as being able to say, "We'll get that undersized order shipped this afternoon - and we don't need to interrupt a production run to do it."

So, on that basis, I still find it a WTF you have no test capability.

OzPeter

@tgape said:

So, on that basis, I still find it a WTF you have no test capability.

Your bosses solution required all of the following:

End of life/scrapped machine to provide basic setup for free
A production process that would suit cheap materials like wood
Various other parts that were refurbished and provided at no cost
Free time in which to do perform all of the labour - yet still do his real job
Undersized plant that required additional capacity to meet overflow orders

Compare this with

No broken down machinery lying about
Custom machinery producing precise parts that requires a solid platform and cannot simply be handcrafted from odds and ends
No spares inventory to draw control systems parts/sensors/actuators etc for just the hell of it
A company that downsized their entire engineering staff leaving only electricians and management - with nothing in between
A plant that is not struggling for capacity
A product that costs around 10 cents to make, but sells for $$

So there is no business case for what you propose.

And yes it would be nice to have a plant mockup in order to do testing on, but in all my years of working in Steel, Paper, Aluminium, Pipe and other manufacturing plants I have yet to see such a setup for PLC level controls (*). Including plants that had down times in the $60,000+ per hour range. So your situation is almost mythical from my experience no matter how common sense it is. If you are not in a break down situation then you plan for working during scheduled down time and if your changes aren't up and running before a set time then you probably revert back to the previous version.

* Where the computing can be abstracted to general purpose hardware ie calculating numerical setups that will be sent to PLC level systems, then yes I have used simulation systems - but they were C++ programs running under windows so it was fairly easy to do in comparison.

DocTrinsOGrace

OzPeter... at least your naieve colleague was willing to be honestly thorough in his testing. I was reviewing a design late last year. It consisted of a single set of togglable switches that could be passed to a business system to adjust its behavior. There were 32 switches. I said to the designer, "You realize that you are creating two to the 32nd settings?" Without blinking they said, "Yes, it is what we need." I replied, "Do you intend to test all those settings?" They looked disdainfully at me and said, "Yes, QA will do full regression testing."

I had similar conversations with management and developers. All of them gave the same sort of reply, and telling me not to be concerned. Despite my efforts, developers coded the logic, QA approved it, and it was implemented by authority of management.

I learned two things: (1) These folks can't do basic arithmetic; and (2) they have no clue what "full regression testing" means.

Perhaps the third thing I learned is that I am clearly not communicating with these folks! :-(

tgape

@OzPeter said:

A company that downsized their entire engineering staff leaving only electricians and management - with nothing in between

Enough said. <facepalm>

tgape

@DocTrinsOGrace said:

Perhaps the third thing I learned is that I am clearly not communicating with these folks! :-(

I've had encounters with people like you describe. Specifically, I've had long discussions with them after finding out their comprehension issue.

After about an hour of discussion, which included several demonstrations of how features can interact with each other in unexpected ways, one out of five of them had a moment of comprehension, in which his face quickly went from confused to amazed to horrified at the implications. I'm fairly confident that this was not at all a representative sample.

DocTrinsOGrace

@tgape said:

...one out of five...

But even a fewer number of people will, right off the bat, understand the magnitude when you say "32 binary switches." In relating this story in classes, I've tried to augment the explanation by including the phrase "over 4 billion combinations." But I become more discouraged at how few immediately grasp the chronological magnitude.

Sheesh... 36 years as an engineer, and I'm more amazed all the time that we humans have succeeded with so small a thing as placing one rock on top of another!

Ilya_Ehrenburg

Try explaining it in steps. "With 10 switches, there are 1024 possible states. Add another switch and you double the number of states, so for 11 switches there are 2048 states. (perhaps 15 ~ 32768) For 20 switches there are a little over 1 million states (1M is in my experience about the upper limit of what people can imagine), now for 30 switches, there are over one THOUSAND MILLION states!"

One thousand million sounds way more impressive than one billion to most people.

tgape

@Ilya Ehrenburg said:

Try explaining it in steps. "With 10 switches, there are 1024 possible states.

Personally, I've had better success starting even smaller. "With two switches, there are 4 possible states - both on, both off, the first on, second off, and the first off with the second on. Add another switch, and you double the number of states." It takes a bit longer, but it seems to work better to me.

I also find it tends to work to throw in some anecdotes about situations where certain combinations have failed - that's what takes the most time, however. In the hour explanation I tried above, I tried doing that at the four switch stage. It may be quicker to do it at the three switch stage.

Weng

@Mole said:

we have never had any released code do something we didn't expect it to do

Am I the only person who views this sentence with extreme suspicion? Or perhaps I'm the only person who considers component interaction bugs "something we didn't expect it to do"

DocTrinsOGrace

Weng: The only code that I have seen do exactly what I expected it to do was code that wasn't running anywhere. I think I remember once finding ten bugs in nine lines of code. :-)

With regard to the particular statement you were commenting on, we do well to keep in mind that our use of human language tends to be buggy too.