(:warning: 2013) Toyota Unintended Acceleration and the Big Bowl of “Spaghetti” Code

AlexMedia

Toyota Unintended Acceleration and the Big Bowl of “Spaghetti” Code
Posted on Thursday, Nov 7th, 2013

Last month, Toyota hastily settled an Unintended Acceleration lawsuit – hours after an Oklahoma jury determined that the automaker acted with “reckless disregard,” and delivered a $3 million verdict to the plaintiffs – but before the jury could determine punitive damages.

What did the jury hear that constituted such a gross neglect of Toyota’s due care obligations? The testimony of two plaintiff’s experts in software design and the design process gives some eye-popping clues. After reviewing Toyota’s software engineering process and the source code for the 2005 Toyota Camry, both concluded that the system was defective and dangerous, riddled with bugs and gaps in its failsafes that led to the root cause of the crash.

Continued at SafetyResearch.net

The story reads as quite a clusterfuck of WTF's. It makes me curious as to what was going on in the heads of Toyota's engineers as they were designing the software, did they never anticipate failure?

LB_

Jeez, it's from 2013. They actually taught this to us in my intro to engineering college class. At this point it's ancient history. But yeah, it's still a

aliceif

Added a warning for people who are too lazy to read the OP enough to notice that this is a story from 2013.

dse

Other egregious deviations from standard practice were the number of global variables in the system. (A variable is a location in memory that has a number in it. A global variable is any piece of software anywhere in the system can get to that number and read it or write it.) The academic standard is zero. Toyota had more than 10,000 global variables.

Interesting.

@LB_ said:

it's from 2013. They actually taught this to us in my intro to engineering college class.

You are young. Me thinking if I am too old ...

AlexMedia

@LB_ said:

Jeez, it's from 2013.

I kinda missed that. It popped up in my Facebook feed this morning and I thought some people here might enjoy it...

They actually taught this to us in my intro to engineering college class. At this point it's ancient history. But yeah, it's still a

Absolutely. I hope Toyota have stepped up their game by now.

loopback0

While we're timepodding about automotive safety failures...

Aug 24, 2015

Ten times more deaths linked to faulty switch than GM first reported

Compensation expert hired by automaker rejected 91% of death and injury claims related to faulty ignition switch problem.

Recall in 2014 for a problem they were aware of from 2005.

cheong

I remember someone told me that if your code has SLA of "4-nines" or above, you're not allowed to declare variables dynamically in procedures or functions. Every variables must be static to avoid bugs like those due to unintended "freeing" of variable data.

I must be too old to hear about this.

fbmac

This post is deleted!

tufty

@fbmac said:

Sadly, I don't believe that anything is critical enough for businesses to care anymore.

Money. Money has always been far more critical than lives.

Mind you, if you're saying card processing code is as bad as automotive safety critical code, the world's definitely gone to hell in a handbasket.

dkf

@cheong said:

I remember someone told me that if your code has SLA of "4-nines" or above, you're not allowed to declare variables dynamically in procedures or functions.

That's BS. What you've got to actually do is prove that the code has a bounded memory footprint. Dynamic allocation is fine, provided it doesn't exceed the pool size (which you can determine). It's approximately equivalent to proving that there are no memory leaks and that no data structure grows in an unbounded fashion. Usually trivial stuff.

Once you've done that, you can pretty much pretend to the no-allocation people that you're not doing allocation: you just grab the memory arena you need at the start of the program and never let it go. You then do your “allocations” out of that.

Bulb

Does anybody know whether at the authorities, at least somewhere, learned their lesson and put up some certification requirements for automotive software that affects control? Something like aviation certification requires satisfying the design assurance levels from DO-178C?

PJH

@Bulb said:

Does anybody know whether at the authorities, at least somewhere, learned their lesson and put up some certification requirements for automotive software that affects control?

Safety integrity level (SIL) and the IEC/ISO/others derived from it is about the closest you're going to get.

fbmac

This post is deleted!

flabdablet

What they should have done is sell all those recalled ignition switches to Toyota, for use as retrofits. That way, if your Toyota's throttle locks on, all you need to do is bump the ignition switch to kill the motor.

dkf

@fbmac said:

I'm interested in knowing more about how you usually do it, because I work in a project where my coworkers won't let me malloc.

Start by proving that the code won't allocate more than some fixed amount of memory. If you can't prove that, your code isn't safe in this sort of situation anyway. You can determine how much space is required experimentally.

Once you've done that, write your own memory allocator! It's not hard to do (especially if you're unthreaded) as long as you're targeting a specific platform and not worrying about just how to write completely portable code. You need to peek beneath that abstraction in order to create code to get the allocated block metadata from the block's pointer.

That done, you just initialise your memory allocator to use a chunk of memory allocated at startup, the memory arena. You know how large it has to be, and you can allocate it in a way that is acceptable to local practice (e.g., big honking static array). Since the arena is fixed size and nothing ever goes outside it, you've got your “guaranteed” memory limit. Finish up by renaming your allocation functions so people don't notice you're really allocating and deallocating behind the scenes. ;)

Congratulations. That shouldn't take you more than a few months, just to work around some idiots' over-strictly-applied rules. You might be able to find an existing allocator library that you can use (I don't know any offhand, but I don't study the area) which will shave a few weeks off the time, but the real costs will be in proving that your memory use is bounded to start out with. Which is the actual thing that you need to do to be safe; you don't get out of doing stuff properly easily.

Exhaustive testing is highly recommended.

tufty

Static analysis, mainly. All of which will depend very much on the language you're using. For C, you might find a partial list of useful links here : http://saturn.stanford.edu/pages/relatedindex.html

skotl

@Original article said:

>The accepted, albeit voluntary, industry coding standards were first set by Motor Industry Software Reliability Association (MISRA)...
For every 30 rule violations, you can expect on average three minor bugs and one major bug.
Barr checked the source code against MISRA’s 2004 edition and found 81,514 violations.

Fuck.
I know that these "measurements" of code quality are open to debate and abuse but the figures in these (abridged) quotes are staggering.

It kind of reads as if Toyota viewed embedded software engineering as being the lowest point of entry for developers, slightly below the folks that are hired to program the display screens in reception.
Given that concerns were raised in 2007, where are all the whistleblowers at?

fbmac

This post is deleted!

tufty

I honestly don't know. My interest in static analysis is compilers for functional languages, haven't written any C in ages.

I doubt you're gonna get all of "free, easy to use and no code changes". You'll probably be lucky to get 2 of the three.

That said, you're obviously writing code that's required to be solid, and you're not using any static analysis tools? WFT?

HardwareGeek

@AlexMedia said:

I hope Toyota have stepped up their game by now.

A former employer of mine sold general-purpose chips for automotive use, among other uses. I worked on that project but was not directly involved in the FA of this problem.

A handful of infotainment systems in high-end Toyotas bricked themselves. Months of FA led to two separate problems. I don't think I ever heard what the second one was, but one was that flash could become irrecoverably corrupted if power was interrupted at just the wrong moment, while the firmware is updating non-volatile memory. We were aware of this problem, and the ap notes for the chip described a procedure to avoid the problem. Toyota didn't use the procedure (although TBH, I'm not sure whether the fault lay with Toyota or an intermediary, a company that is a major supplier to all the Japanese automakers) and apparently went out of their way to avoid using it; if they'd used our software to configure the chip, they'd have gotten the procedure automatically.

skotl

I hope Tesla's autopilot code is more robust...

cheong

Actually, you'll also have to prove there will not be any buffer overrun problem where your instruction pointer (or whatever they call in the family of processor they use) may land on user data.

With static allocation, everything you need is on memory/CPU cache, never on stack/heap. If there's buffer overrun, at most you can do is corrupt other data. So in no case your program should crash and the driver will never lose control to their steering wheel and brake.

These are the major points of static allocation, or at least that's what I'm told to.

Depending on whether they're using home brew operating system to run their code, they may not have access to modern facilities that we take for granted nowadays.

Bulb

Static and dynamic allocation will be in the same area (especially if you are doing dynamic allocation from a pre-defined static block anyway).

The buffer overrun hitting the instruction pointer is specific problem of stack-allocated, i.e. local, buffers.

Of course some problem can be caused by buffer overrun in any kind of buffer, so you have to guard against them everywhere anyway too.

Lorne Kates

Hey n00b this article is from 2013 lol!!!

fake edit**emphasized text: Highlight some text. Reply with Quote. Go to the top of the reply and put in a new line (so that [quote is on a new line). Enter an emojicon, using ENTER to autocomplete it.

Discourse will destroy the new line, and the opening square bracket of the quote block.

SHOULD I FILE META.DERP?

@Bulb said:

Does anybody know whether at the authorities, at least somewhere, learned their lesson and put up some certification requirements for automotive software that affects control?

If anyone was serious about saving lives in cars, we'd all have self-driving cars by now.

@skotl said:

Given that concerns were raised in 2007, where are all the whistleblowers at?

They left to go work at a more ethical car company. Volkswagen. :diesel_powered_trolleybus:

@skotl said:

I hope Tesla's autopilot code is more robust...

It detects engine problems, gives a heads up to the driver, then eventually pulls the car safely off the road.

cheong

The point is just to ensure the error handling code always can get triggered as needed and can recover the situation as if there's never been anything wrong before... but anyway, yes, it's more important to just guard buffer overrun everywhere.

loopback0

@Lorne_Kates said:

Volkswagen. :diesel_powered_trolleybus:

Mason_Wheeler

Wow.

In that design I adopted certain basic principles that I
believe to be as valid today as they were back then. The first
principle was security: The principle that every syntactically
incorrect program should be rejected by the compiler and that every
syntactically correct program should give a result or an error message
that was predictable and comprehensible in terms of the source language
program itself. Thus no core dumps should ever be necessary. It was
logically impossible for any source language program to cause the
computer to run wild, either at compile time or at run time.

A consequence of this principle is that every occurrence of every
subscript of every subscripted variable was on every occasion checked at
run time against both the upper and the lower declared bounds of the
array. Many years later we asked our customers whether they wished us to
provide an option to switch off these checks in the interest of
efficiency on production runs. Unanimously, they urged us not to—they
already knew how frequently subscript errors occur on production runs
where failure to detect them could be disastrous. I note with fear and
horror that even in 1980, language designers and users have not learned
this lesson. In any respectable branch of engineering, failure to observe such elementary precautions would have long been against the law.

-- The Emperor's Old Clothes, C. A. R. Hoare. (Turing Award lecture)

It's good to see real legal consequences getting handed down for bad software engineering. Decades late, to be sure--we've known since at least the Morris Worm that sloppy software development can have serious real-world consequences--but better late than never.

tufty

@cheong said:

With static allocation, everything you need is on memory/CPU cache, never on stack/heap.

That's not a guarantee you can make unless your compiler ensures it's the case, or if you're running on an architecture that separates code and data. In general, statically allocated data can, and does, end up on the stack, and it's often very advantageous for it to do so. In most cases, the only data that should be allocated off-stack is that which escapes its creation scope / stack frame; that's exactly what malloc() is intended for.

@cheong said:

If there's buffer overrun, at most you can do is corrupt other data. So in no case your program should crash and the driver will never lose control to their steering wheel and brake.

That does not hold. Imagine, for example, that my steering wheel and brake system is controlled by a state machine. I have statically allocated the data for that state machine, being careful to ensure it's not on the stack. Unfortunately, I'm a fucking idiot, and have a buffer overrun on a piece of data immediately preceding the state machine data in memory; This gets triggered and zaps the state machine into a nonsensical state. What happens?

My program bugs out / locks / crashes.
Being in a nonsensical state, my program is unable to determine what state should be triggered by a given user input.
I've been careful and it resets itself to some "reasonable" default state, but which does not necessarily represent the correct state.

All of these could be fatal. None of them are caused by direct or indirect smashing of the IP on the stack, nor could they be particularly resolved by changing the way memory is allocated except by saying "don't allocate this next to that".

Buffer overruns are a serious issue, but they are orthogonal to memory allocation strategies.

cheong

No. If the buffer overrun does not overwrite code, the error handling (assume it is written properly, which all coding guideline assume it should because otherwise no guideline can hold) can try to recover the state like instruct refresh from the sensors to the state variables.

Of course, like Titanic, if there is huge amount of garbage data filling in the data region the error handling code may not be able to recover the system to healthy state, but at least the system can set some warning light, trigger the brake, etc. Without crashing hard on the road.

The bottom line is that, if you cannot ensure the code segment will never be overwritten, the system is not to be trusted. No error handling is reliable if the error handling code itself can be overwritten.

tufty

@cheong said:

assume it is written properly

If you can buttume the error handling is written properly, you can also buttume that the rest of the code is written properly. As such there will be no buffer overruns, and there is no need for error handling.

@cheong said:

Of course, like Titanic, if there is huge amount of garbage data filling in the data region the error handling code may not be able to recover the system to healthy state, but at least the system can set some warning light, trigger the brake, etc. Without crashing hard on the road.

Also, this is nonsense. If you have garbage data, you are in an error situation. You do not know what data is in error, and you do not know the source of the erroneous data. You no longer know what the current situation is, and therefore, you do not know what action is safe to carry out.

anotherusername

After reviewing Toyota’s software engineering process and the source code for the 2005 Toyota Camry, both concluded that the system was defective and dangerous, riddled with bugs and gaps in its failsafes

Well, gee. Because nobody here would've seen that coming, right?

@HardwareGeek said:

flash could become irrecoverably corrupted if power was interrupted at just the wrong moment

Sounds familiar. I had not one but two (same exact type) USB flash drives that did just that (permanently bricked themselves) when they were plugged into a loose USB port. Windows popped up the unrecognized/malfunctioning USB device message and no other OS could access the drive either. The first one was under warranty, so I got it replaced, and the second one did the same thing. And it wasn't some knock-off; it was a high-quality memory stick from a best-selling manufacturer. I'm not going to name & blame, but it looked just like this:

Also, the computers were lab computers, so there was no shortage of idiots to abuse them and no good way to tell how snug the USB ports were until you plugged something in.

PleegWat

That's assuming that doesn't violate any of the other rules. I could see most of pointer arithmetic and all but certain types of cast being outruled as well, severely limiting the possibility of writing a malloc().

I mean, if all memory allocation is static anyway, then any address not known compile time is automatically in a static array. A typed array, because that's not much of an extra restriction. We're ultrasafe, so you must bounds-check that access.

So instead of

struct forble *f = malloc(sizeof(struct forble));
f->frazzle = 3;

you get, by necessity:

int i = assign_forble();
if( i < 0 || i >= ARRAY_SIZE(forbles) ) {
    ASSERT(0, "Got invalid forble index!!");
    goto horrible_error;
}
forbles[i].frazzle = 3;

By which time we are probably close enough to what is desired we don't need to bother cloaking it.

blakeyrat

Ok the article might be interesting, but its use of the phrase "eye-popping" has filled me with an urge to strangle its writer to death.

dse

@tufty said:

Also, this is nonsense. If you have garbage data, you are in an error situation. You do not know what data is in error, and you do not know the source of the erroneous data. You no longer know what the current situation is, and therefore, you do not know what action is safe to carry out.

So what? Yes this should not happen but it did, should the car start crying or try to recover from a crash?

@dkf said:

Start by proving that the code won't allocate more than some fixed amount of memory. ...

Use all those fancy and CS-y tools that are written by grad students and professors and sure will not have any bugs. Or just do not malloc and test your program under many different loads for months. I choose the simpler proven methods, over state of the art academics methods when it comes to safety critical code.

cheong

You can if your program follow strict memory allocation rules: Put system fixed size internal states in the beginning, then fixed size variable for storing trusted input (like those from sensors), then string constants (literals), then normal string variable for storing output generated by your program, system variable with variable size, and finally external input with variable size.

In this case when you have buffer overflow happens, you can be sure most of the internal state is not affected, and the data affected is always a more volatile one that you can recover by just forcing a refresh / reload.

Of course, you'll also need to turn off compiler optimization, or use some compiler directive to nail the variable in place because otherwise the compiler may suggest the memory could be better packed other way, defying the purpose.

dse

It might make practical sense to lay them up like that (you can add that to your list of requirements) but relying on compiler optimization or lack there of in a safety critical application is insane!
I am with you on not using malloc thing: Turn off overcommit, allocate all your memory at the beginning of your program life cycle (can use something like sbrk too), just to make sure you have all the memory or none at all.
But to harden against buffer overflow, you will need a more robust method, maybe run the application on 2 separate isolated units at once and back off if they disagree (randomize memory access and use cookies
so buffer overflow will result in that event). It is risky to handle an exception without a full reset, the only correct way is to increase redundancy and refresh/resync everything every once in while.

cheong

Well, the same can be said for relying the linker to pin code as non-pagable for drivers but I bet lots of people is going to disagree.

I'm just asking for non-relocatable memory block in specific layout, not too much thing to ask from the compiler, right?

dse

If you can do it via a pragma, all is fine (it is in the code and part of the code), but never rely on compiler switches for program correctness.

dkf

@dse said:

Use all those fancy and CS-y tools that are written by grad students and professors and sure will not have any bugs.

I was actually thinking of doing it by hand, but that works too. ;)

tufty

@dse said:

Or just do not malloc

Not using malloc saves you from one issue, where your code allocates more memory than is available on the system, and eventually locks up. In 99% of cases, that allocation / lockup cycle is merely a side effect, not the underlying bug.

So, by not using malloc, all can say is "my code has a bounded memory usage in all cases". This is something that, with formal verification, you can say even if you use malloc.
@dse said:

and test your program under many different loads for months

Testing ≇ formal verification.

@cheong said:

trusted input (like those from sensors)

I don't think you were paying attention earlier, so I'll say it again. Your program has hit an error state due to inconsistent data. As such, you do not know what data is invalid and what is valid. Nor do you know why your data is inconsistent, merely that it is. Therefore, you cannot trust sensor data as it may be the sensors that are lying to you. You are in an absolutely unknown state, which may even be outside the design parameters of the system.

So what can you do? The only way that doesn't involve a complete shutdown and restart (and even that might not be a viable option even in system terms if the system is in use) is to poll all your sensors and try to get a consistent view of what's going on, or, at least, determine where the problem is most likely to lie, and then work around that. But that might take longer than you have time for (consider an aircraft on the point of rotation), and still leaves the possibility that you still won't be able to tell what state you're in. Which means you have to fail safe. But, as you don't know what state you're in, how do you fail safe? For an aircraft on the point of rotation, shutting the engines down would most certainly not be safe. But if it's being told its inspection covers are off and someone has their hands in the fan, running them up to max would not be safe. And you might not be able to reliably discern between the two.

Which is not to say that the questions are not soluble, merely that it's not nearly as simple as you seem to think.

cheong

@tufty said:

Which is not to say that the questions are not soluble, merely that it's not nearly as simple as you seem to think.

The only way the question can be soluble is to ensure the error handling code can run, you can't handle error if your error handling function can be overwritten with something else, the data allocation pattern I suggest is just to make sure in case of buffer overrun, the code will be leave intact and you still have access to the "variables", and nothing beyond that.

I, by no mean, suggest that solely by doing this can cure everything. Don't overstretch what I said.

In back to the topic, I'm just trying to convey this idea that, while the Toyota's code can contain lots of , the fact of having 20000+ global variables itself may not be one if it follows standard practice of trying to preallocate everything.

dkf

@cheong said:

when you have buffer overflow happens

When an index into a buffer occurs and the result points outside the buffer, you've already lost. One of the key things you must do is always range-check every buffer access, and ensure that when an out-of-range access is tried, an error case is indicated and that you handle that case correctly. For example, you will need to set strict limits on the sizes of inputs, and just throw things away when they're too large.

But that's not a practice to do with whether you use memory allocation. It's a practice to do with whether you use memory buffers correctly at all. It's also much more important than avoiding malloc(). It's at the core of why Java and C# are used in security-critical applications more than C and C++ are; those checks are pervasive on array and reference access, and are tied into the exception system so that a failure is handled sanely (by throwing an exception, which can be handled normally).

dkf

@cheong said:

the fact of having 20000+ global variables itself may not be one if it follows standard practice of trying to preallocate everything.

That depends in part on the scope of those variables. If the entire program can see all of those 20k globals all the time, it's y. If most of them are more like file-statics, the scale of the problems will be smaller. (Compilers convert those to sort-of globals behind the scenes, but with names that prevent them from being accessed outside their originating scope.)

cheong

Embedding systems may not have infrastructure like SecureCRT equivalents, guard pages or else. And even if it has, there are tax accumulated from the old code base that don't use them, and maybe use their way to manually check for boundaries. That make code analysis tools difficult to verify if those are properly checked. And that's why we need to have fallback strategy just in case there is some code missing the check goes unspotted.

Remember that in 2013, Toyota had been ordered to recall 1 million cars because their brake belt will lost function on overheat. You'll think they must be tight on money usage, and cannot afford multiple-year rewrite like what Microsoft did for Vista.

dkf

@cheong said:

That make code analysis tools difficult to verify if those are properly checked.

So they've done it wrong before and are refusing to do it right now?

cheong

@skotl said:

Original article:
The accepted, albeit voluntary, industry coding standards were first set by Motor Industry Software Reliability Association (MISRA)...For every 30 rule violations, you can expect on average three minor bugs and one major bug.Barr checked the source code against MISRA’s 2004 edition and found 81,514 violations.

I think the 81,514 violation explains a lot. Either they don't check it against these tools, or simply ignore the output.

I have no idea whether they'll do it after being fined, but given this September they were order to recall their cars due to window switch overheat fire hazard, it seems the chance that they'll put big money to improve the codebase is slim.

PWolff

@tufty said:

Testing ≇ formal verification.

This. It happens too often that some software runs for years on millions or even billions of machines without any problem, and then some bad guy does a formal falsification (or a clever guess in case of closed source software), and voilà - the next Zero Day Exploit. Or Mother Nature just does as she always does - test every possible combination of variable settings to see what'll happen.

tufty

@cheong said:

there are tax accumulated from the old code base that don't use them

Indeed. I had some code sent to me years back by a friend that had just picked up a job doing embedded work for one of the bigger players in the medical field. It was terrifying. One "big ball of mud" code "configured" on a per-platform basis based on #ifdefs. nested #ifdefs. nested #ifdefs in other nested #ifdefs. Pretty much all variables were global. Buffer sizes were fixed, but, of course, nothing checked for overflows.

This was code running (amongst other things) dialysis machines.

You'll think they must be tight on money usage, and cannot afford multiple-year rewrite like what Microsoft did for Vista.\

If they were serious about safety, that's exactly what they'd do (a full rewrite). Not what MS did, which was to piss away 3 years of development time and then reskin an existing codebase.

Mason_Wheeler

(The first time I tried to post this, it gave a 500 Internal Server Error and didn't post. Now it's saying my repost attempts are too similar to what I just posted, even though it didn't post. Adding this up top to hopefully work around that.)

@dse said:

But to harden against buffer overflow, you will need...

...to stop using C and related languages.
@Francisco Alvarez said:

Dennis Ritchie's true legacy is the buffer overflow.

It's really that simple. The problem is that it's a bad language in which doing the wrong thing is both intuitive and easy, and getting it right is trick and all to easy to screw up. So why in the world are we still using it?!?

dse

Testing ≇ formal verification correct, but neither one substitutes the other. Formal verification itself is a form of testing, by higher order static analysis. Any testing can miss few cases so it is best to use many as long as your time budget allows. Select a combination. Test with static analyers, and even better by fuzzers. Since your time is limited you have to select.

@dkf said:

I was actually thinking of doing it by hand, but that works too.

I happen to mistake the negative/positive signs, a lot on math exams. More human involvement is going to help, not.

@PWolff said:

It happens too often that some software runs for years on millions or even billions of machines without any problem, and then some bad guy does a formal falsification (or a clever guess in case of closed source software), and voilà - the next Zero Day Exploit

Fuzzers happen to do a great job with them, and they are testing.

@Mason_Wheeler said:

...to stop using C and related languages.

Embedded system, and no Rust yet.