Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching)


  • area_can

    Impact

    An attacker can read x87/MMX/SSE/AVX/AVX-512 register state belonging to another vCPU previously scheduled on the same processor. This can be state belonging a different guest, or state belonging to a different thread inside the same guest. Furthermore, similar changes are expected for OS kernels. Consult your operating system provider for more information.

    Vulnerable Systems

    Systems running all versions of Xen are affected. Only x86 processors are vulnerable. ARM processors are not known to be affected. Only Intel Core based processors (from at least Nehalem onwards) are potentially affected. Other processor designs (Intel Atom/Knights range), and other manufacturers (AMD) are not known to be affected.



  • Ffs


  • area_can

    Dan Luu called it back in 2015/16:

    We've seen at least two serious bugs in Intel CPUs in the last quarter, and it's almost certain there are more bugs lurking. Back when I worked at a company that produced Intel compatible CPUs, we did a fair amount of testing and characterization of Intel CPUs; as someone fresh out of school who'd previously assumed that CPUs basically worked, I was surprised by how many bugs we were able to find. Even though I never worked on the characterization and competitive analysis side of things, I still personally found multiple Intel CPU bugs just in the normal course of doing my job, poking around to verify things that seemed non-obvious to me. Turns out things that seem non-obvious to me are sometimes also non-obvious to Intel engineers. As more services move to the cloud and the impact of system hang and reset vulnerabilities increases, we'll see more black hats investing time in finding CPU bugs. We should expect to see a lot more of these when people realize that it's much easier than it seems to find these bugs. There was a time when a CPU family might only have one bug per year, with serious bugs happening once every few years, or even once a decade, but we've moved past that. In part, that's because "unpredictable system behavior" have moved from being an annoying class of bugs that forces you to restart your computation to an attack vector that lets anyone with an AWS account attack random cloud-hosted services, but it's mostly because CPUs have gotten more complex, making them more difficult to test and audit effectively, while Intel appears to be cutting back on validation effort.



  • @bb36e They already confirmed all 8 "Spectre-NG" bugs, the only news is that 3 of them have now been publicly disclosed.

    At least 5 more will be published this year.



  • @bb36e simple solution: don't use these cloud servers



  • Guess they really opened a can of worms when they found out (and told everybody) that we essentially have a bunch of things in our CPU that can have race conditions.

    Which is a bit funny, because all this speculation stuff and so on essentially exists to make the lives of developers easier.


  • Banned

    @cvi said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    Which is a bit funny, because all this speculation stuff and so on essentially exists to make the lives of developers easier.

    Lowering security generally makes developers' lives easier.


  • Impossible Mission - B

    @gąska said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    Lowering security generally makes developers' lives easier.

    And everyone else's. Right up until it doesn't.



  • @masonwheeler
    How do I delete someone else's post?


  • area_can

    @masonwheeler too soon


  • BINNED

    @dfdub said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    How do I delete someone else's post?

    Click the three dots next to the downvote button and choose 'Flag'.
    The results may vary because of mods and :kneeling_warthog:



  • @gąska said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    Lowering security generally makes developers' lives easier.

    Perhaps true. But in this case the goal really is to allow developers (and compilers, to some degree) to pretend that they are targeting something with sequential in-order execution, while it's actually running something highly parallel and out of order. The bugs are just a result of that - turns out highly parallel things are difficult (and more so if you want them to be efficient), even (or especially?) if you're a Intel HW engineer.


  • BINNED

    At least you could hope that FPU contents aren't quite as sensitive, you probably won't like passwords for example. Still, sounds like we'll be up to a lot more of this.



  • @topspin said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    At least you could hope that FPU contents aren't quite as sensitive, you probably won't like passwords for example.

    AVX registers are also affected and AES encryption uses those registers, according to another article I read.


  • BINNED

    @dfdub Well, shit, so much for that.
    Also, apparently my brain has done some sort of context-switching or speculative execution while I typed that WTF of a sentence.



  • @topspin said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    At least you could hope that FPU contents aren't quite as sensitive, you probably won't like passwords for example. Still, sounds like we'll be up to a lot more of this.

    It leaked the SSE/AVX register state. AVX has a lot of functions for dealing with vectors of integers (down to single bytes) too. A lot of people are looking into using AVX to accelerate stuff like string processing.

    But, hell, on my system, even the memcpy is implemented using SSE/AVX, using essentially all of the xmm registers. So, I would guess that quite a bit of sensitive information has a good potential to end up in there.



  • @cvi said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    pretend that they are targeting something with sequential in-order execution, while it's actually running something highly parallel and out of order.

    This is why I I really can't wait for projects like the Mill CPU to finally come into the realm of physically existing things, because they do away with the whole "let's pretend processors still work just like 50 years ago" shit.


  • 🚽 Regular

    @dfdub said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    @masonwheeler
    How do I delete someone else's post?

    Depends if they're running an Intel CPU I guess...


  • Discourse touched me in a no-no place

    @ixvedeusi There's several ways of doing that, but the real key is going to be getting software programmers to stop pretending that everything is just one sequential CPU. We simply cannot speed that up. As programmers, we must learn to use far more threads than we do now, and we must learn to share less between those threads (to limit the cost of synchronisation and coherency).



  • @dkf said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    getting software programmers to stop pretending that everything is just one sequential CPU

    I agree with you on that, but this is in addition to fixing CPU multiple personality disorder and clandestine out-of-order execution, which is where the OP's bugs have their root. Because as long as the CPUs themselves lie to us and pretend that everything runs in a separate, sequential CPU, what's the programmer gonna do about it?


  • Discourse touched me in a no-no place

    @ixvedeusi said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    but this is in addition to fixing CPU multiple personality disorder and clandestine out-of-order execution, which is where the OP's bugs have their root

    Yes. But there's several proposed ways to fix this, including some very interesting stuff out of Intel's research side (I know of a system with 128 CPU cores per chip). What the approaches seem to have in common is rejecting the current techniques of mainstream systems.

    The thing that worries me about all these systems is that they've not been designed for security. There's no real consideration of how to keep things working correctly when the author of the (non-OS) code is an outright adversary. I know why we've not done that in our experimental CPU, but someone needs to bite that bullet and consider how to force hardware message separation by sender and so on; you can't do it by a fancy compiler precisely because it's important to remember that the object code emitted isn't secured (and we don't want it to be).



  • @dkf said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    @ixvedeusi There's several ways of doing that, but the real key is going to be getting software programmers to stop pretending that everything is just one sequential CPU. We simply cannot speed that up. As programmers, we must learn to use far more threads than we do now, and we must learn to share less between those threads (to limit the cost of synchronisation and coherency).

    EPIC was a good idea, but most programmers are stupid and even more programmers are lazy.



  • @masonwheeler Srsly?


  • Discourse touched me in a no-no place

    @mott555 said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    EPIC was a good idea, but most programmers are stupid and even more programmers are lazy.

    They're also usually very greedy for speed. If they need to learn to parallelise to get things to go fast, they'll bite the bullet. And make a total hash of it and we'll have a good laugh at them here, so it's not all bad…


  • Impossible Mission - B

    @blakeyrat said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    @masonwheeler Srsly?

    Yes srsly. That's admittedly an extreme example, but it's a very valid one. The 9/11 attacks could not have been successful if the airlines had implemented a few common-sense security precautions that pretty much everyone who had been researching this stuff had been telling them for years they needed to take in order to secure planes against malicious hijackers. But they preferred the convenience (and of course the cost savings!) of not having to do so, and we all know how that ended...



  • @masonwheeler I'm not saying you're wrong, I'm saying you apparently have zero conception of the concept of "tact".


  • Impossible Mission - B

    @blakeyrat So what you're saying is, I fit right in here? 🚎



  • @masonwheeler No, I am not.


  • Considered Harmful

    @blakeyrat I would take that seriously, if it were anyone but you saying it. Well, maybe not Lorne.



  • @dkf said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    but the real key is going to be getting software programmers to stop pretending that everything is just one sequential CPU

    And the key to that is to get programming language designers to make their languages parallel-first.

    As a programmer that's not particularly into the "linguistics" of programming, I won't start making parallel stuff until languages start agreeing on how I'm supposed to do it.


  • Discourse touched me in a no-no place

    @anonymous234 said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    I won't start making parallel stuff until languages start agreeing on how I'm supposed to do it.

    Message passing. That works, scales up, and won't drive you completely insane.

    The content of that message and the exact delivery semantics…



  • @dkf said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    won't drive you completely insane.

    I dunno. Maybe if you implement it in a sensible way. But I definitely blame parts of my insanity on having had to deal with MPI.

    But ... not everything needs to scale infinitely. For the time being, I quite like the model GPUs have going. As a programmer, you're responsible for giving the GPU enough jobs, many more than it can execute each clock cycle, and the GPU has a scheduler that will select instructions from whatever jobs is ready to execute something. Essentially Hypterthreading on steroids.

    You don't need as much caches, because latency to/from memory is hidden by executing other jobs while waiting for the results. You don't need funky speculative execution and our-of-order stuff to keep the EUs busy, because, again, you just execute instructions from a different job.

    Maybe doesn't scale as far as some of the funky message passing things. But then again, GPUs for example haven't significantly increased in the number of threads/jobs you need to keep them busy in the last decade.



  • @dkf said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    you can't do it by a fancy compiler precisely because it's important to remember that the object code emitted isn't secured (and we don't want it to be)

    You can make all code go through a transformation stage before running it if that's necessary.
    We could deliver code in "almost x86" format, then when loading check that it doesn't contain certain instructions or replace some placeholders with other instructions. Obviously this doesn't work if you want things like self-modifying code.

    A more extreme approach would be to make an OS that only runs application code in some intermediate language (like .NET). That would certainly mitigate most CPU vulnerabilities.


  • Discourse touched me in a no-no place

    @cvi said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    For the time being, I quite like the model GPUs have going.

    GPUs are very close to being SIMD units, especially when being used efficiently. They're power hungry as hell, but good at some types of task. General compute is more along the lines of MIMD, where the execution units are essentially just doing their own thing. Message passing is a lot simpler for working with them than trying to use shared memory; I've seen systems with 1000 cores with fully shared memory between them, but they were monster (for the time) supercomputers that obviously weren't going to scale up any further. Message passing scales up enormously further.

    Arguably, the whole internet is a kind of massively parallel message passing system…



  • @dkf said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    GPUs are very close to being SIMD units, especially when being used efficiently.

    Not as a whole. Unfortunately the terminology is a bit screwy (so what NVIDIA calls a "thread" has rather little in common with a CPU thread). Either way, the hierarchy of executable things looks like follows (using the CUDA terminology):

    • You have CUDA threads
    • 32 CUDA threads are grouped into a "warp"
    • One to 32 (I think, depends a bit on the GPU) warps are grouped into a "block"
    • ~arbitrary (some upper limit) blocks are grouped into a "grid"

    You can see a warp as being a SIMD unit with a width of 32. So, the 32 threads of a warp will always execute the same instruction. You also have a per-thread mask that can cause the effects of an instruction to be ignored for that thread (branches are a bit special, though -- you can have divergence within a warp, but that's a bit costly as it will cause threads to do nothing).

    Different warps are not forced in any way to execute the same instructions, so on a warp-level and higher the GPUs are rather MIMD than SIMD.

    Warps that belong to a certain block are guaranteed to execute on the same (multi-)processor, so they can access some shared resources, and can do basic synchronization (like barriers). But even within a block, warps may diverge their execution freely.

    Blocks in a grid simply belong to the same kernel launch, so essentially they all start at the same entry point (the kernel's entry point). It's not guaranteed that all blocks are resident at the same time, so you can launch as many blocks as is convenient to your problem. Blocks that don't fit initially are swapped in as earlier blocks finish and resources become available.

    Other GPUs behave similarly, the numbers vary a bit. I think AMD has (or at least had) groups of 64 as the smallest unit.

    Either way, even if you execute the same kernel with no real branches, the GPU won't run as a single SIMD unit. Latency/scheduling will cause the some warps to be delayed, and others will get ahead. It's also very useful to do things like having warps pull jobs from a list dynamically and so on, at which point the whole thing is definitively not running SIMD-like.

    They're power hungry as hell, but good at some types of task.

    Last time I checked, they compared rather favourably in FLOPS/Watt, at least compared to CPUs.



  • @dkf but I'm not sure about most of the messages being sent...



  • @anonymous234 said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    get programming language designers to make their languages parallel-first

    I'd say a good place to start with this would be to have sensible defaults for what C++ calls "storage duration". In most programming languages it's easy to create a non-synchronized global variable (it's the default for anything with global scope), more difficult to create a thread-local variable, and even more difficult to create a global with serialized access. this is all backwards; IMHO, it should be:

    • By default, everything with global scope is thread-local.
    • You need to explicitly state that you want global rather than thread-local storage. By default, access to such a variable is serialized.
    • You need to jump through some hoops to get a non-serialized global.


  • @ixvedeusi said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    Mill CPU

    I just hope they have an efficient backend for at least one of gcc or llvm so they don't fall into similar trap that killed off Itanium.


  • Discourse touched me in a no-no place

    @cvi said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    Last time I checked, they compared rather favourably in FLOPS/Watt, at least compared to CPUs.

    It depends on the exact operation mix that you're using. There's a whole subtle set of issues involved, and for some things you're actually much better off with an ASIC (provided you can build on a suitable process technology). The usual rule is that if you have very branch-heavy code (or are working with doubles) you should use a CPU, if you have a need for more parallelism but fewer branches (and are more float heavy) then you go with a GPU, and if you're really certain about what you're doing then you go with an ASIC and stamp as many parallel execution units out as your budget allows. The key to why ASICs are better for specific tasks is that you've got a choice in how fast you make your transistors and can spend them in ways that optimise your real task, which is harder to do in more generic systems.

    I look forward to when I can report the FLOPS/Watt figures for our next-gen chip. The old version was great at ops/watt, especially given how old the design was, but terrible at FLOPS as it had no hardware float support at all. And its real specialty was in federated multicast comms, which FLOPS/Watt doesn't even get close to mentioning… 😐


  • Discourse touched me in a no-no place

    @ixvedeusi said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    • By default, everything with global scope is thread-local.
    • You need to explicitly state that you want global rather than thread-local storage. By default, access to such a variable is serialized.
    • You need to jump through some hoops to get a non-serialized global.

    👍

    Did you know that there are programming languages which do this? Except that global state is a bit trickier than that; you need to be able to expand the scope of synchronisation beyond a single operation as not all changes can be easily expressed in simple atomic ops. It turns out that reentrant critical sections (however implemented) are pretty important for keeping programming models with shared state sane… unless you instead send the code to the modification to the the single thread that owns the state…



  • @dkf said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    There's a whole subtle set of issues involved, and for some things you're actually much better off with an ASIC (provided you can build on a suitable process technology).

    Then again, making an ASIC is perhaps just a slight bit out of reach for most people. Not to mention things like finding people who can design ASICs, times for a single iteration, and the fact that few codes are that stable.

    Compare this to a GPU, which you can buy in $local_store, or of which you find a pile in most compute clusters these days. And, hell, even matlab can use a GPU occasionally these days.

    or are working with doubles

    Maybe. If you run on a compute cluster, you probably have access to the fancy expensive versions that do very well in the double department (again with a favourable FLOP/Watt rating, compared to other general-purpose devices).

    There's a reason new high-end clusters are stuffed full of GPUs. For example, Summit has 3 GPUs for each CPU. And the 200 petaflop rating is likely for that standardized double-precision benchmark that everybody compares to (they mention reaching exaflop levels in lower precision -- but that might even be at half-precision).



  • @ixvedeusi said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    • By default, everything with global scope is thread-local.

    Perl has that model, but other languages didn't follow. I suspect it is too confusing for programmers.

    Rust instead has the rule that everything with global scope has to be synchronised, which is trivially true for constants (constness is transitive) and by definition true for synchronisation containers. And false for anything else.


  • Java Dev

    @dkf I do a lot of parallel stuff with message passing (in C) and even that can get fiddly.

    Not so much my primary working state; messages only affect state on their own session, so I just queue everything on the same session to the same thread.
    More so things like configuration, which is essentially static, but I do sometimes need publish a new version which is picked up atomically.
    Or processing statistics, which I keep per thread (fiddly, with a pthreads key) and then later need to aggregate over all threads (unsynchronised reads; I don't care enough to spend performance on keeping it accurate/consistent).


  • Considered Harmful

    @ixvedeusi Rust does something close to that. Basically it's an unsafe operation to access a static mut variable, so you have to wrap it in a Mutex or similar.


  • Discourse touched me in a no-no place

    @bulb said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    Perl has that model, but other languages didn't follow.

    Some others use that model too.



  • @dkf Which ones? I admit I never looked at how threading is done in Tcl.


  • Discourse touched me in a no-no place

    @bulb That's the one I'm thinking of, where each “global” context is strictly thread-bound and substantial hoops have to be jumped through to access cross-thread state. The implementation isn't entirely lock-free, but most code doesn't need to care about the details. There's none of the bizarre behaviour associated with Python's GIL fuckery. (A good production Tcl program will be using worker thread pooling, passing fairly coarse-grained messages around, and will leave any fine-grained shared stuff to the database. Or it might spin off one interpreter per session; they're light enough that that's entirely practical and I know some people who use that.) I was amused a month or two ago when I compared an identical algorithm (admittedly without threads) in Python (various versions) and Tcl and got the result that Tcl beat Python in all cases; I didn't expect that as I'm sure Python's got a lot more developer effort available. I also know that Python's threading has major problems with starvation when there's a CPU-bound thread about (it's caused us significant headaches at work, both in 2.7 and 3.6).

    I believe that Go is also quite a bit along that road as well, as it makes messaging at least as easy as global shared state. (It's the legacy of the theoretical work in CCS/CSP/π-calculus, which is all message-based.) I don't know it as well though.

    Of course, all this isn't the only way to slice up the problem. The programming models that most people assume don't work so effectively once the number of processors starts to really get large. (We've got a million-core monster system at work, and it's downright weird to work with. Especially as each core is a slow old ARM with not much fast RAM nor vast amounts of slow RAM.)



  • @dkf said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    Tcl beat Python in all cases; I didn't expect that

    I'd definitely expect that. CPython tends to be one of the slowest in most benchmarks, whatever they are actually measuring.

    @dkf said in Intel Confirms 'CPU Bugs' Episodes 4, 5, 6 (FPU Switching):

    I believe that Go is also quite a bit along that road as well, as it makes messaging at least as easy as global shared state.

    It does make message passing easy, but as far as I could tell nothing to prevent sharing state, either global one or created by passing references in the messages.


Log in to reply