Is clean architecture controversial?

dkf

@Bulb said in Is clean architecture controversial?:

Makes me wonder what Ruby does; I never looked at that one in more details.

I just looked it up. They provide a standard Thread class that runs a block with standard (for Ruby) captured variables, and so must have some sort of global interpreter lock to protect the memory management code. But they probably take the approach of just holding the lock when they need it instead of the idiotic Python approach (which only ever worked on single core machines or in very I/O-bound code).

The acid test is to run CPU-bound and I/O-bound threads at the same time and see what happens. If things are done right (whatever that means in terms of code) then the threads won't unnecessarily block each other. If that isn't the case, the gross priority inversion will prove that their code must have the same blunder as Python.

topspin

I know that Python has this fucked up threading implementation that makes it completely useless due to the GIL. But, being at home mostly in unmanaged languages, I just realized I never really thought about what a non-stupid threading model in managed languages looks like.

From what I can tell:
In Python everything is locked behind the GIL and performance is non-existent. In JS, threading itself is non-existent. In C/C++ it's free for all. You can get fast threading, but it's also up to you to not get UB from race conditions and randomly crash. In Rust, I figure you need to be careful in unsafe and are otherwise forced to avoid race conditions mostly by not having shared mutable state to begin with?
What do other languages like Java/C#/etc. do? I think they don't protect you from fucking up with race conditions, but being managed, they still need to restrict the fallout in a way that it doesn't violate the memory safety guarantees. So I guess that means all memory accesses are atomic? Is that even enough?

dkf

@topspin said in Is clean architecture controversial?:

What do other languages like Java/C#/etc. do? I think they don't protect you from fucking up with race conditions, but being managed, they still need to restrict the fallout in a way that it doesn't violate the memory safety guarantees. So I guess that means all memory accesses are atomic? Is that even enough?

Speaking about Java only, references are the largest things that can be normally written pseudo-atomically (no memory barrier, but your thread either sees a whole write or doesn't see it at all; no half writes). It isn't guaranteed for primitive types unless the volatile keyword is used, and you don't have atomic increment or decrement. (Or rather you have to use a special class that does the low level shenanigans for you.) You are strongly encouraged to not have two threads writing to a variable without some sort of locking or atomicity control. You never ever have misaligned reads and writes (except of binary buffers, which can handle that stuff for you with copies) so that's a whole class of trouble that simply isn't there.

I'm not quite sure how they avoid the half write problem for references, but they have enough dev effort that they could even do something like aligning things with awareness of processor cache layouts, and that would be totally transparent to higher level code. I know they don't guarantee the visibility across threads unless you do things the right way, which tells me there's no memory barrier.

cvi

@dkf said in Is clean architecture controversial?:

I'm not quite sure how they avoid the half write problem for references,

I only have a vague memory from somebody hacking on the internals of the JVM. But from that it seemed like object references were word sized (e.g. 64bits on 64 bit systems), which at least on common machines would write atomically. (The 64 bit values were some sort of compressed pointer, with additional bits used for other nefarious purposes.)

Another vague memory is that early .net was struggling on ARM, due to having/assuming guarantees w.r.t. multithreading that were valid on x86, but required additional memory barriers on e.g. ARM (which was costly). I don't know if/how they resolved that...

Still, those are a different set of problems than with dynamic languages. If you've ever written an interpreter, there are additional problems like trying to look up a name in some scope and then accessing the value. If that requires a (possibly) shared data structure, you're in for some fun.

That's one of the places where I think something like transactional memory (even Intel's primitive extension) seemed like a good match...

dkf

@cvi The approach I know best involves an apartment memory model, where data structures are essentially never shared between threads (yes, that implies a deep copy on any inter-thread messaging). That has the advantage of making almost all code not need to deal with memory consistency or locks or priority inversion or any of that shit. That approach has the consequence of making threads relatively heavy to use, but does mean that code that uses threads can scale well (up to the level supported by the hardware) and transitioning to using multiple processes is pretty easy too.

On the other point...

The memory model of the x86 and descendants works nicely for small writes, providing those writes are contained within a single cache line. If your write crosses cache lines, you need to be brave to assume that things will work. With ARM, the story is complex because it depends on what optional modules have been added to the CPU core. If there is no MMU, there is no cache, writes have to be aligned, and you have no memory consistency at all, and must write your code accordingly (this is what we do with our SoC designs FWIW; locks use a special custom hardware unit and are extremely limited in number). But there are multiple MMU options, each with associated costs and benefits; greater inter-CPU consistency tends to involve more energy per write. I don't know other platforms well enough to comment.

cvi

@dkf said in Is clean architecture controversial?:

The approach I know best involves an apartment memory model, where data structures are essentially never shared between threads (yes, that implies a deep copy on any inter-thread messaging). That has the advantage of making almost all code not need to deal with memory consistency or locks or priority inversion or any of that shit. That approach has the consequence of making threads relatively heavy to use, but does mean that code that uses threads can scale well (up to the level supported by the hardware) and transitioning to using multiple processes is pretty easy too.

That would also imply (essentially) not having any objects shared between threads? (E.g. the global scope practically becomes a thread-local global scope.)

This was where TSX seemed like a good match. No locks required (however, a few additional instructions are), and the HW tells you if any of the memory you accessed was modified by a different core. On x86 where TSX existed, that information was kinda around anyway, courtesy of the cache coherency protocols on the die.

With ARM, the story is complex because it depends on what optional modules have been added to the CPU core. If there is no MMU, there is no cache, writes have to be aligned, and you have no memory consistency at all, and must write your code accordingly (this is what we do with our SoC designs FWIW; locks use a special custom hardware unit and are extremely limited in number). But there are multiple MMU options, each with associated costs and benefits; greater inter-CPU consistency tends to involve more energy per write. I don't know other platforms well enough to comment.

Yeah, I tend to forget about that. :-)
ARM for me is usually the kind cores that you find in smartphones, tablets and the ARM-based pseudo-laptops (I actually should know better, on account of using stripped down ARM pseudo-Arduino boards too). The smartphone/tablet/laptops however all have a MMU (plus seem to be relatively consistent about some other options, e.g., vector units).

I wasn't even aware that you could have different MMUs. What's the key difference between them?

dkf

@cvi said in Is clean architecture controversial?:

I wasn't even aware that you could have different MMUs. What's the key difference between them?

Don't really know. I'm a software guy, even if occasionally a very low level one! I know that whether non-aligned writes are supported depends on having an MMU, and that means that that module must also be handling the cache (makes sense; the cache needs to be aware of the virtual/physical mapping and is intimately involvedinI making non-aligned reads and writes work; main memory doesn't do anything non-aligned). Cache coherency depends on the configuration of the communications between cache controllers (wikipedia lists a few options) and that means there will probably be options to do it sold by ARM (because that's totally their business model; if a big enough customer wants it, they'll sell it).

I've not explored the MMU part of the ARM catalogue precisely because I knew we hadn't bought one. We use a different memory model entirely (cacheless, so no consistency to worry about).

sockpuppet7

@topspin for c# and Java references are atomic, container classers have special thread-safe versions, and there are locking mechanisms like semaphores and mutexes

Ruby, as far as I googled, is the same as Python, with a GIL

cvi

@dkf That kinda makes sense, especially w.r.t. cache coherency. From what I understand, it's relatively expensive. Or at least it was - from people working on that at HW level, I heard that it was what was keeping back x86 for a long time to relatively modest core counts. (But I'm mainly software as well, so don't quote me on this.)

We use a different memory model entirely (cacheless, so no consistency to worry about).

GPUs (especially the early general purpose-capable ones) would also mostly bypass caches, or only use them for specialized purposes. Instead they had a hyperthreading-like mechanism to hide latency and avoid stalling cores until data was available (e.g., execute other 'threads' until data arrives, have a ton of 'threads' around to hopefully have other jobs to do so). How do you deal with this issue in your HW?

dkf

@cvi said in Is clean architecture controversial?:

GPUs (especially the early general purpose-capable ones) would also mostly bypass caches, or only use them for specialized purposes. Instead they had a hyperthreading-like mechanism to hide latency and avoid stalling cores until data was available (e.g., execute other 'threads' until data arrives, have a ton of 'threads' around to hopefully have other jobs to do so). How do you deal with this issue in your HW?

We have (local to each CPU core) a relatively small amount of memory with the same speed as cache; (simple) memory operations on it can be done within the same clock cycle as the rest of the instruction. (OK, the clock speed isn't hugely fast.) We then have another larger memory that is shared between all cores on the die; access to that is slow unless we use either a buffered write (the memory controller does it in the background) or the DMA engine (which can do clever hardware pipelining tricks). We don't run code out of the large memory... but that limits us to 32kB of instruction memory per core (and 32kB of fast data memory). The big memory is used in practice in a way that is almost analogous to a disk/filesystem, except it still is all memory mapped by default. It can be used for inter-core comms, but we can't do locking that way (we've a separate hardware module for that, but that's reserved to communication with the OS) so working in that mode requires adhering to TDMA coding; as long as your writes are finished 10–15 cycles before the end of your timeslot, the other core will see the changes just fine from the start of its timeslot.

We have plenty of CPU cores per chip though (we normally dedicate one CPU core per chip to the OS and another to being an extended streaming IO controller — there isn't room to pack that into the OS core), and a fast comms fabric for (very) short multicast messages that can be extended from internal to the chip to the whole system (quite a few absolutely full rack cabinets; our stuff packs tighter than most cluster hardware because we have a better thermal envelope).

It's an unusual architecture that is part embedded system and part funky supercomputer. It has the habit of invalidating lots of higher-level programming languages' assumptions.