Am I the dumb? Why are we rolling our own Destroy logic?

topspin

@steve_the_cynic said in Am I the dumb? Why are we rolling our own Destroy logic?:

TR, which is a bunch of people who don't know about locking and other synchronisation funtimes, but are writing threaded code anyway.

It actually looks like they don't know about lifetimes/ownership, either.

Scarlet_Manuka

@cvi said in Am I the dumb? Why are we rolling our own Destroy logic?:

assuming that the TNode constructor does the ~~necessary~~needful things

dfdub

@topspin said in Am I the dumb? Why are we rolling our own Destroy logic?:

It actually looks like they don't know ~~about lifetimes/ownership, either~~anything.

I think we can all agree that whoever wrote that shouldn't continue writing C++ without further training.

cvi

@steve_the_cynic said in Am I the dumb? Why are we rolling our own Destroy logic?:

Coherency between cores/vCPUs in different sockets is significantly harder than between different cores of the same socket, and there are substantial performance hits when locks are contested between threads on different sockets.

Yeah. Any situation where there's a lot of contention is problematic, because that will kick the cache coherency protocols into full overdrive. That is, if thread A and B on different cores repeatedly write to the same memory, they will always end up evicting the corresponding cache lines of each other. I think doing contested locks would do the same thing, since the typical implementation involves spinning on with a CAS on a single memory location (if CAS doesn't bypass the L1$) for a while before eventually handing it off to the OS.

I actually don't know how e.g. Intel does coherency across different different chips/sockets. Seems messy and expensive, especially when they are on different sockets.

HardwareGeek

@cvi I worked on a project doing just that, except "system" memory was decentralized — each CPU had a few GB of RAM physically connected to that CPU. A thread running on CPU 1 could "own" a chunk of memory that was physically connected to CPU 7. Also, the virtual address of that memory might be different for each CPU, a DMA request from a PCIe peripheral, a DMA request from a legacy PCI device, and all different from the RAM's physical address. And we were intentionally creating high-contention situations, where every device that could conceivably access it was accessing overlapping chunks repeatedly. Following a transaction through multiple address translations in each direction to figure out why a device got stale data, or data it shouldn't have seen yet, or data from the wrong location was ... not fun ... to debug.