Does this circular buffer need a mutex?

kazitor

inb4 Betteridge

One thread produces items fairly quickly and needs to dump them in short order so it can get on with the next one. Another thread consumes items and is liable to be blocked (file I/O).

I have a circular buffer / queue, where the first thread checks that end+1 (mod length) ≠ start, then inserts an item and increments end if so. The second thread pulls items as long as start ≠ end, incrementing start after each one. The first never changes start, and the second never changes end.

Being a good citizen I wrapped both sets of operations with a mutex / lock without a second thought. But now I’m wondering whether it’s actually possible for the two threads to do something Inconsistent, because my intuition says no. The increments happen after the buffer is modified, and each thread won’t disturb the other’s position until it makes that increment, right?

No crystal clear answer from some quick interweb searches, so, can any experts here guarantee the situation as described would be safe without locks? It’d be great if the producer thread doesn’t ever have to wait on a lock, although the consumer doesn’t actually do anything blocking while it holds it.

The essence in program form:

#define LENGTH 64
int start = 0;
int end = 0;
int buffer[LENGTH];
pthread_mutex_t mutex;

/* invoked by thread A reguarly; should return quickly */
void addItem(int item)
{
  pthread_mutex_lock(&mutex);

  if((end + 1) % LENGTH != start) {
    buffer[end] = item;
    end = (end + 1) % LENGTH;
  }
  /* if there’s no space, do nothing */

  pthread_mutex_unlock(&mutex);
}

/* invoked by thread B periodically, should only strive to not hold the lock for long */
void removeItems()
{
  pthread_mutex_lock(&mutex);

  while(start != end) {
    process(buffer[start]);
    start = (start + 1) % LENGTH;
  }

  pthread_mutex_unlock(&mutex);
}

dkf

@kazitor You need some sort of memory barrier so that the action that marks an item as free definitely happens after you've stopped using it, and equivalent for the other end too; once an item is enqueued conceptually, you really mustn't be handling it at the producer side. It's particularly important that the compiler respects that ordering, which is one of the reasons you need memory barriers. You might also need to make the start/end variables volatile.

OTOH, if you just make the circular buffer big enough then it won't matter too much. Also, make sure that the average rate of the consumer is faster than the average rate of the producer. It sounds obvious but is really critical. (If you like math, you can use statistical estimates of the rates to work out the projected optimal queue length. I just overallocate unless I've got a good reason not to.)

If you go to having two or more consumers, put a lock in.

cvi

tl;dr: Look for lock-free data structures for designing a queue/buffer without mutexes. There are some well-known methods for this, and some of them even work in practice (and not only theory).

The longer answer is that there's more to this than just mutexes. You want to make sure things happen in the right order (both on the level of the compiler/optimizations and in hardware). The conceptually right way of doing this is with atomics (e.g., even just atomic load and store); if you have a decent implementation, those will reduce just to normal loads/stores, if your hardware makes sufficient guarantees (e.g., x86, but arm not as much).

Edit: Seeing @dkf's post. Barriers help but are tricky, there are quite a few different ones. IMO: use atomics instead, unless you're writing assembly-level code and know the hardware well enough. Atomics should generate the correct set of barriers in addition to making sure that loads/stores/etc are atomic. (In some HW, you can get the i = (i+1)%L as a single operation.)

See Memory Reordering Caught in the Act.

kazitor

Right, so even with a sane algorithm, it needs assurances that the compiler and processor won’t be playing silly buggers. I’ll leave my own silly buggers for ‘another time,’ like procrastination.

remi

I don't have enough experience to answer your overall question but I can see that you explicitly discard items that you can't write:

/* if there’s no space, do nothing */

Is that really intended? You're going to (silently) drop items if the buffer is full.

Intuitively, I would not want to randomly loose items (but of course it depends on what those items are, if they're e.g. graphical frames of a movie it might be OK to drop some if they come in too fast compared to displaying them).

I would therefore do some sort of wait/loop if that happens to try again a bit later, and I can't see how that one would work without some sort of synchronisation between the threads?

(also, apart from the interesting theoretical question, does it matter in practice? I know that in my code I tend to put mutexes every time I have the slightest doubt and only look more closely if there is an actual performance impact...)

topspin

@dkf said in Does this circular buffer need a mutex?:

You might also need to make the start/end variables volatile.

I'm not sure how much the C memory model differs from the C++ one, but assuming it's not too much, I'm going with volatile is not the right answer for this. You need the correct barriers for synchronization and atomics for, well, atomicity.

the situation as described would be safe without locks?

Regarding the original question: if you mean can you just remove the mutex completely and have it work fine, I don't think so. If you want to replace it with something lighter-weight than mutexes, see what's already been written: possible, but an expert-only rabbit-hole you probably want to avoid.

E: Of course a hanzo appeared while I was typing and you already decided on the correct solution of .

dkf

@remi said in Does this circular buffer need a mutex?:

Is that really intended? You're going to (silently) drop items if the buffer is full.

With experience from dealing with realtime code, if things get dropped then increment a counter saying that you've done so. From time to time, you check the counter: if it's still 0 then you know that everything's working as it should. If it's non-zero but fairly small then you had a spike of load (not great, but happens). If it's large, you've made a bad mistake somewhere.

kazitor

@remi said in Does this circular buffer need a mutex?:

I don't have enough experience to answer your overall question but I can see that you explicitly discard items that you can't write:
/* if there’s no space, do nothing */
Is that really intended? You're going to (silently) drop items if the buffer is full.

Definitely intentional; it needs to finish in time for the next item, and waiting on space could take any amount of time. The real comment might even use the ‘drop’ word. But it does print a warning, and if it happens occasionally once I run this at full speed (and not persistently, which indicates larger problems), I’ll just bump up the buffer size. There are timestamps so it should be somewhat apparent if something was missed; I’ve also had thoughts on keeping a more explicit record of how many are missed sequentially.

(also, apart from the interesting theoretical question, does it matter in practice? I know that in my code I tend to put mutexes every time I have the slightest doubt and only look more closely if there is an actual performance impact...)

Dunno. I’ve only tested and debugged with a few orders of magnitude less data than targeted, but this threading and buffering implementation is already considerably more responsive than when there wasn’t any .

cvi

@kazitor said in Does this circular buffer need a mutex?:

Dunno. I’ve only tested and debugged with a few orders of magnitude less data than targeted, but this threading and buffering implementation is already considerably more responsive than when there wasn’t any .

From my experience with profiling this kind of stuff: mutexes are typically quite reasonable in this regard.

Getting the lock if there is no contention is cheap, it's essentially just an atomic compare and swap. If there's contention, most mutexes have a short spin-lock that will busy wait for the lock to be released. Since you have a very small amount of code that's locked and only two threads, it's likely that the spin lock will wait long enough for the other end to release the mutex. This is good, because it keeps the mutex from suspending the thread and having the kernel wait for the mutex to become released before scheduling it again (IIRC, look up "futex" on linux for details ... I believe Win32 CRITICAL_SECTIONS work in a similar way). This means that most of the time you won't have to make a system call for locking and only execute relatively cheap userland code.

dkf

@topspin said in Does this circular buffer need a mutex?:

I'm not sure how much the C memory model differs from the C++ one, but assuming it's not too much, I'm going with volatile is not the right answer for this. You need the correct barriers for synchronization and atomics for, well, atomicity.

The barriers are absolutely critical. And volatile in both C and C++ says "you can't see all that is going on with this, Mr. Compiler, so do what you're told"; in particular, it can't remove, merge or reorder reads or writes. (The barriers stop other work from being moved across that.)

You only need the full power of an atomic if you've got multiple writers of a variable. With a single writer of a particular variable, you don't need a full bus synch and can handle x86/x64 and ARM-A just fine; ARM-M is something else, and normal atomics might be broken entirely on that. I'm not sure what memory models occur in RISC-V systems, and anything else these days is pretty obscure. Note that the timing of when the event happens is formally when the write reaches main memory, not when the writer thread initiates it, but that isn't too much of a problem provided you're also doing real work and have the circular buffer large enough.

PleegWat

@dkf said in Does this circular buffer need a mutex?:

@topspin said in Does this circular buffer need a mutex?:

I'm not sure how much the C memory model differs from the C++ one, but assuming it's not too much, I'm going with volatile is not the right answer for this. You need the correct barriers for synchronization and atomics for, well, atomicity.

The barriers are absolutely critical. And volatile in both C and C++ says "you can't see all that is going on with this, Mr. Compiler, so do what you're told"; in particular, it can't remove, merge or reorder reads or writes. (The barriers stop other work from being moved across that.)

You only need the full power of an atomic if you've got multiple writers of a variable. With a single writer of a particular variable, you don't need a full bus synch and can handle x86/x64 and ARM-A just fine; ARM-M is something else, and normal atomics might be broken entirely on that. I'm not sure what memory models occur in RISC-V systems, and anything else these days is pretty obscure. Note that the timing of when the event happens is formally when the write reaches main memory, not when the writer thread initiates it, but that isn't too much of a problem provided you're also doing real work and have the circular buffer large enough.

You do need to make sure you are using atomic types for start and end. This will ensure read or write operations never see a partially updated field, and that writes to other nearby variables never affect this field. Full-on atomic instructions are not required as long as you only have a single producer and a single consumer.

As an extreme counter example, the following struct (in C) would not be safe, but there are other structs which are less obviously still unsafe.

/* NOT THREAD SAFE */
struct {
    char flag;
    unsigned readidx;
    unsigned writeidx;
    int buffer[LENGTH];
} __attribute((packed));

Note the packed structure causes the read and write indexes to not be aligned to a 4-byte boundary. The exact alignment requirements vary based on hardware architecture and possibly on operating system.

dkf

@PleegWat Yes, you can run into that sort of thing if you play games with packed structs (and you overrun a cache boundary) but you have to work at it to get into trouble there. Alignment requirements for basic writes to be atomic are almost always "(32-bit) word boundary" because word-sized writes are very very common and making them expensive would make the hardware look slower in benchmarks.

As I mentioned above, std::atomic doesn't solve all your problems in embedded code. In particular, ARM-M (very common in embedded!) may not have a standard option for the compiler to ask for atomicity in the bus access or a standard way to implement a lock. (There's a standard ARM module for those things, but it's usually only present on ARM-A cores; ARM-M may well have something entirely custom if they bother at all.) OTOH, on ARM-M you really don't want to use unaligned word accesses at all as they're slow (or just cause a crash; the platform takes alignment seriously).

Zerosquare

I'm very late, but I'd like to address the in the room: why are you rolling your own solution in the first place?

As multiple people mentioned above, on modern architectures, there are hidden gotchas. Unless you're doing low-level embedded stuff, using the FIFO primitives provided by your OS* or language is both easier and less likely to have subtle bugs.

* Windows has plenty of functions to handle that kind of situations. I believe Linux does as well, but I'm less familiar with it.

kazitor

@Zerosquare said in Does this circular buffer need a mutex?:

I'm very late, but I'd like to address the in the room: why are you rolling your own solution in the first place?

Because it’s fast and it’s simple and it’s lightweight and it’s worked faultlessly

Hey, it’s using real-time scheduling policies. Let’s just set affinities and let the scheduler work it out!

cvi

@Zerosquare said in Does this circular buffer need a mutex?:

FIFO primitives provided by your [...] language

*laughcry*

Well, there's always the moodycamel concurrent queue.

But, I don't remember there being a great many "standard" concurrent FIFOs out there. Nothing springs into mind for Win32, except if you want to abuse the message pump mechanism for UI stuff. Neither for POSIX.

dkf

@cvi said in Does this circular buffer need a mutex?:

But, I don't remember there being a great many "standard" concurrent FIFOs out there. Nothing springs into mind for Win32, except if you want to abuse the message pump mechanism for UI stuff. Neither for POSIX.

Pipes, named or otherwise.