Question arising from "Sprite Threading"



  • Ok, I'd actually like to learn something here, as well as have a laugh! :)<Gasp!></Gasp!>

    <Gasp!>I'm a young programmer - I'm not the world's most advanced, but I understand threading and synchronization concepts. Now, here's the question...

    There is an obvious problem with the Sprite Threading WTF (of course I understand that), but I'm wondering at what point we consider that we have "too many threads." I mean, if I have many tasks that can be executed in parallel and don't rely on each other, is there a problem with creating a Worker Thread for each one?  Provided I clean everything up a the end of execution, isn't the only overhead the creation of the thread?</Gasp!>

    The other option I can see in this case would be to use N threads that constantly loop to handle the tasks.  But in this case, I, as the programmer, would have to decide on how many of these threads ("N") there should be.  Isn't this less efficient?  If I use worker threads, surely the OS decides how many can be handled at once and notifies the threads to execute as required.  (To be specific, I'm thinking here about .Net 2.0 and ThreadPool.QueueWorkItem but I think this is a general threading concepts question!)

    Cheers to anyone who can give me a point of view on this.  I find that college often gives a foundation on the technical aspects and not enough on practical, real-world design.

    Aryah
     



  • @Aryah said:

    Ok, I'd actually like to learn something here, as well as have a laugh! :)<Gasp!></Gasp!>

    <Gasp!>I'm a young programmer - I'm not the world's most advanced, but I understand threading and synchronization concepts. Now, here's the question...

    There is an obvious problem with the Sprite Threading WTF (of course I understand that), but I'm wondering at what point we consider that we have "too many threads." I mean, if I have many tasks that can be executed in parallel and don't rely on each other, is there a problem with creating a Worker Thread for each one?  Provided I clean everything up a the end of execution, isn't the only overhead the creation of the thread?</Gasp!>

    The other option I can see in this case would be to use N threads that constantly loop to handle the tasks.  But in this case, I, as the programmer, would have to decide on how many of these threads ("N") there should be.  Isn't this less efficient?  If I use worker threads, surely the OS decides how many can be handled at once and notifies the threads to execute as required.  (To be specific, I'm thinking here about .Net 2.0 and ThreadPool.QueueWorkItem but I think this is a general threading concepts question!)

    Cheers to anyone who can give me a point of view on this.  I find that college often gives a foundation on the technical aspects and not enough on practical, real-world design.

    Aryah
     

    The main problem in this case is that Java threads map to OS threads, which generates expensive contextual switches (switching between threads for execution is expensive) and high memory footprint.

    This is why it's usually a good idea not to use *too many* OS threads, thus you'll usually want to use a threadpool indeed. Yes you do have to decide how many threads should live in your pool, but in the end you'll move to 0 the overhead of creating/destroying OS threads, and will reduce the amount of context switching.

    Any language that maps its threads to OS threads will have this issue and will probably have to use this solution, and this is an implementation issue more than a "general threading" one:

    In other languages such as Erlang you can use threads for everything (Erlang gurus say that the primary mistake of erlang beginners is not using enough erlang processes), but that's because it uses extremely lightweight non-OS processes (an "empty" Erlang process, right at creation before the initialisation of local objects and such, weights around 300 bytes, and the cost of context switching is kept extremely low by the VM) and was built with concurrency in mind.



  • Basically, the major problem with spawning too many threads is synchronization (as mentioned in that Sprite article). I'd say, if you do not have an immediate problem, don't worry about threads. As far as sprite drawing is concerned: There is an event dispatcher thread, which is continuously running. It's supposed to be the spot where all GUI drawing should be done, and all events should be handled. Creating a separate event dispatching thread for each object is like using a funnel, then punching holes in its side to "increase its throughput", and catching the spilling liquid with buckets to pour it through the funnel again. Or something like that.



  • @Aryah said:

    <Gasp!> I mean, if I have many tasks that can be executed in parallel and don't rely on each other, is there a problem with creating a Worker Thread for each one?</Gasp!>

    If you spawn a lot of threads and all threads are concurrently working (really CPU-intensive working), it most likely hurts performance (for all the reasons already stated).

    On the other hand, on a computer with n cores, running n threads most likely makes better utilization of the hardware. If the threads are most of their time just waiting (for user input, network packets, slow hardware etc.), a relatively high number of threads probably doesn't hurt as long as you have enough memory. 



  • Thanks for the responses - these have been great.  In particular, I was thinking that if the threads are waiting on I/O (or doing non-intensive work, as ammoQ said) there wouldn't be much harm in this approach.

    Maybe I'll clarify a particular situation I was thinking of.  I had an application that spawned a large number of worker threads using .Net 2.0's ThreadPool.QueueWorkItem, using the default pool size.  The threads all made a network request and waited on a response.  The main goal here (a requirement) was that the requests are sent as quickly as possible.  I should state that the server was capable of handling multiple inbound and outbound requests and would send responses.

    I created many worker threads because my thinking was that while one thread is waiting on a response, the others can be sending their requests.  All seemed to work fine and I had enough memory for the operations.

    The slight problem (similar to what Skurry mentioned) comes in the thread behaviour after the network reponse is received.  The response must be logged, which involved synchronization on a common object.  The method called to do this was as lightweight as possible, simply storing the response in a data structure and releasing the lock.  (Responses were written to a database later, after the network requests had finished.  I avoided database access until the end as I wanted to maximise the speed of the network requests.)

    Sorry for the long-winded explanation, but I'm wondering if this is a good approach, as you can see how there are a lot of threads in the application.  If not, how should this problem be approached?  In my (perhaps naive) view, this seems to be making good use of the processor.  I suppose I should also mention that it worked, but I'm seeking clarification on the approach that should ideally have been taken or validation of the current approach.

    Thanks again,

    Aryah

    P.S.  Masklinn, thanks for the clarification on Erlang threads.  I have no experience with it but I've looked into it out of my own interest in concurrency - I'm impressed by the concept!
     



     



  • Sounds good to me. Things to consider:

    - That data structure you're using may already be synchronizing access to itself (at least in Java, there are synchronized versions of containers), so you could simplify your code further.

    - Is there a fixed set of requests per use case? What happens when one request times out?



  • Cool, thanks.

    The data structure isn't automatically synchronized, but it does provide an object (SyncRoot) to allow simple synchronization, which I'm using to prevent multiple simultaneous accesses.  I see what you mean - there might be synchronized versions available to me that could simplify things...

    Regarding your second question, there are not a fixed number of requests to be made, but the number of requests to be made is read from the database on a polling basis.  I know a "batch" of requests has finished processing when I have received the same number of responses (and I suspend polling while processing a batch).  I have some pretty extensive exception handling within the network access threads so that I either receive a response, or can indicate the exception and allow the database updating to commence when the network requests complete (exception or no exception).

    Again, thanks very much for the help.  The only remaining problem I have is when to use the method I have used (ThreadPool.QueueWorkItem) and when to have a set number of looping threads?  My current worker threads will "die" straight after execution, meaning there's an overhead in creating the next thread.  However, with the "set number of looping threads" method, the inefficiency is in choosing the number of threads - too low and we aren't as fast as possible, too high and there may be too many threads and context switches!

    :-S

    Aryah



  • The only advice I can give you regarding the size of the thread pool is to measure performance with different settings. Of course, the result will be depending on the hardware the tests are run on, and maybe even the data the tests are run with, so the "sweet spot" will not be universal, but at least give you a clue about the magnitude (e.g. will 100 threads perform better than 10 threads). It also depends on whether your goal is effectiveness (i.e. the highest throughput possible given a certain configuration) or efficiency (highest throughput relative to CPU usage). The latter only matters if you need to worry about other processes running on the target machine.



  • Brilliant, thanks.  So sometimes 100 threads is ok?! :)

    I got worried about my approach when I read about the 50 threads in the WTF being a problem.  Of couse, in that case I think the overall number of threads executing was the least of his problems! :-D

    Cheers again!
     



  • My basic rule of thumb is one thread for everything that is CPU-intensive and one thread for everything that has to wait independently of the others.

    If you have two tasks that each use only the CPU and memory, it will take longer to run them on two concurrent threads than to run one to completion and then run the other.  This is a "worker" thread and you usually have only one.  (You could make an exception for this if each task runs for a very long time and returns useful partial results from time to time.)  You can use a queue to schedule tasks for the worker thread.

    The user input task must wait for the user to do something like press a key or move the mouse.  Usually (for instance in Windows and Java Swing) a single task runs the display and user input.  Each keystroke or mouse movement is packaged into an event that is put into a queue for the UI thread.  This allows you to draw directly from code that processes a mouse event without worrying about synchronizing with the video code.  There is always a way by which any thread can package tasks to be executed on the UI thread.  In Windows this is done with SendMessage() and PostMessage().  In Java Swing this is done with SwingUtilities.InvokeAndWait() and SwingUtilities.InvokeLater().

    Disk I/O spends most of its time waiting, and so it usually requires a separate thread.  If you run it on the UI thread, user input is locked out while the I/O takes place.  If you run it on the worker thread, worker tasks are locked out.  You need to have a way for the UI thread to communicate with the I/O thread so that the user can cancel a task, for instance if he needs to insert a blank disk but is all out of blank disks.

    Network I/O can use one thread for each port.  If there are many ports being handled, there is a technique of waiting on multiple ports that allows you to efficiently use one thread for all of them.

    That is my general guideline, but there are times when a different approach is better.

    A web server can handle hundreds of requests at once.  Each request is independent of the others (with occasional exceptions, as when the requests are updating a database); each request involves some waiting (fetching things from the database); and each request involves a number of steps that can be done one after the other.  The web server will have a pool of hundreds of threads.  Each request is assigned to a single thread.  The program that handles the request is written as a single-threaded task (but aware that it is sharing resources with other single-threaded tasks).

     



  • @Aryah said:

    Brilliant, thanks.  So sometimes 100 threads is ok?! :)

    I got worried about my approach when I read about the 50 threads in the WTF being a problem.  Of couse, in that case I think the overall number of threads executing was the least of his problems! :-D

    Cheers again!
     

    Sure, sometimes it's okay.  Odds that you will actually encounter that case...  not so high.

    Edit: That's assuming today's widely available consumer hardware, and most server hardware.



  • Ignoring for the moment languages with weird-arse threading semantics like Erlang, here's a good rule of thumb to stick to:

    Use no more than one thread unless you've got a really, really good reason. Only create a new thread if it would be seriously inconvenient to do otherwise.

    Most programs that you are likely to write have absolutely no need for threading, and it will merely complicate the design and increase the resource consumption of the program if you attempt to use them. Threads get a lot of air time in the computer science community because threading creates interesting new problems in analysing program behaviour. In practice, very few applications have any use for them. Far too many applications use them for no good reason (probably because they're seen as "hardcore" and "enterprisey"). Threading without a good reason is just a whole pile of new race condition and deadlock bugs for you to deal with.

    (If you're using Java and Swing, you're stuck with threading because Swing utterly sucks - please stop using it. Any decent GUI toolkit should not require you to mess with threads yourself, although it may use a thread internally)



  • @newfweiler said:

    Disk I/O spends most of its time waiting

    You're doing it wrong. Every meaningful platform supports asynchronous IO, which does not block the caller. Putting synchronous IO calls into separate threads is abysmally slow compared to using the operating system's own asynchronous IO mechanism (and in Java it's just a joke - somewhere between 10 and 100 times slower under some loads).

     

    Network I/O can use one thread for each port.  If there are many ports being handled, there is a technique of waiting on multiple ports that allows you to efficiently use one thread for all of them.

    IO is IO, there's no difference between the disk and the network. The whole lot should be run asynchronously (possibly with one thread for event dispatch, but only if it can't be packed into the main thread for some reason). 

     

    A web server can handle hundreds of requests at once.  Each request is independent of the others (with occasional exceptions, as when the requests are updating a database); each request involves some waiting (fetching things from the database); and each request involves a number of steps that can be done one after the other.  The web server will have a pool of hundreds of threads.  Each request is assigned to a single thread.  The program that handles the request is written as a single-threaded task (but aware that it is sharing resources with other single-threaded tasks).

    That tends to make the server crawl due to rapid thrashing between hundreds of threads (a lot of J2EE stuff takes this approach and ends up using far more hardware than it ought to because of it - it's enterprisey but hugely wasteful).

    Systems like this, which take the form of a pile of tasks that follow the "read a little data... do a little work... write a little data... wait for more data" pattern, are far more efficiently built using a coroutine/task model instead of threading.

    The ideal number of threads for any given program is equal to the number of CPUs it is running on. If you're running hundreds of threads, and you don't have hundreds of CPUs, you're doing something wrong. (It's okay to have a few more threads than CPUs, but you shouldn't be off by several orders of magnitude)



  • asuffield, I can see that thread creation and switching overheads are a problem with a very high number of threads and I appreciate your comments.

    But in my case (mentioned above), I am required to make network requests in very quick succession, as fast as possible.  All my threads do is make the request, wait for a response and store the response in a data structure.  I'm trying to minimise the time between network requests by ensuring that while one thread is waiting for a response, another is sending a request.

    What I'm trying to say is, if this approach is not the best, then what alternatives should I investigate?
     



  • @Aryah said:

    I can see that thread creation and switching overheads are a problem with a very high number of threads

    Thread creation is actually very cheap (on a properly implemented operating system). Creating many threads is not intrinsically a problem. The problem is when, at any given time, the number of threads in existence is significantly larger than the number of CPUs. Designs which fail to limit the number of threads to something similar to the number of CPUs are therefore flawed - in this case, the problem originates from the fact that the number of requests to be made has got nothing to do with the number of CPUs, so "one thread per request" can't be right. That works as a general test: figure out an expression that describes how many threads will be running, and if the number of CPUs is not a factor in that expression, you've gone wrong somewhere.

     

    But in my case (mentioned above), I am required to make network requests in very quick succession, as fast as possible.  All my threads do is make the request, wait for a response and store the response in a data structure.  I'm trying to minimise the time between network requests by ensuring that while one thread is waiting for a response, another is sending a request.

    What I'm trying to say is, if this approach is not the best, then what alternatives should I investigate?

    This one is amenable to a task-oriented design, that's probably the best way.

    Each 'task' has three elements: send_request, receive_response, and store_response. The objective is to run this sequence as efficiently as possible:

    send_request()
    res = receive_response()
    store_response(res)
    

    We can model this task directly as data, rather than as a thread. Consider this loop:

    while(1) {
      task = get_next_ready_response()
      res = task->receive_response()
      task->store_response(res)
    }
    

    Filling in the other parts should be straightforward. The interesting function here is get_next_ready_response(): it somehow figures out which of the currently running tasks is ready to receive a response, and returns one of them. Exactly how you do this depends on the specifics of what your task is waiting for - but for network sockets, you're going to want something based on the operating system's event-driven IO system. I have no idea how that works in .NET, but in C it could be a select() or poll() call, and in java it would be java.nio.channels.Selector. Something along these lines:

    while (1) {
      poll_for_ready_fds()
      for each ready fd:
        task = get_task_for_fd(fd)
        task->read_data_into_buffer()
        if (task->is_response_complete()) return task
    

    This sort of structure is called an 'event loop', and it's the correct way to implement most non-trivial networking code - any system that deals with asynchronous messages coming in over the network should have one of these somewhere.



  • @asuffield said:

    That works as a general test: figure out an expression that describes how many threads will be running, and if the number of CPUs is not a factor in that expression, you've gone wrong somewhere.

    IMO, this is a too general rule. Sounds like "premature optimization". If a program for a certain task can be elegantely written by using exactly 6 threads, the overhead might be negligible even on a single-CPU machine. For example, a emulator for a game console might want to use one thread for the CPU simulation, one for the GPU, one for sound, one for the CD-Rom, one for the input devices and one as the main thread that controlls all that stuff. Maybe it can all be done in one thread, but taht would make it all much more complicated...



  • Doesn't using that model mean that I must wait for a response before sending the next request?  Or are each of these tasks threads, yeah?  (Apologies if that's a silly question - I'm off to investigate task-oriented design).



  • @Aryah said:

    Doesn't using that model mean that I must wait for a response before sending the next request?  Or are each of these tasks threads, yeah? 

    There's just one thread here. You send off requests whenever you like, inform the event loop about them (so that it knows to monitor their network socket), and then forget about them. The event loop kicks off the response handler when some data arrives.

    Normally, you either send all your requests at the start (if there's a reasonably small number of them), or you maintain a count of "active tasks" and send off one new request every time a task completes - much like a thread worker pool, only without any threads. 



  • @ammoQ said:

    @asuffield said:

    That works as a general test: figure out an expression that describes how many threads will be running, and if the number of CPUs is not a factor in that expression, you've gone wrong somewhere.

    If a program for a certain task can be elegantely written by using exactly 6 threads, the overhead might be negligible even on a single-CPU machine.

    There's exceptions to every rule - but to have an exception to this particular rule, you have to really understand what is going on and why having too many threads is a problem. Anybody who knows that much wouldn't need to consider rules like this, they'd get it right from the outset.

     

    For example, a emulator for a game console might want to use one thread for the CPU simulation, one for the GPU, one for sound, one for the CD-Rom, one for the input devices and one as the main thread that controlls all that stuff. Maybe it can all be done in one thread, but taht would make it all much more complicated...

    More complicated perhaps, but almost certainly worth it in this scenario. Console emulators typically need to run as fast as they can (unless we're talking about an NES emulator or something), and several of those threads would be running flat out - they'd be running on the host CPU pretty much all the time. Probably the worst thing you can possibly do with threads is to have N of them running like that, where N is larger than the number of host CPUs. The performance penalty for doing that is atrocious, as the CPU constantly thrashes between them. You can get away with having more threads than CPUs only when some of the threads don't need to use a CPU all the time.

    Games and scientific work are the two areas where this becomes really, really important to get right. They will usually go out of their way to ensure that the number of threads is equal to the number of CPUs available (the optimal configuration), no matter how complicated that is to implement.

    (I'm not sure if there are any hardware emulators that use threading for any purpose other than emulating SMP machines on SMP host machines - but I've never heard of anybody finding a way to do it. Threading is usually a poor approach to hardware emulation, because of timing and scheduling issues; an emulator usually has to do its own scheduling, rather than relying on the host kernel to do it)



  • @asuffield said:

    For example, a emulator for a game console might want to use one thread for the CPU simulation, one for the GPU, one for sound, one for the CD-Rom, one for the input devices and one as the main thread that controlls all that stuff. Maybe it can all be done in one thread, but taht would make it all much more complicated...

    More complicated perhaps, but almost certainly worth it in this scenario. Console emulators typically need to run as fast as they can (unless we're talking about an NES emulator or something), and several of those threads would be running flat out - they'd be running on the host CPU pretty much all the time. Probably the worst thing you can possibly do with threads is to have N of them running like that, where N is larger than the number of host CPUs. The performance penalty for doing that is atrocious, as the CPU constantly thrashes between them. You can get away with having more threads than CPUs only when some of the threads don't need to use a CPU all the time.

    Games and scientific work are the two areas where this becomes really, really important to get right. They will usually go out of their way to ensure that the number of threads is equal to the number of CPUs available (the optimal configuration), no matter how complicated that is to implement.

    (I'm not sure if there are any hardware emulators that use threading for any purpose other than emulating SMP machines on SMP host machines - but I've never heard of anybody finding a way to do it. Threading is usually a poor approach to hardware emulation, because of timing and scheduling issues; an emulator usually has to do its own scheduling, rather than relying on the host kernel to do it)

    AFAIK, playstation emulators (i.e. ePSXe) resp. their plugins are multithreaded.



  • I'm surprised nobody came with the story slashdot reported recently

    http://it.slashdot.org/article.pl?sid=07/03/22/1218238 

    There is a video of a demo on playstation 3, where each chicken has its thread. Unfortunately the file is quite slow to download and the recording is terrible.
     Still, the rooster simulation is quite funny.



  • The sprite threading was maybe a bad example.

    Sprite engines really shouldn't be multi-threaded anyway. Why? Because if each sprite corresponds to a thread, then each thread has to cooridinate with a large slice of the other threads, asking "Have I collided with or occluded you?". Whether that's handled with sockets or with shared data structures, mutexes and semaphores, that's a lot of conditional logic to run through, data structures to trundle through, cache to evict, and potentially; context switches to incur.

     But a bit of code that uses (any) well-known collision algorithm that walks the entire tree or uses scanlines or whatever can be chosen to run efficiently over the in-memory representation of your sprite world.

    If you want, you can group sprites into thread owner collections that correspond to regions of the screen (especially if they belong to seperate viewports); that makes sense since those sprites won't need to send messages to each other (or will do so rarely). That is a parallel task that is suitable for multithreading.

    There is no hard and fast rule.

    You must look at your problem, and decide: how much communication overhead am I incurring for needlessly abstracting my problem? How can I use procedural or dynamic programming techniques to make the solution scale better? Where is my problem very parallel and how do I architect my solution to take advantage of that?

    You have to ask those questions no matter what kind of framework or programming language you use. 



  • When optimizing for processing speed, the optimal number of threads is the number of cpus.

    When optimizing for turnaround time in IO control, the optimal number of threads is the number of independent external data streams.

    If you are not tuning for either of these, don't bother. 



  • Don't spawn a bunch of threads!  I have seen cases of someone writing "one thread per sprite".  That's not good.  It works ok when you have a dozen sprites but it becomes non-working when you have more than that.  Remember when threads need to pass information to each other, that often means that separate CPUs have to empty their caches, negating any performance benefit.

    Use one thread to move all your sprites, one thread to access net, etc.


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.