Galaxy Zoo



  • If only he wasn't so honest about it. But still a good WTF... [link]http://arfon.org/confessions_of_a_zoonometer_addict/[/link]



  • At least two WTF's here

     First:  that there is no locking around the method.  If  you have a global object (which the counter is) and you have multiple threads (which you do, because this is a web app) then you need to have locking.  If you don't have locking then lots of bad things can happen, including in this case multiple threads updating the counter at the same time.  You're almost certain to have a gummed-up value

    Second, that the mysql query takes 6 seconds  to query the database.  That is an etermally long time to fetch a single value in a single record.




  • Am I the only one that doesn't think every mundane bug is worth posting as a WTF? Maybe it's a good sign that no one can find WTF-worthy code anymore.

    If I were to dig up some code where a buffer overflow possibly occurs when a monkey dances on the keyboard, someone will invariably reply with "ZOMG, wut a n00b, buffer overflow llolz WTF".

    No, it's a BUG, not a WTF. Code contains bugs, people forget synchronization, threads might deadlock, null pointers are dereferenced. There are real WTFs out there that need to be exposed, and not drowned out in non-WTFs. The boy who cried wolf comes to mind.

    Also for the literacy-impaired:

    "Without really thinking I decided that rather than count the total number of clicks each time we wanted to update the Zoonometer™ (a MySQL query that takes about 6 seconds)"

    means that it takes 6 seconds to grab every single row and sum up the clicks, and because of that, a global counter was used instead. 6 seconds to sum a multi-million-entry column seems pretty reasonable IMHO, but I'll be the first to admit I'm wrong if someone shows me a database table with more than a million rows which can do better. Either way, it was a non-WTF since the article explicitly mentions the (possibly overestimated) 6 seconds to re-fetch the entire table every update would be too long, so that a separate counter is used instead.



  • Dogbrags: There does not need to be a locking method in the code.  The database will handle it - assuming the query is written correctly.  A lock in the code would only handle those threads ON THE SAME SERVER.  At the number of transactions being reported, you can bet they have >1 web server...

    Secondly, It sounds like the 6 seconds is to query a large table and add up the actual clicks - he was trying to make that query more efficient by storing a single number (and failed).  Six seconds is not unreasonable when querying a very large table.

    Note: A more correct way to handle it would be:

    Begin Transaction;
    Update rec set counter = counter +1;
    Commit;

    This is assuming that the counter can be updated outside of any other transaction occuring...I'm sure that they brought the old value into memory and incremented, then saved, rather than just letting the db do its job.  The above SQL should be enclosed in a retry block in case the update times out due to excessive load.  I believe that MySql also does implicit transactions by default, so the Begin/Commit are not strictly necessary ... but should be included for completeness (and just in case someone changes the default configuration of your mySql instance).

     



  • @Whoa314 said:

    Also for the literacy-impaired:

    "Without really thinking I decided that rather than count the total number of clicks each time we wanted to update the Zoonometer™ (a MySQL query that takes about 6 seconds)"

    means that it takes 6 seconds to grab every single row and sum up the clicks, and because of that, a global counter was used instead. 6 seconds to sum a multi-million-entry column seems pretty reasonable IMHO, but I'll be the first to admit I'm wrong if someone shows me a database table with more than a million rows which can do better. Either way, it was a non-WTF since the article explicitly mentions the (possibly overestimated) 6 seconds to re-fetch the entire table every update would be too long, so that a separate counter is used instead.

    I would assume that you would insert one row per click, in which case you would only need a SELECT COUNT(*) FROM clicks, which for me executes in 0.0006 seconds with 2,617,570 rows.

    (Just for interested, SUMing 2,617,570 rows takes 1.7281 seconds for me)

    ((Admittedly those were the only queries running at the time, making the timings pretty artificial))



  • @brightemo said:

    I would assume that you would insert one row per click, in which case you would only need a SELECT COUNT(*) FROM `clicks`, which for me executes in 0.0006 seconds with 2,617,570 rows.

    (Just for interested, SUMing 2,617,570 rows takes 1.7281 seconds for me)

    Now try that with a criterion.  The database keeps a counter of the number of rows in each table, so no how big the table gets the count(*) will be returned in O(1) time.  However, they probably have other rows in the db which should not be counted - e.g. maybe from their first test.  So they're counting all rows where insert date > "some date".

    See how long that takes 'ya !  (And it's no fair cheating and using cached results.  Make sure you change the "some date" each time or restart your db between timings).



  •  I would not put transactions around the database update; at least not if I'm incrementing the count one at a time.  If you do that, you serialize your updates (meaning only one thread at a time can update the database).  In this context, an update occurs on each postback.  You're limiting yourself to responding to clicks only as fast as the database can update. 

    You're fine as long as the database update takes less than the number of postbacks per second you're experiencing; for this site the original poster said 30-40 clicks per second.  So you need to have a response less than 1/40th second per update or you can't process all the clicks, and your system bogs down.   Granted for most databases this is just fine (1/40th of a second is a LONG time for a computer)

     Still, I'd rather see some code like :  count the clicks in a variable in memory; every X clicks update the database. If x = 40, then you have an update every second; close enough to real-time that no one would notice.  But if you're updating a variable in memory, then you need some sort of protection around the variable because your web server is processing many postbacks in parallel.  So locking the in-memory variable is required.

     If you have just one server, then the database update occurs only once every X clicks; and that can occur on only one thread at a time (since the variable is locked during update) so no database transaction is required.  If there are multiple servers, then each could update the database and so you'd want transactions.

     Why do you want to know about individual clicks?  

    Yes, I'd agree this is not a "WTF" because when you read the blog post, the reaction is "rolling eyes" rather than "LOL", but still worthy of a little discussion. 



  • @dogbrags said:

     I would not put transactions around the database update; at least not if I'm incrementing the count one at a time.  If you do that, you serialize your updates (meaning only one thread at a time can update the database).  In this context, an update occurs on each postback.  You're limiting yourself to responding to clicks only as fast as the database can update. 

    Of course you are limiting to that speed - I never said this was the most efficient method, but 40/sec should be well within reach of any database system.  It all depends upon how accurate your count has to be - in this case your solution (below) would be appropriate if you didn't have to ABSOLUTELY make sure that all clicks are counted (e.g. system crash/power loss).

    If they've got problems updating a counter - then whatever other updates they're doing in real-time probably can't keep up either and they've got far bigger issues.

    @dogbrags said:

    You're fine as long as the database update takes less than the number of postbacks per second you're experiencing; for this site the original poster said 30-40 clicks per second.  So you need to have a response less than 1/40th second per update or you can't process all the clicks, and your system bogs down.   Granted for most databases this is just fine (1/40th of a second is a LONG time for a computer)

     Still, I'd rather see some code like :  count the clicks in a variable in memory; every X clicks update the database. If x = 40, then you have an update every second; close enough to real-time that no one would notice.  But if you're updating a variable in memory, then you need some sort of protection around the variable because your web server is processing many postbacks in parallel.  So locking the in-memory variable is required.

     If you have just one server, then the database update occurs only once every X clicks; and that can occur on only one thread at a time (since the variable is locked during update) so no database transaction is required.  If there are multiple servers, then each could update the database and so you'd want transactions.

     Why do you want to know about individual clicks?  

    Yes, I'd agree this is not a "WTF" because when you read the blog post, the reaction is "rolling eyes" rather than "LOL", but still worthy of a little discussion. 

    I have no idea why they'd want individual clicks, or even what those clicks actually represent.  Obviously some task is being performed for identifying galaxies.  Maybe they're trying to avoid the heat death of the universe, like in the Dr.Who episode: http://tardis.wikia.com/wiki/Entropy



  • @Auction_God said:

    Now try that with a criterion.  The database keeps a counter of the number of rows in each table, so no how big the table gets the count(*) will be returned in O(1) time.  However, they probably have other rows in the db which should not be counted - e.g. maybe from their first test.  So they're counting all rows where insert date > "some date".

    See how long that takes 'ya !  (And it's no fair cheating and using cached results.  Make sure you change the "some date" each time or restart your db between timings).

    ~0.05 seconds with an indexed integer column for the criteria.

    Besides, if they did have rows from other tests, it would simply be a matter of recording the count before starting the new test, and doing a subtraction from COUNT(*). (This of course would not work when you have multiple concurrent tests).



  • @brightemo said:

    @Auction_God said:
    Now try that with a criterion.  The database keeps a counter of the number of rows in each table, so no how big the table gets the count(*) will be returned in O(1) time.  However, they probably have other rows in the db which should not be counted - e.g. maybe from their first test.  So they're counting all rows where insert date > "some date".

    See how long that takes 'ya !  (And it's no fair cheating and using cached results.  Make sure you change the "some date" each time or restart your db between timings).

    ~0.05 seconds with an indexed integer column for the criteria.

    Besides, if they did have rows from other tests, it would simply be a matter of recording the count before starting the new test, and doing a subtraction from COUNT(*). (This of course would not work when you have multiple concurrent tests).



    First, .05 seconds would already fail, since it said around 40 clicks per second came in, and 40 clicks would take 2 seconds given your .05 second benchmark. Secondly, theres no way it's as simple as selecting from integer column criteria; the real, functional database would likely involve a "and date > $startdate" criteria, not to mention possibly requiring a table join between a list of unique "galaxy pictures" and list of "categorizations" which include "galaxy_id"'s and "user_id", etc.

    In any case, if a new row were added for every click in the approximately 92 weeks between July 2007 and April 2009, the table would have 92 million rows. It's also cheating if your table only has a single column with that integer column as its data; the size of records is important, because the file size and layout on disk determines the rate of page misses and disk accesses are SLOW. With more columns, the desired data may end up more fragmented, and memory locality will suck. Even a very nice server can only cache the top couple levels of the database B-Tree, and there is a sharp performance hit once you need to traverse beyond the cached levels.

    Granted, none of this is really relevant to the article since the author simply throws out the idea of recounting from the database as how not to do it. I guess the nuances of database performance are relevant in the grand scheme of things, though.



  • @Auction_God said:

    Now try that with a criterion.  The database keeps a counter of the number of rows in each table, so no how big the table gets the count(*) will be returned in O(1) time.  However, they probably have other rows in the db which should not be counted - e.g. maybe from their first test.  So they're counting all rows where insert date > "some date".

    SELECT COUNT(*) FROM Products.Items
    WHERE ChangedOn > '2009-03-15' AND ChangedOn < '2009-04-01'
    -----------
    93232

    It takes 2 ms to do this query. Total rowcount is 1,409,172.

    @Auction_God said:

    See how long that takes 'ya !  (And it's no fair cheating and using cached results.  Make sure you change the "some date" each time or restart your db between timings).

    Why would you change the date? The timeframe of the submissions was clearly defined. Also, why would anyone restart the DB in a production environment?

    Now, granted, the database probably contains more than a million rows (more like 100 million), but even if a query takes 6 seconds you could still use it and cache the results for a short time. This would be reasonable IMHO.


Log in to reply