MySQL cronjob optimization



  • Oh hey guys, I heard some of you know stuff about databases. I have a few questions.

    Currently, I have a cronjob running every ten minutes that will extract a list of names out of an HTML page and "reverse engineers" when these names appeared or vanished from this list.

    First of all, my tables:

    characters

    • columns id, name, last_seen
    • indices id (primary), name (unique)
    • has one row for every name it has come across so far and the last time it has seen it.

      This table currently has 6352 rows (running since ~October)

      sessions
    • columns id, character_id, start_time, end_time
    • index id (primary)
    • has one row for every period during which it has found the character's name in the HTML page on every call of the cronjob.

      This table has 61209 rows so far.

    I have two questions / concerns.

    First, sessions that haven't ended yet are identified by end_time=NULL. So I have many calls to ... WHERE end_time IS NULL, and I have no index on end_time. An index sounds like an overkill since I only need the NULL entries quickly available. So I wondered, is their a better way? My only idea would be to add another (boolean) column session_ended and create an index over that. Also, since NULL requires a special WHERE clause syntax, it somehow also comes to mind that probably NULL values are automatically optimized in that way, but I couldn't find anything about it...

    The second concern is my cron-job. Most the looping is done in PHP. I wondered if there's a better way. Yay for pseudocode:

    $start_session = $db->prepare('INSERT INTO sessions SET character_id=:id, start_time=:time');
    $find_running_sessions = $db->prepare('SELECT s.id, c.id, c.name FROM sessions AS s, characters AS c WHERE s.end_time IS NULL AND s.character_id=c.id');
    $end_session = $db->prepare('UPDATE sessions SET end_time=:time WHERE id=:id');
    $end_open_sessions = $db->prepare('UPDATE sessions SET end_time=? WHERE end_time IS NULL');
    $add_character = $db->prepare('INSERT INTO characters SET name=:name, last_seen=:seen ON DUPLICATE KEY UPDATE last_seen=:seen');
    $update_character = $db->prepare('UPDATE characters SET last_seen=:seen WHERE id=:id');
    $time = time();
    // $html = file_get_contents() done before connecting to DB
    if ($html) {
    $list = extract_name_list_from_html($html);
    $list = array_flip($list); // Pretty sure I did this for a reason...
    $db->beginTransaction();
    try {
    $find_running_sessions->execute();
    // Loop through every session that has not ended
    foreach ($find_running_sessions->fetchAll() as $row) {
    list($session_id, $character_id, $name) = $row;
    // If the name is not in the list eny more, end the session
    if (!array_key_exists($name, $list)) {
    $update_character->execute(array(':id' => $character_id, ':seen' => $time));
    $end_session->execute(array(':id' => $session_id, ':time' => $time));
    // Else we know at least that the oh wait this is a small bug. But doesn't matter now...
    } else {
    unset($list[$name]);
    }
    }
    // Now, start a new session for every character left in the list (and add the character
    // to the DB if not in yet)
    foreach (array_keys($list) as $name) {
    // Yeeeaaaah surely TRWTF is the regex. (this does some late checking if the HTML
    // was really parsed right or contained the expected content.... definitely misplaced
    // but not the point (I think))
    if (!preg_match('/^[A-Z][a-z]*$/', $name)) continue;
    $add_character->execute(array(':name' => $name, ':seen' => $time));
    $character_id = $db->lastInsertId();
    $start_session->execute(array(':id' => $character_id, ':time' => $time));
    }
    $db->commit();
    } catch (Exception $e) {
    $db->rollBack();
    }
    } else {
    // Occurs when the HTML page couldn't be downloaded.
    // (assuming server unreachable => all characters offline)
    $end_open_sessions->execute($time);
    }


    So... I guess something must be wrong with this. But, I've tested SELECT * FROM sessions WHERE end_time IS NULL, and it gave me a result rather quickly. Sooo... no idea. So I thought I'd ask you guys :D thanks in advance for any help.


    (There might be some WTFs in there, but hey, I'm a student, and not even a CompSci student; also this is not something I do for school OR money (OR the FOSS community), so they won't count as WTFs right?)



  • For 61k rows? An index isn't going to do shit. Don't worry about it until you actually have problems... maybe when you hit 61m rows.

    Your PHP looks fine to me, but I'm no expert in PHP.



  • @blakeyrat said:

    Don't worry about it until you actually have problems...

    Well I was once in phpmyadmin while the cronjob must have been going on, and it didn't load any pages until the cronjob was done. but then this might be because at that time it connected to mysql before getting the HTML page, and maybe their server was slow and mine has a connection limit of 1 connection. thanks for your estimate!



  • @derula said:

    @blakeyrat said:
    Don't worry about it until you actually have problems...

    Well I was once in phpmyadmin while the cronjob must have been going on, and it didn't load any pages until the cronjob was done. but then this might be because at that time it connected to mysql before getting the HTML page, and maybe their server was slow and mine has a connection limit of 1 connection. thanks for your estimate!

    How long does the process take? That transaction probably just locking all your tables. Not an issue unless you need concurrency for some reason, and considering what this script does, I highly doubt you do.



  • @blakeyrat said:

    @derula said:
    @blakeyrat said:
    Don't worry about it until you actually have problems...

    Well I was once in phpmyadmin while the cronjob must have been going on, and it didn't load any pages until the cronjob was done. but then this might be because at that time it connected to mysql before getting the HTML page, and maybe their server was slow and mine has a connection limit of 1 connection. thanks for your estimate!

    How long does the process take? That transaction probably just locking all your tables. Not an issue unless you need concurrency for some reason, and considering what this script does, I highly doubt you do.

    Well I haven't checked how long it is really. I'd estimate somewhat over 30 seconds? There is some use of concurrency, though I would not describe it as a need. Apart from the cronjob, the script also generates small images saying if a specified character is currently online which people can put in their forum signatures (guess it'll look funny if 5 sigs simultaneously refuse to show up for 30 seconds xD). Also, you can view a list of active sessions. But hmm it's really more a useless toy. I guess the greatest use of fixing it is that I'll have learned how to do it right. Like a future investment. Or something.

    I was also wondering if a pure SQL version of the cronjob can bring significant improvement (of course I'm not talking a literal translation, but something that takes advantage of some SQL features that are worked around here in PHP).



  • Ok, I've changed my mind. If that takes over 30 seconds, something is wrong.



  • Well... it was also only a one-time observation, really, I didn't check if it's reproducible. I just thought that maybe it's something really obvious a more experienced person would notice immediately. I'm going to do some local testing.



  • Definitely be sure whether the problem is the query, the server, or the PHP. Don't guess. Dump timers everywhere.

    I've been in situations where in one case I optimized the stringbuilding bit of the script, and one where I optimized the query, and one where we just threw better hardware at it.



  • No speed problems in testing... (it does take a fragment of a second to download the HTML, but actual database access seems instant)

    Could not this have caused the problem?

    1. Cronjob connects to database
    2. Cronjob starts downloading the external HTML
    3. External server is busy / lazy or just crashed (it does have a habit to do that; however, then it would have given me an empty list I guess) and takes a half minute to respond
    4. During this time, database (or maybe PHP script execution*) is locked to cronjob
    5. Other script no happy.

    • hmmm that would then not be fixable. Not on my shared webhost, at least.

    Nevermind, I'm happy enough my code doesn't appear to be especially wrong.

    (Though the local testing showed me a really dumb mistake. Why does PHP translate previously unused variables to an "empty" value in whatever type seems appropriate with only giving me a notice? I think that should at least be a warning... yeah I know the real wtf is me not testing it with notices enabled xD)



  • @dhromed said:

    I've been in situations where in one case I optimized the stringbuilding bit of the script, and one where I optimized the query, and one where we just threw better hardware at it.

    I thought the latter would always work! Also, next time consider dropping a zero from the speed-up loop.



  • @dhromed said:

    Definitely be sure whether the problem is the query, the server, or the PHP. Don't guess. Dump timers everywhere.

    Given that it worked well in my local test, what remains is only that my shared host is slow (which it isn't... but since it is shared, maybe someone else on the server did something very stupid at the time), maybe it's a correlation between simultaneous executions (I don't know, can mod_php actually handle simultaneous execution? if yes, it might still be limited by my hoster). Or it could be due to different PHP/MySQL versions?!?

    (I just called the cronjob on the server manually, end it finishes as quickly as the local version. so what could it be other than a one-time effect or possibly weird things that are out of my control or debugability?)



  • @derula said:

    what remains is only that my shared host is slow

    Is your MySQL instance shared? You might just get unlucky if you hit it while someone else is doing a huge reporting query. You could call them up and see if you can get your own MySQL instance. My host, Dreamhost, does stuff like that if you're a good customer and pay your bills on time, even if you don't have SQL-heavy apps.

    @derula said:

    (I don't know, can mod_php actually handle simultaneous execution? if yes, it might still be limited by my hoster)

    It'd be a huge WTF if it couldn't! "One person at a time hitting the website, please."

    @derula said:

    (I just called the cronjob on the server manually, end it finishes as quickly as the local version. so what could it be other than a one-time effect or possibly weird things that are out of my control or debugability?)

    What I usually do is wrap each database call (in my case, stored procedures, but in yours the prepared statements) in a simple timer that keeps track of a list of start/end times. Then when your script is done you can have it write the times (or just the average) into a log file, which you can look at later. I usually write a Q&D "statistics" page along with my web apps, so I can check the DB performance in real-time (for example, "Last 100 Queries Average Duration (milliseconds): 32.61"), but again, that's overkill for your use here.

    @derula said:

    (I just called the cronjob on the server manually, end it finishes as quickly as the local version. so what could it be other than a one-time effect or possibly weird things that are out of my control or debugability?)

    Add some logging code and know for sure. If you wanted to complain to your host, you'd need some evidence to provide them anyway. Nothing's out of your control.



  • @blakeyrat said:

    but again, that's overkill for your use here.

    Yeah, probably. But thanks for all the hints!



  • This thread prompted me to look into some of my own sprocs that were running slow... I had one that did a string compare on a 500k+ table (for the purpose of inserting URLs without duplicating any), and I benchmarked it at 67 seconds per 100 runs.

    After musing over it for a bit, I realized comparing the entire table's worth of strings is stupid. Instead, I added a column ("urlChecksum"), populated it with the CHECKSUM() function on the "url" column, and modified my sproc to look at only url records that match the checksum:

    declare @matchingChecksums table ( urlid int, url nvarchar(max), urlChecksum int )

    insert into @matchingChecksums
    select urlID, url, urlChecksum from url
    where urlChecksum = CHECKSUM(@url)

    A quick and dirty check:

    select urlChecksum, COUNT( urlChecksum )
    from url
    group by urlChecksum
    order by COUNT( urlChecksum ) desc

    shows that the maximum number of URLs that share a checksum in the entire 500k+ row set is 4.

    The benchmark that was taking 67 seconds now completes so fast that SSMS' timer doesn't even hit 1 second before the query returns. Badasssssss. Morning well-spent.



  • I inspired someone, yay!



  • @derula said:

    I inspired someone, yay!

    Well, I did have to set the locking to "OMGZORS" (aka Serializable). But it's fast enough that that's not an issue, the webservers already know to retry when they hit deadlocks.



  •  This is the kind of shit that makes happy.



  • @dhromed said:

    Filed under: happy happy joy joy

    Jubel Jubel Freu Freu

    Also I just noticed my random tag selector really seems to like the "I am Spartacus" tag.


Log in to reply