T-SQL storing URLs



  • We have a web analytics product that's running really well, and for the most part it's scalable as crap. But there's one particular sticking point that's been driving me nuts, because I can't work out how to improve the performance.

    The system tracks what URL the analytics call is coming from, and the referring URL, and these URLs are normalized into their own table. (Which is necessary, because we do a lot of reporting based on the values.) When a new URL enters the system, we need to figure out whether that URL is already stored in the table, and either return the ID or create a new entry for it. Here's the sproc I've been using to do this:


    -- =============================================
    -- Author: blakeyrat
    -- Description: Inserts a URL into the url table if not already present,
    -- returns the urlID
    -- =============================================
    create procedure [dbo].[urlINS]
    @url nvarchar(max)
    as
    begin

    set nocount on;

    declare @returnID int

    if (select count(*) from url where url = @url) = 0
    begin
      insert into url( url )
      values ( @url )

      set @returnID = @@identity
    end
    else
    begin
      set @returnID = (select top 1 urlID from url where url = @url)
    end

    return @returnID

    end

    Note that it's crap. Determining whether the URL already exists in the table is a long operation, and after this analytics product has been running a few weeks, the execution time is sitting at about 500 milliseconds. I've already added all the application-level caching I can, so at least the sproc doesn't have to run too often, and database inserts are done in a thread, so at least it's not stalling the web server's response or anything, but I'd really like to get this running faster.

    If this were normal alphabetical data, I'd store the first couple characters as kind of a psuedo-hash, so I could add a WHERE clause to the above query and shave a ton of time off of the execution. I've done that before in other situations and it works great. But since these are URLs, the first couple characters are (nearly) always going to be "http://www.theclientdomain.com/", so... that solution doesn't work in this case. Also they can be as short as http://ow.ly/5ZIWl or as long as 2k.

    Any ideas? This is for Microsoft SQL Server 2008. The tracking web server app is in ASP.net, if that matters.



  • Quick follow-up:

    Is there a performance difference between using nvarchar(4000) and using nvarchar(max)? We use "max" because it's technically correct (the best kind of "correct"), but in reality the longest URLs we've seen are about 1.5k


  • ♿ (Parody)

    I hope there's an index on the url column... if so, this query should be fine. It could be tweaked a little, though:

    <font face="Courier New">-- =============================================
    -- Author: blakeyrat
    -- Description: Inserts a URL into the url table if not already present,
    -- returns the urlID
    -- =============================================
    create procedure [dbo].[urlINS]
    @url nvarchar(max)
    as
    begin

    set nocount on;

    declare @returnID int
    </font><font face="Courier New">set @returnID = (select urlId from url where url = @url)

    if (@returnID is null)
    begin
        insert into url( url )
        values ( @url )
        set @returnID = scope_identity() -- use that instead of @@
    end

    return @returnID

    end</font> 

    DB connections have very litte overhead -- are you sure your app caching isn't slowing down? I've seen that happen a lot

    @blakeyrat said:

    If this were normal alphabetical data, I'd store the first couple characters as kind of a psuedo-hash, so I could add a WHERE clause to the above query and shave a ton of time off of the execution.

    What'd the distribution of data? You could just add a new column (urlCrc) and use the CHECKSUM() method. It's pretty cheap to call and is a quick way to have a quasi-hash index. mabe that's an index option in 2008, or even 2005. Haven't done that in a while.



  • Well, the first thing I would say is that you're potentially querying the table twice - once to get a count and once to get the id if it exists.  And more than likely, after the application is running for a while; the most common branch will be when the url exists which is where you are querying the table twice.

    DECLARE @FoundId int

    SELECT @FoundId = Id FROM Urls WHERE url = @url

    IF @FoundId IS NULL BEGIN

       INSERT INTO urls ...

       SET @FoundId = Scope_IDENTITY()

    END 

    RETURN @FoundId

    Beyond that, how many rows are we talking about and are there any indexes on the url column?  If there is an index on the url column, does it have id as a covered column?

     



  • @Alex Papadimoulis said:

    I hope there's an index on the url column... if so, this query should be fine.

    The entire table schema is just:

    urlID int autonumber(1,1) PK
    url nvarchar(max)

    It's been awhile since I designed this schema... there's no index on url, and I assume I had a reason for that? But maybe I'm just an idiot.

    @Alex Papadimoulis said:

    DB connections have very litte overhead -- are you sure your app caching isn't slowing down? I've seen that happen a lot

    I'm using the ASP.net built-in Cache class, if that's slower than hitting the DB then that's a huge WTF. The cache "hit" rate is about 75% on this query.

    @Alex Papadimoulis said:

    What'd the distribution of data? You could just add a new column (urlCrc) and use the CHECKSUM() method. It's pretty cheap to call and is a quick way to have a quasi-hash index. mabe that's an index option in 2008, or even 2005. Haven't done that in a while.

    That is a good idea, I'll look into it.

    @lpope187 said:

    Well, the first thing I would say is that you're potentially querying the table twice - once to get a count and once to get the id if it exists.  And more than likely, after the application is running for a while; the most common branch will be when the url exists which is where you are querying the table twice.

    DECLARE @FoundId int

    SELECT @FoundId = Id FROM Urls WHERE url = @url

    IF @FoundId IS NULL BEGIN

       INSERT INTO urls ...

       SET @FoundId = Scope_IDENTITY()

    END 

    RETURN @FoundId

    Der, that's an obvious improvement.



  • @Alex Papadimoulis said:

    What'd the distribution of data? You could just add a new column (urlCrc) and use the CHECKSUM() method. It's pretty cheap to call and is a quick way to have a quasi-hash index. mabe that's an index option in 2008, or even 2005. Haven't done that in a while.

    Hahaha, I just looked back at the last project that had this performance issue, and it turns out I'd already done exactly that! ... I wonder how come those changes were never put in the model DB.

    So I guess the real problem here is me having amnesia, and also me not rolling optimizations from previous projects into the model DB.



  • Oh for FUCK'S SAKE.

    I DID put the fix in the model DB.

    My co-worker decided that instead of using the model DB for his project, he'd copy an ancient DB from a project we did years ago and truncate all the data. It's actually remarkable this thing is working at all, considering the schema changes we've made since the customer's last project.

    Wow. This whole thread, and my whole morning at work, has all been the result of a co-worker's WTF, and my own amnesia. Sorry all.


  • ♿ (Parody)

    @blakeyrat said:

    I'm using the ASP.net built-in Cache class, if that's slower than hitting the DB then that's a huge WTF.

    I guess I was referring to the caching of computed/derived data. The case I'm thinking of off the top of my head involved a join of about 6 tables and a dozen-line WHERE clasue, so the developer assumed it would be slow to look up a single row. So, he cached results on the web server based on the row search paramters.Seemed logical, except the cache hit was like 10% and the memory footprint grew to hundreds of megs.



  • @blakeyrat said:

    Wow. This whole thread, and my whole morning at work, has all been the result of a co-worker's WTF, and my own amnesia. Sorry all.

    Who hasn't had a morning like this?



  • The round out the thread, I talked to him about it, explained the problem (he didn't realize there had been schema changes, and wanted to keep his lookup tables), and we've decided that it's not worth taking the project offline for the hour or so it would take to update the DB schema. So... resolved. The DB server will just have to do more chugging than normal for the next month or so.


  • ♿ (Parody)

    @blakeyrat said:


    if (select count(*) from url where url = @url) = 0

    This sort of a query is potentially a lot more expensive than it needs to be. Probably better off with something like "SELECT TOP 1....FROM URL WHERE URL = @URL". Some of the others suggested similar improvements to get rid of count.



  • @boomzilla said:

    This sort of a query is potentially a lot more expensive than it needs to be. Probably better off with something like "SELECT TOP 1....FROM URL WHERE URL = @URL". Some of the others suggested similar improvements to get rid of count.

    Yeah, but that's minor tweaking compared to the checksum() function usage. I shoved it on my issue tracker, and I'll change that when/if I get around to it.

    But fair warning: people who type SQL in all-caps make me kill kittens. I'm drowning 3 of them right now.


  • ♿ (Parody)

    @blakeyrat said:

    Yeah, but that's minor tweaking compared to the checksum() function usage. I shoved it on my issue tracker, and I'll change that when/if I get around to it.

    IT'S THE SORT OF THING THAT YOU MIGHT NOT NOTICE UNTIL THE TABLE GETS REALLY BIG. ALSO DEPENDS ON LOTS OF IMPLEMENTATION DETAILS. BUT IT'S A CODE SMELL THAT I WIPE OUT WHENEVER I SEE IT.

    @blakeyrat said:

    But fair warning: people who type SQL in all-caps make me kill kittens. I'm drowning 3 of them right now.

    AWESOME! HOW MANY WILL THIS POST GET RID OF? BECAUSE CATS ARE TRWTF.



  • @boomzilla said:

    AWESOME! HOW MANY WILL THIS POST GET RID OF? BECAUSE CATS ARE TRWTF.

    @T-SQL said:

    Msg 102, Level 15, State 1, Line 1

    Incorrect syntax near '!'.

    Sorry, no SQL, no drowned cats.



  • @blakeyrat said:

    @boomzilla said:
    select 'AWESOME! HOW MANY WILL THIS POST GET RID OF? BECAUSE CATS ARE TRWTF.'

    @T-SQL said:

    Msg 102, Level 15, State 1, Line 1

    Incorrect syntax near '!'.

    Sorry, no SQL, no drowned cats.

     

    FTFY



  •  @locallunatic said:

    @blakeyrat said:

    @boomzilla said:
    select 'AWESOME! HOW MANY WILL THIS POST GET RID OF? BECAUSE CATS ARE TRWTF.'

    FTFY

    Guess which part is NOT in all-caps?


  • ♿ (Parody)

    @blakeyrat said:

    Sorry, no SQL, no drowned cats.

    Sorry...well played....

    BEGIN
        FOR CAT IN CATS
            INSERT CAT INTO SACK;
            DROP SACK FROM BRIDGE;
        END LOOP;
    END;
    /
    


  • @PSWorx said:

    Guess which part is NOT in all-caps?
     

    That's cause unlike boomzilla I'm human, and thus a fan of cats.



  • @blakeyrat said:

    But fair warning: people who type SQL in all-caps make me kill kittens. I'm drowning 3 of them right now.

    That's an awful strong opinion for something as trivial as space-vs-tab.  FWIW, Joe Celko recommends uppercase.  I always uppercase keywords and type object and column names exactly as they are defined.



  • @Jaime said:

    That's an awful strong opinion for something as trivial as space-vs-tab.

    Spaces are retarded.

    @Jaime said:

    FWIW, Joe Celko recommends uppercase.

    Well, whoop-de-shit. I don't know who that is, nor do I care what he recommends. Blakeyrat recommends typing queries so they're not hard as fuck to read. And I like Blakeyrat's opinion a lot more than Joe Celko's.

    @Jaime said:

    I always uppercase keywords and type object and column names exactly as they are defined.

    You can do whatever you like. I won't strangle any kittens until you paste your skank-ass capitalized SQL here.



  • Ok so the co-worker who pulled this on me:

    1) is going out-of-town for a few days, meaning I'm temporarily in charge of his project

    2) ramped up the sampling rate to 100%, meaning it's getting a heck of a lot more URLs

    So I went ahead and shut down data collection for the 30 seconds or so it took to run the query to fix this issue. (Well, not all the fixes in this thread, but bringing it in-line with what's in our model DB.) To give you an idea of how retarded the non-checksummed table was, the average URL insertion time went from 500ms to 1.5ms, so about a 333x improvement in performance.



  • @blakeyrat said:

    Spaces are retarded.
     

    Still on the fence.

    You use tabs with a width of 3 ( = perfect for me), then open it up in Notepad for a quicky, andeverything's 8-width tabs and OH MY GOD THE INDENTS DO NOT STOP.

    But spaces have other drawbacks. They nearly cancel out for me with a slight win for spaces; so I uses spaces.

    @blakeyrat said:

    Well, whoop-de-shit. I don't know who that is, nor do I care what he recommends. Blakeyrat recommends typing queries so they're not hard as fuck to read. And I like Blakeyrat's opinion a lot more than Joe Celko's.

    +1



  • @dhromed said:

    Still on the fence.

    Well then you're wrong, because spaces are retarded.

    @dhromed said:

    You use tabs with a width of 3 ( = perfect for me), then open it up in Notepad for a quicky, andeverything's 8-width tabs and OH MY GOD THE INDENTS DO NOT STOP.

    Why would you ever open up a code file in Notepad? Why would you decide what to use based on what Notepad does? Notepad is not an IDE. Notepad is not even a good program. The only reason Notepad even ships with the OS is for backwards compatibility.

    You are doing it wrong.

    @dhromed said:

    But spaces have other drawbacks. They nearly cancel out for me with a slight win for spaces; so I uses spaces.

    The only reason spaces are even slightly tolerable is that modern IDEs treat spaces like tabs. Which is also retarded.

    Look:
    1) Tabs are the way to align text to tab stops in ASCII. Ignoring that feature of ASCII and then simulating it (badly) using spaces is stupid. This use is what tabs exist for
    2) With tabs, each developer can set their tab stops where they are comfortable without reformatting the code, so one Nazi developer can't ruin the code for everybody else
    3) Like I said above, the only reason spaces are even slightly acceptable is that IDEs have been modified so that they treat spaces like tabs! They basically use spaces in "tab emulation mode", because so many idiot devs don't just use fucking tabs in the first place!
    4) If you really, really like spaces (and you're not so fucking retarded that you edit code in Notepad! Notepad!!!) you can just have your IDE convert the tabs to spaces, then convert back before checking-in. That way you're happy and you're not ruining the code for everybody else

    To me, spaces are so inferior for the task compared to tabs, I can't even figure out why this debate exists! I think the debate really should be framed "old Unix geezers who want spaces because They've Always Done It That WayTM" vs. "people who actually critically analyze which solution is best for all parties".

    Plus Mac Classic developers always used tabs, and that's a group of hoopy froods who really know where their towels are.



  • <font face="Courier New" size="2"><font color="blue">if</font> <font color="maroon">(</font> <font color="blue">not</font> <font color="blue">exists</font> <font color="maroon">(</font><font color="blue">select</font> <font color="#8000FF">@returnID</font> <font color="silver">=</font> <font color="maroon">urlid</font>
                     <font color="blue">from</font>   <font color="maroon">url</font>
                     <font color="blue">where</font>  <font color="maroon">url</font> <font color="silver">=</font> <font color="#8000FF">@url</font><font color="maroon">)</font> <font color="maroon">)</font>
      <font color="blue">begin</font>
          <font color="blue">insert</font> <font color="blue">into</font> <font color="maroon">url</font>
                      <font color="maroon">(</font><font color="maroon">url</font><font color="maroon">)</font>
          <font color="blue">values</font>      <font color="maroon">(</font> <font color="#8000FF">@url</font> <font color="maroon">)</font>

          <font color="blue">set</font> <font color="#8000FF">@returnID</font> <font color="silver">=</font> <font color="fuchsia">@@identity</font>
      <font color="blue">end</font> </font>

    Ofcourse the performance of this query strongly depend on a index on the url row (or indeed, on a checksum, as you can't index varchar(max)). Anyway, i find exists a little more clear and performant



  •  Oops, looks like subqueries can't set variables, sadly. Stupid online sytax checker didn't spot that, and didn't have access to sqlserver before.



  •  Yeah, max edit time expired agian, i still got something fun to add:

     If you are going to resort to 2 queries, you do want to add some locking:

    <font face="Courier New" size="2"> <font color="blue">create</font> <font color="blue">procedure</font> <font color="#FF0080">Urlins</font> <font color="#8000FF">@url</font> <font color="black">nvarchar</font><font color="maroon">(</font><font color="blue">max</font><font color="maroon">)</font>
    <font color="blue">as</font>
      <font color="blue">begin</font>
          <font color="blue">begin</font> <font color="blue">transaction</font>

          <font color="blue">declare</font> <font color="#8000FF">@returnID</font> <font color="black">int</font>

          <font color="blue">select</font> <font color="#8000FF">@returnID</font> <font color="silver">=</font> </font><font face="Courier New" size="2"><font color="maroon">urlid</font></font>
    <font face="Courier New" size="2">      <font color="blue">from</font>   <font color="maroon">url</font> <font color="maroon">(</font><font color="maroon">updlock</font><font color="maroon">)</font>
          <font color="blue">where</font>  <font color="maroon">url</font> <font color="silver">=</font> <font color="#8000FF">@url</font>

          <font color="blue">if</font> <font color="maroon">(</font> <font color="#8000FF">@returnID</font> <font color="blue">is</font> <font color="blue">null</font> <font color="maroon">)</font>
            <font color="blue">begin</font>
                <font color="blue">insert</font> <font color="blue">into</font> <font color="maroon">url</font>
                            <font color="maroon">(</font><font color="maroon">url</font><font color="maroon">)</font>
                <font color="blue">values</font>      <font color="maroon">(</font> <font color="#8000FF">@url</font> <font color="maroon">)</font>

                <font color="blue">set</font> <font color="#8000FF">@returnID</font> <font color="silver">=</font> <font color="fuchsia">Scope_identity</font><font color="maroon">(</font><font color="maroon">)</font>
            <font color="blue">end</font>

          <font color="blue">select</font> <font color="#8000FF">@returnID</font>

          <font color="blue">commit</font> <font color="blue">transaction</font>
      <font color="blue">end</font>   </font>



  • updlock isn't good enough.  You would need to set the transaction isolation level to SERIALIZABLE in order to prevent a double-insert on a race condition.



  •  Ah, i only tested it by putting a delay before the <font face="Courier New" size="2"><font color="blue">if</font></font>, is it possible to force this double-insert on a race condition to happen by smartly adding locks or delays? Or can we only get by running the proc in 10 threads 10.000x times?

     

    *Goes off to run the proc 10,000 times in 10 threads.



  • Yeah, uh. I solved the problem.

    But go ahead and do whatever.



  • @blakeyrat said:

    Yeah, uh. I solved the problem.

    But go ahead and do whatever.

    You should know better than anyone else that this site isn't about helping the OP solve a problem.  We live for off-topic posts.


  • @Jaime said:

    @blakeyrat said:

    Yeah, uh. I solved the problem.

    But go ahead and do whatever.

    You should know better than anyone else that this site isn't about helping the OP solve a problem.  We live for off-topic posts.

    If you guys worked for me, I'd so fire you.

    "You're still working on that? It's already fixed! We have a million things to do!"



  • @Dorus said:

     Ah, i only tested it by putting a delay before the <FONT size=2 face="Courier New"><FONT color=blue>if</FONT></FONT>, is it possible to force this double-insert on a race condition to happen by smartly adding locks or delays? Or can we only get by running the proc in 10 threads 10.000x times?

     

    *Goes off to run the proc 10,000 times in 10 threads.

    If you are going to test it empirically, then be careful that you can duplicate the issue.  Personally, I would add some WAITFOR DELAY statements to the procedure to guarantee a race condition rather than assuming that 10,000 runs in 10 threads would cause it.

    Besides, the documentation for the serializable isolation level says that it is for exactly this issue.  It prevents inserts that would have been returned in a query.  It's the best way to prevent a different connection from adding the url that you previously verified to be not there.  You could also use (tablock holdlock), but that would create performance problems.



  •  Been playing a bit with it, refusing to use SERIALIZABLE ofcourse. Did try holdlock. I've managed to crash management studio, make variouse deathlocks, made queries seem to freeze before they even run the proc, and once it even looked like it deathlocked with itself (but it's also possible a canceled querie was still running in the background.

    Oh and haha, canceling a proc that has begin transaction in it is grand fun too. I could swear that would automaticly result in rollback tran, but it doesnt.

     

     Still unable to make it fail cleanly.



  • @blakeyrat said:

    @Jaime said:

    @blakeyrat said:

    Yeah, uh. I solved the problem.

    But go ahead and do whatever.

    You should know better than anyone else that this site isn't about helping the OP solve a problem.  We live for off-topic posts.

    If you guys worked for me, I'd so fire you.

    "You're still working on that? It's already fixed! We have a million things to do!"

    Hey, i'm at work now, having fun with sqlserver (that's completely unrelated to anything i do, i don't even get near a database lately). Sec, need to get that important email out i guess.

     



  • @Dorus said:

    Hey, i'm at work now, having fun with sqlserver (that's completely unrelated to anything i do, i don't even get near a database lately). Sec, need to get that important email out i guess.

    Don't get me wrong, I'm all for slacking off at work. But I do it via. surfing the web, or watching YouTube, or going upstairs to play ping-pong or Xbox.



  • @blakeyrat said:

    But fair warning: people who type SQL in all-caps make me kill kittens.
    +1 for hatred of SQL statements in all-caps. I find them far harder to read that way and always use lowercase for my queries.



  • @Dorus said:

    make variouse deathlocks,
     



  • YOUNGBLOOD, BLOODSHOT, DEATHMATE RED, THIS BLOOD'S FOR YOU!


Log in to reply