T-SQL storing URLs

blakeyrat

We have a web analytics product that's running really well, and for the most part it's scalable as crap. But there's one particular sticking point that's been driving me nuts, because I can't work out how to improve the performance.

The system tracks what URL the analytics call is coming from, and the referring URL, and these URLs are normalized into their own table. (Which is necessary, because we do a lot of reporting based on the values.) When a new URL enters the system, we need to figure out whether that URL is already stored in the table, and either return the ID or create a new entry for it. Here's the sproc I've been using to do this:

-- ============================================= -- Author: blakeyrat -- Description: Inserts a URL into the url table if not already present, -- returns the urlID -- ============================================= create procedure [dbo].[urlINS] @url nvarchar(max) as begin set nocount on; declare @returnID int if (select count(*) from url where url = @url) = 0 begin insert into url( url ) values ( @url ) set @returnID = @@identity end else begin set @returnID = (select top 1 urlID from url where url = @url) end return @returnID end

Note that it's crap. Determining whether the URL already exists in the table is a long operation, and after this analytics product has been running a few weeks, the execution time is sitting at about 500 milliseconds. I've already added all the application-level caching I can, so at least the sproc doesn't have to run too often, and database inserts are done in a thread, so at least it's not stalling the web server's response or anything, but I'd really like to get this running faster.

If this were normal alphabetical data, I'd store the first couple characters as kind of a psuedo-hash, so I could add a WHERE clause to the above query and shave a ton of time off of the execution. I've done that before in other situations and it works great. But since these are URLs, the first couple characters are (nearly) always going to be "http://www.theclientdomain.com/", so... that solution doesn't work in this case. Also they can be as short as http://ow.ly/5ZIWl or as long as 2k.

Any ideas? This is for Microsoft SQL Server 2008. The tracking web server app is in ASP.net, if that matters.

blakeyrat

Quick follow-up:

Is there a performance difference between using nvarchar(4000) and using nvarchar(max)? We use "max" because it's technically correct (the best kind of "correct"), but in reality the longest URLs we've seen are about 1.5k

apapadimoulis

I hope there's an index on the url column... if so, this query should be fine. It could be tweaked a little, though:

-- =============================================
-- Author: blakeyrat
-- Description: Inserts a URL into the url table if not already present,
-- returns the urlID
-- =============================================
create procedure [dbo].[urlINS]
@url nvarchar(max)
as
begin

set nocount on;

declare @returnID int
set @returnID = (select urlId from url where url = @url)

if (@returnID is null)
begin
 insert into url( url )
 values ( @url )
 set @returnID = scope_identity() -- use that instead of @@
end

return @returnID

end

DB connections have very litte overhead -- are you sure your app caching isn't slowing down? I've seen that happen a lot

@blakeyrat said:

If this were normal alphabetical data, I'd store the first couple characters as kind of a psuedo-hash, so I could add a WHERE clause to the above query and shave a ton of time off of the execution.

What'd the distribution of data? You could just add a new column (urlCrc) and use the CHECKSUM() method. It's pretty cheap to call and is a quick way to have a quasi-hash index. mabe that's an index option in 2008, or even 2005. Haven't done that in a while.

lpope187

Well, the first thing I would say is that you're potentially querying the table twice - once to get a count and once to get the id if it exists. And more than likely, after the application is running for a while; the most common branch will be when the url exists which is where you are querying the table twice.

DECLARE @FoundId int

SELECT @FoundId = Id FROM Urls WHERE url = @url

IF @FoundId IS NULL BEGIN

INSERT INTO urls ...

SET @FoundId = Scope_IDENTITY()

END

RETURN @FoundId

Beyond that, how many rows are we talking about and are there any indexes on the url column? If there is an index on the url column, does it have id as a covered column?

blakeyrat

@Alex Papadimoulis said:

I hope there's an index on the url column... if so, this query should be fine.

The entire table schema is just:

urlID int autonumber(1,1) PK
url nvarchar(max)

It's been awhile since I designed this schema... there's no index on url, and I assume I had a reason for that? But maybe I'm just an idiot.

@Alex Papadimoulis said:

DB connections have very litte overhead -- are you sure your app caching isn't slowing down? I've seen that happen a lot

I'm using the ASP.net built-in Cache class, if that's slower than hitting the DB then that's a huge WTF. The cache "hit" rate is about 75% on this query.

@Alex Papadimoulis said:

What'd the distribution of data? You could just add a new column (urlCrc) and use the CHECKSUM() method. It's pretty cheap to call and is a quick way to have a quasi-hash index. mabe that's an index option in 2008, or even 2005. Haven't done that in a while.

That is a good idea, I'll look into it.

@lpope187 said:

Well, the first thing I would say is that you're potentially querying the table twice - once to get a count and once to get the id if it exists. And more than likely, after the application is running for a while; the most common branch will be when the url exists which is where you are querying the table twice.
DECLARE @FoundId int
SELECT @FoundId = Id FROM Urls WHERE url = @url
IF @FoundId IS NULL BEGIN
INSERT INTO urls ...
SET @FoundId = Scope_IDENTITY()
END
RETURN @FoundId

Der, that's an obvious improvement.

blakeyrat

@Alex Papadimoulis said:

What'd the distribution of data? You could just add a new column (urlCrc) and use the CHECKSUM() method. It's pretty cheap to call and is a quick way to have a quasi-hash index. mabe that's an index option in 2008, or even 2005. Haven't done that in a while.

Hahaha, I just looked back at the last project that had this performance issue, and it turns out I'd already done exactly that! ... I wonder how come those changes were never put in the model DB.

So I guess the real problem here is me having amnesia, and also me not rolling optimizations from previous projects into the model DB.

blakeyrat

Oh for FUCK'S SAKE.

I DID put the fix in the model DB.

My co-worker decided that instead of using the model DB for his project, he'd copy an ancient DB from a project we did years ago and truncate all the data. It's actually remarkable this thing is working at all, considering the schema changes we've made since the customer's last project.

Wow. This whole thread, and my whole morning at work, has all been the result of a co-worker's WTF, and my own amnesia. Sorry all.

apapadimoulis

@blakeyrat said:

I'm using the ASP.net built-in Cache class, if that's slower than hitting the DB then that's a huge WTF.

I guess I was referring to the caching of computed/derived data. The case I'm thinking of off the top of my head involved a join of about 6 tables and a dozen-line WHERE clasue, so the developer assumed it would be slow to look up a single row. So, he cached results on the web server based on the row search paramters.Seemed logical, except the cache hit was like 10% and the memory footprint grew to hundreds of megs.

Scarlet_Manuka

@blakeyrat said:

Wow. This whole thread, and my whole morning at work, has all been the result of a co-worker's WTF, and my own amnesia. Sorry all.

Who hasn't had a morning like this?

blakeyrat

The round out the thread, I talked to him about it, explained the problem (he didn't realize there had been schema changes, and wanted to keep his lookup tables), and we've decided that it's not worth taking the project offline for the hour or so it would take to update the DB schema. So... resolved. The DB server will just have to do more chugging than normal for the next month or so.

boomzilla

@blakeyrat said:



if (select count(*) from url where url = @url) = 0

This sort of a query is potentially a lot more expensive than it needs to be. Probably better off with something like "SELECT TOP 1....FROM URL WHERE URL = @URL". Some of the others suggested similar improvements to get rid of count.

blakeyrat

@boomzilla said:

This sort of a query is potentially a lot more expensive than it needs to be. Probably better off with something like "SELECT TOP 1....FROM URL WHERE URL = @URL". Some of the others suggested similar improvements to get rid of count.

Yeah, but that's minor tweaking compared to the checksum() function usage. I shoved it on my issue tracker, and I'll change that when/if I get around to it.

But fair warning: people who type SQL in all-caps make me kill kittens. I'm drowning 3 of them right now.

boomzilla

@blakeyrat said:

Yeah, but that's minor tweaking compared to the checksum() function usage. I shoved it on my issue tracker, and I'll change that when/if I get around to it.

IT'S THE SORT OF THING THAT YOU MIGHT NOT NOTICE UNTIL THE TABLE GETS REALLY BIG. ALSO DEPENDS ON LOTS OF IMPLEMENTATION DETAILS. BUT IT'S A CODE SMELL THAT I WIPE OUT WHENEVER I SEE IT.

@blakeyrat said:

But fair warning: people who type SQL in all-caps make me kill kittens. I'm drowning 3 of them right now.

AWESOME! HOW MANY WILL THIS POST GET RID OF? BECAUSE CATS ARE TRWTF.

blakeyrat

@boomzilla said:

AWESOME! HOW MANY WILL THIS POST GET RID OF? BECAUSE CATS ARE TRWTF.

@T-SQL said:

Msg 102, Level 15, State 1, Line 1

Incorrect syntax near '!'.

Sorry, no SQL, no drowned cats.

locallunatic

@blakeyrat said:

@boomzilla said:
select 'AWESOME! HOW MANY WILL THIS POST GET RID OF? BECAUSE CATS ARE TRWTF.'

@T-SQL said:
Msg 102, Level 15, State 1, Line 1

Incorrect syntax near '!'.

Sorry, no SQL, no drowned cats.

FTFY

PSWorx

@locallunatic said:

@blakeyrat said:
@boomzilla said:
select 'AWESOME! HOW MANY WILL THIS POST GET RID OF? BECAUSE CATS ARE TRWTF.'
FTFY

Guess which part is NOT in all-caps?

boomzilla

@blakeyrat said:

Sorry, no SQL, no drowned cats.

Sorry...well played....

BEGIN
    FOR CAT IN CATS
        INSERT CAT INTO SACK;
        DROP SACK FROM BRIDGE;
    END LOOP;
END;
/

locallunatic

@PSWorx said:

Guess which part is NOT in all-caps?

That's cause unlike boomzilla I'm human, and thus a fan of cats.

Jaime

@blakeyrat said:

But fair warning: people who type SQL in all-caps make me kill kittens. I'm drowning 3 of them right now.

That's an awful strong opinion for something as trivial as space-vs-tab. FWIW, Joe Celko recommends uppercase. I always uppercase keywords and type object and column names exactly as they are defined.

blakeyrat

@Jaime said:

That's an awful strong opinion for something as trivial as space-vs-tab.

Spaces are retarded.

@Jaime said:

FWIW, Joe Celko recommends uppercase.

Well, whoop-de-shit. I don't know who that is, nor do I care what he recommends. Blakeyrat recommends typing queries so they're not hard as fuck to read. And I like Blakeyrat's opinion a lot more than Joe Celko's.

@Jaime said:

I always uppercase keywords and type object and column names exactly as they are defined.

You can do whatever you like. I won't strangle any kittens until you paste your skank-ass capitalized SQL here.

blakeyrat

Ok so the co-worker who pulled this on me:

1) is going out-of-town for a few days, meaning I'm temporarily in charge of his project

2) ramped up the sampling rate to 100%, meaning it's getting a heck of a lot more URLs

So I went ahead and shut down data collection for the 30 seconds or so it took to run the query to fix this issue. (Well, not all the fixes in this thread, but bringing it in-line with what's in our model DB.) To give you an idea of how retarded the non-checksummed table was, the average URL insertion time went from 500ms to 1.5ms, so about a 333x improvement in performance.

dhromed

@blakeyrat said:

Spaces are retarded.

Still on the fence.

You use tabs with a width of 3 ( = perfect for me), then open it up in Notepad for a quicky, andeverything's 8-width tabs and OH MY GOD THE INDENTS DO NOT STOP.

But spaces have other drawbacks. They nearly cancel out for me with a slight win for spaces; so I uses spaces.

@blakeyrat said:

Well, whoop-de-shit. I don't know who that is, nor do I care what he recommends. Blakeyrat recommends typing queries so they're not hard as fuck to read. And I like Blakeyrat's opinion a lot more than Joe Celko's.

+1

blakeyrat

@dhromed said:

Still on the fence.

Well then you're wrong, because spaces are retarded.

@dhromed said:

You use tabs with a width of 3 ( = perfect for me), then open it up in Notepad for a quicky, andeverything's 8-width tabs and OH MY GOD THE INDENTS DO NOT STOP.

Why would you ever open up a code file in Notepad? Why would you decide what to use based on what Notepad does? Notepad is not an IDE. Notepad is not even a good program. The only reason Notepad even ships with the OS is for backwards compatibility.

You are doing it wrong.

@dhromed said:

But spaces have other drawbacks. They nearly cancel out for me with a slight win for spaces; so I uses spaces.

The only reason spaces are even slightly tolerable is that modern IDEs treat spaces like tabs. Which is also retarded.

Look:
1) Tabs are the way to align text to tab stops in ASCII. Ignoring that feature of ASCII and then simulating it (badly) using spaces is stupid. This use is what tabs exist for
2) With tabs, each developer can set their tab stops where they are comfortable without reformatting the code, so one Nazi developer can't ruin the code for everybody else
3) Like I said above, the only reason spaces are even slightly acceptable is that IDEs have been modified so that they treat spaces like tabs! They basically use spaces in "tab emulation mode", because so many idiot devs don't just use fucking tabs in the first place!
4) If you really, really like spaces (and you're not so fucking retarded that you edit code in Notepad! Notepad!!!) you can just have your IDE convert the tabs to spaces, then convert back before checking-in. That way you're happy and you're not ruining the code for everybody else

To me, spaces are so inferior for the task compared to tabs, I can't even figure out why this debate exists! I think the debate really should be framed "old Unix geezers who want spaces because They've Always Done It That Way^TM" vs. "people who actually critically analyze which solution is best for all parties".

Plus Mac Classic developers always used tabs, and that's a group of hoopy froods who really know where their towels are.

Dorus

if ( not exists (select @returnID = urlid
 from url
 where url = @url) )
 begin
 insert into url
 (url)
 values ( @url )

 set @returnID = @@identity
 end

Ofcourse the performance of this query strongly depend on a index on the url row (or indeed, on a checksum, as you can't index varchar(max)). Anyway, i find exists a little more clear and performant

Dorus

Oops, looks like subqueries can't set variables, sadly. Stupid online sytax checker didn't spot that, and didn't have access to sqlserver before.

Dorus

Yeah, max edit time expired agian, i still got something fun to add:

If you are going to resort to 2 queries, you do want to add some locking:

create procedure Urlins @url nvarchar(max)
as
 begin
 begin transaction

 declare @returnID int

 select @returnID = urlid
 from url (updlock)
 where url = @url

 if ( @returnID is null )
 begin
 insert into url
 (url)
 values ( @url )

 set @returnID = Scope_identity()
 end

 select @returnID

 commit transaction
 end

Jaime

updlock isn't good enough. You would need to set the transaction isolation level to SERIALIZABLE in order to prevent a double-insert on a race condition.

Dorus

Ah, i only tested it by putting a delay before the if, is it possible to force this double-insert on a race condition to happen by smartly adding locks or delays? Or can we only get by running the proc in 10 threads 10.000x times?

*Goes off to run the proc 10,000 times in 10 threads.

blakeyrat

Yeah, uh. I solved the problem.

But go ahead and do whatever.

Jaime

@blakeyrat said:

Yeah, uh. I solved the problem.
But go ahead and do whatever.

You should know better than anyone else that this site isn't about helping the OP solve a problem. We live for off-topic posts.

blakeyrat

@Jaime said:

@blakeyrat said:
Yeah, uh. I solved the problem.

But go ahead and do whatever.
You should know better than anyone else that this site isn't about helping the OP solve a problem. We live for off-topic posts.

If you guys worked for me, I'd so fire you.

"You're still working on that? It's already fixed! We have a million things to do!"

Jaime

@Dorus said:

Ah, i only tested it by putting a delay before the if, is it possible to force this double-insert on a race condition to happen by smartly adding locks or delays? Or can we only get by running the proc in 10 threads 10.000x times?

*Goes off to run the proc 10,000 times in 10 threads.

If you are going to test it empirically, then be careful that you can duplicate the issue. Personally, I would add some WAITFOR DELAY statements to the procedure to guarantee a race condition rather than assuming that 10,000 runs in 10 threads would cause it.

Besides, the documentation for the serializable isolation level says that it is for exactly this issue. It prevents inserts that would have been returned in a query. It's the best way to prevent a different connection from adding the url that you previously verified to be not there. You could also use (tablock holdlock), but that would create performance problems.

Dorus

Been playing a bit with it, refusing to use SERIALIZABLE ofcourse. Did try holdlock. I've managed to crash management studio, make variouse deathlocks, made queries seem to freeze before they even run the proc, and once it even looked like it deathlocked with itself (but it's also possible a canceled querie was still running in the background.

Oh and haha, canceling a proc that has begin transaction in it is grand fun too. I could swear that would automaticly result in rollback tran, but it doesnt.

Still unable to make it fail cleanly.

Dorus

@blakeyrat said:

@Jaime said:
@blakeyrat said:
Yeah, uh. I solved the problem.
But go ahead and do whatever.
You should know better than anyone else that this site isn't about helping the OP solve a problem. We live for off-topic posts.

If you guys worked for me, I'd so fire you.

"You're still working on that? It's already fixed! We have a million things to do!"

Hey, i'm at work now, having fun with sqlserver (that's completely unrelated to anything i do, i don't even get near a database lately). Sec, need to get that important email out i guess.

blakeyrat

@Dorus said:

Hey, i'm at work now, having fun with sqlserver (that's completely unrelated to anything i do, i don't even get near a database lately). Sec, need to get that important email out i guess.

Don't get me wrong, I'm all for slacking off at work. But I do it via. surfing the web, or watching YouTube, or going upstairs to play ping-pong or Xbox.

Scarlet_Manuka

@blakeyrat said:

But fair warning: people who type SQL in all-caps make me kill kittens.

+1 for hatred of SQL statements in all-caps. I find them far harder to read that way and always use lowercase for my queries.

dhromed

@Dorus said:

make variouse deathlocks,

blakeyrat

YOUNGBLOOD, BLOODSHOT, DEATHMATE RED, THIS BLOOD'S FOR YOU!