Probably a simple T-SQL question but my brain is fried



  • I have a bunch of tables in T-SQL which are designed like this:

    CREATE TABLE [dbo].[Event] (
        [EventId]         UNIQUEIDENTIFIER CONSTRAINT [DF_Event_EventId] DEFAULT (newsequentialid()) NOT NULL,
    

    Basically, the IDs for them are auto-generated from NewSequentialId() defaults when you insert rows. So far so good.

    Now the problem is I'm going to have X webservers hammering this database server, so I need to ensure I have safe upserts using the Merge syntax in a stored procedure. This stored procedure should return the DB row that was either edited or updated in its entirety.

    Is the only option here to use:

    WHEN MATCHED THEN
      UPDATE SET @Id = table.id, SET @blah = table.blah etc for every single goddamned column
    

    ?

    Is there a cleaner way to do this, that still obtains the columns from the table schema itself (like a SELECT TOP 1 *) but doesn't require me to manually list each column in the sproc? Note that I can't use SCOPE_IDENTITY() as my primary keys are filled via. a default value and not an IDENTITY.

    I also just noticed, what do you do if the record doesn't match?

    Is there a clean way to get what I want without having to put two separate queries in the same sproc?

    Thanks.


  • Garbage Person

    @blakeyrat said in Probably a simple T-SQL question but my brain is fried:

    Now the problem is I'm going to have X webservers hammering this database server, so I need to ensure I have safe upserts using the Merge syntax in a stored procedure. This stored procedure should return the DB row that was either edited or updated in its entirety.

    If i am understanding, merge should be unnecessary because of locking.



  • @weng said in Probably a simple T-SQL question but my brain is fried:

    If i am understanding, merge should be unnecessary because of locking.

    I'd rather not have to lock the DB for multiple queries. (You're suggesting I should do the old fashioned "SELECT (table contents) with UPDLOCK, if result = 0 rows, then INSERT else SELECT") logic, yes?

    I need to be cognisant that this table is going to be hammered (10 million + rows a day) and GUID keys are most costly than int keys.


  • Fake News

    I was going to mumble something about rowversion to compare with the previous value of that in your UPDATE ... WHERE , but reading your post again has me all confused what you intend to do.

    Don't bother explaining though as I might not be able to help any further anyway.



  • @jbert I want to upsert with MERGE, then get the entire row that was just upserted as output. To put it in one sentence.

    The rest of the stuff is mostly fluff and "stuff that means your naive solution might not work" like the inability to use SCOPE_IDENTITY().



  • If you have an ID, isn't that pretty much a guarantee that the row either exists or that the ID is invalid?


  • Garbage Person

    Maybe I'm confused a bit.

    Newsequentialid() should return a duplicate GUID a grand total of approximately never (a unique constraint on the column and having the client retry for a constraints violation would be the thing to do there, assuming you have influence over the client)

    Do you have an optional parameter where the client supplies a GUID and you upsert on that instead of inserting with the default?

    Is this question not about GUID collisions at all?



  • @ben_lubar said in Probably a simple T-SQL question but my brain is fried:

    If you have an ID, isn't that pretty much a guarantee that the row either exists or that the ID is invalid?

    I do have an ID when I call this sproc, but it's not in the database and it's not suitable for using as a primary key in the database (it's non-sequential, which would make indexing performance utterly shit. It's also not fixed-length, at least I don't think it is).

    Look, explained another way:

    I need to ask the database if it's seen this BullshitID before. If it has, I want it to update the rest of its info with the info I gave. If it hasn't, I want it to insert the BullshitID along with all the rest of the info, and also create a new database ID to identify the row. In either case, I want the entire row as output. And ideally, this happens in a single atomically-locked SQL statement without requiring any manual locking hints.



  • @weng said in Probably a simple T-SQL question but my brain is fried:

    Newsequentialid() should return a duplicate GUID a grand total of approximately ne

    Irrelevant to the question.

    @weng said in Probably a simple T-SQL question but my brain is fried:

    Do you have an optional parameter where the client supplies a GUID and you upsert on that instead of inserting with the default?

    I can't parse that sentence.

    @weng said in Probably a simple T-SQL question but my brain is fried:

    Is this question not about GUID collisions at all?

    Nope.

    I'm honestly not sure where I'm going wrong in explaining this. It's really simple in my head:

    I want to upsert a row, and get the entire row back (including the database's ID) as output.



  • @blakeyrat
    I think this is probably one of the few areas where MySQL does it better than SQL Server, providing syntactic sugar for INSERT ... ON DUPLICATE KEY UPDATE

    Given that T-SQL doesn't have that construct, your four main options are:

    1. Have the application determine whether it needs to do an INSERT or UPDATE itself
    2. Have the application call a stored procedure that determines whether it needs to do an INSERT or an UPDATE and fire the appropriate call from within an IF branch in the stored procedure.
    3. Use a MERGE statement
    4. Use an INSTEAD OF trigger for either INSERT or UPDATE to override the application's query as appropriate (so if you use INSTEAD OF INSERT, then it checks to see if the key already exists and changes to an UPDATE).

    From a performance & reliability standpoint, option #1 or #2 are the best, though both suffer from code duplication. And, obviously, in the stored procedure, you'd have to update the schema of the stored procedure whenever the schema of the underlying table changed.

    I wouldn't recommend the MERGE statement (option #3) because of your expected concurrency & volume requirements. There have been a number of race condition / concurrency bugs around the MERGE statement, some of which have been closed as "won't fix" in certain versions of SQL Server. To the point that prevailing wisdom is to take out locks when using a MERGE statement for this sort of work (see the accepted answer to https://stackoverflow.com/questions/41882471/ms-sql-equivalent-of-on-duplicate-key-update-upsert and the question it links to).

    The INSTEAD OF trigger (option #4) can have the advantage of making the INSERT / UPDATE choice transparent to the application without as much code duplication as the stored procedure, but you still have the hard dependency on updating it whenever the table's schema changes, and now that's hidden in a trigger that's a lot harder for most Schema Compare tools and developers to discover. So I feel like this option is less good than the two-branch stored procedure, since it's less obvious what's going on and much more likely to be a future time bomb.

    So, overall, my recommendation would be to have a single stored procedure that the application can call, which then makes the appropriate INSERT or UPDATE decision and calls the necessary operation on the underlying table.


  • Garbage Person

    Ah. I see. The ID you're upserting on ISN'T the newsequentialid.

    Makes sense now.

    Having consulted the SQL internals bible, merge is the way to go.

    As for 'not having to painfully enumerate every column in the table', you could exec some dynamic SQL based on the table schema system tables, but that won't be peformant and would see you hanged in my department.


  • Garbage Person

    @blakeyrat said in Probably a simple T-SQL question but my brain is fried:

    I also just noticed, what do you do if the record doesn't match?

    WHEN NOT MATCHED THEN....

    and WHEN NOT MATCHED BY SOURCE THEN....

    clauses.



  • @izzion said in Probably a simple T-SQL question but my brain is fried:

    Have the application determine whether it needs to do an INSERT or UPDATE itself

    Not really an option due to race conditions.

    @izzion said in Probably a simple T-SQL question but my brain is fried:

    Have the application call a stored procedure that determines whether it needs to do an INSERT or an UPDATE and fire the appropriate call from within an IF branch in the stored procedure.
    Use a MERGE statement

    Looks like I need to do a combination of these two, followed by a SELECT. Sigh.

    The "good" news is I can at least pull the database ID out of the MERGE statement. So the SELECT can be simple.

    @izzion said in Probably a simple T-SQL question but my brain is fried:

    I wouldn't recommend the MERGE statement (option #3) because of your expected concurrency & volume requirements. There have been a number of race condition / concurrency bugs around the MERGE statement, some of which have been closed as "won't fix" in certain versions of SQL Server.

    Seriously? Jesus.

    @izzion said in Probably a simple T-SQL question but my brain is fried:

    The INSTEAD OF trigger (option #4) can have the advantage of making the INSERT / UPDATE choice transparent to the application without as much code duplication as the stored procedure, but you still have the hard dependency on updating it whenever the table's schema changes, and now that's hidden in a trigger that's a lot harder for most Schema Compare tools and developers to discover. So I feel like this option is less good than the two-branch stored procedure, since it's less obvious what's going on and much more likely to be a future time bomb.

    Agreed on that. Sprocs are easy to write and understand.

    @izzion said in Probably a simple T-SQL question but my brain is fried:

    So, overall, my recommendation would be to have a single stored procedure that the application can call, which then makes the appropriate INSERT or UPDATE decision and calls the necessary operation on the underlying table.

    I guess I have to if MERGE is unreliable, but it's shocking to me that there's no better way.



  • @blakeyrat
    You should be able to use OUTPUT INSERTED at the end of the stored procedure to avoid potential contention from doing a SELECT along with the write operation.



  • @weng said in Probably a simple T-SQL question but my brain is fried:

    As for 'not having to painfully enumerate every column in the table', you could exec some dynamic SQL based on the table schema system tables, but that won't be peformant and would see you hanged in my department.

    I built a replication service that does something similar to this, which we've used for about half a decade. It's great because it's tolerant of schemata that don't exactly match, which is helpful when you have multiple servers with unreliable network connectivity. And apart from a couple tables with millions of rows with inconvenient PKs, it's pretty performant. If the cost of a recompile from the dynamic SQL is too much to bear, I suppose you could implement DDL triggers that bake the results to stored procedures every time there's a significant schema change. Do I still get hanged?


  • Garbage Person

    @groaner said in Probably a simple T-SQL question but my brain is fried:

    @weng said in Probably a simple T-SQL question but my brain is fried:

    As for 'not having to painfully enumerate every column in the table', you could exec some dynamic SQL based on the table schema system tables, but that won't be peformant and would see you hanged in my department.

    I built a replication service that does something similar to this, which we've used for about half a decade. It's great because it's tolerant of schemata that don't exactly match, which is helpful when you have multiple servers with unreliable network connectivity. And apart from a couple tables with millions of rows with inconvenient PKs, it's pretty performant. If the cost of a recompile from the dynamic SQL is too much to bear, I suppose you could implement DDL triggers that bake the results to stored procedures every time there's a significant schema change. Do I still get hanged?

    You get hanged thrice.

    Once for dynamic SQL.
    Once for mismatched schema.
    Once for DIY replication.


  • Garbage Person

    @izzion said in Probably a simple T-SQL question but my brain is fried:

    I wouldn't recommend the MERGE statement (option #3) because of your expected concurrency & volume requirements. There have been a number of race condition / concurrency bugs around the MERGE statement, some of which have been closed as "won't fix" in certain versions of SQL Server. To the point that prevailing wisdom is to take out locks when using a MERGE statement for this sort of work (see the accepted answer to https://stackoverflow.com/questions/41882471/ms-sql-equivalent-of-on-duplicate-key-update-upsert and the question it links to).

    I know Merge had WONTFIX bugs in the past (they will never introduce a change in a dot release that causes a query plan to compile differently, except occasionally as opt-in flags when there's no real workaround) but those almost always get taken care of in the next major release. I kind of assumed they were done by 2014.



  • @weng said in Probably a simple T-SQL question but my brain is fried:

    I know Merge had WONTFIX bugs in the past (they will never introduce a change in a dot release that causes a query plan to compile differently, except occasionally as opt-in flags when there's no real workaround) but those almost always get taken care of in the next major release. I kind of assumed they were done by 2014.

    Well at this point I'm 75% done writing the queries the old fashioned way, so I'll keep doing that and add a backlog item to check into switching them to MERGE later.



  • @weng said in Probably a simple T-SQL question but my brain is fried:

    You get hanged thrice.

    Sweet!

    Once for dynamic SQL.

    Everything comes from the system catalogs and is QUOTENAME'd. What's the harm? If you outlaw dynamic SQL, you basically outlaw dynamic search conditions, which were a hard requirement at one of the places I've worked. How else do you allow users to filter, for example, for all open orders that have 3 or more items, with a total greater than $100, initiated by users who are currently Gold Members?

    Once for mismatched schema.

    We have multiple customers (each of which is sort of like a sub-department) with differing requirements that feed shared data into a central database for reporting. Changes to each application (and underlying database) go through a change control process that must be ratified by the respective customer. So something approved by sub-department A may not necessarily have been approved by sub-department B, yet the group above A and B still wants to see their aggregate data. As you might expect, this leads to some divergence, and the only person for whom it could really be an issue is me. And I've fixed that issue. Otherwise, I'd have to update my application every time any developer adds a column to any table that is replicated.

    Once for DIY replication.

    SQL Server replication was ruled out due to our requirements (and the network partitioning between various nodes). There are complex business rules involving ownership of records (for example, one node is allowed to work on a record once it's in a certain status, and others aren't). If not built into the replication process, you'd have to encode all that same logic and complexity into an SSIS package or something.



  • @weng
    Maybe it is, maybe it isn't... I could swear I saw a recentish post on Brent Ozar's blog detailing still open issues with MERGE in 2014+, but my search-fu isn't finding it tonight, just articles and SO questions from the 2012-2014 era when the HOLDLOCK hint was de facto required because of all the "fun" with MERGE in 2008. I know that I've seen reproductions with current versions of SQL how all of your rows still wind up "going through" all of the legs of your merge, from a ROWCOUNT / trigger perspective, so there's certainly potential for ugly lurking there that you just won't get with a manually separated INSERT and UPDATE.

    Edit to add: So I don't know if it's just one of those features that wasn't ready for prime time when it first went in, and it's just been stuck with that label ever since despite improvements, or whether it's actually in fact still busted.


Log in to reply