SQL Server transaction deadlocking with itself?

Bulb

I have a weird problem with Azure SQL Database here. I have a bunch of upgrade SQL scripts (they have some SMSS/sqlcmd meta-commands in them, so they have to be run with that), and a wrapper script that finds which are new, concatenates them, wraps them in

set xact_abort on
go
begin transaction release_some_unique_id_bflm_psvz
go

and

rollback transaction release_some_unique_id_bflm_psvz
go

and feeds that into sqlcmd. There is rollback, because this is a dry-run, executed against a copy of the production database, to make sure the scripts actually work before they are handed over to the admin of the production environment, who will execute them similarly, but with commit at the end.

The scripts do contain nested transactions, but that should be supported by SQL Server, and usually works fine. But this time around I'm getting

Msg 1205, Level 13, State 55, Server bflmpsvz, Line 3
Transaction (Process ID 666) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

which makes no eFfing sense. It is a copy of the database. No other queries are running against it, and this query, though it contains some nested transactions, is just one process, so it shouldn't be trying to do anything in parallel either. But the error is reproducible. Every time we tried to run it against that clone, it failed at the same place with the same error!

Does anybody have an idea how the eF is that possible?

And it gets better! On a newer snapshot—that might have a bit of data changes, but no schema changes—it works fine. It is that one snapshot with which it throws that deadlock error. Reproducibly over the two weeks since that snapshot was made!

boomzilla

@Bulb is there some other open transaction there on the old clone? Or at least, does it think there's one?

Way beyond my skis here. My Sql Server experience is thin and never used the Azure version.

Bulb

@boomzilla said in SQL Server transaction deadlocking with itself?:

@Bulb is there some other open transaction there on the old clone? Or at least, does it think there's one?

I am fairly confident there is not. It seems to think there is, but I don't know enough about it to pick its brain and find out what it thinks.

Kamil Podlesak

Just an idea: maybe some of the (sub-)queries are actually run on the original database? I am pretty sure that running queries in remote db is possible, and I would not be surprised if the cloning process somehow created a database link....

cheong

Try take a look at locks if able.

Depending on what you're trying to do, the locked resource could be file as well.

iKnowItsLame

@Bulb

".. a wrapper script that finds which are new, concatenates them, wraps them .."
are you quite sure that if you use a different -working- snapshot this thing picks up the same scripts?

-) some trigger on some_Table which causes a loop?
e.g. script#123:

declare myCur cursor for select ...
open myCur
fetch next from myCur into...
while @@fetch_status = 0 begin
 --some doing here which causes the _cursor_ to re-run (*)
fetch next from myCur into...
end

(*) basically this one, scroll down to the repro-answer: https://stackoverflow.com/questions/62045845/cursor-in-sql-runs-multiple-times-when-there-is-only-1-record-in-it

Anyways: I do not think that this has something to do with Azure but with some fuckery within this one Snapshot.

Therefore my advice (c) some random guy via Internet:
-) re-test with as many snapshots as possible, if it runs fine except with this one...
-) depending on your downtime-window / script-runtimehave PROD-Admin run this thing with

rollback release_some_unique_id_bflm_psvz

first

Bulb

@iKnowItsLame said in SQL Server transaction deadlocking with itself?:

@Bulb

".. a wrapper script that finds which are new, concatenates them, wraps them .."
are you quite sure that if you use a different -working- snapshot this thing picks up the same scripts?

Yes, I compared the assembled scripts.

-) some trigger on some_Table which causes a loop?
e.g. script#123:
declare myCur cursor for select ...
open myCur
fetch next from myCur into...
while @@fetch_status = 0 begin
 --some doing here which causes the _cursor_ to re-run (*)
fetch next from myCur into...
end
(*) basically this one, scroll down to the repro-answer: https://stackoverflow.com/questions/62045845/cursor-in-sql-runs-multiple-times-when-there-is-only-1-record-in-it

The schema doesn't differ between the working and non-working copy of the database, which means triggers don't either. And the script itself does not use any cursors, and does not switch contexts.

Anyways: I do not think that this has something to do with Azure but with some fuckery within this one Snapshot.

Therefore my advice (c) some random guy via Internet:
-) re-test with as many snapshots as possible, if it runs fine except with this one...

We have 3 copies with which it does not work so far and at least 3 copies with which it does work. The ones with which it does not work are from the same time.

-) depending on your downtime-window / script-runtimehave PROD-Admin run this thing with
rollback release_some_unique_id_bflm_psvz
first

You mean put a dummy rollback at the beginning … we can try. It's a copy, so it has no uptime to worry about.

All the documentation suggests there must be some other session to cause a deadlock. Do you know where we should look for which session that is? The message only lists the ID of the session it kills, not the ID of the session that it stepped on.

Mason_Wheeler

@Bulb said in SQL Server transaction deadlocking with itself?:

Do you know where we should look for which session that is?

Look in the SQL Server Profiler. Run a trace and it should show you what's going on.

iKnowItsLame

@Bulb said in SQL Server transaction deadlocking with itself?:

You mean put a dummy rollback at the beginning … we can try. It's a copy, so it has no uptime to worry about.

No. At some time in the future this will run against PROD and -depending on your usecase- this might require downtime + it is "hands off 'cos it's done by admin".

If I read this correctly at the moment it's 50:50 (3 "good / upgradeable" snapshots, 3 are bad) but now my question is:
even when your upgrade-script doesn't use any cursors -> what about already existing DB-Triggers which could step/fall over their toes?
/disable all triggers on the DB -> run scripts -> enable all triggers might fix this but also can open a completely different can of worms. but hey, on a snapshot: why not?
-> just remember to really really check everything afterwards; ask me why I know this.

Anyway: I would open up to admin re "we've got no clue what is going on here so please make sure that your backups are working & be prepared to restore if..."
Of course you always can pad it with/blame Azure ("three snapshots from the same time -> NOK, all others OK so duh...") but still...

@Mason_Wheeler

Look in the SQL Server Profiler. Run a trace and it should show you what's going on.

Maybe. Depending on how much is going on here + Profiler-Settings you either drown in line noise, see nothing of consequence or the Profiler barfs out after 10mins.
IDK: is it even possible to run a Profiler against Azure? All my MSSQL-DBs are onPrem and ...

Mason_Wheeler

@iKnowItsLame said in SQL Server transaction deadlocking with itself?:

Maybe. Depending on how much is going on here + Profiler-Settings you either drown in line noise, see nothing of consequence or the Profiler barfs out after 10mins.

It's been a while since I had to do this, so I don't remember exactly how it's done, but there is a way to take a profiler trace, save it to a database table, and run SQL queries against it. That makes "drowning in noise" easy to deal with.

IDK: is it even possible to run a Profiler against Azure? All my MSSQL-DBs are onPrem and ...

Apparently you can't use the standard SQL Server profiler, but there is a different tool for it:

erinstellato-ms

SQL Server Profiler extension - Azure Data Studio

Learn how to install and use the SQL Server Profiler extension. An easy-to-use SQL Server tracing solution similar to the SQL Server Management (SSMS) Profiler.

Bulb

Colleague found a “fix”. And it's even more surprising:

In the middle of the long upgrade script, there is a MERGE statement that inserts or replaces all the 6 values in a simple code table (it has a numeric key, string key and a key to translations table). That MERGE is where the deadlock is reported.

The values are already set in all the relevant copies of the database, so that MERGE does not change the content of the table, though it does SET all the rows to the same values they were before. And doing that fixes the database and then the script works fine.

The merge looks like:

MERGE INTO [pa].[Tejbl] AS T
USING (
    VALUES 
        (N'Nejm',   N'Tejbl.Nejm',   N'Blah blah blah.'),
        -- 5 more rows
)
AS S ([Key], [LocKey], [Description])
ON (T.[Key] = S.[Key])
WHEN MATCHED THEN 
    UPDATE SET T.[LocKey] = S.[LocKey], T.[Description] = S.[Description]
WHEN NOT MATCHED BY TARGET THEN 
    INSERT ([Key], [LocKey], [Description]) VALUES (S.[Key], S.[LocKey], S.[Description])
WHEN NOT MATCHED BY SOURCE THEN DELETE;

So the query forces the exact content of the table except the integer primary keys (that are autoincrement) are left up to the database engine, and the content was already filled, so the UPDATE branch runs for all 6 rows and overwrites the values with the same values. I queries both the database where the script works and the one where it does not and the content is the same. But running it still fixes the database.

Colleague found somewehre something about wrong indices possibly causing deadlocks and thought maybe a broken index could be restored from that backup. Makes no sense, but the whole symptom doesn't…

iKnowItsLame

@Bulb
ahh... MERGE.

Merge IME seems to be a solution looking for a problem (https://www.mssqltips.com/sqlservertip/3074/use-caution-with-sql-servers-merge-statement/), the umpteen times where I thought "hey that is exactly what I'm looking for!" it either simply didn't work (cause: most likely me) or could not work (e.g. need to audit ..action.. somewhere and one can't simply fire "exec createAudit @blah=blubb, @frob = bar" inline) and so on.

just don't. "use a cursor and ~~have fun~~get fucked somewhere else"

Summary: good catch, these errors are the hardest to squash.

Bulb

@iKnowItsLame This merge is just an INSERT OR REPLACE of other databases. It's a bit verbose for the purpose, but seems sound.

Anyway, I just tried to make yet another copy and run that MERGE in a rollbacked transaction! And … the whole script started working afterwards. And no, the script does not touch that table before the point where it fails.

And since the MERGE will write the values, even if they are still the same, it's still being written the next time around.

The table is somehow messed up, but the fact it comes messed up on restoring from a backup bothers me.

iKnowItsLame

@Bulb

This merge is just an INSERT OR REPLACE of other databases.

Just don't try to compare [any database] with MSSQL (or any database with any database, same difference); they seem to behave the same way until they aren't.

The table is somehow messed up, but the fact it comes messed up on restoring from a backup bothers me.

That might be a different topic altogether: restore from the same backup or from one "just ..minutes later.. where the content of this table could be different?

Jaime

@Bulb BTW, I see that you already found the cause, but here's how to get SQL Server to tell you the cause:

WilliamDAssafMSFT

Use the system_health session - SQL Server

The system_health Extended Events session is included with SQL Server. This session collects system data to troubleshoot database engine performance.

Bulb

@iKnowItsLame said in SQL Server transaction deadlocking with itself?:

@Bulb

This merge is just an INSERT OR REPLACE of other databases.

Just don't try to compare [any database] with MSSQL (or any database with any database, same difference); they seem to behave the same way until they aren't.

Yeah, MSSQL has its quirks.

And, well, I tried to do just update to the same values, because that's what the merge is supposed to do on all rows anyway, but no, that does not fix the table, only the merge does. So yeah, merge seems to be magic.

The table is somehow messed up, but the fact it comes messed up on restoring from a backup bothers me.

That might be a different topic altogether: restore from the same backup or from one "just ..minutes later.. where the content of this table could be different?

All the instances where it fails ultimately come from the same backup. The names suggest they come from two backups, but colleague said they don't. Copying a broken database results in another equally broken database though.

Hm, I believe we already had some non-working copies in the past that worked on another attempt…

@Jaime said in SQL Server transaction deadlocking with itself?:

@Bulb BTW, I see that you already found the cause, but here's how to get SQL Server to tell you the cause:

We didn't. We found a dead chicken, which, when waved around the database, makes the problem go away. But we have no eFing clue what's actually wrong, or even how to detect it next time it happens.

WilliamDAssafMSFT

Use the system_health session - SQL Server

The system_health Extended Events session is included with SQL Server. This session collects system data to troubleshoot database engine performance.

is included by default with SQL Server and Azure SQL Managed Instance.

wrong version. This is Azure SQL Database. It does not have it.

Jaime

@Bulb The last paragraph is titled "The system_health session in Azure SQL"

Jaime

@Bulb If this weren't Azure, I'd have just said "Turn on trace flag 1222", because that's been the standard way to get to the bottom of deadlocks since forever. However, trace flags are system-wide and you don't own the system on Azure... so that's not an option.

In short... the cloud sucks.

TwelveBaud

@Jaime You could at least quote the relevant portion for @Bulb...

There's no built-in system_health Extended Event session in Azure SQL Database, but you can use the sys.fn_xe_file_target_read_file() function to read from Extended Event sessions you create yourself and store in Azure Storage. For a walkthrough, see Event File target code for Extended Events in Azure SQL Database.

Mason_Wheeler

@Jaime said in SQL Server transaction deadlocking with itself?:

In short... the cloud sucks.

That thread is

Kamil Podlesak

@Bulb said in SQL Server transaction deadlocking with itself?:

@iKnowItsLame This merge is just an INSERT OR REPLACE of other databases. It's a bit verbose for the purpose, but seems sound.

Obligatory note: MERGE originates from DB2.

Which actually makes the point stronger, because "Microsoft trying to emulate IBM" is a special class of... "fun".

Watson

@Kamil-Podlesak said in SQL Server transaction deadlocking with itself?:

@Bulb said in SQL Server transaction deadlocking with itself?:

@iKnowItsLame This merge is just an INSERT OR REPLACE of other databases. It's a bit verbose for the purpose, but seems sound.

Obligatory note: MERGE originates from DB2.

Which actually makes the point stronger, because "Microsoft trying to emulate IBM" is a special class of... "fun".

It's also been adopted into standard SQL, which just means everyone else is going to have to emulate IBM as well.

Arantor

@Watson there was a time when IBM set actual standards for quality that didn’t mean dogshit. I feel like DB2 is a relic of that era.