Surrogate vs natural primary keys

blakeyrat

@tufty said in Surrogate vs natural primary keys:

… and also a potential security / privacy hole. Keys should not be guessable.

Especially don't use them if you're building military equipment.

cartman82

Ok, surrogate keys have it.

I also like to slap a surrogate key first and ask questions later. But occasionally, there's a situation like this, where it seems just superflous to maintain two sets of keys.

Remember, this makes all sorts of things more complicated. You need more code to cover fetches by both keys. You often need an additional select to convert one key to another. Or an extra join, in more involved queries.

So there's always the temptation to just use a natural key. Sometimes I do it and it works out well. Eg. if your table is a glorified enum, I'd go with a natural key. Country codes, phone areas, genders, etc. All good candidates for natural keys.

But yeah, in general I agree. If there's any doubt at all, go with the surrogate key and you're safe.

blakeyrat

@cartman82 said in Surrogate vs natural primary keys:

Country codes,

Change all the fucking time. Residents of South Sudan hate your broken shit.

@cartman82 said in Surrogate vs natural primary keys:

phone areas

Here in the US, those don't exist anymore. Phone numbers are completely portable, so the 3-digit "area code" is meaningless. Phone numbers are simply 10-digits long now.

@cartman82 said in Surrogate vs natural primary keys:

genders

You'd create a table that's just male/female/unknown? Weirdo.

tufty

@cartman82 said in Surrogate vs natural primary keys:

if your table is a glorified enum

… you give it a surrogate key and a field containing its natural data.

You don't use an enumerated data type or similar at the database level, either. Especially not if you're using MySQL.

ScholRLEA

@tufty said in Surrogate vs natural primary keys:

You don't use an enumerated data type or similar at the database level, either. Especially not if you're using MySQL.

I have to agree with you, though frankly the reasons why I agree piss me the hell off. As with more general sorts of user-defined domains, the 'standard' SQL approach to enumeration domains sucks donkey's balls; most SQL implementations don't handle user-defined domains worth shit, including enumeration domains, and when they do have something that is semi-usable, they are so incompatible with each other and the POS they call a language standard it is ludicrous.

It should be possible to use enumerated domains. It should be possible to use them in a way that dovetails from one implementation to another with only minor hiccups. The fact that it isn't shows that both the standards committee and the commercial implementors are fucking morons who think a normalized relation is what their aunt became after her crazy second husband died and the doctors got her on Cymbalta and HRT.

FrostCat

@BaconBits said in Surrogate vs natural primary keys:

Now. think about GUIDs. They're essentially random. Indeed, the NEWID() function in SQL Server generates a v4 GUID, which is almost entirely random (10 6 bits are static). So if your uniqueidentifier column is the primary key and has the clustered index, then inserting data in your table means the server will essentially always have to move other data out of the way in the index. You can work around it using NEWSEQUENTIALID(), but I tend to just avoid them because GUIDs are kind of a pain in the ass to work with.

I wonder how true this is--is it actually true or just in the abstract?

A Progress database will allocate space in the database in "clusters" which are usually, say, 8K or so. It separates data and index clusters. What this means is that when in your example, there will usually be free space in the cluster, so the new entry will go in the empty space. You only move stuff rarely, because the engine will prefer to allocate a new cluster if it needs space. (I don't know what the details are on how individual keys in the index are stored.)

julmu

@RaceProUK said in Surrogate vs natural primary keys:

The only times I've found order of insertion to be important, there's been a timestamp column that serves a business-relevant purpose already

A timestamp column might work most of the time but it's still incorrect if the order of events really matters.
It's possible for multiple records to be inserted at the exact same second (or whatever the smallest unit of time is in your favorite database).
If you try to order those records by insertion time then some databases (e.g. PostgreSQL) may order them randomly.
So if you have record X at the bottom of page 1 then you might find that record X is also at the top of page 2 and record Y is nowhere to be found.

dkf

@julmu said in Surrogate vs natural primary keys:

It's possible for multiple records to be inserted at the exact same second (or whatever the smallest unit of time is in your favorite database).

I've seen timestamps with greater than microsecond precision. If you're inserting more than a million records per second, you've got awesome hardware.

Jaloopa

@julmu said in Surrogate vs natural primary keys:

's possible for multiple records to be inserted at the exact same second

CREATE TRIGGER hurdur ON stupidTable INSTEAD OF INSERT
AS BEGIN
    WHILE EXISTS (SELECT * FROM inserted i INNER JOIN stupidTable s ON i.timestamp = s.timestamp)
    BEGIN
        UPDATE inserted SET timestamp = DATEADD(ms, 1, timestamp)
    END
    INSERT INTO stupidTable SELECT * FROM inserted
END

PJH

@cartman82 said in Surrogate vs natural primary keys:

genders

LOL. No.

Jimmy Nsubuga / Jan 28, 2016 / News

Pupils given 23 options to define their gender in new government survey

Some parents were unhappy and felt it could confuse their children.

That's

22 (can choose more than one if they wish)
plus 'not sure' (possibly in conjunction with the 22 above and the one below)
'mind your own business'
freeform text field.

RaceProUK

@PJH When did 'tomboy' become a gender? I always thought that was just a term for a girl who dresses and behaves in a boyish manner.

Also, that's far too many options; all you need is 'Male', 'Female', 'Other', with a space to write in the 'Other'. Keeps the form simple without excluding any genter type.

Salamander

Bi-Gender
Tri-Gender
All genders
Non-binary

I'm confused, how many genders are there? Two, three, four, or twenty-two?

PJH

@RaceProUK said in Surrogate vs natural primary keys:

@PJH When did 'tomboy' become a gender?

Pass.

@Salamander said in Surrogate vs natural primary keys:

I'm confused, how many genders are there? Two, three, four, or twenty-two?

Yes. HTH, HAND etc.

Onyx

@Salamander said in Surrogate vs natural primary keys:

I'm confused, how many genders are there?

About 7.5 × 10⁹ by the last count, apparently.

I agree with @RaceProUK, just put in a text field next to Other, saves everyone's time.

boomzilla

@PJH said in Surrogate vs natural primary keys:

That's

22 (can choose more than one if they wish)
plus 'not sure' (possibly in conjunction with the 22 above and the one below)
'mind your own business'
freeform text field.

I like how they included age in gender. I mean...I get that they say they're repeating what kids have said.

What gender are you?
Well, my gender fluid is a bit low. Top me off and let's find out.

julmu

@dkf said in Surrogate vs natural primary keys:

I've seen timestamps with greater than microsecond precision. If you're inserting more than a million records per second, you've got awesome hardware.

You don't have insert more than a million records per second. You just have to have two users who happened to press a button at the same time.

Luhmann

@boomzilla said in Surrogate vs natural primary keys:

Top me off and let's find out.

Just now I got the hang off this 'Chill and Netflix' thing there is a new phrase ...

RaceProUK

@julmu said in Surrogate vs natural primary keys:

You just have to have two users who happened to press a button at the same time

The chances of that are astronomical, especially when timing to the millisecond

Jaloopa

@RaceProUK One in a million is next Tuesday

julmu

@RaceProUK said in Surrogate vs natural primary keys:

@julmu said in Surrogate vs natural primary keys:

You just have to have two users who happened to press a button at the same time

The chances of that are astronomical, especially when timing to the millisecond

I just checked some tables in a database that has been used by 27000 users during the last 2 years.
About 0.1% of records have been inserted at the exact same microsecond by different users.

RaceProUK

@julmu said in Surrogate vs natural primary keys:

0.1% of records

An insignificant amount

boomzilla

@RaceProUK said in Surrogate vs natural primary keys:

@julmu said in Surrogate vs natural primary keys:

0.1% of records

An insignificant amount

Says you.

RaceProUK

@boomzilla said in Surrogate vs natural primary keys:

Oi! Bring those back! The game's not finished yet! :P

Jaloopa

@RaceProUK Tell that to the user who just submitted their tax return and lost it because of some other inconsiderate jerk doing the same

RaceProUK

@Jaloopa That system clearly has more serious issues than matching timestamps

Jaloopa

@RaceProUK Making the timestamp issue . I see

Matches

I'll throw my two cents in, because my day to day job involves managing billions of new rows daily over ~30 tables, one being a primary insert only table.

An identity field (int/bigint, serial/auto increment) should be used when the data may be altered in row, but you need to keep a reference to the row in another table. So in the case of a user table having a serial fields let's you maintain a user fact table which says between date a and b (or on date) user changed name from Jim to Bob. The row impacted is user table, id 5867.

Why is this important? It depends. If you don't care what a user is called it doesn't matter. The serial field is actually garbage, and is actually worse for the table because it fucks your clustering. This can be mostly mitigated with a unique index, but it will never perform as well. Depending on the size of your table, it might not matter. (A few million rows with a good unique index means basically instant access.)

If on the other hand you have a group that monitors user orders, audits purchase history and address changes, they may want to review the meta data for a specific user to look for fraud. Knowing when a name changed and being able to associate it to a single person in a coherent unbroken chain an id field will allow you to query for a user easily, instead of tracking each name a user has had in your system.

An auto increment id is the throw away solution to tracking things across the database and simplifying foreign keys, but if you don't use it for that purpose it's a dead field. It will allow you to build out the system later (probably), so unless you have a specific reason not to use them, you should.

Reasons to not use them:

Your data is massive. Specifically, your data rows approach or exceed INT.MAXVALUE
-- and you need to frequently query the raw, unaggregated data, potentially in multiple looks, for (raisins)

tl;dr;

Id fields are for system creators to create easy access to data, but when your primary use is reads, and the data is enormous identity fields are no longer the correct method, and you should use a derived key instead.

dkf

@julmu said in Surrogate vs natural primary keys:

You just have to have two users who happened to press a button at the same time.

I was thinking about keeping nanosecond resolution (which works well enough with 64-bit integers for timestamps). That's theoretically enough precision to distinguish commits within a few CPU instructions of each other. Given that you're actually dealing with commits, i.e., writes to persistent storage, I think that's not a big problem even with SSDs.

BaconBits

@dkf said in Surrogate vs natural primary keys:

@julmu said in Surrogate vs natural primary keys:

It's possible for multiple records to be inserted at the exact same second (or whatever the smallest unit of time is in your favorite database).

I've seen timestamps with greater than microsecond precision. If you're inserting more than a million records per second, you've got awesome hardware.

CREATE TABLE #Dates
(
    Id INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
    TestDate DATETIME2(7) NOT NULL DEFAULT SYSDATETIME(),
    TestValue INT NOT NULL
)

INSERT INTO #Dates (TestDate, TestValue)
SELECT TOP 5 SYSDATETIME(), 0
FROM sys.objects;

INSERT INTO #Dates (TestValue)
SELECT TOP 5 object_id
FROM sys.objects;

SELECT Id, CONVERT(VARCHAR(30), TestDate, 126) TestDate, TestValue 
FROM #Dates;

 Id     TestDate                     TestValue    
 -----  ---------------------------  ------------ 
 1      2016-04-06T10:54:01.7282540  0            
 2      2016-04-06T10:54:01.7282540  0            
 3      2016-04-06T10:54:01.7282540  0            
 4      2016-04-06T10:54:01.7282540  0            
 5      2016-04-06T10:54:01.7282540  0            
 6      2016-04-06T10:54:01.7332545  3            
 7      2016-04-06T10:54:01.7332545  5            
 8      2016-04-06T10:54:01.7332545  6            
 9      2016-04-06T10:54:01.7332545  7            
 10     2016-04-06T10:54:01.7332545  8

I got me some killer hardware, I guess.

Edit: Apologies, I mistakenly used GETDATE() in my initial post. I should have used SYSDATETIME(), which returns a DATETIME2. I have corrected my code.

You could also do something like:

INSERT INTO #Dates (TestDate, TestValue)
VALUES (SYSDATETIME(), 100),
(SYSDATETIME(), 110),
(SYSDATETIME(), 120);

julmu

@dkf said in Surrogate vs natural primary keys:

I was thinking about keeping nanosecond resolution (which works well enough with 64-bit integers for timestamps). That's theoretically enough precision to distinguish commits within a few CPU instructions of each other. Given that you're actually dealing with commits, i.e., writes to persistent storage, I think that's not a big problem even with SSDs.

Are there any databases that have nanosecond resolution timestamps by default?

If I have a job to do then I would rather use the database's recommended method of guaranteeing a unique and sequential number than waste my time trying to find some other way to achieve the same thing.

Jaloopa

@julmu said in Surrogate vs natural primary keys:

Are there any databases that have nanosecond resolution timestamps by default?

SQL Server's DATEADD and DATEDIFF functions support a Datepart as small as nanoseconds, but I doubt there's that much precision in the DATETIME datatype. IIRC, it's implemented internally as a float representing the number of days since the epoch

RaceProUK

@Jaloopa SQL Server datetimes are only accurate to ~3ms; datetime2 is accurate to 100ns

BaconBits

@FrostCat said in Surrogate vs natural primary keys:

@BaconBits said in Surrogate vs natural primary keys:

Now. think about GUIDs. They're essentially random. Indeed, the NEWID() function in SQL Server generates a v4 GUID, which is almost entirely random (10 6 bits are static). So if your uniqueidentifier column is the primary key and has the clustered index, then inserting data in your table means the server will essentially always have to move other data out of the way in the index. You can work around it using NEWSEQUENTIALID(), but I tend to just avoid them because GUIDs are kind of a pain in the ass to work with.

I wonder how true this is--is it actually true or just in the abstract?

A Progress database will allocate space in the database in "clusters" which are usually, say, 8K or so. It separates data and index clusters. What this means is that when in your example, there will usually be free space in the cluster, so the new entry will go in the empty space. You only move stuff rarely, because the engine will prefer to allocate a new cluster if it needs space. (I don't know what the details are on how individual keys in the index are stored.)

If you really want to know the nitty-gritty, it's well documented on MSDN.

ScholRLEA

@Salamander said in Surrogate vs natural primary keys:

Bi-Gender

Tri-Gender

All genders

Non-binary

I'm confused, how many genders are there? Two, three, four, or twenty-two?

It's a continuum in multiple dimensions, actually (no, I am not joking, this really is the only sensible way of approaching it in cases like mine), but do you really want to go there when talking to kids? Especially when 98% of them will be hugging one of two poles on a single dimension of that continuum?

Yeah, even I find this poll absurd, if only because people with significant gender variance are so uncommon, and the potential variance is so high, that once you find an edge case, you almost always need to deal with it uniquely anyway. Have an 'Other' option to trap the exceptions, but for fuck's sake don't try re-throwing that exception once you've caught it.

The problem is that when you are talking about either gender identity or gender expression (especially when you have to zoom in on it close enough that differentiating them is meaningful - psychology, as a field, has a very high fractal dimension), there are both biological, emotional and social dimensions to the problems. A small part of it genetic (more as a matter of tendencies than anything), much more of it is congenital (from biological but non-inherited effects due to hormone balance during gestation and similar factors), and all of it is colored and redirected by social norms and upbringing in ways that can only be described as chaotic, at least in the mathematical sense (and often in a more general sense as well).

TL;DR - the whole thing is just TOCOTOX

tufty

@Matches said in Surrogate vs natural primary keys:

The serial field is actually garbage, and is actually worse for the table because it fucks your clustering.

Quite. That's another good reason not to use "serial / auto-increment" fields as surrogate keys, although if you're worrying about that you're probably committing premature optimisation.

@BaconBits said in Surrogate vs natural primary keys:

In SQL Server, usually the primary key is a clustered index. That means they're stored in sorted order on disk.

Actually, no it doesn't mean that. What it means is that the index b-tree nodes contain the row data rather than a pointer to the data; the index is the data. Within a page, which may contain many b-tree nodes, the nodes themselves might not be in order.

That said, inserting a non-sequential element into a clustered index may result in a page split, which will slow insertions very slightly.

blakeyrat

@Jaloopa said in Surrogate vs natural primary keys:

SQL Server's DATEADD and DATEDIFF functions support a Datepart as small as nanoseconds, but I doubt there's that much precision in the DATETIME datatype.

DATETIME2 is pretty goddamned precise. I don't remember the actual value.

The normal DATETIME is only accurate to 3 milliseconds or so.

Talonj

@BaconBits said in Surrogate vs natural primary keys:

Now. think about GUIDs. They're essentially random. Indeed, the NEWID() function in SQL Server generates a v4 GUID, which is almost entirely random (10 6 bits are static). So if your uniqueidentifier column is the primary key and has the clustered index, then inserting data in your table means the server will essentially always have to move other data out of the way in the index. You can work around it using NEWSEQUENTIALID(), but I tend to just avoid them because GUIDs are kind of a pain in the ass to work with.

Interestingly, Oracle has the opposite problem - If you use sequential identifiers for very high data insertion scenarios, there is locking contention around that side of the index. The problem (that we were, in fact, running into [or so my DBAs tell me]) is solved by switching to GUIDs (that aren't Oracle generated, because Oracle only generates sequential GUIDs...).

Groaner

@blakeyrat said in Surrogate vs natural primary keys:

The normal DATETIME is only accurate to 3 milliseconds or so.

And only accurate for years after 1752.

tufty

@Groaner said in Surrogate vs natural primary keys:

And only accurate for years after 1752.

Which, ironically enough, is the year Oracle became shit.

blakeyrat

@Groaner said in Surrogate vs natural primary keys:

And only accurate for years after 1752.

That's because it's explicitly a date in the Gregorian calendar, which wasn't generally adopted before then.

You can't take any calendar system back before it existed with any rational meaning, so Microsoft made the right decision there. (The one exception is the Julian date count, which is specifically designed for archaeology. But very inconvenient for basically anything else.)

Jaime

@julmu said in Surrogate vs natural primary keys:

Are there any databases that have nanosecond resolution timestamps by default?

Is there any common hardware with a nanosecond resolution clock? Most stuff runs on VMs nowadays which can have very interesting clock behavior.

Groaner

@blakeyrat said in Surrogate vs natural primary keys:

@Groaner said in Surrogate vs natural primary keys:

And only accurate for years after 1752.

That's because it's explicitly a date in the Gregorian calendar, which wasn't generally adopted before then.

You can't take any calendar system back before it existed with any rational meaning, so Microsoft made the right decision there. (The one exception is the Julian date count, which is specifically designed for archaeology. But very inconvenient for basically anything else.)

It gets even more complicated when you consider that different countries adopted the Gregorian calendar in different centuries. I once tried explaining the rationale behind 1753 as the epoch in SQL Server, and my boss said, "You know, if you ever go on a date, don't talk about that."

blakeyrat

@Groaner I like how he said "if" and not "when".

Groaner

@blakeyrat This guy was very classy and professional. He once rejected one of my proposals on the grounds that "it's more likely that you'll be married next year than that this will work out." There are a few reasons I no longer work for him.

antiquarian

@Groaner So did you get married?

BaconBits

@tufty said in Surrogate vs natural primary keys:

@BaconBits said in Surrogate vs natural primary keys:

In SQL Server, usually the primary key is a clustered index. That means they're stored in sorted order on disk.

Actually, no it doesn't mean that. What it means is that the index b-tree nodes contain the row data rather than a pointer to the data; the index is the data. Within a page, which may contain many b-tree nodes, the nodes themselves might not be in order.

A B-tree is sorted, and the underlying problem is due to the physical placement on disk. The specific details aren't really important, and most people, even programmers, don't really care.

However, if you do want to see the details, you can do something like this:

CREATE TABLE [dbo].[tblGUID_test] (
	[id] [UNIQUEIDENTIFIER] NOT NULL
	,[name] [CHAR](969) NOT NULL
	,date_stamp DATETIME NOT NULL
	,CONSTRAINT [PK_tblGUID_test] PRIMARY KEY CLUSTERED (id) WITH (PAD_INDEX = OFF,FILLFACTOR = 100)
);
GO

CREATE TABLE [dbo].[tblBIGINT_test] (
	[id] [BIGINT] NOT NULL
	,[id2] [BIGINT] NOT NULL
	,[name] [CHAR](969) NOT NULL
	,date_stamp DATETIME NOT NULL
	,CONSTRAINT [PK_tblBIGINT_test] PRIMARY KEY CLUSTERED (id,id2) WITH (PAD_INDEX = OFF,FILLFACTOR = 100)
);
GO

TRUNCATE TABLE [dbo].[tblGUID_test];
GO

TRUNCATE TABLE [dbo].[tblBIGINT_test];
GO

SET NOCOUNT ON;
GO

DECLARE @n BIGINT = 1;

WHILE @n < 80000
BEGIN
	INSERT INTO tblGUID_test (id, name, date_stamp)
	VALUES (NEWID(), 'FOO', getdate());

	SET @n = @n + 1;
END
GO

DECLARE @n BIGINT = 1;

WHILE @n < 80000
BEGIN
	INSERT INTO tblBIGINT_test (id, id2, NAME, date_stamp)
	VALUES (@n, @n, 'FOO', getdate());

	SET @n = @n + 1;
END
GO

Then we can look at the index stats:

SELECT 'tblGUID_test' AS tableName
	,ixP.index_level
	,CAST(ixP.avg_fragmentation_in_percent AS DECIMAL(5, 2)) AS avg_frag_pct
	,ixP.page_count
	,ixP.avg_record_size_in_bytes AS avg_bytes_per_record
	,CAST(ixP.avg_page_space_used_in_percent AS DECIMAL(5, 2)) AS avg_page_fill_pct
FROM sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID('tblGUID_test'), 1, NULL, 'detailed') ixP
INNER JOIN sys.indexes I
	ON I.index_id = ixP.index_id
		AND I.object_id = ixP.object_id;
GO

SELECT 'tblBIGINT_test' AS tableName
	,ixP.index_level
	,CAST(ixP.avg_fragmentation_in_percent AS DECIMAL(5, 2)) AS avg_frag_pct
	,ixP.page_count
	,ixP.avg_record_size_in_bytes AS avg_bytes_per_record
	,CAST(ixP.avg_page_space_used_in_percent AS DECIMAL(5, 2)) AS avg_page_fill_pct
FROM sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID('tblBIGINT_test'), 1, NULL, 'detailed') ixP
INNER JOIN sys.indexes I
	ON I.index_id = ixP.index_id
		AND I.object_id = ixP.object_id;
GO

The results are not pretty:

tableName    index_level avg_frag_pct                            page_count           avg_bytes_per_record   avg_page_fill_pct
------------ ----------- --------------------------------------- -------------------- ---------------------- ---------------------------------------
tblGUID_test 0           99.24                                   14702                1000                   67.34
tblGUID_test 1           100.00                                  65                   23                     69.84
tblGUID_test 2           0.00                                    1                    23                     20.05

tableName      index_level avg_frag_pct                            page_count           avg_bytes_per_record   avg_page_fill_pct
-------------- ----------- --------------------------------------- -------------------- ---------------------- ---------------------------------------
tblBIGINT_test 0           0.67                                    10000                1000                   99.01
tblBIGINT_test 1           3.03                                    66                   23                     46.77
tblBIGINT_test 2           0.00                                    1                    23                     20.36

The key size is the same, which is expected with 2 BIGINTs and 1 GUID both being 128 bits. You can tell because the bytes per record at level 1 and 2 are the same. Index layer 0 is where the row data actually is, including the 969 length character field. Now, yes, I cheated and made it as bad as possible with FILLFACTOR = 100 and PAD_INDEX = OFF, but I'm just exacerbating the problem.

The thing to look at is avg_frag_pct. 99.24% for GUIDs, and 0.67% for BIGINTs. For reference, Microsoft recommends that Indexes should be reorganized at 5% fragmentation, and rebuilt completely at 30%. An index that's 99% fragmented is going to have severely degraded performance.

ALTER INDEX ALL ON dbo.tblGUID_test REBUILD WITH (PAD_INDEX = OFF, FILLFACTOR = 100);
ALTER INDEX ALL ON dbo.tblBIGINT_test REBUILD WITH (PAD_INDEX = OFF, FILLFACTOR = 100);

And now they're about the same:

tableName    index_level avg_frag_pct                            page_count           avg_bytes_per_record   avg_page_fill_pct
------------ ----------- --------------------------------------- -------------------- ---------------------- ---------------------------------------
tblGUID_test 0           0.03                                    10000                1000                   99.01
tblGUID_test 1           9.38                                    32                   23                     96.50
tblGUID_test 2           0.00                                    1                    23                     9.86

tableName      index_level avg_frag_pct                            page_count           avg_bytes_per_record   avg_page_fill_pct
-------------- ----------- --------------------------------------- -------------------- ---------------------- ---------------------------------------
tblBIGINT_test 0           0.05                                    10001                1000                   99.00
tblBIGINT_test 1           6.25                                    32                   23                     96.51
tblBIGINT_test 2           0.00                                    1                    23                     9.86

But doing a REBUILD will take the index offline (unless you're running Enterprise) and is also pretty memory and disk intensive. It's really a nightly maintenance type of operation.

[Note: I stole most of these queries, like any good developer. However, the results are from my Dev system.]

tufty

@BaconBits said in Surrogate vs natural primary keys:

I stole most of these queries, like any good developer

Good. I wouldn't have wanted you to spend too much time on them, as they aren't measuring what you think they are measuring, or proving what you think they are proving.

Back to what I originally took exception with,

@BaconBits said in Surrogate vs natural primary keys:

In SQL Server, usually the primary key is a clustered index. That means they're stored in sorted order on disk.

And then said again

@BaconBits said in Surrogate vs natural primary keys:

the underlying problem is due to the physical placement on disk

Now, consider your clustered index. Each row in a clustered index is a node in the index b-tree. A number of those nodes are packed into a logical block called an extent, and those extents are stored in a file. Addressing is therefore File:Extent:Node.

Sybase^H^H^H^H^H^HSQL Server is relatively sane in how it allocates index nodes (regardless of whether the index is clustered or not); it keeps adjacent nodes in the same extent as far as possible. So an extent covers a range on index keys, and has (potentially) a pointer to the preceding and successive extents in the index. There is no need for nodes in the same extent to be stored "in order"; it's quite possible (and acceptable, performance-wise) to simply blat a new node on the end of the extent and fix up the pointers on the preceding and successive nodes (and before you say "ah, but what about the first and last nodes, surely you'd need to fix up stuff in the preceding or successive extents", there's perfectly obvious ways around that as well). And even if you do decide (for space reasons) to keep your nodes ordered in each extent, maintaining that invariant is an in-memory operation and massively dwarfed by the fact you're going to need to write the extent to storage in any case.

Now, if we go inserting data into an index (randomly or sequentially), eventually we're going to hit the case where an extent is full. So we have to make a new extent for our data to go into. What generally happens (and certainly used to happen in Sybase, regardless of how we were inserting) is that the old extent is split pretty much 50/50 and a decision taken as to where to insert the new data. It's possible that this has been optimised in SQL Server for data with automatically-assigned incrementing keys, I don't know and can't be arsed to find out.

So the problem with inserting random data into an index is absolutely not a case of …

@BaconBits said in Surrogate vs natural primary keys:

inserting data in your table means the server will essentially always have to move other data out of the way in the index.

… because data only has to be moved when splitting, and that is only going to happen approximately every n inserts, where n is the capacity, in nodes, of an extent. Indeed, without optimisation you get more splits on a sequential insert, as you only get to insert n/2 nodes before having to split once more.

On the other hand, inserting a lot of random data will (in the absence of deletions) tend to leave a lot of extents with only n/2 < x < n data in them, and the fragmentation measure you were using above measures exactly that - how much empty space is floating about in the nodes. Worst case from insertions only, obviously, is having an index that is 50% empty, and therefore being twice as big as it should be on disk (the only measure that matters, bandwidth-wise). Even rebuilding your index won't necessarily get to the "optimum" situation where all but one extent is full, either; to do so guarantees splitting down to maximum insert-only fragmentation if you ever insert out of order.

So what about that

physical placement on disk

That's nothing to do with the database server. For starters, extents aren't necessarily allocated sequentially within a file, and underneath the file is the file system, which doesn't necessarily allocate sectors to the file in sequential order, or even from the same media (and may well not want to), and underneath that you have the media firmware which serves up physical sectors as logical sectors to the filesystem.

TL;DR version : you're talking cobblers.

tufty

Also, forgot to mention : better hope there's nothing else touching your storage at all, and you're not trying to access more than one index or database at a time, 'cause if you are, it doesn't matter how scattered your extents are - if you're using spinning media you will be seeking anyway.

Jaime

@BaconBits said in Surrogate vs natural primary keys:

The thing to look at is avg_frag_pct. 99.24% for GUIDs, and 0.67% for BIGINTs. For reference, Microsoft recommends that Indexes should be reorganized at 5% fragmentation, and rebuilt completely at 30%. An index that's 99% fragmented is going to have severely degraded performance.

That only happened because you truncated the table and inserted all of your data at once. You just simulated how a system will behave if it goes from completely empty to pretty damn big in one maintenance cycle. In the real world, you don't delete all of your data every day. Very few real world tables grow by more than a few percent per day.

It turns out that if your usage is more typical - grow a bit - rebuild indexes - grow a bit more, then clustering on something well distributed combined with a modest fillfactor almost completely eliminates fragmentation.

Jaime

@tufty said in Surrogate vs natural primary keys:

Also, forgot to mention : better hope there's nothing else touching your storage at all, and you're not trying to access more than one index or database at a time, 'cause if you are, it doesn't matter how scattered your extents are - if you're using spinning media you will be seeking anyway.

Yup. Most of us aren't in a scenario where we have platters dedicated to a single index. Hell, most of us aren't in a scenario where we even have a predicable mapping between drive letters and physical disks due to server and storage virtualization.