I'm no DBA....

tgape

Can you have two of the same email addresses at the same time?

Yes. If you have two different hosts with the same name, they can provide the same email address to different users. And, if your DNS lists an A record for both of those IPs, then sometimes email to that address will go to one user, and sometimes to another.

Incorrect, broken, and undesired, but I've seen it happen. The same setup also bounced nearly half of the email for that domain as 'user unknown', because we were talking to the wrong host at the time.

This can also happen for more nefarious reasons, but I wouldn't expect those addresses to make it to the general Internet.

Volcanon

Some more shenanigans from the same database. I have been finding some more and piling them up as time goes on, but here are some gems:

We see a lot of disk and CPU bottlenecking at times of high queries and usage on our production server. I was curious as to why and so I went and looked at the RFLog table again to start. (the table mentioned at the beginning) I had not yet examined datatypes and the first thing I see is absolutely horrific. Two of the three textfields are Char(30). These are populated by an IP address, and a name. Most of these values are 5-15 in size.(maybe they wanted to be IPv6 compatible?:P)

Farther down, it gets worse. The log field where it describes the error message, a char(200) field. Most messages are less than 50 characters. No message is longer than 100. I begin looking at all the other tables, suspecting foul play, I find that there is no varchar usage at all. Char is used for almost all numbers, big and small, and for every text field. There are a few char(1000) fields. In one table, the char(1000) field is 95% no more than 10 characters.

If I had to guess as to why it's this way, I would guess that it's because the way formatting is done on the front end, they likely couldn't figure out how to pad and just decided to pad with the database. No joke.

PJH

No DBA either, but....

@Volcanon said:

Two of the three textfields are Char(30). These are populated by an IP
address, and a name. Most of these values are 5-15 in size.(maybe they
wanted to be IPv6 compatible?:P)

Is this really an issue? In any DB? I was under (the possibly wrong) impression that char was preferable to varchar. OK, an IP address may be 4 or 6 chars, so use char(6).

My point being, would varchar be any better?

@Volcanon said:

Farther down, it gets worse. The log field where it describes the error
message, a char(200) field. Most messages are less than 50
characters. No message is longer than 100. I begin looking at all
the other tables, suspecting foul play, I find that there is no varchar usage at
all.

Again, would varchar actually gain you anything in this particular situation? Either in time or space?

@Volcanon said:

Char is used for almost all numbers, big and small, and for every text
field.

I see a potential problem there, if all numbers are being stored as text instead of 'natural' format. But that's more related to the code sorting out the crap it gets back from the database.

smbarbour

@PJH said:

No DBA either, but....
@Volcanon said:
Two of the three textfields are Char(30). These are populated by an IP address, and a name. Most of these values are 5-15 in size.(maybe they wanted to be IPv6 compatible?:P)
Is this really an issue? In any DB? I was under (the possibly wrong) impression that char was preferable to varchar. OK, an IP address may be 4 or 6 chars, so use char(6).
My point being, would varchar be any better?

@Volcanon said:
Farther down, it gets worse. The log field where it describes the error message, a char(200) field. Most messages are less than 50 characters. No message is longer than 100. I begin looking at all the other tables, suspecting foul play, I find that there is no varchar usage at all.
Again, would varchar actually gain you anything in this *particular* situation? Either in time or space?

@Volcanon said:
Char is used for almost all numbers, big and small, and for every text field.
I see a potential problem there, if all numbers are being stored as text instead of 'natural' format. But that's more related to the code sorting out the crap it gets back from the database.

A general rule of thumb is to use char when most data is at or just below the set data size and use varchar when the data has a large variance in size. The varchar datatype has overhead to define how much data is in the field in a given row, but if it is well below the maximum size of the field, it takes less space than a fixed-length char.

I'll pass on the IP address size comment as a bad attempt at humor :P

Basically, these are the general rules I follow for character datatypes:

char - simple short string
varchar - short description or message
text - Oh my god, he's writing a book in the database!

(and of course their unicode variants)

Volcanon

Yeah I tend to follow arbour's guidelines. If you don't think a char(1000) field with an average of 50-100 characters in it is wrong, one of us must be very confused cause I think it's very wrong.lol

wrack

coughfixedlengthcoughcoughstringtegerscoughhackcobolorsomethingchokecoughwheeze

tgape

@Volcanon said:

I begin looking at all the other tables, suspecting foul play, I find that there is no varchar usage at all. Char is used for almost all numbers, big and small, and for every text field. There are a few char(1000) fields. In one table, the char(1000) field is 95% no more than 10 characters.

Disclaimer: Not an RDBMS DBA, but maybe the closest to being one in this thread, as I am an LDAP DBA, and I've worked with RDBMSes on the side.

RDBMS systems apparently take a performance hit if any column in a table is variable length. So simply converting all of those tables to varchar could get you a net performance loss.

That having been said, if some of those columns really never use more than 50% of their space, it could be worthwhile altering the table to use more appropriate column sizes (keeping in mind that any data that does not fit within the columns will be lost - the database trusts its administrators to add any needed fudge factor; it just does what it's told.) That having been said, note that most RDBMSes (if not all) implement this kind of alter table by locking the table, copying it to a new table with the new layout, dropping the old table, renaming the new table to the old name and then unlocking it. It's a time consuming operation on a large database, during which that table is effectively down.

XIU

AFAIK varchar column can be compared with an int describing the length and a char* having that length, and char column with a char[length].

So the disk overhead for varchar is almost nothing but searching through a list of varchar values will be a lot harder then searching through a list of char.

Jaime

@smbarbour said:

@morbiuswilters said:
Another thing to take into account is performance: large PKs in certain databases can slow things down.

That's true. PKs are typically a clustered index. An auto-increment PK would place the new record at the end of the index, whereas using a natural PK will incur additional performance penalties on insert where the index is being rearranged to accomodate the new entry.

PKs should rarely be a clustered index. Clustered indexes are most useful when they help return a group of related data by using a small number of disk reads. Foreign keys are usually the best candidates for clustered indexes. Most DBMSs that will use a physical pointer or virtual pointer as a "stand-in" clustered index if you don't define one. In most cases, this automatic clustered index is actually better than explicitly creating a clustered index on the primary key.

In your example, it would likely be better to simply not create any clustered index than to create one on the PK (read up on heaps in Microsoft SQL Server for an explanation). Also, in 99.9% of the cases, the performance lost on a thousand INSERTs is gained back the first time the data is read using the clustered index.

VydorScope

@Volcanon said:

Believe it or not, I did suggest we just do away with PKs in that table because I agree, though it is not in the best practices book, I don't think we need a PK for that table at all. It is not edited, simply added to.

I tend to have a seq number, or other key to enforce order on all log tables. If enought events happen in the same time slice, they will resist sorting attempts by date/time. Keep in mind, most DBA's will enforce some kind of trimming on the table, which could affect (effect? some gramar nazi will have to flame me on that) insert order since deleted rows would get reused, which means even ordering by stored seqence will not be trustworthy.

Rule of thumb I use : If order matters, make your own key.

smbarbour

@Jaime said:

@smbarbour said:
@morbiuswilters said:
Another thing to take into account is performance: large PKs in certain databases can slow things down.

That's true. PKs are typically a clustered index. An auto-increment PK would place the new record at the end of the index, whereas using a natural PK will incur additional performance penalties on insert where the index is being rearranged to accomodate the new entry.

PKs should rarely be a clustered index. Clustered indexes are most useful when they help return a group of related data by using a small number of disk reads. Foreign keys are usually the best candidates for clustered indexes. Most DBMSs that will use a physical pointer or virtual pointer as a "stand-in" clustered index if you don't define one. In most cases, this automatic clustered index is actually better than explicitly creating a clustered index on the primary key.

In your example, it would likely be better to simply not create any clustered index than to create one on the PK (read up on heaps in Microsoft SQL Server for an explanation). Also, in 99.9% of the cases, the performance lost on a thousand INSERTs is gained back the first time the data is read using the clustered index.

You may want to check your references. Microsoft SQL Server sets the PK index as clustered by default, and you can have only one clustered index on a table (which controls how the records are actually sorted within the database). Setting a primary key adds a contraint to prevent duplicates as well as creates an index. If the best practice was to not set the PK to be clustered, why would it be the default where you have to explicitly remove the clustered flag from the index?

Jaime

@smbarbour said:

@Jaime said:
@smbarbour said:
@morbiuswilters said:
Another thing to take into account is performance: large PKs in certain databases can slow things down.

That's true. PKs are typically a clustered index. An auto-increment PK would place the new record at the end of the index, whereas using a natural PK will incur additional performance penalties on insert where the index is being rearranged to accomodate the new entry.

PKs should rarely be a clustered index. Clustered indexes are most useful when they help return a group of related data by using a small number of disk reads. Foreign keys are usually the best candidates for clustered indexes. Most DBMSs that will use a physical pointer or virtual pointer as a "stand-in" clustered index if you don't define one. In most cases, this automatic clustered index is actually better than explicitly creating a clustered index on the primary key.

In your example, it would likely be better to simply not create any clustered index than to create one on the PK (read up on heaps in Microsoft SQL Server for an explanation). Also, in 99.9% of the cases, the performance lost on a thousand INSERTs is gained back the first time the data is read using the clustered index.
You may want to check your references. Microsoft SQL Server sets the PK index as clustered by default, and you can have only one clustered index on a table (which controls how the records are actually sorted within the database). Setting a primary key adds a contraint to prevent duplicates as well as creates an index. If the best practice was to not set the PK to be clustered, why would it be the default where you have to explicitly remove the clustered flag from the index?

Primary keys are clustered by default for compatibility with 20 year old database creation scripts. This in no way implies that it is a good idea.

Jaime

You have the actual reason not to cluster a primary key in your own post. You can only have one clustered index, so it has to be used where it will provide the most benefit. Putting a clustered indexing a primary key will work best in those situations where a group of records retrieved in one query will havecontigous primary keys. This is fairly rare and usually only happens when selecting rows by "date added" if the primary key is a constantly increasing value. There is likely a more effective "main grouping value". For example, order detail data may benefit from a clustered index on ProductID if sales-by-product queries are the heaviest load on that table. . Putting the clustered index on the primary is usually useful, butputting it somewhere else is usually more useful.

smbarbour

Typically, your primary key is the field used most often in queries and is therefore the most efficient for clustering. Of course not every application will have the heaviest query load on the primary key, and in those cases, you would change the clustered index to another field (don't forget that the clustered index key is the row locator for all other indexes on the table, and if the clustered index is on a non-unique field, the server will have to generate a unique value to append to the index entry). About 90% of the queries I have written use the primary key of the tables. I would even conjecture that in a given purpose-built database, there will typically be one or a very small handful of tables that do not have primary keys that are foreign keys in other tables. The remainder of tables in the database are lookup tables (which are typically queried by the primary key and thus benefit the most by being clustered on the primary key).

Jaime

But.... A clustered index on a primary key will usually return a single record in two or three page reads. The same query will do three or four page reads if a non- clustered index is used. However, a query that return 100 records will typically take four or five page reads using a clustered index. The same 100 rows would typically take 95 or so page reads if using a non-clustered index.
A clustered index makes a minor improvement to a point query, but a major improvement to a range query. So, I don't claim that a clustered index won't help a single-row lookup. But, I do claim that your one clustered index should be used where it will make a huge improvement, rather than where it will make a small improvement. Even if 95% of your queries are primary key lookups, it is still a bad idea to spend your clustered index there.