MySQL indexes/uniques/primary keys

Felix_C

Hey there.

I'm a 17-years-old developer-wannabe, and I've been toying around with PHP and MySQL for a few years. (inb4 "php/mysql is TRWTF", they're free and widely used in free/cheap web hosts, and well, you have to learn somehow.)

So I pretty much got the concept of primary keys and their uses for unique identifiers a long time ago, which I believe is a good start.

However, MySQL also offers 2 other types of indexes: INDEX and UNIQUE.

I think it's a safe guess to assume these two allow faster access when searching for something that includes them. However, to what extent? When should you use an INDEX or an UNIQUE key? What do they do? What's their tradeoff? What are the main differences between the three?

And when googling it, I mostly find places explaining how to do them, not what they do, so I'm pretty much at loss there.

Anyone? I don't want to make WTF tables that could end up here, so I thought here could be a good place to ask.

belgariontheking

Man, this thread has been up for 2 1/2 hours with no morb to answer. I think he may have starved to death due to lack of ketchup.

Felix_C

Don't worry, I had my ketchup dose. I think I'll be able to hold on a few more hours, perhaps even see the light tomorrow; please hurry, though.

vr602

OK, slightly against the tradition of TDWTF, I'll give you a bit of an answer.

In general, table indexes give you much much faster access to the data IF you use the indexed column(s) in your selection critiera. If you have a telephone directory, then it is assumed that you are going to use last name & initials to do the lookup, which you can probably do in say 20 seconds. If someone asked you to find an entry based on street number and street name, then the only way to do it would be to read every single entry in the book until you found it.

So in your index-less database, every query will be scanning all the data in the table. This isn't a problem in a teeny database, but when the tables get big it's a disaster. So you need to work out what key columns are going to be used as selection criteria, and create indexes using those columns. This will improve performance many hundreds of times, as with the phone directory.

Typically, you should have one unique index (e.g. customer ID number) which both ups performance and also documents what the unique key of the table is, and optionally other indexes just for performance. So, in a customer table, you might also put an index (non-unique) on State/Town, if you were expecting to select records by State, or by State/Town combination.

The tradeoff is that indexes take up space and have to be kept updated (internally). They are essentially a sorted list of the values of the columns in your index, and a pointer to the page where that entry is found. So, in your phone directory example, you *could* create an index of every entry sorted by street name, pointing you to the pages where those entries are found, but it would need a new book to keep the list in - small overhead. In addition, when you insert a new record into your table, all the indexes have to be updated, which may involve some shuffling around, so insert/update performance will be a bit slower. This is why you don't put an index on every single column.

Understanding indexes and the query optimiser for your particular dbms is crucial; it's really the whole 'point' of databases - fast access by key.

Here endeth the lesson. Everyone stop yawning please.

Felix_C

Yay! I wasn't doing it wrong after all. Thanks.

savar

Nothing wrong with PHP/MySQL. Just make sure you understand their strengths and weaknesses.

You already understand what a PRIMARY KEY is.

A index is just an ordering of data that allows for faster lookups. Essentially you are spending more time and space up front pre-ordering the data, and in return you get faster retrieval time later. This is a tradeoff that makes sense if you search for data far more often than you create or modify data.

A unique key is an index that has the additional constraint of not allowing the same value to exist twice. The difference between a UK and a PK is that the PK is not nullable, and you can only have one PK per table. A UK is nullable (and multiple records can be null, since null is not a value) and you can have several on the same table.

Unique fields are almost obvious to spot -- user names in a database, for example. Anything that has the unique requirement should be a UK.

Indexes only need to be created in response to specific application needs. For example, you are planning to write a query which joins together several large tables in non-obvious ways. You might need to create an index on the join key, if you know that this query could be a performance bottleneck.

Notice that the indexes must match the search query in order to have the most benefit:

SELECT * FROM employees WHERE age=50 AND experience=10

If you create separate indexes on age and experience, then the query will not be able to use them efficiently. You need a single index on (age, experience) combined. But the more indexes you create, the more overhead you create also. So it's a careful balancing game that requires knowledge and experience.

If you want to play around with, set up a few fictional tables and write a script to populate them each with 1 million records. Now write some queries to select from those tables using various joins, with and without indexes. MySQL will tell you how long a query takes so you can do an informal experiment pretty easily.

I recently took a 4-table join that ran in 8 or 9 minutes and reduced it to several 10ths of a second by adding 2 indexes.

It gets complicated pretty quickly. Being self-motivated is great, but part of learning is to first figure out what you don't know. Try taking a college class. You really need to understand how the database works. Read up on heaps, binary search trees, and query plans. This would be a good start.

@savar said:

Notice that the indexes must match the search query in order to have the most benefit:

SELECT * FROM employees WHERE age=50 AND experience=10

If you create separate indexes on age and experience, then the query will not be able to use them efficiently. You need a single index on (age, experience) combined. But the more indexes you create, the more overhead you create also. So it's a careful balancing game that requires knowledge and experience.

This is not true across all database engines and queries.The database engine could use a hash join across the two indexes to get the intersection of the two conditions. I made SQL Server 2000 do that just yesterday in fact.

sootzoo

@tster said:

@savar said:

SELECT * FROM employees WHERE age=50 AND experience=10

This is not true across all database engines and queries.The database engine could use a hash join across the two indexes to get the intersection of the two conditions. I made SQL Server 2000 do that just yesterday in fact.

Not to mention using the * operator screws you out of using covering indexes, another concept the OP should probably bone up on. You could quite possibly get by with a list of employee IDs that satisfy the where clause, where the IDs are also stored in a covering index of (employeeId, age, experience).

Also, if employeeId were a primary key, you could get away with just (age, experience) depending on the DB engine - though probably not with MyISAM, unless things have changed significantly since the last time I took a look at it.

@sootzoo said:

@tster said:

@savar said:

SELECT * FROM employees WHERE age=50 AND experience=10

This is not true across all database engines and queries.The database engine could use a hash join across the two indexes to get the intersection of the two conditions. I made SQL Server 2000 do that just yesterday in fact.

Not to mention using the * operator screws you out of using covering indexes, another concept the OP should probably bone up on. You could quite possibly get by with a list of employee IDs that satisfy the where clause, where the IDs are also stored in a covering index of (employeeId, age, experience).
Also, if employeeId were a primary key, you could get away with just (age, experience) depending on the DB engine - though probably not with MyISAM, unless things have changed significantly since the last time I took a look at it.

Yes, but you can also trick some engines into using them. Consider SQL Server.

select EmployeeID into #temp from employees WHERE age=50 AND experience=10 order by EmployeeID

select * from Employee where EmployeeID in (select EmployeeID from #temp)

I got my query to run 5 times faster this way than with a straight select. Pissed me off to no end though that the database engine wouldn't make that optimisation.

Felix_C

Well, now things got interesting. I didn't know about indexes covering multiple fields.

I am taking courses in college, too. Actually, I'll be starting tomorrow. I asked this question by email to my soon-to-be teacher and she didn't answer, and I guess I didn't want to wait until then.

My tables use InnoDB. MyISAM doesn't have foreign keys nor transactions, so I avoid it.