What the Daily WTF?

Dorus

That, you know, screen readers might be able to read.

Dorus

Actualy, the null check is wrong too. As the comparable javadoc shows:

Note that null is not an instance of any class, and e.compareTo(null) should throw a NullPointerException even though e.equals(null) returns false.

As for equals, overiding equals automaticly means you need to overide hashcode too, but that's not all. A correct equal method most likely needs a handfull of helper methods and pitfalls to look out for. That makes compareTo seems easy.

As far as the NullPointerException goes, i can kinda agree that If you insist to send my methods a null, i'll insist to give you a nullpointer!

Nothing is more ugly then methods that start with

If (param1 == null) {
return null;
}

Rather, i use the wonderfull ability java has to throw a nullpointerexception for you when you write param1.method() if param1 is null.

Dorus

Must... run ... unittest .. . locally.

...

Must ... pass alll ... test before ... commit.....

....

</zombie mode>

Dorus

Public void save() { while (autosaving) { Wait(1); } [....] }

That took months to come up with!

Dorus

@The_Assimilator said:

@Rick said:
@The_Assimilator said:
We suggested that all deployments be kept in sync in the future. Let's see if that happens.

Can I suggest a table in your database called version, and that your application check this before starting up?

Has already been proposed by multiple people multiple times. Unfortunately, client is not interested in paying for such functionality, and our regular deploys to dev/qa mean version number is yet another thing that has to be updated. And no, we can't check if the app was compiled in debug or release mode and disable/enable the version check accordingly, because the app only works when compiled in debug mode (we didn't write it, but we have to maintain it).

You are not selling a version number in the database, you are selling a error-free update system. Seems trivial to me to claim that the next update cost 1-2 hours additional development to prevent update errors that took 3 people a full day work last update.

Dorus

delete from InboundMessages where isDev = 'Y'
* Hits F5

WIN!

Dorus

My mistake, you should be abe to get a long way with something like:

public class PageTree extends List<IPage> Implements IPage { }

Then again, XmlSerializer.Serialize probably doesn't support half the stuff i used above, considering i'm mostly writing Java lately. Multilpe seperate lists might still be a problem, but im not sure what you mean with new elements or attributes, the class above should still be able to contain other members.

Glad I've been using XSLT for my XML manupilation mostly, and lately some SOAP if we must talk in XML. The former allows every random xml repesenation you can think off, and the later support both Serialize and Deserialize, so you never have to look or think about the resulting XML.

Dorus

Let me gues: You do have to use it.

Yeah, that's what they must have been thinking too.

Dorus

Ofcourse, your pagetree contains a List, the List contains pages. Your container element indicates the List is there.

I'm not familiar with Serialize, but you might want to try this instead:


public class PageTree extends List<Page>
{

}

Serializes to:

<PageTree> <Page/> <Page/> </PageTree>

Arguably this is a cleaner desing too. If extending List is too much, extend or implement a supperclass or the interfaces of List instead.

Ps. Use the html button to add code tags :P

Dorus

@blakeyrat said:

@Jaime said:
@blakeyrat said:
Yeah, uh. I solved the problem.
But go ahead and do whatever.
You should know better than anyone else that this site isn't about helping the OP solve a problem. We live for off-topic posts.

If you guys worked for me, I'd so fire you.

"You're still working on that? It's already fixed! We have a million things to do!"

Hey, i'm at work now, having fun with sqlserver (that's completely unrelated to anything i do, i don't even get near a database lately). Sec, need to get that important email out i guess.

Dorus

Been playing a bit with it, refusing to use SERIALIZABLE ofcourse. Did try holdlock. I've managed to crash management studio, make variouse deathlocks, made queries seem to freeze before they even run the proc, and once it even looked like it deathlocked with itself (but it's also possible a canceled querie was still running in the background.

Oh and haha, canceling a proc that has begin transaction in it is grand fun too. I could swear that would automaticly result in rollback tran, but it doesnt.

Still unable to make it fail cleanly.

Dorus

Ah, i only tested it by putting a delay before the if, is it possible to force this double-insert on a race condition to happen by smartly adding locks or delays? Or can we only get by running the proc in 10 threads 10.000x times?

*Goes off to run the proc 10,000 times in 10 threads.

Dorus

Yeah, max edit time expired agian, i still got something fun to add:

If you are going to resort to 2 queries, you do want to add some locking:

create procedure Urlins @url nvarchar(max)
as
 begin
 begin transaction

 declare @returnID int

 select @returnID = urlid
 from url (updlock)
 where url = @url

 if ( @returnID is null )
 begin
 insert into url
 (url)
 values ( @url )

 set @returnID = Scope_identity()
 end

 select @returnID

 commit transaction
 end

Dorus

Oops, looks like subqueries can't set variables, sadly. Stupid online sytax checker didn't spot that, and didn't have access to sqlserver before.

Dorus

@Sutherlands said:

No, no noooooo! Like I said above, please use container elements:[...]
This makes it so much easier to serialize.

Can you be a bit more specific on what part of the serialize process is made easier? I've always found container elements pretty retard, it's not like i don't know /child/child is a child of /child if it's not wrapped in /child/childeren/child.

Personally the only rule i use for elements vs attributes is that attributes have a one-one relation with the elemnt, while elements can have a one-many relation.

If you need a many-many relation, resort to id and ref.

Dorus

if ( not exists (select @returnID = urlid
 from url
 where url = @url) )
 begin
 insert into url
 (url)
 values ( @url )

 set @returnID = @@identity
 end

Ofcourse the performance of this query strongly depend on a index on the url row (or indeed, on a checksum, as you can't index varchar(max)). Anyway, i find exists a little more clear and performant

Dorus

@SQLDave said:

So you had monthly crashes that took a month, or more, to fix?

He also wrote they had 3 servers, and on average 2 where running. Looks like those numbers add up.

Dorus

Yeah, i think i'm going to stick to that as my final answer.

Dorus

LinkedHashMap is a implementation of a Map<K,V>.

Some map implementations, like the TreeMap class, make specific guarantees as to their order; others, like the HashMap class, do not.

I know this spesific implementation does gurantee the order. But there is no Map interface that enforces this. Theres is one that enforce some iteration, there is one that enforce a sorted iteration, there is none that enforce a stable iterator.

Dorus

I'm talking about java here, not his function. I'm missing a OrderedMap or something like that. Some way to enforce in the contract of a method that you expect a map with a order.

I just realize there are ordered collections, namely List<E>. ListMap<K,V> however does not exists.

Edit: Looks like i wasn't the first one that thought of a OrderedMap, wonder why the IterableMap didn't make it into java 1.6.

Dorus

Reading.. reading. I'm reading now. I'm working on understanding. Give me some time on that.

Now I'm now wondering why there is no interface in java that gurantees iteration order by instertion order. Would be nice to be able to write <? extends InsertionSortedIterable> to prevent these problems by default. Seems crazy you can make a function misbehave if you enter a hashmap into it instead of a linkedhashmap. I know there is a SortedSet, but that one is based on the set sorting the data instead of keeping the insertion order.

See, i thought about it.

Dorus

I don't read, i write. Beside, he wrote his testcases did not test his (incorrect) implementation troughly enough for the differnce between HashMap and LinkedHashMap to show up. In this case he couldn't trust HashMap to work correctly in java. Oops.

Dorus

If you use the LinkedHashMap only to make your testcases happy, something seems wrong. I've used mocking before to prevent these problems (after all, you want to test your class, we trust HashMap to work correctly in java).

I might be a nitpick, but i do not accept codechanges on the behalf of adding testcases, unless my code actualy improves from these changes (witch it does 90% of the time, other 10%, powermock to the rescue).

Dorus

@Monomelodies said:

@Dorus said:
Please be more specific, at what step in eliminating redundant data, enforcing referential integrity and keeping data consistent do you have to add ~~substitute~~ surrogate keys?

Either you're kidding, or I need you very badly to stay far away from any database I happen to be working on.
CREATE TABLE user (
firstname text NOT NULL,
lastname text NOT NULL,
CONSTRAINT PRIMARY KEY(firstname, lastname)
);
INSERT INTO USER VALUES ('john', 'smith');
INSERT INTO USER VALUES ('john', 'smith'); -- error
And don't even get me started on why a multifield primary key is NOT practical in almost any fathomable case where you refer to it as a foreign key.

Either your data model is bogus, or you might need a substitute key. However that's related to your domain, not your database desgn. It's not something you as a databasdesinger can solve. Adding keys still isn't part of normalization.

And damn, i'm confusing substitute with surrogate again.

Dorus

Q: Did they specify what order this time?, and added it to any desing? Or do they just want "the random order we had last 3 year"?

Dorus

Please be more specific, at what step in eliminating redundant data, enforcing referential integrity and keeping data consistent do you have to add substitute keys?

Edit: and okay, finding the correct key is not the goal of normalization, it's the process.

Dorus

The entire point of normalization is to find a unique key, not to create one (ok, you create one by moving colums to other/new tables and by splitting the data in one colum into multiple columns). Nowhere during the normalization process do you add new data that wasn't there before.

Dorus

I lacked time to write up another responce in that topic, but at the very least, normalization absolutely does not say anything about adding id fields. It doesn't say anything about them at all.

At design time, you should not add id fields or other unnatrual keys. I know you add them during implementation to get questionalbe performance gains, and other add them for lazyness or lack of understanding of database design. However, when you desing a database, i *really* hope you try to apply normalization, not adding surrogate key's all over the place.

Dorus

@snoofle said:

Most of the tables are exceedingly normalized. So you have table A with some values and an Id. Ditto for table B.

As far as i know, id fields and normalization doesn't come together.

@snoofle said:

Then I find one table with a column: listOfCustomers, which contains a comma separated list of hundreds of customer Ids; sometimes with spaces after the commas, sometimes before, sometimes both; sometimes with multiple commas in a row (sometimes with intervening spaces but no Ids).

Was to be expected after ths id madness.

Dorus

@dhromed said:

For trivial stuff like this, you, as implementer, always add some sort of ordering, even if it's just A-Z or id desc, even if it's not explicitly in the spec.You are human, yes?

SQLServer will tell you it's a lot faster when you do not use `order by`. In fact, in some situations `order by` can seriousely slow down a query.

Dorus

=-(A1)^2
=-A1^2
=-(A1^2)
=0-A1^2

This result in 25, 25, -25, -25. The first one does in fact confuse me, the other 3 pretty much give what i had expected.

And i'm not even sure if the first one is that unexpected. Basicly Excell sees a - as a minus, unless the - is written in a position where it would otherwise be a syntax error (try to replace it with a * or /). In that case it takes it as a unary function.

Dorus

When you run into Chinese/Japanese/math sings on rechapta, simply enter "???" since that's what 90% if the population does, and rechapta accept what most people do.

Dorus

@too_many_usernames said:

As I said originally, I think it's a matter of personal style and the way I perceive notational consistency. It bothers me that in Excel

=-5^2

has a different result than

=0-5^2

When basic arithmetic says they should be the same thing. I know that it has to do with the way Excel parses expressions, but I find it annoying.

Just like

= (-5)^2

has a different result than

=0(-5)^2

The last one is a syntax error, one that has been anoying me since the first time i used excell,since this obviouse should translate to 0*(-5)^2 using common math rules. At least excell 2007 corrects this exmample correctly.

Dorus

What is the big deal with the order of evaluerenatiom of operands anyway? Do you functions have... eeeek.... side effects?

Dorus

@PedanticCurmudgeon said:

@Sutherlands said:
If you have trouble memorizing the precedence rules for any language, here is an easy pneumonic to remember them by:

Multiply/Divide comes before Add/Subtract.

Use parenthesis for everything else.

That's good advice even if you can remember the precedence rules.

Oh i know all the operator precedence rules by hearth, and i loved to write code with as little parenthesis as possible.

Now, just 2 years later, my preference is to leave no ambiguity anywhere, and wrap pieces of calculations in functions instead of warpping the functions inside the calculations.

Dorus

As for java, && is the lazy version of &, || is the lazy version of |.

Beginning programmers understand neither, lazy boolean operators are difficult, and the non-lazy versions happily work on non-boolean variables, performing the bitwise operation.

Not to mention i'm now starting to wonder why we shouldn't use the XOR ^ in java instead of !=. Perhaps i'll start to do that when i want to mess around with people :D

I find it more stunning that i can use the bitwise XOR as unequal operator on booleans, then using != for XOR operations.

Dorus

Now i'm wondering why i install k-light codec pack with MPC if i can also get MPC directly.

edit: And i wonder why i still can't spell after all these years.

Dorus

VLC isn't known for it's perfect performance. They do update there decoding regular. I use VLC mostly because i prefer there interface and settings (mostly subtitles), the program runs pretty much everything anyway, so i hardly ever need something else to play video.

Last time i couldn't run something with VLC is quite some time ago. If i remember correctly it was a 720p or 1020p movie that used to much CPU on vlc. I usually resort to the k-light codec pack, it ships with 101 codecs and MPC.

Doesn't MPC require codecs to be installed? I know from VLC that it uses it's own decoding

Dorus

Oh i've run a project where we code in C. It was my resit semester, me a 4th year student together with 3 second year students. Of course i was massive outnumbered, so getting any of the engineering techniques i learned in the past 2 yeas in practice was impossible. One of these guys even skipped the first year for reasons that are beyond me.

We used eclipse to write our code. However these guys swore to makefile to compile there code. Not to mention complicated stuff like memory management was too difficult for these guys. Somewhere around the first day of the project eclipse failed to even build the project. There was some makefile magic going around that made the thing compile.

Going forwards 2 weeks, i got eclipse to compile the code, using different compiler parameters. Debugging was still impossible for some reason. At this point it almost made me cry, obviouse i was the only one that even needed a debugger "You can add printf to your code right?". And also the only one that spend half his time fixing eclipse setting nobody else used.

Beside, i was now getting random crashes when i ran the application from eclipse, but building with the makefile it was still solid stable.

Not to mention the project was evolving into a beast where every C file included every possible header in the project. All so called modules knew all other modules, all module's wrote (should be-private) vars in other module's. And we where still slugging forwards.

Around the end of the project, i figured out you could check for memory-leaks with tools like valgrind. Fixing 2 dozen of memory leaks later (including one that let us leak 5kb/msg send to the server) the eclipse build started to be more stable, and after fiddling a bit more with compiler settings, i could actualy debug the code. Some more searching showed there was a buffer overflow, or freed memory (or both not 100% sure) somewhere in the code that was still used. This made the eclipse build die horrible, but the makefile build didn't mind it if you kept using this memory address.

Dorus

That's what i said right?

And yes, if you don't you have people problems :D

Dorus

@boomzilla said:

This is a false choice. It's not always up to you.

Depend, if you are asked to make something functional, then you should just explain why it's hard to make it that way. If you fail to explain that you either had the wrong arguments, or your boss might actualy have better ideas about it then you (if he accept the disadvantages, those disadvantages migth actualy be wortht the benefits). When it comes to cost vs benefits, your boss/customer will always be smarter then you.

If you are asked to implement technical requirements that you know won't work that way, or will work bad, I'll stick with it you're a amateur if you make it anyway. You should be able to come up with a better alternative solution, or make good arguments why it doesn't work.

Dorus

On a professional level, I think it's professional to point out when problems can not be solver with technology, instead of beating your code in all kind of impossible shapes to come up with solutions that in the end have much more disadvantages then usefulness.

If you have the mindset that your manager/customer knows everything better and you should make whatever they ask sounds pretty amateurish to me. You are the professional, and you where hired to warn them about problems beforehand. They don't need to know anything about computers, if they did know everything, why did they need to hire you in the first place?

Dorus

@Jaime said:

You are asking us to take your advice on a much-discussed topic, but you don't even know the platform you are using very well. You don't have to query the table to find out the last identity value inserted. When I started using SQL Server in 1996, it had @@IDENTITY which works but has some subtle issues; those were fixed a long time ago with the SCOPE_IDENTITY function. I find it hard to believe that you have ever written a large scale application on SQL Server and you aren't aware of these.

Interesting. No, i did not know about SCOPE_IDENTITY, i'm a student and the only database application i worked on was for a small company and they did now know about SCOPE_IDENTITY. I do wonder if SCOPE_IDENTITY was available in the SQL Sever version they used at that time.I think they worked with 2005 at that time, but the code that used @@IDENTITY was written against 2000 i think.

@Jaime said:

Are you basing all of your knowledge on one book? Not even a book written by someone respected in the industry, a book written by a career academic? At least you could have cited Joe Celko.

If you want to know my book, it's

"Volledig communicatiegeoriënteerde informatiemodellering FCO-IM" by Guido Bakema, Jan Pieter Zwart and Harm van der Lek. And i wasn't really citing it, more reading it any interpreting there opinions. It takes half a academic degree to read that book and understanding what they write (even trough it's very well written Dutch), with me quickly reading it on my free Sunday i need to be a bit carefull about what i claim they wrote :P

Edit: Found the book online. See page 157 (it's 171 in the dutch version).

Dorus

@havokk said:

I disagree with your use of "always". Whether id fields hurt performance or not depends on the database product you are using. Denormalisation is per-product - a denormalisation step that makes sense in Microsoft SQL Server does not necessarily make sense in Oracle Database. Additionally, OLTP and OLAP databases are different again.

Like i said before, with denormalization you might optimize for one action, but lose other ground (more work for constraints, or less contraints checked, read performance better, but write performance worse etc.) When you design your database your first goal should always be normalization. And last time i ran a benchmark to see if denormalization was worth it, i saw it rarely was. You need like 1:1000 reads vs writes, and even then the difference is VERY small, compared to a lot more work to hang onto your business rules. Ofcoure the small difference can mean a lot when a read action happens extremely often, but it's not a very usual situation. Also make that benchmark! More often then not the results will surprise you (this is true for any form of performance optimization, benchmark or it doesn't count!). I agree with you some database systems can gain more from certain techniques then others. Also, even a normalized database allows many different solutions that are all normalized, but offer a great variate in performance.@havokk said:

Besides, sometimes there is no natural key. For example, what natural key would you choose to uniquely identify a person?

As for ID fields (i don't remember normalization dictates the use or non-use of ID fields and the like), you should read this page about surrogate PK and substitute PK he makes a number of valid points.

• There is at least one candidate key (before the surrogate is added),

<p class="bodyind">• the table is a parent in at least one association, and</p>

<p class="bodyind">• there is no candidate key small enough for its values to
   be copied many times into the child table.</blockquote></p><p class="bodyind">I would like to add to that that substitute keys and surrogate keys should always be avoided, if no other solution is present, give your surrogate keys a real world meaning (like customer number printed on letters to the customer) so they become a substitute key instead.</p><p class="bodyind">As for what i use for a person. It strongly depend on the domain. For a webshop, username is fairly unique. For the hospital administration, you are going to have to discuss this with the <span id="result_box" class="short_text" lang="en"><span title="Klik voor alternatieve vertalingen" class="hps">domain</span> <span title="Klik voor alternatieve vertalingen" class="hps">expert, you can't get away with using a new unique number for every patient, you also need something that matches real patients with the rows in your database. Most likely the hospitals has a procedure how they identify patients. Anyway, you can't decide for yourself what's unique in this. The domain expert has to give you that information, and once established you make that information unique in your database.</span></span> Odds are high a substitute key is already present in the form of a patient number or anything the like. Notice this shouldn't be a ID field (by default), but something the domain expert agrees with. You are going to see this number a lot in the hospital, probably on many forms. Perhaps he wants it to start with a p, (p1234) or patient numbers should be unique across different hospitals nearby so patient data can be exchanged easily (even when they use a different database)</p><p class="bodyind">Using identity fields yields many different problems too, like, after a insert in 1 table, you need to do a lookup before you know what number to use in the next table. Not to mention what happens when you need to import multiple lines at the same time. Using a self made number is much easier then a identity field in that case. (I'm talking about sql server now, trough i'm pretty sure identity fields work the same in other databases). </p><p>@havokk said:<blockquote>Again, I disagree. If denormalisation&nbsp; is required by a system then that system, by definition, will not be in 3NF (or BCNF or whatever you were at).</blockquote></p><p>What crazy requirements do you use that dictate implementation details? As far as i know requirements and implantations details are completely separate. There is nothing normalization forbid you to do in a database, other then dictate good practice. This is like saying the requirements forbid you to use code standards or unit testing. </p><p>&nbsp;</p><p>@blakeyrat said:<blockquote>What is this magical database system you're

using where joins affect performance? Joins only slow you down if:

1) Your PKs are ridiculously huge (first_name, last_name, DOB, user_name)

2) You're joining in tables that are irrelevant to the query
(joining in "car_owned", then never referring to it in another clause)

Have you actually benchmarked any of this, or are you just echoing some bearded Unix geek who was an expert in OS/400 databases in 1986, and just assuming that's still relevant?

Using user_name alone should be more then enough as a PK. I'm not sure if this is the second or 3th NF, but one of those dictate shortest possible PK's. Also, as long as first_name column hold a fair amount of unique values, a join won't be slowed down by the other 3 fields.If it is the same 90% of the time, you better switch the order of columns in the PK so that another one that has a higher level of uniqueness comes first. As soon as the first column only result in 1 unique value, the database won't look at the other 3.

BTW, if user_name is not unique i'de like to see the login form of your website:

Username*:
First name*:
Last name*:
DOB*:
Password*:
*Required.

And yes, i benchmark my stuff before i make claims about performance. I've never seen a join that was faster then looking up another column in the same table.

@blakeyrat said:

EDIT: this thread is ridockulous. Is it just full of trolls? I'm starting to think so.

Been wondering if you where trolling me too, i decided i should just make fun of the ridiculous argument, and reply serious to the serious one's.

@boomzilla said:

I'm generally interested in some balance of write and read performance, depending on the circumstances. If all you need is just that simple reference value, then I guess your comment makes sense. But assuming you need anything else (which is probably pretty often--at least it is for me), you're going to have to do the join anyways. Also, what could / should be updating a single column in a single row can suddenly balloon into updating thousands or millions of other rows. Not to mention all of the locks that would create. I don't hate my users that much (yet).

You would be surprised how often you do not need that additional data. Also, if you are going to have to join anyway, that id fields add 1 more field the database has to retrieve and load into memory. As long as your natural key does not contain any ridiculous dataset like 90% of the time only the last 4 chars of a 450 char string differ, a join on a id field isn't going to be much faster then a join on a string field. I've seen plenty of times a join the string field was in fact faster.

Even when you do have that ridiculous dataset where only the last 4 chars differ (I've seen them), my solution was to add a calculated column that had the data of the first column backwards. Next i included that backwards column into my index. (Wasn't a PK index, but could have been).

@boomzilla said:

You're all over the place here. I guess it's possible that you deal with situations where the reference key is all that you need. Then yeah, I guess I can understand how the performance would be better for you. But if you need that data, then you need that data. Sometimes you need further joins, too, in order to wade through the hierarchy of objects / tables / data / whatever. My point is that in this case, you've probably made the DB use a lot more data in its indices (due to composite keys vs a simple key), which also means it has to look at more information to do the join.

And an ON CASCADE UPDATE is not a magical device. If you don't see the performance implications of that, you're not paying attention.

I don't see why the DB uses more data in my case, i got 1 ID field less to worry about, also, columns not included in the select are offten not even read by the database engine.

I do agree ON CASCADE UPDATE is not magic. It has it's disadvantages, but you won't run into them in most general usecases. Don't confuse it with ON CASCADE DELETE, that one more often then not results in entire databases running empty on 1 deleted row, and are not considered very safe to implement in most situations.

ON CASCADE UPDATE only hits tables that have a FK directly to the current table (or indirectly trough a composite key). You won't have that many rows that are referred my thousands or others. The fact alone that those other row that refer to the first one will not have many references them self, means that at best 0.1% of your dataset has this characteristic. For that 0.1% to also be changing daily is unlikely. If you do need to optimize for that case, i prefer to either look for a alternate PK, or put changes to this table in a queue and handle them at night. This might be a acceptable solution if you use company_name to identify your customers, and sometimes company's change there name. You don't want to put your database on hold for 15 mins during the day when one of them does this, but it's perfectly acceptable to run a queue like that at night.

@Jaime said:

@Dorus said:
As for performance, the more data you put in 1 table, the less joins you need. Thuss more performance. This is especially nice if you remember normalization just allowed you to put more data in one single table.
Head asplode.

Normalization almost never creates more columns, it almost always pushes data out to other tables. You aren't talking about the benefits of normalization, you are talking about the benefits of natural keys. The debate between intelligent and surrogate keys is far from over and far from one-sided. I don't have a problem with intelligent keys when they are appropriate, but I do have a problem with your insistence that surrogate keys are never the answer. Your arguments also reek of premature optimization.

I think this happens somewhere around the 4th - 7th normal form, but the goal of those is to have less tables. There are normalization techniques that pushes data back into tables. One of these techniques is to use natural keys. Another one is that you can sometimes eliminate a table when the data it contains can already be found in another because it has a 1-1 or many-1 relation.

Imagine a class only exists when it has at least 1 student. All students have 1 class. Also teachers are assigned to multiple classes. It's perfectly acceptable to make this scheme:

Teacher		Teaches			Student
FK1	<-	FK1	FK2	->	FK2
Name		Teacher	Class		Class	Name
PK		pk	pk			pk

No need for a table class, since "Select distinct Class from Student" retrieves this already.

Edit: I looked up my book. They denormalize at the end of the process, to make queries simpler and spare out joins. Notice this is at the END of the process, around the time where you can decide if adding additional constraints cost less then less tables. denormalizated tables should still be closely guarded by constraints so no bogus data makes it in.

Dorus

@Jaime said:

@Sutherlands said:
As long as the database keys that have no meaning outside the database don't get outside the database... *wishful thinking*

Use GUIDs for surrogate keys. That tends to stop the use outside of technical scenarios.

We once used GUID to make sure user's coulnd't guess the ID of the next img in our image service. Oops.

@Jaime said:

... bracing for GUID religious war ...

Oh, but i love GUID, other then identity fields, a GUID is unique in space and time. You can do fun things with that :D

Dorus

@Nelle said:

i've worked on projects where databases were in 3NF, all indexes were optimised for the executing queries, tables partitioned, servers top-notch, and still we had to resort to denormalisation in order to create reports for managers who want to see EVERY FIELD from EVERY database on ONE BLOODY SCREEN. (and preferably formatted as a spreadsheet, but NOT exported to Excel, because then it is not REAL TIME, although they would be happiest if we would connect the Excel to the database, which I know is possible by the way, but please don't tell THEM that)
and judging from some stories, this is not at all uncommon ...

Glad to hear the managers are the most important users of the database, and they need to be updated on the ms. Too bad the end users will be delayed a bit more, especially with all the counting going on in the background by the manager display. This means the end users will be delayed in there action, most likely also delaying the report again. Not to mention you can't have to many users running around at the same time, those pesky guys only slow down the database.

Are you bankrupt yet?

Dorus

@boomzilla said:

Oh, yeah, let's inject performance issues just to keep to a notion of natural key purity. Yes, I was talking about having to update all of the FK references to your PK. So, in fact I was talking about both.

So now you want fast write performance instead of read performance? Joining all those tables each time you need to do a read is a lot more work then updating 2-3 values with on cascade update when you change one.

@boomzilla said:

Sorry, but this is still stupid. A database is a tool to be used by an application. If using arbitrary keys improves the development, maintenance, performance or function of the database, then forcing the use of a natural key seems silly. What if your natural key is composite? Now, so are your FKs. That certainly can have performance implications. Theoretical purity is nice, but it rarely survives the real world.

Can you explain how this improves development or maintenance? Last time i checked normalization especialy helped development and maintenance. As for performance, the more data you put in 1 table, the less joins you need. Thuss more performance. This is especially nice if you remember normalization just allowed you to put more data in one single table.

Dorus

@Nelle said:

yes you are right, that is why data warehouses, olap cubes and such are so unpopular

Data for a date warehouse always comes from a primary system that does enforce constraints and data integrity. Usually a normalized database. The whole point of data warehouse is to skip those steps in favour of (read) performance. Even a data warehouse will have a certain degree of normalization. Most of the performance gains come from ignoring business rules and constraints because those are already enforced by the first layer

Even then, this is not a very common as you can also get good report data from a normalized database, a view and a index will dramaticly speed this up without losing normalization. (It does slow down write actions, just like denormalization would, but at least you won't need any insane data integrity checking).

@boomzilla said:

So...if we enforce natural keys whenever possible, what happens when it changes

Last time i checked there was no reason for a PK to be unable to change. It only had to be unique.

@boomzilla said:

There are many other situations that are more domain specific. But either way, if that was your PK, then you have to update all references to it when it changes.

Because "ON CASCADE UPDATE" is sooo difficult to use. You are talking about a FK btw, not PK.

@__moz said:

@Dorus said:
Using non-meaningful keys, like ID fields is always wrong. A database contains data about reality and a ID field never matches with something real.

I'm not sure if you only mean this in the context of snoofle's situation, but the second sentence suggests not. ID fields come into their own if you need to store a references to a few rows in other tables (say, table Z) temporarily for some reason, as Z only needs a table name field and a number field, rather than needing fields for the primary key of each table it can store rows from.

I'm advocating that reference fields should store "meaningful data".

Beside, i'm not exactly sure what you try to say. Table Z needs a FK that refers to a table, certainly not a column with a table name (WTF). Also, what's wrong with needing the fields of the primary key of the referred tables? Unless your database designer is notepad, most of them let you drag a arrow from one table to the other, and add the required columns for you. Not to mention the data in those PK columns contain data that directly refers to data in table Z. If you store a ID instead of the PK, you will need to join table Z and the other's every time you use it. If you store the PK instead, this can be avoided half the time.

@PJH said:

@boomzilla said:
you use people's social security numbers as your PK?
As the federal government has no doubt found out by now, SSNs are not good candidates for primary keys, not because there's a one-many connection between people and SSNs that you imply, but because there's sometimes a many-many connection between them. The most well known, I think, being Hilda Schrader Whitcher's when lots of people decided to use hers.

If storing people that use fake SSNs in your database has a high priority, using SSN as a PK might not fit. However, storing many-many relations is hardly a issue in databases, you just need a wider PK.

Dorus

@Scarlet Manuka said:

If he's standing in front of you, you can look to see what differences there are between the records in your database, and then ask questions based on that. Harder to do on a webform, I admit. But you seem to be arguing against your own point here - this is why an ID field is good, because it guarantees that there will be something unique about that user's record, and furthermore you know which field will be unique. So now you just have to ask him for his ID, and you know which record to update

Good, so we agree that rows in a database should be unique based on something that's unique in the real world.

Now i chalange you to ask mister datetime what's his id. If you can't, then it's probably not a good idea to add a ID to your event's date.

@Scarlet Manuka said:

If he's standing in front of you, you can look to see what differences there are between the records in your database, and then ask questions based on that.

Yes, but there is no gurantee there's even a diference between those records since you did not put a constraint on your database that enforces unique records other then the ID field. And how does mister John Doe knows his ID? It was generated for him in the background for him when he first ordered a marble from your company, to lazy to fill in his real name in the request form.

Or are you going to comunicate him his ID after he placed his first order? Seems like you are now building your business logic around your database instead of the other way around.

@boomzilla said:

@Dorus said:
A good PK would be a username, or, for example my school resort's to student numbers. A student number is, other then a identity field, something that exists in the real world. My student number is printed on my student ID, all letters the school send me, and also a valid username when i log into the school website.
OK, so we can't create our own id fields unless we print them on cards?

Well, if that's how you decide to identify your rows: Yes. Most websites use usernames to have a unique idenitfier, but we all know the fun with helpdesk support numbers, customer numbers, tracking numbers etc. All these are perfectly valid solutions, and very much printed somewhere.

Dorus

@Jaime said:

Speaking of normalization and performance in the same breath is a bit weird. Normalization has always been about data integrity. Good performance often requires denormalization.

I have yet to see a example of this. Good performance always require stuff like good indexes, it hardly ever require denormalization. Poorly normalized databases either suffer data integrity problems, or have so much checking going around that it hurt performance so much that normalization wouldn't had made a difference.

Imagine you store your employees living address both in the employee table and the pay-check table, to spare inner join. Now you need to add a check that these 2 address columns always update at the same time, or you run into trouble when a employee moves and only 1 field is update (thus your data is bogus).

@Jaime said:

How do you propose a database handle situations where there is no natural key, like a list of people? I've found that only only statements that are always wrong are statements that contain the word "always". Unfortunately, this rule nullifies itself.

How do know what person belongs to what row if you have 5 times mister John Doe in your table, and 1 of them stands in front of you and ask you to change his name to Karl Doe? You are probably going to ask his birthday, but then still 2 John Doe's remain, and there are no guarantees anything else you ask him will make him unique in your database. Not to mention how much fun it is to make a web form for this. "If name is not unique, ask for birthday, then ask for ..., then ask for ...".

First and last name can hardly be considered as a candidate key, since multiple people can have the same name. Some organizations rely on birthday, but even that fails at times. In the end you will need to add something to a user that also exists in the real world. A good PK would be a username, or, for example my school resort's to student numbers. A student number is, other then a identity field, something that exists in the real world. My student number is printed on my student ID, all letters the school send me, and also a valid username when i log into the school website.

Posts made by Dorus