I don't wanna create a WTF.. database question



  • Okay so I'm not exactly a database expert, and I have a problem that I can only come up with solutions that seem WTF-worthy.  I'm wanting to create a table that has a list of items, that are in a specific order (but the order is not necessarily related to any attributes of the data).  Further, this order should be adjustable.  For example, I may have "A, B, C, D", but decide that I'd rather move "C" to the front (i.e. "C, A, B, D").  I'm trying to figure out the best schema.  I have two ideas, neither one really seems elegant.  I feel like this has to be a common problem with a better solution than the ones I've come up with.  Any suggestions?


    Shema 1:  Sort of like a linked list

    id:  A unique identifier for the row.
    next_id:  The id of the "next" row of data.
    data:  All the data (obviously, this would actually be several rows)

    Pros:  Adjusting the location of an item only requires at most 3 queries (to update a few next_id's).
    Cons:  Cannot (with any method I can think of) get a range of items (i.e. items number 15-25).  You must query for all and then perform the count programmatically.  You also have to keep track of which item is the head of the list.


    Schema 2:  Indexed list

    id:  A unique identifier for the row.
    index:  The index of this row of data (independent of the id).
    data:  All the data (obviously, this would actually be several rows)

    Pros:  Easy to get a range of items (i.e. items number 15-25) with just an SQL query.
    Cons:  Difficult to rearrange list (i.e. to move the last item to the front, you'd have to increment the index of every row).


    Another thought I had is to use Schema 2, but have the indexes initially spaced out (instead of 1,2,3, do 10, 20, 30).  That way to move the last item (say, 990) to the front, I'd just have to set its index to something between 0 and 10 (say, 5).  This is ideal if the data is not going to be rearranged often.  But I'm afraid that's not the case.  It also requires a lot of work ahead of time, I feel like an elegant solution would only require a database query.



  • Schema 2 seems better to me -- that's how I'd implement it, at least.

    I don't necessarily agree that the most elegant solution involves only a database query, because it seems like this index really belongs to some other piece of the system (for example, it might be a way to compare your domain objects, or a sort order in a GUI). By that assumption, the indexes are manipulated in (hopefully) a real language somewhere (where it's much easier to make an elegant solution), then just written back to the database.


  • I have this kind of problem regulary, and choose schema 2. Schema 1 is a pain to query.

    Rearranging the list is two update statements. How many records do you plan to have in this table?



  • Schema 2 is better than schema 1.  At least #2 allows you to retrieve a sorted list from the database, rather than manually jumping around the list following "next" pointers.

    With schema 2, swaps are simple, just two updates.  You could actually use swaps for all inserts, removals, and moves, bubble-sort style, but it would be rather inefficient.  A better option for inserts would be to move all the affected rows first, and then insert.  Then it's still just two updates (though obviously more than two rows may be affected).  e.g. "UPDATE table SET index=index+1 WHERE index > $insertpoint; INSERT INTO table ..."  The same scheme would work in reverse for removals.  The scheme also work for moves within the list.  You'd just move all items within a specified range up (or down) by one, and then update the moving item to be in the now-open spot.  You could actually do this to swap, too, rather than making swaps a special case.

    I think skipping indices is just going to make your code more complex, because you're still going to have to build in the case where there's no room to insert.

    If your DB supports transactions, use them.  You really need this to be atomic.



  • In schema 2 you could avoid the problem of having no numbers between n
    and n+1 by using strings instead. You can always find a string that
    sorts between two other strings, e.g. if you want an index between A
    and B, use AM (M is the middle letter of the alphabet).



  • @Goplat said:

    In schema 2 you could avoid the problem of having no numbers between n
    and n+1 by using strings instead. You can always find a string that
    sorts between two other strings, e.g. if you want an index between A
    and B, use AM (M is the middle letter of the alphabet).

    Um.  Insert between A and AA?



  • @Jörg said:

    @Goplat said:
    In schema 2 you could avoid the problem of having no numbers between n
    and n+1 by using strings instead. You can always find a string that
    sorts between two other strings, e.g. if you want an index between A
    and B, use AM (M is the middle letter of the alphabet).

    Um.  Insert between A and AA?




    sorry, wasn't thinking. Just never create anything ending in A or Z.



  • @Jörg said:

    @Goplat said:
    In schema 2 you could avoid the problem of having no numbers between n
    and n+1 by using strings instead. You can always find a string that
    sorts between two other strings, e.g. if you want an index between A
    and B, use AM (M is the middle letter of the alphabet).

    Um.  Insert between A and AA?


    Right. Just use integer indexes. Like someone said before, a couple of "update table set index=index+1 where index>POINT_WHERE_MOVED_ROW_SHOULD_GO" will do the trick. Been using this for years, always works. Linked lists are a pain because - well, they're linked lists. They're not what you intend to do! Sure, there are many examples when linked lists are the best thing since sliced bread, but to store sorting orders isn't one of them...



  • actually ending in Z is ok.

    Why does this forum have an edit button if it doesn't actually let you edit?



  • Rearrangement in Schema 2 can even be done in one statement (important if a unique index prevents a two-step solution).



    For example, swapping 3 and 4 is

    update list set id=7-id where id in (3,4);

    Vendor-specific SQL extension let you do more complex rearrangements easily, e.g.

    update list set id = decode(id, 7, 2, id+1) where id between 2 and 7;

    moves an item from position 7 to position 2.
    Even without that, some applied math does the same trick:

    update list set id = -1.5*id*id+5.5*id-2 where id in (1,2,3);

    moves an item from position 3 to position 1 (shifting 1->2 and 2->3).


  • Unrelated to your actual question, but: consider carefully whether you need a standalone 'id' column.  Most data has a field that uniquely identifies it already; make that the primary key instead if possible.  It makes the database that much less prone to duplicate data.



  • @Angstrom said:

    Unrelated to your actual question, but: consider
    carefully whether you need a standalone 'id' column.  Most data
    has a field that uniquely identifies it already; make that the
    primary key instead if possible.  It makes the database that much
    less prone to duplicate data.




    It also wastes space when you need to reference the row from other
    tables, makes JOINs slower, and makes it more complicated to change
    that field. Duplicate data isn't a problem if you make it a unique key.



  • I've found a number of times that it bites me when I don't create an id key.  The fundamental problem with using unique data in the table as the primary key is that any useful or "real" data in the table is subject to change, including the primary key.  Change the primary key, and now any other tables which reference that key also need to be updated.

    e.g. Let's create a table of users.  The obvious primary key is the username.  So other tables reference that key to show ownership/authorship/whatever.  But Suzie Random (srandom) gets married, becoming Suzie Random Hacker, and wants to change her username (to srhacker).  It's possible, but it's a hassle, because tons of things point to her username as a foreign key.  If they pointed, instead, to an id which has no meaning outside the database, then nothing would need to change except the one entry in the user table.

    I think using meaningful data as a primary key can typically be viewed as data replication.  I almost always create id keys.  It just normally makes things simpler.  Worrying about the efficiency of joins/etc during design falls under premature optimization, generally (but possibly not always).  Of course, I always make the "logical" primary key (username in this case) unique.



  • I'd like to introduce you to my friend, SQL F701 ON UPDATE
    CASCADE.  At that point it becomes more of a practicality versus
    purity discussion: on update cascade keys are marginally less portable
    and, unless used CAREFULLY and documented well, can lead to updates
    taking an unexpectedly long time and to hard-to-comprehend database
    schemas.



  • @ammoQ said:

    Rearrangement in Schema 2 can even be done in one statement (important if a unique index prevents a two-step solution).

    For example, swapping 3 and 4 is
    update list set id=7-id where id in (3,4);

    Vendor-specific SQL extension let you do more complex rearrangements easily, e.g.

    update list set id = decode(id, 7, 2, id+1) where id between 2 and 7;

    moves an item from position 7 to position 2.
    Even without that, some applied math does the same trick:

    update list set id = -1.5idid+5.5*id-2 where id in (1,2,3);


    moves an item from position 3 to position 1 (shifting 1->2 and 2->3).

    Hmm...  I wonder what these things would do in the presence of a unique constraint on the id column (that you use here - or the index column from the original post).  It seems to me that the way unique constraints are applied (I think row by row on most databases for an UPDATE) this would be very likely to fail.  In such a case, you'd need to disable the constraint prior to doing the update and then reenable the constraint after completing the update.  In a DB like Oracle, I think this would be problematic, as if the reenabling of the constraint failed, you'd have no way to rollback the transaction (at least back when I was using Oracle, DDL always caused an implicit COMMIT).



  • Definitely Schema 2 is the way to go, but why hasn't anyone pointed out the use of LIMIT?!



    eg - items 15-25:



    select some-fields from some-table order by index limit 15,10;



    While we're at it - think about how the re-ordering will take place -
    it'll probably be an interactive process with the user, so it'll be
    easy to put in 'up, down, to-top, to-bottom' style controls on each
    record, which allows you to easily calculate the indexes.



    Should the indexes get too crowded you have a simple housekeeping job which runs through and re-indexes the records.




  • Hi Kipthegreat,

    I agree with others: use schema #2. It's the common choice for these cases since it's the easiest to understand. You should only resort to #1 if the specific needs of your application are such that it will perform much better (and I doubt if that'll be the case for any application).

    Some people have recommended to use character codes as the index in schema #2. Don't. You'd be making yourself way too dependant on the collation settings.

    If you want to be able to insert a row without having to change the index of all rows "after" that, use a floating point datatype for the index. The new index is then simply the average of the indexes of the entries "before" and "after" the desired location. Set up a scheduled job to renumber indexes during off hours. The frequency has to depend on update frequency. (Remember that with the maximum precision of floating point columns, it will take an awful lot of updates to get to the point where the average of two values is rounded back to one of those values - but it will happen eventually, unless you renumber the indexes from time to time).

    --
    Hugo Kornelis, SQL Server MVP



  • how come no one suggested Schema 2 but with real's as datatype for the order column
    if you have 1 and 2, and you need to insert, take 1.5, then 1.25, then 1.125

    That's how our 3rd party ERP does it, and it works fine



  • @kellyleahy said:

    Hmm...  I wonder what these things would do in the presence of a unique constraint on the id column (that you use here - or the index column from the original post).  It seems to me that the way unique constraints are applied (I think row by row on most databases for an UPDATE) this would be very likely to fail.  In such a case, you'd need to disable the constraint prior to doing the update and then reenable the constraint after completing the update.  In a DB like Oracle, I think this would be problematic, as if the reenabling of the constraint failed, you'd have no way to rollback the transaction (at least back when I was using Oracle, DDL always caused an implicit COMMIT).



    If you do it in one statement, it's no problem. The constraint is checked after the whole statement has been processed, not on a per-row base. No need to disable the constraint.



  • @t-bone said:

    how come no one suggested Schema 2 but with real's as datatype for the order column
    if you have 1 and 2, and you need to insert, take 1.5, then 1.25, then 1.125

    That's how our 3rd party ERP does it, and it works fine


    No one suggetsted that because it's not a good idea. After some rearrangements, you might reach the limit of the accuracy of that column, leading to unpredictable results.



  • Have another column, "SortOrder".  Then do SELECT * FROM MyTable ORDER BY SortOrder.




  • @kierenj said:

    Have another column, "SortOrder".  Then do SELECT * FROM MyTable ORDER BY SortOrder.

    Depends on the rest of the database. If the table is referenced from other tables, seperating the key and the sortorder column is highly recommended. If it is a unreferenced table, and the key is an artificial number anyway, doing that is probably useless.



  • Wow, I didn't realize so many people actually read these side-column posts.. thanks for all the advice.  It sounds like the linked list method is definitely a bad idea, and that was where I was initially leaning so I'm glad I asked.  I was really hoping there was some built-in way of modelling ordered
    lists that I didn't know about, but I'll settle for schema 2.  I actually won't be reordering the items very often after an initial
    ordering, but I wanted to know what was generally the best way to do
    this type of thing.

    The database won't be used too very often, it's just for my personal php/mysql website.  I have a lot of photos on the website, and now I store the timestamp and the description of the file as well as some other things in the filename, which is passed in the URL.  So I have really ugly URL's like this:

    photo_view.php?dir=photos%2F%7E2%7EHawai%27i%2F%7E3%7EThursday+%28O%27ahu%29%2F%7E3%7EPearl+Harbor&image=2005.04.07+-+10.35.32+-+USS+Arizona+Memorial.jpg&bgcolor=ffffff

    I'd like to be able to have a url that just looks more like "photo_view.php?image=1189", and get all the other information from the database rather than the file name.



  • @Angstrom said:

    I'd like to introduce you to my friend, SQL F701 ON UPDATE
    CASCADE.  At that point it becomes more of a practicality versus
    purity discussion: on update cascade keys are marginally less portable
    and, unless used CAREFULLY and documented well, can lead to updates
    taking an unexpectedly long time and to hard-to-comprehend database
    schemas.

    Don't go confusing things with your facts and logic.



  • @ammoQ said:

    @t-bone said:
    how come no one suggested Schema 2 but with real's as datatype for the order column
    if you have 1 and 2, and you need to insert, take 1.5, then 1.25, then 1.125

    That's how our 3rd party ERP does it, and it works fine


    No one suggetsted that because it's not a good idea. After some rearrangements, you might reach the limit of the accuracy of that column, leading to unpredictable results.


    true but in this case it's a case of inserting for example order lines and making sure it's always the same sequence on packing slips and invoices etc, there are not a huge number of records (with the same foreign key to a header table), and there are not a huge number of rearranges. It's to retain sequence after entry



  • @t-bone said:

    true but in this case it's a case of inserting for example order lines and making sure it's always the same sequence on packing slips and invoices etc, there are not a huge number of records (with the same foreign key to a header table), and there are not a huge number of rearranges. It's to retain sequence after entry

    The index method is still safer than the float method.  It's rarely safe to assume that number of changes will be small.  If the number of items (order lines in your example) isn't expected to change much, then there's no real reason to make it dynamic at all.  Put it in a config file.  Don't mess with the complexities of a database unless you need them, and if you do need the complexities of a database, then use it correctly (safely).  If you build in the capability to easily rearrange items, expect that people will use, and even abuse, that capability.  Just because it seems unreasonable to make a lot of changes doesn't mean it won't happen.

    I believe that the insert-in-the-middle method with floats could actually run out of precision pretty quickly, too.



  • @versatilia said:

    Definitely Schema 2 is the way to go, but why hasn't anyone pointed out the use of LIMIT?!


    LIMIT is specific to MySQL, isn't it? 



  • @lizardfoot said:

    @versatilia said:
    Definitely Schema 2 is the way to go, but why hasn't anyone pointed out the use of LIMIT?!


    LIMIT is specific to MySQL, isn't it? 


    Nope.  It exists in most databases.  LIMIT is a reserved word in SQL99, although it doesn't seem to exist as a predicate to SELECT in that standard.

    I wouldn't be surprised to find that MySQL does implements it differently to every other database in terms of syntax, performance and results, though ;-)

    Simon



  • @tufty said:

    @lizardfoot said:
    @versatilia said:
    Definitely Schema 2 is the way to go, but why hasn't anyone pointed out the use of LIMIT?!


    LIMIT is specific to MySQL, isn't it? 


    Nope.  It exists in most databases.  LIMIT is a reserved word in SQL99, although it doesn't seem to exist as a predicate to SELECT in that standard.

    I wouldn't be surprised to find that MySQL does implements it differently to every other database in terms of syntax, performance and results, though ;-)

    Simon

    Doesn't exist in Oracle, I don't think, unless it's a recent (9 or later) addition.



  • @kellyleahy said:

    @tufty said:

    [quote
    user="lizardfoot"]@versatilia said:
    Definitely Schema 2 is the
    way to go, but why hasn't anyone pointed out the use of LIMIT?!


    LIMIT is specific to MySQL, isn't it? 


    Nope. 
    It exists in most databases.  LIMIT is a reserved word in SQL99,
    although it doesn't seem to exist as a predicate to SELECT in that
    standard.

    I wouldn't be surprised to find that MySQL does
    implements it differently to every other database in terms of syntax,
    performance and results, though ;-)

    Simon

    Doesn't exist in Oracle, I don't think, unless it's a recent (9 or later) addition.

    [/quote]

    in Oracle, it's the ROWNUM pseudocolumn..

    select top 3 bla from blubb; 

    ->

    select bla from blubb where rownum<=3;

    but beware of the combination of rownum and "order by":

    select top 3 bla from blubb order by blublu;

    ->

    select bla from (select bla from blubb order by blublu) where rownum<=3;

    ( select bla from blubb order by blublub where rownum<=3 gives you 3 arbitrarily choosen records, nicely sorted)


  • @ammoQ said:

    @kellyleahy said:

    @tufty said:

    @lizardfoot said:
    @versatilia said:
    Definitely Schema 2 is the way to go, but why hasn't anyone pointed out the use of LIMIT?!


    LIMIT is specific to MySQL, isn't it? 


    Nope.  It exists in most databases.  LIMIT is a reserved word in SQL99, although it doesn't seem to exist as a predicate to SELECT in that standard.

    I wouldn't be surprised to find that MySQL does implements it differently to every other database in terms of syntax, performance and results, though ;-)

    Simon

    Doesn't exist in Oracle, I don't think, unless it's a recent (9 or later) addition.



    in Oracle, it's the ROWNUM pseudocolumn..

    yup... that's what I meant.  I was just commenting on the "exists in most databases" comment.  LIMIT, per se, does not exist in Oracle, though Oracle provides a way that this can be done through ROWNUM (with the appropriate caveats applied, as you so carefully noted in your post).



  • @kellyleahy said:

    yup... that's what I meant.  I was just commenting on the "exists in most databases" comment.  LIMIT, per se, does not exist in Oracle, though Oracle provides a way that this can be done through ROWNUM (with the appropriate caveats applied, as you so carefully noted in your post).



    You're right, it seems.  Been a while since I used Oracle, and Sybase has a totally different way of doing things.  Postgres has LIMIT ... OFFSET, I'm not sure what MySQL, DB2, Ingres or Informix have.

    anyway, I apologise for the confusion.

    Simon


Log in to reply