Poll: how often do programs have collections sorted/hashed by pointer/reference value?



  • If you're in C++, a collection like std::set<Foo*> or std::unordered_set<Foo*> could store the elements in different orders between runs because the "same" allocations could get different pointer addresses. The same thing can happen if you have a TreeSet<Foo> in Java or C# if you don't define hash()/equals() for Foo for almost the same reason. (It's not quite a memory address in that case, of course.)

    So... how often do you think this happens in "real-world" programs?

    • Very few programs do this (say... < 20%)
    • Not many do this (20-50%)
    • Many do this (50-70%)
    • Almost all do this (70-110%)

    For purposes of this poll, let's restrict to a somewhat C-ish language; e.g. not Python. (I think my dividing line here is something like whether there's a hash-map literal...)

    Edit: To be clear, I am only asking about cases where the hash or sort order is the object identity itself. If you have a std::set<Foo*, FooComparator> or something that looks into the objects, that doesn't count.



  • SORT_NOT_FOUND

    Body is invalid; try to be a little more descriptive


  • Discourse touched me in a no-no place

    Good question.

    This sort of thing has been one of the banes of my life at times, trying to get a bunch of things stored in a hash table and with interlinks as well to tear themselves apart correctly when I pulled the plug. At least twice I thought I had it cracked, only to find that changing the hash table iteration order would mess things up (with a nasty crash; this was in C).



  • I think the answer to this is, "who gives a shit? Start thinking at a higher level and solve actual practical problems instead of dicking around with bullshit like this."



  • @EvanED said:

    To be clear, I am only asking about cases where the hash or sort order is the object identity itself. If you have a std::set<Foo*, FooComparator> or something that looks into the objects, that doesn't count.

    Why are you asking? Because whyever you're asking, you're up to do something wrong. Either you're going to rely on order on enumeration in a hashtable (don't do that), compare pointers to different object (don't do that), or serialize hashes (why would you even want to do that).

    As long as you adhere to the contracts and remember that pointers are effectively noncomparable entities,

    @blakeyrat said:

    who gives a shit?



  • PHP has actually had security vulnerabilities in itself because of this sort of issue where the unsetting of a member of such a set wasn't cleaned house properly. And PHP is technically 'C-like' so that counts, right?



  • @blakeyrat said:

    I think the answer to this is, "who gives a shit? Start thinking at a higher level and solve actual practical problems instead of dicking around with bullshit like this."

    Maybe you haven't noticed, but even high-level programmers have to deal with things like hashed and sorted collections. And doing that can affect high-level behavior. The company I work for a few years ago went through and put some deliberate effort into eliminating collections like this because they made run-to-run comparisons and hence debugging very difficult -- even in collections where the order didn't matter in terms of the final results, it affected how the program got to those results, and non-deterministic allocators and such could introduce non-trivial nondeterminism into the program's internal behavior. That's something that could affect programmers in most languages and at almost any layer of abstraction.

    @Maciejasjmj said:

    Why are you asking? Because whyever you're asking, you're up to do something wrong. Either you're going to rely on order on enumeration in a hashtable (don't do that), compare pointers to different object (don't do that), or serialize hashes (why would you even want to do that).
    I don't want to do that. I was trying to come up with scenarios where programs actually care about the concrete memory addresses of objects because, ideally, I'd like the answer to be "rarely"... but I suspect it's closer to "commonly", and this seems like the most dangerous cause.

    (Loosely speaking, I work on bug-hunting software -- think like FindBugs but, uh, probably better and on other languages -- and a "rarely" answer would simplify something I'm thinking about now a lot. Basically, for how many programs would that assumption fail on.)



  • @EvanED said:

    Maybe you haven't noticed, but even high-level programmers have to deal with things like hashed and sorted collections.

    Not... really?

    @EvanED said:

    The company I work for a few years ago went through and put some deliberate effort into eliminating collections like this because they made run-to-run comparisons and hence debugging very difficult -- even in collections where the order didn't matter in terms of the final results, it affected how the program got to those results, and non-deterministic allocators and such could introduce non-trivial nondeterminism into the program's internal behavior.

    Seriously? Man, you guys should try debugging some SQL queries without ORDER BY clauses, your heads would explode.



  • They would just add an ORDER BY clause. Easy.

    [size=9]I'm not even objecting to that. The only downside would be any performance penalty, but real high level programmers evidently don't deal with performance, so clearly there's no downside.[/size]


  • Discourse touched me in a no-no place

    @created_just_to_disl said:

    They would just add an ORDER BY clause. Easy.

    I'm not even objecting to that. The only downside would be any performance penalty

    In cases where that was a concern, I've been in the habit of "read the records in the order they come, and copy into a temp-table with an appropriate index. Then read the temp-table, which is nicely sorted." Usually works pretty well, even if it's not the most clean thing in the world.



  • @blakeyrat said:

    without ORDER BY clauses

    Aren't all queries ordered by the PK of the table by default?

    I'm leaving that for future junior WTF'ers because this only happens when you're running your queries in one table or view. When you start joining results, this isn't guaranteed. Also, I'm not sure every RDBMS does this either.

    Ok, definitely not.



  • @EvanED said:

    I don't want to do that. I was trying to come up with scenarios where programs actually care about the concrete memory addresses of objects because, ideally, I'd like the answer to be "rarely"... but I suspect it's closer to "commonly", and this seems like the most dangerous cause.

    Then you're asking the wrong question. Because if you stuff things into an unordered_set (i.e. a hash table), it's fine to hash by pointers. And if you then rely on the order of enumeration, you're doing it wrong not because of pointers, but because you rely on the order of an unordered set in the first place!

    Now stuffing pointers into ordered collections is a Bad Idea anyway - as I said, they're not contractually comparable.


  • Discourse touched me in a no-no place

    @Eldelshell said:

    Aren't all queries ordered by the PK of the table by default?

    I'm leaving that for future junior WTF'ers because this only happens when you're running your queries in one table or view.

    Uh huh... here - have a black swan from our very own Discourse:

    #select * from user_visits limit 10;
     id  | user_id | visited_at | posts_read 
    -----+---------+------------+------------
       1 |       1 | 2014-02-03 |          0
       2 |       1 | 2014-02-05 |          0
     100 |     185 | 2014-05-19 |          4
       3 |       1 | 2014-02-06 |          2
       4 |       3 | 2014-02-06 |          2
       5 |       4 | 2014-02-06 |          1
       6 |       1 | 2014-02-07 |          0
       7 |       3 | 2014-02-07 |          0
       8 |       4 | 2014-02-07 |          0
       9 |       4 | 2014-02-08 |          0
    (10 rows)
    
    #\d user_visits
                               Table "public.user_visits"
       Column   |  Type   |                        Modifiers                         
    ------------+---------+----------------------------------------------------------
     id         | integer | not null default nextval('user_visits_id_seq'::regclass)
     user_id    | integer | not null
     visited_at | date    | not null
     posts_read | integer | default 0
    Indexes:
        "user_visits_pkey" PRIMARY KEY, btree (id)
        "index_user_visits_on_user_id_and_visited_at" UNIQUE, btree (user_id, visited_at)
    


  • @Maciejasjmj said:

    And if you then rely on the order of enumeration, you're doing it wrong not because of pointers, but because you rely on the order of an unordered set in the first place!

    There are three levels of "relying on the order of an unordered set" (though the division between the first two is somewhat qualitative):

    1. It affects the sequence of internal states only, but all output is the same.
    2. The order selects among multiple correct (or "correct") outputs.
    3. The order determines whether the output is correct.

    Unless your position is that you should never iterate an unordered set at all, (1) will always apply. Even (2) can be fine.

    For example, suppose you have a set of Employee objects, and you want to find the person who has been at the company the longest but you don't have a key so have to do a linear scan. (I'm making this example up as I go.) If you know that you only have one Employee object for each person, it really isn't unreasonable to just store a set of those pointers (say if you don't have an employee ID number -- I don't claim this example is good style or realistic). However, if your allocator returns different memory blocks from run-to run (which some do) or if your program gets a different heap base address (which will almost certainly happen), then even with the exact same input your program will store things in the set in a different order. That won't affect the "who has been here the longest" answer, but it will affect how it got to it.

    Contrast that to if you have a key that is deterministic based on your input, like an actual employee ID, and you hash on that instead. Now the order isn't predictable if all you know is what's in it, but if you feed the program the same input it will not just come up with the same final answer (who has been there the longest) but it will actually do the same computations in the same order.

    That difference matters for what I'm thinking about, but it also can matter for debugging, especially for memory corruption type bugs.



  • I know an application that for ten years relied on the ordering predicate of MySQL to order its data correctly, where queries explicitly got ORDER BY NULL to prevent them from being reordered otherwise except for a table scan as the data is expected to be in the correct order at table level. (And avoids sorting the data)



  • @EvanED said:

    The same thing can happen if you have a TreeSet<Foo> in Java or C# if you don't define hash()/equals() for Foo for almost the same reason.

    TreeSet uses comparisons, not hashing, and there's no default comparison in those languages. Did you mean HashSet?



  • Stupid SQL ordering trick:

    select 4
    union
    select 1
    union
    select 2
    union
    select 3
    


  • BTW, "what is the default order for a SQL query with no ORDER BY clause?" is a great interview question, to weed-out people like Eldelshell from fucking up all your queries.



  • @blakeyrat said:

    BTW, "what is the default order for a SQL query with no ORDER BY clause?" is a great interview question, to weed-out people like Eldelshell from fucking up all your queries.

    Correct answer: "WTF-ery"



  • @blakeyrat said:

    BTW, "what is the default order for a SQL query with no ORDER BY clause?" is a great interview question, to weed-out people like Eldelshell from fucking up all your queries.

    Although, do you even care for most of the queries you do?

    Filed under I should write a database adapter that randomly permutes the results of queries without an ORDER BY in them



  • @tarunik said:

    Although, do you even care for most of the queries you do?

    You'd be surprised how many programs assume an order. (Generally, time-of-insertion order.)

    If I ran the universe of SQL, I'd code SQL DBMSes so that if no order was requested, one in every 20 or so queries would swap two result lines. Just so developers don't get complacent when writing their code.



  • @blakeyrat said:

    You'd be surprised how many programs assume an order. (Generally, time-of-insertion order.)

    If I ran the universe of SQL, I'd code SQL DBMSes so that if no order was requested, one in every 20 or so queries would swap two result lines. Just so developers don't get complacent when writing their code.

    Why every 20? I think every query would drive the point home. Granted, I have no idea what perf penalty that would incur.



  • @powerlord said:

    Why every 20?

    Well let's have a long boring debate about a hypothetical proposed feature that'll never exist!!!!!

    Look, WHEN you work on the SQL Server team, and WHEN you convince your bosses to add this feature, and WHEN you get buy-off on it and are reasonably sure it'll ship, then come back here and ask, "why 20?"


  • FoxDev

    not that i know the answer but without a group by or order by clause i'm going to guess:

    • Bu row order in the database unless it uses some sort of key field to resolve the query in which case it would be by row order of the first used key metadata table thingy.

    Have i mentioned i'm not a DBA type person and really only know enough SQL to be super dangerous with it?



  • Pretty sure the standard doesn't guarantee an order when no ORDER BY is present, so the DB server is free to return rows in whatever order it wants.



  • Yay for undefined bahaviour!



  • @Maciejasjmj said:

    effectively noncomparable entities,

    Not true, you can compare to make sure it actually is a pointer!

    But no. If you're doing any type of work where you're trying to sort hashes in the way you're insinuating, you are most certainly doing it wrongtm.


  • FoxDev

    @powerlord said:

    Pretty sure the standard doesn't guarantee an order when no ORDER BY is present, so the DB server is free to return rows in whatever order it wants.

    well yes, there is no guarantee without orderby, but there is a consistent ordering of some kind. i was taking a guess at what it might be.

    however:

    IF YOU CARE ABOUT ORDER THEN SPECIFY A DAMNED *ORDER BY* CLAUSE!



  • SQL is typically 'sorted by magic' where the magic is things like memory cache, and disk location. (Still a significant simplification) - but ultimately, while you're more likely to get things returned in order because primary keys are generally stored clustered on disk, you can still get some sillyness in the return due to page sizes and disk seek.



  • Just throw this in your stored proc:

    order by newid()



  • @accalia said:

    not that i know the answer but without a group by or order by clause i'm going to guess:

    You're wrong.

    If there's no ORDER BY clause, the resulting data is in NO PARTICULAR ORDER AT ALL. (It may appear to be in a particular order, but that's an illusion and exactly why I think SQL should juggle the results around in this case.)

    @accalia said:

    Have i mentioned i'm not a DBA type person and really only know enough SQL to be super dangerous with it?

    That's fine, know your limitations and all.

    @aliceif said:

    Yay for undefined bahaviour!

    It's not undefined. You get the exact rows of results you asked for. It just doesn't sort them in any way.

    @accalia said:

    IF YOU CARE ABOUT ORDER THEN SPECIFY A DAMNED ORDER BY CLAUSE!

    If someone answered that to the interview question, you may hire them.

    @Matches said:

    Just throw this in your stored proc:

    order by newid()

    That doesn't help people out there writing sprocs that assume wrong behavior.


  • FoxDev

    @blakeyrat said:

    You're wrong.

    If there's no ORDER BY clause, the resulting data is in NO PARTICULAR ORDER AT ALL. (It may appear to be in a particular order, but that's an illusion and exactly why I think SQL should juggle the results around in this case.)

    i'm not surprised. I got asked about joins in my interview for this job.

    as i recall my answer was: "well a cross join is almost always a bad idea as it's the cartesian product of the tables; every row in a joined to every row in b.... as for the others the differece is how non matching records in a and b are handled, but it's have to look it up which option does what"

    i still got the job (/me is app developer working with DBAs)



  • Be concerned when people say cross joins are fine when 'distinct' clause is included as part of the select.


  • FoxDev

    i get concerned when people idiomatically put distinct in their queries.

    if you need distinct that much you probably have a bigger issue with your schema or query....



  • I never thought of that because I always use order by.

    Anyway, how many @blakeyrat there are? Because it must be exhausting to be a dickhead in so many different threads at the same time.



  • There are reasons to use it. For instance, if I have a table that I use to log whenever users modify particular records in another table and want to see which users have ever modified a specific record...

    SELECT DISTINCT u.username FROM users u INNER JOIN task_log tl ON u.user_id = tl.user_id WHERE tl.task_id = ?

    That query just needs a few adjustments to actually run on the dev DB for the app I'm currently working on.


  • Discourse touched me in a no-no place

    @blakeyrat said:

    Yay for undefined bahaviour!

    It's not undefined.

    Yeah, it's actually (or so I assume) unspecified, which means "we know but we're not going to tell you."


  • FoxDev

    there are uses for it yes.

    i'm talking about the people who when ask to write a select query will, before doing anything else or hearing further, write:

    SELECT DISTINCT

    that's just doing it wrong. right, @CodingHorrorBot?


  • 🔀

    @accalia Is Doing It Wrong™



  • It's not even that. It's

    Every person who supports the DB is free to implement what is most optimized for their database engine. It's the DB engine's 'secret sauce' if you will.

    It has everything to do with optimizing reads/writes.



  • @Eldelshell said:

    Anyway, how many @blakeyrat there are?

    I'm the wind, baby.


  • Discourse touched me in a no-no place

    @blakeyrat said:

    I'm the wind, baby.

    Yeah, the nether wind.



  • #ORDER_NOT_FOUND

    Body is invalid; try to be a little more descriptive


  • Discourse touched me in a no-no place

    @accalia said:

    IF YOU CARE ABOUT ORDER THEN SPECIFY A DAMNED ORDER BY CLAUSE!

    It's better than that. If the results of the query are going to be showed to the user (well, to a normal user) perhaps via some sort of web interface, you'd better care about the order. Why? Because otherwise you cannot possibly implement paging right.

    We have a site where we've Done It Wrong. If you're looking at a collection of things (i.e., the result of a DB query) and that collection is bigger than one page worth, when you go to the second page, what you get is actually just another random slice through the data. It's possibly to go through the entire collection and not see some members, simply because the random selection never picked it. And this happened for us in the middle of a public demonstration where we were trying to find some useful stuff that we knew we'd packed inside the application specifically for use in the demo. (The details of the collection don't matter.) It was a bit embarrassing, being unable to find an item that we knew we'd put in there and which had been trivially locatable during testing.

    Always order the results of a query when it's fuelling users' views, and make sure that the ordering is guaranteed to be consistent (in case you're ordering by a non-unique column). No exceptions. You can order by any column by default; the row-id number would work (the DB has one). You just need to be sure that if you do the query twice (without intervening writes) then you'll get everything in the same order.


  • Discourse touched me in a no-no place

    @chubertdev said:

    Stupid SQL ordering trick:

    select 4
    union
    select 1
    union
    select 2
    union
    select 3
    ```</blockquote>
    
    OK - what's the trick? I'm missing something and I don't know what it is...
    
    ```sql
    postgres=# select 4
    postgres-# union
    postgres-# select 1
    postgres-# union
    postgres-# select 2
    postgres-# union
    postgres-# select 3;
     ?column? 
    ----------
            1
            2
            4
            3
    (4 rows)
    
    Time: 67.952 ms
    
    mysql> select 4
        -> union
        -> select 1
        -> union
        -> select 2
        -> union
        -> select 3;
    +---+
    | 4 |
    +---+
    | 4 |
    | 1 |
    | 2 |
    | 3 |
    +---+
    4 rows in set (0.04 sec)
    

    @tarunik said:

    Although, do you even care for most of the queries you do?

    Your thoughts on how to do the one presented in this post in a different way without relying on ordering would be appreciated. If it'll be quicker all the better (post there if you're answering seriously please):

    http://what.thedailywtf.com/t/5-posters-unite/4069/16?u=pjh

    @blakeyrat said:

    one in every 20 or so queries would swap two result lines

    This. Only more frequently. Like every single time. And with the whole result set.

    @aliceif said:

    Yay for undefined bahaviour!

    It is well defined. "Results returned in the absence of an ORDER BY clause may be returned in any order" is a definition. It stems from Set Theory of which database tables are one instance. Sets are inherently unordered, and asking a set for its members will result in an unordered answer.

    In reality, and where the problem stems from, is that database software typically produces the answer in a consistent manner.

    "A query missing an ORDER BY clause will result in undefined behaviour [i.e. returning nothing, what you asked for or random shit, pick one of the three every time you query]" is undefined behaviour.



  • I guess another solution would be to make the ORDER BY clause required. But I'm guessing that would have a lot more resistance than just juggling the results occasionally. Also it goes against the SQL language spec that virtually everybody ignores anyway.


  • ♿ (Parody)

    @accalia said:

    i get concerned when people idiomatically put distinct in their queries.

    I have been fighting this here. There are a couple of perpetrators...one I have a bit of respect for, the other has no redeeming qualities that I've been able to find.



  • I have no objections to ORDER BY being compulsory.


  • Discourse touched me in a no-no place

    No. The solution is to randomize the result set in the absence of ORDER BY, and have a short sharp shock.

    But we all know that won't happen. Because people will whine.

    Same thing happens with UB on C compilers and source that actually relies on UB to work. They moan that the current compiler's UB is different from last time - the fix is usually not fixing the broken code but changing the compiler to the old UB...


  • Discourse touched me in a no-no place

    I do. As mentioned, there is a not insignificant number of occasions where it's not necessary.


Log in to reply