Poll: how often do programs have collections sorted/hashed by pointer/reference value?

boomzilla

There are reasons to use [SELECT DISTINCT]

Yes, of course. But some people use it to mask inefficient and possibly retarded queries. They don't understand what they're doing or why they're getting duplicates, but DISTINCT seems to solve the problem.

blakeyrat

@PJH said:

No. The solution is to randomize the result set in the absence of ORDER BY, and have a short sharp shock.

But we all know that won't happen. Because people will whine.

The type of SQL DBAs who are like, "swapping those rows adds 12 machine code instructions to the execution time? UNACCEPTABLE!"

Meanwhile all of their queries take 4 times longer than they should because they turned-off all of SQL's "unnecessary" features. "I'm too good at SQL to need statistics tracking!" "Resultset caching? Our developers would never be so stupid as to request the same resultset twice!" etc.

I've known people like that. DBA "pros" who have the database servers over-configured to the point where a fucking stock install with stock settings would actually perform twice as fast.

boomzilla

@blakeyrat said:

I guess another solution would be to make the ORDER BY clause required. But I'm guessing that would have a lot more resistance than just juggling the results occasionally.

Yes. There are plenty of times where I don't care about the order, and making it sort just slows stuff down. Or it's working with a large amount of stuff, and I'd rather start getting results as they come in instead of waiting for all of the results so they can be sorted.

Matches

Except the WHOLE POINT of how it works today, is to get the results faster - any modification to the sorting return, or order return impact performance noticeable and directly.

People just need to learn the right way, the database shouldn't enforce this particular item since it literally changes based on memory size, hard disk size, disk paging, and tons of other factors.

PJH

OK - I'm probably entertaining my own sense of hyperbole there, but something like the randomization forced in a development environment would smoke out the egregious reliance on order. In production it could be turned off.

But we all know what would happen in real life.

Matches

And sometimes you just need a 'select distinct' hack to make a problem go away until you have proper time to address it - though if you don't actually schedule the time to fix it, you're an asshole.

Matches

Randomizing results after the dataset page is returned is one thing, reordering the entire query is another. (I'm for your general idea, and against PJHs)

boomzilla

@Matches said:

And sometimes you just need a 'select distinct' hack to make a problem go away until you have proper time to address it - though if you don't actually schedule the time to fix it, you're an asshole.

This person is just stupid. I've given up on trying to train her (7 years later). It's easier to just fix her shit.

Matches

I wasn't arguing for or against, I was just stating (one of) the valid reasons to use distinct arantor might have been referring to.

chubertdev

@PJH said:

OK - what's the trick? I'm missing something and I don't know what it is...

The best RDBMS there is.

chubertdev

@Matches said:

And sometimes you just need a 'select distinct' hack to make a problem go away until you have proper time to address it - though if you don't actually schedule the time to fix it, you're an asshole.

And that's why every professional developer is an asshole.

blakeyrat

@PJH said:

OK - I'm probably entertaining my own sense of hyperbole there, but something like the randomization forced in a development environment would smoke out the egregious reliance on order.

Right, well when I proposed this feature, which we're now talking about despite it being purely fucking hypothetical, that's why I said it should just do it like one in every 20 queries or so. A low-enough number that the DBAs won't whine about the additional CPU usage, but a high-enough number that it'll be guaranteed to come up during any reasonable QA period.

But whatever, it's hypothetical.

boomzilla

@Matches said:

I wasn't arguing for or against, I was just stating (one of) the valid reasons to use distinct arantor might have been referring to.

I understand. I was just pointing out that "scheduling time to fix it" isn't going to help with some people.

Matches

I don't know, sometimes fixing people is good for the long term survival of the race.

antiquarian

@blakeyrat said:

Meanwhile all of their queries take 4 times longer than they should because they turned-off all of SQL's "unnecessary" features. "I'm too good at SQL to need statistics tracking!" "Resultset caching? Our developers would never be so stupid as to request the same resultset twice!" etc.

"DBA"s who turn off statistics updates (unless they schedule their own updates) deserve what they get.

PleegWat

I know we're basically doing 3 kinds of SELECTs:

To insert into another (aggregate) table - don't care about ordering
To turn DB configuration into file configuration. Each selected record becomes a file, so ordering is irrelevant.
To be shown to the user - always sorted

Number 1 is where the most DB time is spent. Ordering would just slow it down.

Arantor

It depends, actually. Virtually everything I do is #3 on your list.

PleegWat

Yeah, I think that was my point. Was that my point?

Arantor

I don't know, I don't have the mind-reading patch installed.

For use cases where #1 is the primary use, sure, you don't want ORDER BY - which is the primarily OLAP use case. For use cases where #3 is the primary case, you will need it all the time - which is the primarily OLTP use case.

PleegWat

Ah, I see already. That we was intended as a 'our product' we, not a generic we. That list was intended as 'what we do in our product'.

I don't know if OLAP vs. OLTP is the key differential - it's more what you're doing with the result set. Our product is definitly OLAP, but when showing to the end user ordering is still important. It's just when you're selecting for batch processing that order may not be important.

Arantor

@PleegWat said:

Ah, I see already. That we was intended as a 'our product' we, not a generic we. That list was intended as 'what we do in our product'.

I don't know if OLAP vs. OLTP is the key differential - it's more what you're doing with the result set. Our product is definitly OLAP, but when showing to the end user ordering is still important. It's just when you're selecting for batch processing that order may not be important.

It's not the key differentiator as such, it's more the style of workload you're doing - and thus which tool you use for the job.

M_Adams

+@

Jaime

@FrostCat said:

Yeah, it's actually (or so I assume) unspecified, which means "we know but we're not going to tell you."

No, it means, we don't know because we are going to pick the best performing method. It's possible that a query will appear to have ordered results one time due to high CPU pressure causing the query to run on only one core and appear unordered five minutes later due to lower CPU pressure allowing the query to be split across all of the core on the box.

@Arantor said:

I have no objections to ORDER BY being compulsory.

That's stupid. That's as arbitrary as "the primary key should always be included in the result set". You ask for what you want - if you want the results ordered, ask for it. Then, the server can do its thing and get you what you asked for as quickly as possible.

Arantor

Yes, but I've had to clear up clusterfucks where the application explicitly relied on the server's behaviour. Where the actual table order on disk was managed specifically to avoid having to ORDER BY in the query.

I only wish I were kidding.

Jaime

@chubertdev said:

Stupid SQL ordering trick:

select 4
union
select 1
union
select 2
union
select 3
```</blockquote>
When this trick works, it works due an implementation detail. Union filters out rows that are exact duplicate but are returned from different selects. In order to do this, sometimes a database engine will store the working results in a temp table or in memory in order because it's much faster to find a duplicate in an ordered set than in an unordered set. Other database engines will return the results in 4th, 1st, 2nd, 3rd order because they dump the first three result sets into the temp table, then stream the fourth directly to the client - filtering out duplicates on the fly. Then the three stored result sets follow.

Jaime

@Arantor said:

Yes, but I've had to clear up clusterfucks where the application explicitly relied on the server's behaviour. Where the actual table order on disk was managed specifically to avoid having to ORDER BY in the query.

But you'd only be trading one problem for another. The problem you have can be solved by training, experience, and code reviews. The problem you would be trading for is solved by buying more hardware. Besides idiots are very creative. I'm sure if you enforced ORDER BY, the idiot would have just written ORDER BY 1 and relied on disk ordering anyways.

Arantor

@Jaime said:

But you'd only be trading one problem for another. The problem you have can be solved by training, experience, and code reviews. The problem you would be trading for is solved by buying more hardware. Besides idiots are very creative. I'm sure if you enforced ORDER BY, the idiot would have just written ORDER BY 1 and relied on disk ordering anyways.

Nope. I got them to do it properly by way of an ORDER BY.

I can occasionally be a force of antiWTF.

Jaime

That's because there was a person involved. The parser would have a lot less luck un-supidifying people.

Arantor

Indeed.

ben_lubar

package main

import "fmt"

func main() {
    m := make(map[int]int)
    for i := 0; i < 100; i++ {
        m[i] = i
    }
    for k, v := range m {
        fmt.Println(k, v)
        break
    }
}

@blakeyrat what does this code print?

Jaime

Who cares? You're iterating through something which is designed to be optimized for random access at the cost of strange ordering and pointing out that it orders strangely.

The OPs original point about making sure that there is always the same code path every time the program executes is not a common issue and is easily addressed by using collections that have deterministic ordering.

Hashing by a pointer is fine because you are not supposed to care about the order, just that you touch every item exactly once. Sorting by a pointer is not a good idea since the sort is meaningless. Note that most implementations of a HashMap won't sort by the pointer, they will sort by a hash of the pointer - possibly with some magic thrown in like multi-level has buckets.

ben_lubar

My point was that it doesn't print the same thing every time because the Go devs noticed people assuming things about ordering that they shouldn't assume so they made map iteration order random.

"No specified order" is great until someone assumes a reason for it and starts depending on that (probably incorrect) reason.

ben_lubar

@Jaime said:

Note that most implementations of a HashMap won't sort by the pointer, they will sort by a hash of the pointer

Okay, let's say I have a 64-bit pointer and I want to convert it to a 64-bit hash.

Oh look, my function has 0 instructions if you inline it.

Jaime

@ben_lubar said:

Oh look, my function has 0 instructions if you inline it.

If your compiler inlines something that you have a stored reference to, it's broke.

ben_lubar

... or I have an optimizing compiler?

Jaime

No, a functioning optimizing compiler only inlines things when it can verify that inlining them won't break the program. If it can account for all of the references and handle them properly, then it can be inlined. A compiler that blindly inlines function and hopes for the best should be taken out back and shot.

blakeyrat

@ben_lubar said:

@blakeyrat what does this code print?

I don't know. I don't even know what language that is, though I'll assume Go. I also don't care. Not one whit.

So the answer is, it prints, "go fuck yourself".

toon

Also sometimes you know an order by is going to incur you with a significant performance hit and you don't care about the order. Having order by be mandatory is going to look real annoying in those cases. That admittedly doesn't happen often but I've been in a few situations like that.

VinDuv

@ben_lubar said:

My point was that it doesn't print the same thing every time because the Go devs noticed people assuming things about ordering that they shouldn't assume so they made map iteration order random.

I think it’s because they randomize the hash function to prevent hash collision DoS attacks. It’s probably not to make developers do the right thing.

PleegWat

That doesn't make sense. A compiler can very well inline some invocations but not others.

Jaime

@PleegWat said:

A compiler can very well inline some invocations but not others.

I never implied it couldn't. The compiler is not required to inline everything that it can, but it is required to only inline things than can be inlined.

Arantor

@ben_lubar said:

My point was that it doesn't print the same thing every time because the Go devs noticed people assuming things about ordering that they shouldn't assume so they made map iteration order random.

"No specified order" is great until someone assumes a reason for it and starts depending on that (probably incorrect) reason.

They listened to @blakeyrat?!?!?

boomzilla

@Jaime said:

if you want the results ordered, ask for it.

ORDER BY ROWNUM

Jaime

SELECT * FROM Developers WHERE thoughts LIKE '%ORDER BY should be required%' AND smart = 1 AND ROWNUM > 1

Arantor

I see the Arantor row has been excluded

Maciejasjmj

@ben_lubar said:

Okay, let's say I have a 64-bit pointer and I want to convert it to a 64-bit hash.

Oh look, my function has 0 instructions if you inline it.

Then it's a shitty hash. If you bucket from the left, you're pretty much guaranteed to get a linked list for any reasonable bucket size. If you bucket from the right, you'll end up with a lot of unused buckets due to alignments.

(actually scratch that, it might just work when taken modulo prime; but it's still shitty)

Besides, it's a moot point, since hash-based sorting ranges from "bad idea" (when your hash is deterministic enough to guarantee some order) to "horrible idea" (when it isn't).

Also, treating a pointer as a hash itself already makes a difference in the contract.

int x = 42;
int y = x*3;

makes sense.

int* x = (int*)42;
int* y = x * 3;

does not, even though you have the same bytes (for any reasonable architecture on which sizeof(int) == sizeof(int*)).

ben_lubar

Does your hashtable magically take an int* for some reason?

PleegWat

I know mine takes void*. Other functions take:

uint64_t(*)(void*)
int(*)(void*,void*)
int(*)(void*,void*,void*)

and possibly some others I'm forgetting.

smallshellscript

@blakeyrat said:

it prints, "go fuck yourself".

All undefined behaviour should do this.

antiquarian

Good ideas thread?