Mixed-endianness

mott555

I was reading up on the Shapefile (.shp) format since there is a good chance I'll have to write some code to parse it. I see one really strange WTF:

@wikipedia said:

The .shp and .shx files have various fields with different endianness, so an implementor of the file formats must be very careful to respect the endianness of each field and treat it properly.

The file header for .shp files has three big endian-values followed by several little-endian values. Each record header has big-endian data, but then the actual record is little-endian. The endianness of the coordinate data is not listed, at least not in Wikipedia.

I can't imagine why the format authors decided to do this.

dhromed

Utter insanity.

Xyro

It certainly prevents naive developers from making any assumptions. That's almost a good thing!

lethalronin27

Never heard of shapefiles before, has the format been around a long time? There MUST be some outdated/obscure reason for this. Though, I'd hate to imagine some convoluted system where mixing endianness makes the data MORE convenient to work with.

mott555

@lethalronin27 said:

Never heard of shapefiles before, has the format been around a long time? There MUST be some outdated/obscure reason for this. Though, I'd hate to imagine some convoluted system where mixing endianness makes the data MORE convenient to work with.

I believe the format is at least a couple decades old. I assumed it had to do with legacy systems, but I can't fathom how.

lethalronin27

@Xyro said:

It certainly prevents naive developers from making any assumptions. That's almost a good thing!

Unless he's so naive that he doesn't know what endianness is, and just figures it's not important. Unfortunately I know a couple "developers" fresh out of college that would probably do exactly that.

pure

I think I can hazard a guess...

Classically, "network" endianness is "big endian." I think this makes most sense when reading a stream byte-by-byte off a socket. In Shapefile, the big-endian values at the beginning (file code and length, record number and length) are fairly common in messages passed over a network (e.g. message type + length equivalent, field type + length, etc.). This may be why they are big-endian.

The rest of the values are probably little-endian because x86 systems are little endian and therefore don't require a "flip" from the actual data read (or received in network terms).

Like I said it's just a guess, but it seems plausible. You read 28 bytes off the socket (out of the file), ntohl() the first 4 to get the message code, then ntohl() the 25th-28th and multiply by 2 to get the file length in bytes. then read that file length (minus header) by casting the remaining values directly, to say int32_t (assuming you're on x86, otherwise flipping them).

It seems dumb, but the only explanation I can think of...

blakeyrat

@Xyro said:

It certainly prevents naive developers from making any assumptions. That's almost a good thing!

I've always maintained that 1/20th of the time, SQL queries without an "ORDER BY" should return the results in randomized order, to prevent programmers from assuming that a query lacking an ORDER BY would always return results in a set order.

My idea is good. Changing the endianness per-field, especially now that every piece of hardware on Earth* is little endian, is stupid.

The real WTF.

*) Yes pedantic dickweeds, I am aware that PPC CPUs are still in wide use, I didn't mention is purely to wind you up

bedouin98

If you're working in .NET, save yourself some time and take a look at this library. I recently had to parse some shapefiles and it saved me a lot of time.

I did have to add an implementation of the PolygonZ shape type to the library, but that was pretty trivial, the ESRI shapefile specs aren't too hard to figure out.

mott555

@bedouin98 said:

If you're working in .NET, save yourself some time and take a look at this library. I recently had to parse some shapefiles and it saved me a lot of time.

I did have to add an implementation of the PolygonZ shape type to the library, but that was pretty trivial, the ESRI shapefile specs aren't too hard to figure out.

Good find. I wonder if it works with Silverlight.

antiquarian

@blakeyrat said:

I've always maintained that 1/20th of the time, SQL queries without an "ORDER BY" should return the results in randomized order, to prevent programmers from assuming that a query lacking an ORDER BY would always return results in a set order.

A better solution would be using snoofle's shiny new cluebat on any programmer that would make such an assumption.

pjt33

@blakeyrat said:

I've always maintained that 1/20th of the time, SQL queries without an "ORDER BY" should return the results in randomized order, to prevent programmers from assuming that a query lacking an ORDER BY would always return results in a set order.

Why only 1/20th of the time?

blakeyrat

@pjt33 said:

@blakeyrat said:
I've always maintained that 1/20th of the time, SQL queries without an "ORDER BY" should return the results in randomized order, to prevent programmers from assuming that a query lacking an ORDER BY would always return results in a set order.

Why only 1/20th of the time?

You don't want to do it enough that it's a performance hit, but you want to do it often enough so that it's a "fail early" thing and happens during dev/testing.

Edit: you also don't want bad devs thinking results ALWAYS come out in random order when you don't add an ORDER BY because that assumption is also false.

mott555

@blakeyrat said:

@pjt33 said:
@blakeyrat said:
I've always maintained that 1/20th of the time, SQL queries without an "ORDER BY" should return the results in randomized order, to prevent programmers from assuming that a query lacking an ORDER BY would always return results in a set order.

Why only 1/20th of the time?

You don't want to do it enough that it's a performance hit, but you want to do it often enough so that it's a "fail early" thing and happens during dev/testing.

Edit: you also don't want bad devs thinking results ALWAYS come out in random order when you don't add an ORDER BY because that assumption is also false.

Then snoofle would be writing about his team's SQL queries that most of the time run in 6 hours, but every twentieth query takes 3 days.

topspin

@blakeyrat said:

The real WTF.

What? The??

I can't be arsed to read the whole article, but WHY THE FUCK is that in there? That definitely is the real WTF you found there.

boomzilla

@topspin said:

@blakeyrat said:
The real WTF.

I can't be arsed to read the whole article, but WHY THE FUCK is that in there? That definitely is the real WTF you found there.

Big vs Little Endian was first applied in relation to eating eggs, you filthy Yahoo.

Xyro

@topspin said:

@blakeyrat said:
The real WTF.
What? The??
I can't be arsed to read the whole article, but WHY THE FUCK is that in there? That definitely is the real WTF you found there.

Yeah! That's totally upside-down! WTF!!!

StephenCleary1

@blakeyrat said:

... especially now that every piece of hardware on Earth is little endian ...

As someone who has done a ton of development for custom hardware, I must take issue with that statement.

The CPUs used most commonly in desktop computers use a little endian memory interface, and serial ports and USB usually have configurable endianness, that is all. Cell phones, tablets, and most hardware components I can think of (Ethernet, CAN, SPI, FPGAs) are big-endian by nature.

If you work at the hardware level (e.g., one-bit-at-a-time transfers), big-endian makes a lot more sense because it's consistent. A few months ago I had to explain to a EE (programming an FPGA) that I would prefer the data in little-endian bytes with big-endian bits (which is what "little endian" really is). His "WTF?!" response was priceless.

-Steve (a .NET dev who became a firmware guy last year)

Mcoder

@pure said:

I think I can hazard a guess...

Classically, "network" endianness is "big endian." I think this makes most sense when reading a stream byte-by-byte off a socket. In Shapefile, the big-endian values at the beginning (file code and length, record number and length) are fairly common in messages passed over a network (e.g. message type + length equivalent, field type + length, etc.). This may be why they are big-endian.

As old they are, I don't think shape files preceed TCP.

blakeyrat

The good news is I don't have to call Steve a pedantic dickweed because I already did it preemptively.

boomzilla

@Mcoder said:

@pure said:

I think I can hazard a guess...

Classically, "network" endianness is "big endian." I think this makes most sense when reading a stream byte-by-byte off a socket. In Shapefile, the big-endian values at the beginning (file code and length, record number and length) are fairly common in messages passed over a network (e.g. message type + length equivalent, field type + length, etc.). This may be why they are big-endian.
As old they are, I don't think shape files preceed TCP.

On a similar note, I don't think that steam engines precede windmills. But probably dildos precede vibrators.

boomzilla

@blakeyrat said:

The good news is I don't have to call Steve a pedantic dickweed because I already did it preemptively.

But he dickweeded you in a different direction than you preempted! Anyways, I thought his comment was interesting, because I didn't know all that stuff (in addition to PPC, which I did know) was little endian.

Are you against discussions, or just replies to your posts?

PJH

@boomzilla said:

But probably dildos precede vibrators.

Well, yes. Dildos were around 28,000 years ago. Vibrators were originally invented as a piece of medical equipment in the late 1800's to help doctors treat women with hysteria, since manual manipulation was so tedious and time consuming.

zelmak

@blakeyrat said:

*) Yes pedantic dickweeds, I am aware that PPC CPUs are still in wide use, I didn't mention is purely to wind you up.

Actually, you forgot to mention SPARC ...

@boomzilla said:

Are you against discussions, or just replies to your posts?

The answer to that question is obviously "yes".

boomzilla

@El_Heffe said:

@boomzilla said:
Are you against discussions, or just replies to your posts?
The answer to that question is obviously "yes".

Actually, proper punctuation says that it should be:

Yes.

Zemm

@boomzilla said:

Actually, proper punctuation says that it should be:
Yes.

Actually, British English lets you put punctuation outside quotes. Which makes more sense in many cases. Log in an type "rm -rf folder ." - if you include the dot in that command it will delete the current directory, not just "folder".

boomzilla

@Zemm said:

@boomzilla said:
Actually, proper punctuation says that it should be:

Yes.

Actually, British English lets you put punctuation outside quotes. Which makes more sense in many cases. Log in an type "rm -rf folder ." - if you include the dot in that command it will delete the current directory, not just "folder".

Well, since I'm not British (and neither are you), you're wrong to imply that's proper punctuation.

Ben L.

An egg in an egg cup with the little-endian portion oriented upward.

THIS JUST IN: EGGS HAVE BYTE ORDER

ASheridan

@boomzilla said:

@Zemm said:
@boomzilla said:
Actually, proper punctuation says that it should be:
Yes.

Actually, British English lets you put punctuation outside quotes. Which makes more sense in many cases. Log in an type "rm -rf folder ." - if you include the dot in that command it will delete the current directory, not just "folder".

Well, since I'm not British (and neither are you), you're wrong to imply that's proper punctuation.

I think the clue is in the name: "English", which comes from England, which is part of Britain...

boomzilla

@ASheridan said:

I think the clue is in the name: "English", which comes from England, which is part of Britain...

They should have thought of that before they lost the war.

dhromed

@blakeyrat said:

The real WTF.

Oh my god this is awesome.

pure

@Mcoder said:

@pure said:
I think I can hazard a guess...

Classically, "network" endianness is "big endian." I think this makes most sense when reading a stream byte-by-byte off a socket. In Shapefile, the big-endian values at the beginning (file code and length, record number and length) are fairly common in messages passed over a network (e.g. message type + length equivalent, field type + length, etc.). This may be why they are big-endian.

As old they are, I don't think shape files preceed TCP.

I once glanced at a hedge. What's your point?

ochrist

@boomzilla said:

@topspin said:
@blakeyrat said:
The real WTF.
I can't be arsed to read the whole article, but WHY THE FUCK is that in there? That definitely is the real WTF you found there.
Big vs Little Endian was first applied in relation to eating eggs, you filthy Yahoo.

I believe Jonathan Swift is to blame:

Lilliput and Blefuscu - Wikipedia

Xyro

@Ben L. said:

THIS JUST IN: EGGS HAVE BYTE ORDER

Bite order, yes.

ZPedro

I… have seen that occur. The format was in fact a format inside a meta-format. Consider XML. It's not so much a format by itself than a mechanism for defining formats; it's a meta-format. In my case the meta-format was little endian due to its Windows origins, but the format inside it had its own fields in big endian byte order as that format was defined on the Mac and repackaged in the meta-format for porting purposes. There may not be one format author (or one authoring team), there may be two at different points in time.

MiffTheFox

@Zemm said:

Actually, British English lets you put punctuation outside quotes. Which makes more sense in many cases. Log in an type "rm -rf folder ." - if you include the dot in that command it will delete the current directory, not just "folder".

Well that's why you end every sentence telling people what to type with "(without the quotes)".

Then again I always use backticks to quote code/shell commands because you never know when you need to have syntactically relevant quotes inside.

ZPedro

What are we calling "every piece of hardware", anyway? There is a big difference between "every piece of hardware a mainstream programmer can target", and "every kind of processor diffused in a fab". I for one did not forget how to code in an endian-independent fashion, so that my code would also work on a big-endian machine, when Apple switched to Intel procs then released the iPhone. It is true, though, that ARM as used in the applications processors in all mobile phones and tablets and x86 are little-endian and most mainstream programmers will only know that for the foreseeable future.

The question is, do you want to limit yourself to being a mainstream programmer? You never know when you might need to do code that will run on a console (all big-endian), or do network programming, or work with existing formats, etc.

(bit order? Every machine I can think of is, at best, byte addressed, so programming the software to be little-endian only forces the ordering down to the byte, and if the hardware adresses down to the bit, it can be designed to do so in a little-endian fashion as well, so I don't see how little-endian is less consistent).

Mcoder

@pure said:

I once glanced at a hedge. What's your point?

Hardly anybody pass a shape file through a bare network stream, people pass files thourgh TCP or UDP.

Now, knowing that, your guess is either out, or the mixed endianess are a WTF in a completely new level.

boomzilla

@ZPedro said:

What are we calling "every piece of hardware", anyway?

That was just blakey trolling by mentioning his limited experience / world he cares about. He just likes to flame people who point that sort of thing out to him, or who have a different set of experiences.

Vila_Restal

Cry for the endians
Die for the endians
Cry for the endians
Cry, cry, cry for the endians

WARDANCE!!!

Vila_Restal

@boomzilla said:

@ASheridan said:
I think the clue is in the name: "English", which comes from England, which is part of Britain...
They should have thought of that before they lost the war.

Which war was that precisely?

Vietnam? Oh no we weren't involved in that - that one you managed to lose (didn't have help you see).

Nope, nothing last century, or the century before.

Oh you mean the "War" of Independence. That little skirmish. We just had bigger fish to fry at the time.

It's good to see our colony is managing ok on its own.

boomzilla

@Vila Restal said:

@boomzilla said:
@ASheridan said:
I think the clue is in the name: "English", which comes from England, which is part of Britain...

They should have thought of that before they lost the war.

Which war was that precisely?

Vietnam? Oh no we weren't involved in that - that one you managed to lose (didn't have help you see).

Nope, nothing last century, or the century before.

Oh you mean the "War" of Independence. That little skirmish. We just had bigger fish to fry at the time.

It's good to see our colony is managing ok on its own.

I guess it's probably for the best that your rationalization skills have not gone the way of your empire.

I'm not sure what your point is about Vietnam (maybe you need to learn more about the English language in order to get your point across?). You have an odd conception of a lost war. When we left, the South still existed. Of course, leave something important (like supporting allies) to traitors like Ted Kennedy, and, well...

pjt33

@blakeyrat said:

@pjt33 said:
@blakeyrat said:
I've always maintained that 1/20th of the time, SQL queries without an "ORDER BY" should return the results in randomized order, to prevent programmers from assuming that a query lacking an ORDER BY would always return results in a set order.

Why only 1/20th of the time?

You don't want to do it enough that it's a performance hit, but you want to do it often enough so that it's a "fail early" thing and happens during dev/testing.

What's wrong with it being a performance hit? The whole point is to discourage people from using it.

Edit: you also don't want bad devs thinking results ALWAYS come out in random order when you don't add an ORDER BY because that assumption is also false.

Fair enough. But you should raise the frequency to something like 2/5 of the time, because bad devs probably won't run the query 20 times before they commit it to revision control and move on to the next task.

Hatshepsut

@Vila Restal said:

Which war was that precisely?

Vietnam? Oh no we weren't involved in that - that one you managed to lose (didn't have help you see)

Erm... Do you really think the US lost that war?

boomzilla

@pjt33 said:

@blakeyrat said:
@pjt33 said:
@blakeyrat said:
I've always maintained that 1/20th of the time, SQL queries without an "ORDER BY" should return the results in randomized order, to prevent programmers from assuming that a query lacking an ORDER BY would always return results in a set order.

Why only 1/20th of the time?

You don't want to do it enough that it's a performance hit, but you want to do it often enough so that it's a "fail early" thing and happens during dev/testing.

What's wrong with it being a performance hit? The whole point is to discourage people from using it.

From using the database? Do you really always care about the order of things? I sure don't. The point is not to rely on the arbitrary ordering that one finds at any given point. I think this would be a good setting to be able to turn on in a DB in some development or testing modes. I doubt I would ever want this turned on in a production environment.

pure

@Mcoder said:

Hardly anybody pass a shape file through a bare network stream, people pass files thourgh TCP or UDP.

Troll?

lethalronin27

@Vila Restal said:

@boomzilla said:
@ASheridan said:
I think the clue is in the name: "English", which comes from England, which is part of Britain...
They should have thought of that before they lost the war.

Which war was that precisely?

Vietnam? Oh no we weren't involved in that - that one you managed to lose (didn't have help you see).

Nope, nothing last century, or the century before.

Oh you mean the "War" of Independence. That little skirmish. We just had bigger fish to fry at the time.

It's good to see our colony is managing ok on its own.

I'm not sure how or why this thread started on UK vs USA, but our square footage is much higher than yours, so clearly we win. We can just pick ourselves up and squash your little island like a pancake any time we like...and the same thing goes for the people who live here!

DaveK1

@Hatshepsut said:

@Vila Restal said:
Which war was that precisely?
Vietnam? Oh no we weren't involved in that - that one you managed to lose (didn't have help you see)

Erm... Do you really think the US lost that war?

As everyone knows, Vietnam has been a multi-party capitalist democracy ever since the triumphant American victory in the Vietnam War eliminated Communism from the Indochinese Peninsula forever and reunited both halves of Vietnam under the leadership of the South.

boomzilla

@DaveK said:

As everyone knows, Vietnam has been a multi-party capitalist democracy ever since the triumphant American victory in the Vietnam War eliminated Communism from the Indochinese Peninsula forever and reunited both halves of Vietnam under the leadership of the South.

I'm trying to find a point somewhere in there, but I don't think there's any point in making the attempt.