UCS-2/UTF-16 decoding in PHP
-
take the address of i, cast it to a char pointer and take the value, this will be the same as i cast directly to a char. What does that have to do with endianness?
-
...... what the frack does that do?
It's casting int to char - an old man's modulo 256.take the address of i, cast it to a char pointer and take the value, this will be the same as i cast directly to a char. What does that have to do with endianness?
In big endian, casting int pointer to char pointer would make it point at the most significant byte instead of least significant byte, resulting in byteshift instead of modulo.
-
@accalia, can I borrow a do not want jpg?
-
-
i think it's warranted.
-
@accalia, can I borrow a do not want jpg?
No because it's been moved.
Filed under: when Rust finally takes over the world, people will make more jokes about borrowing and less about nasal demons.
-
In big endian, casting int pointer to char pointer would make it point at the most significant byte instead of least significant byte, resulting in byteshift instead of modulo.
.... and why is that something we want?
-
.... and why is that something we want?
We don't. But in 70's it made sense, and backwards compatibility is important... for someone... I think...
-
We don't. But in 70's it made sense, and backwards compatibility is important... for someone... I think...
not seeing that. is it related to smoking hash or something?
i get the backwards compatibility argument. i don't like it but i get it.
-
we math by hand in bigendian
Multi-digit addition, subtraction and multiplication all begin at the little end; only division begins at the big end.
Bigendian computing machinery, to my way of thinking, simply perpetuates TRWTF, which is the way the RTL Arabic numeration system got spliced as-is into LTR languages.
Little-endian architectures put the least significant word of a multi-word number - by far the most common word for arithmetic processing to begin with - at the same memory address as the number overall. This considerably simplifies the addressing machinery required for multi-word arithmetic.
As an example, have a look at the performance difference between a couple of very simple 8-bit processor designs: the big-endian 6800, and the little-endian 6502 (which shared many of the 6800's designers). The 6502 is consistently one cycle faster on comparable 16-bit address calculations because little-endianism combined with simple advancement of the program counter often gives it the opportunity to overlap the LSB half of effective-address calculations with fetching the MSB of the address.
Little-endian byte numbering is also consistent with the way we usually number bits, where bit n has weight 2n. In a little-endian multi-byte number, byte n has weight 256n, and contains bits 8n .. 8n+7.
The only advantage of bigendian addressing, as far as I can tell, is that it makes it slightly easier for a human being to read multi-byte values out of a conventionally formatted raw hex dump. Personally I would much rather see hex dump tools that support RTL dumps for little-endian formats than deal with hardware that contorts the addressing machinery to support the useless fluff of big-endianism.
Filed under: Lilliput forever, down with Blefuscu
-
TRWTF, which is the way the RTL Arabic numeration system got spliced as-is into LTR languages.
That.
The Unicode RTL rules crown it with by making arabic digits LTR to compensate for the fact that software is usually written primarily for Latin input and thus formats numbers most significant digit first.
-
MS did exactly what MS usually does: a non-standard extension that follows MS' rules and to hell with everyone else.
Yeah, it does seem a bit abusive. Sounds like a front-page TDWTF-worthy article, almost. :)
-
look at the performance difference between a couple of very simple 8-bit processor designs: the big-endian 6800, and the little-endian 6502
This is enough to explain everything. If you have a chance, read the story of the design of the 6502. Back then, chips were laid out by hand and getting as many features as possible into a given footprint made for a better product at that price. This decision bubbled up into library design and OS design and is now engrained there for backwards compatibility purposes.BTW, many file formats (notably all the MS Office formats before ODF/OOXML) were traditionally on-disk dumps of memory structures. That's how endianness made the leap from the CPU to file formats.
-
Except it would not be a complaint about going UTF-8 with one thing but about not going UTF-8 with everything else. So, not much would change.
Yeah, that's what I meant: Any way you slice it, people would scream about MS Doing It WrongTM.
-
MS jumped on the Unicode bandwagon a bit too eagerly and had too much programmer time on their hands, so they just went and duplicated all their APIs in UCS-2 (surrogates did not exist yet). And then the other software vendors who didn't have the manpower came looking for solution that would not require them to modify everything and picked UTF-8.
And then Unicode committee realized 2¹⁶ codepoints is not enough for everybody and were forced to create the abomination that is UTF-16, because Windows would not be able to switch to 32-bit
wchar_t
as it was already baked into everything with 16 bits.
-
Yeah, that's what I meant: Any way you slice it, people would scream about MS Doing It WrongTM.
And the reason would be the same - lack of UTF-8 support.
-
If you have a chance, read the story of the design of the 6502
Yeah, that's all familiar territory.
Also worth noting that the 6502's contemporary contender for world's most popular 8 bit processor - the Zilog Z-80 - featured an extension of Intel's 8080 architecture, which was itself heavily influenced by Intel's earlier 8008, and that all three of these little-endian designs came after Intel's earlier and big-endian 4004. Little-endian addressing was an optimization that good designers put into their 8 bit chips because it made sense.
-
-
Wonder if you can get a version of Internet Explorer 5 that runs on it
-
Yeah, that's all familiar territory.
Hilariously, one of the links on that page is to a blog post, with a comment that links back to the old forums here.
-
I am surprised. As far as I can tell, and I've been looking through the meaning of various platform defines in Visual Studio documentation, no Microsoft software (or rather no software using anything based on Win API) was ever natively big endian. So I would expect them to buttume little endian.
Well, aside from the small detail that IIRC Xbox 360 is BE and some of the non Intel architecture for WinNT was also BE, you're largely correct.
However... when you have a file that itself indicates whether BE or LE and Windows will be inserting the data into it, it generally (except for some odd versions, usually older stuff) actually follows what the file itself indicates. So a file that originated on a BE architecture having data inserted into it may or may not also be in BE depending on Windows.
At least this was the result of my not-great data sample.
-
and some of the non Intel architecture for WinNT was also BE, you're largely correct
I thought that might be true, so I looked it up. MIPS is bi-endian and NT only ran with the processor in LE mode. Alpha is LE. NT only ran on PPC variants that were bi-endian, not the BE only ones.
-
However... when you have a file that itself indicates whether BE or LE and Windows will be inserting the data into it, it generally (except for some odd versions, usually older stuff) actually follows what the file itself indicates.
But this is Microsoft extension and the strings don't have marks…
-
Well, aside from the small detail that IIRC Xbox 360 is BE and some of the non Intel architecture for WinNT was also BE, you're largely correct.
The CPU in this Xbox supports either mode. I'm not sure which the OS uses, but I'm guessing it's little-endian.
As for this thread in general, the spec says vendors can add their own EXIF tags, and so Microsoft did. It seems stupid to bitch at Microsoft-- the real problem here seems to be whoever determined the Exif format never thought about text encodings.
It's the same as people bitching at Microsoft for implementing ActiveX. Guess what? The HTML/DOM spec says it could work with any language, so Microsoft implemented a second one. Bitch at the writers of the shitty specs, not the company that follows them.
-
Yes, yes, there is that. I was more venting at the obvious (easy) target for making me jump through more hoops than necessary.
But this is Microsoft extension and the strings don't have marks…
Yes, I know. Elsewhere in the file, the endianness is specified, either as
II
orMM
character literals.
-
The HTML/DOM spec … Bitch at the writers of the shitty specs, not the company that follows them.
And they just repeated the same idiocy over again with not specifying supported formats for
<video>
and<audio>
. Not even required ones.
-
Oh yes. I was especially annoyed that there was nothing indicated in the spec when it came to, say, providing any hint to the browser for file size.
Meaning that every browser which supports these tags does so with 'let's start and 0 and ask for the entire file using a method by which we expressly indicate we will be able to ask for arbitrary chunks of a file and might even do that shortly'. The entire chain of request/response for audio/video tags is actually irritating as heck if you ever want to, say, handle such things via PHP and still have auth in front of the request (while still fighting with the '30 second time limit' because of the way the browsers do idling on the connection)
-
Yup. The W3C is fucking awful at their job.
I knew the CDDB used a shitty nonsense format, but this is the first time I've heard someone talk about the shitty nonsense in EXIF. It doesn't surprise me though.
-
So, TRWTF is anything that has four-letter acronym?
-
TRWTF is a five-letter acronym.
-
-
WTF is a TLA
-
What's on second base.
-
-
It's the same as people bitching at Microsoft for implementing ActiveX.
ActiveX provided a potentially large amount of functionality to the browser, too, at a time when they were pretty useless in modern terms.
-
-
That would be an FETLA wouldn't it (a five-letter acronym as well)?
-
ActiveX provided a potentially large amount of functionality to the browser, too, at a time when they were pretty useless in modern terms.
No it didn't - it created a hole in the browser where you could run native code. It turned a browser into a Windows code delivery platform, which was exactly what Microsoft wanted it to be.That's like saying that you empowered Ghandi to affect change by putting him in a tank with a crew and then claiming his victory represented the power of peace.
-
That would be an FETLA wouldn't it (a five-letter acronym as well)?
Properly an XLA I think. ETLA is reserved for 4 letters…
-
It turned a browser into a Windows code delivery platform, which was exactly what Microsoft wanted it to be.
Sure but why not? The HTML spec allowed it, even encouraged it.
-
But, we've already established that the W3C is run by morons.
-
-
That would be an FETLA wouldn't it (a five-letter acronym as well)?
No, that's the punchline.
Back in the day when assembly still ruled, a couple of engineers were talking about the latest language extensions in the next generation of processor.
Bob says "Yeah, we're running out of TLAs for opcodes."
Jim says "So what are we going to do?"
Bob replies "management is considering going to FLAs."
Jim asks "what are those, Four-Letter Acronyms?"
Bib answers "No, they're Field-Extended TLAs"The joke involves things like 8->16 bit field, operand, and so on, widenings.
-
No it didn't - it created a hole in the browser where you could run native code. It turned a browser into a Windows code delivery platform, which was exactly what Microsoft wanted it to be.
What you said is true but is not a functional denial of what I said. It was never very widespread but I knew people who developed apps that used moderately complex semi-custom controls to deliver functionality browsers just could not do in the late 90s.
-
deliver functionality browsers just could not do in the late 90s.
But, the browser didn't deliver it any more than a page with a link to an exe file can claim responsibility for delivering that application's UI.
-
But, the browser didn't deliver it any more than a page with a link to an exe file can claim responsibility for delivering that application's UI.
The users thought it did, though.
-
The users thought it did, though.
That illusion was named "Embrace, Extend, Extinguish".
-
The users thought it did, though.
And the logical technology required for it supported delivery of UI that way.
-
And the logical technology required for it supported delivery of UI that way.
Actually in the specific case I'm thinking of--this is 1998 or a bit earlier--the users were using unreliable dialup at remote sites; the vendor shipped updates via floppy, which installed the package of controls, as this was deemed a better method than code download in their circumstances.
-
That's like saying that you empowered Ghandi to affect change by putting him in a tank with a crew and then claiming his victory represented the power of peace.
Isn't it exactly what USA is doing in the Middle East, liberating all those peaceful states?Jim asks "what are those, Four-Letter Acronyms?"Bib answers "No, they're Field-Extended TLAs"
We need ATF-3 - Acronym Translation Format. If the acronym is LOL, it should be interpreted as WTF plus the next five letters, and if those five letters end in OMG, another seven letters are read and STFU is appended.