UCS-2/UTF-16 decoding in PHP



  • So as some of you may remember, I built a media gallery solution in PHP. Yes, yes, I know, spare me the TRWTF commentary. 😛

    So, anyway. One of the things I need to do is EXIF parsing. Fine, except for the fact there are two existing libraries in PHP code, both of which are GPL... so that's a no-go for a paid product. The third option is the exif extension that most shared hosts don't seem to support. Fine. I'll fucking well build my own.

    And I did. Three days, lots of swearing but it works. It doesn't do GPS yet, haven't decided if I want to support GPS or not, but I'll figure it out. Maybe I'll just write code to strip GPS tags out of pictures.

    Anyway, one of the folks interested in the software has a bunch of pictures that he put metadata into himself from Windows Explorer - you know, right click an image, hit properties, there's the title/subject/author/keywords/comment field.

    This is all, interestingly, in EXIF. It is, also interestingly, actually defined as standard in EXIF, as XPTitle, XPSubject, XPAuthor etc. tags in the IFD0 block.

    Here comes the fun part. EXIF defines the data type of the content it's storing, there is a discrete text type, referred to in spec as 'ASCII' but it doesn't have to be, it can happily be UTF-8. Different manufacturers do different things but fortunately this is mostly UTF-8 or even just 7-bit ASCII. You're expected to play guess, but ASCII has shown to be a pretty safe bet.

    But not these fields. Oh no, these ones can't use ASCII or UTF-8. They are listed as streams of unsigned bytes that the client is just expected to know what to do with - and after a bit of research, turns out they're UCS-2/UTF-16 strings with null terminators.

    That is saddening but not surprising - this is, after all, introduced with Windows XP which is still UCS-2 under the hood in a lot of places.

    Fair enough, it's UCS-2. Is it little endian or big endian? There's no BOM or anything, so problem number 1 is the fact you have to kind of manually detect whether it's little endian or not. You see, EXIF data is not actually one fixed endianness. You get a two byte string early on which indicates the endianness of things. But the endianness of the data is not guaranteed to match the endianness of the EXIF tags >_< So you have to manually detect that by guessing. Fortunately it's not that hard to come up with a semi reliable detection.

    This is not the worst of it, in fact.

    I'm working in that environment which means I can expect to be using UTF-8 if I'm very lucky or ISO-8859-SOMETHING if I'm not. I can't even buttume ISO-8859-1 for sure. I'll have information available to me to know which one it is, but yeah.

    This gets better, though... PHP has the mbstring extension and access to iconv, both of which support UTF-16LE. Allegedly. Except when they don't. And there's no way to actually tell for sure without actually trying it, which is a fat lot of use to me. This of course presumes either are installed, and unsurprisingly there's a lot of crappy hosts that don't.

    Cue me, then, spending my morning with a freshly brewed cup of tea in hand... manually decoding UCS-2/UTF-16 into codepoints and then doing something useful with them. By hand with bitshifting and everything... in PHP. The most pleasant comment I had from colleagues on the matter was '... in PHP? ICKY.'

    For the record, I solved all of this by converting it back to a codepoint and then entity-encoding everything above 127. Ugly as sin, but at least it's safe in every single encoding.

    Actually maybe PHP is TRWTF because PHP can't man up enough to bundle one of these libraries into the core and never let anyone disable it.

    I still also think MS could have used UTF-8 for this instead of UCS-2 😛



  • @Arantor said:

    there are two existing libraries in PHP code, both of which are GPL..

    Do they work? Have you considered taking a peek at how this particular problem was solved there?



  • Nope, because then anything he writes will become contaminated with GPL as a "derivative work" and thus, if his company's lawyers' shoulder aliens are to be believed, "unsellable! to anyone! forever!"


  • Banned

    You hard-coded a (correct) buttumption that the string is UCS-2, but didn't hard-code a (just as much correct) buttumption it's also LE? Weirdo.



  • First up, I meant PHP bundling the mbstring or iconv libraries as core, because right now you actually have to consciously enable them at compile time. I believe this stuff should be core language functionality.

    As for the other EXIF libraries, neither of them recognises the XP* tags, so I had to do it myself.

    I hardcoded the assumption it was UCS-2 but I saw both LE and BE cases, because the entire file may be either. Canon digital cameras seem to be LE, Apple devices seem to be BE - and Windows may or may not adhere to this depending on version.

    As for shoulder aliens, actually... I'm doing this solo. I don't get a lot of sales, every sale counts and I don't have a userbase big enough to make 'you can distribute it and pay for support'... and while I could charge for initial distribution, anyone that receives it is bound to the GPL too - and they are allowed to be freely distribute.


  • ♿ (Parody)

    @Arantor said:

    That is saddening but not surprising - this is, after all, introduced with Windows XP which is still UCS-2 under the hood in a lot of places.

    Isn't windows still that way?

    @Arantor said:

    I still also think MS could have used UTF-8 for this instead of UCS-2

    Have they done that anywhere?



  • Yes, Windows is still that way in a lot of places.

    I could understand Windows using UTF-16 for internal stuff like the filesystem. But we're talking about metadata being inserted into an already existing semi-standard system in a format unlike anything else. Nothing else in EXIF is UTF-16, the one real exception to 'everything is numeric or ASCII' is the MakerNote tag which is vendor specific extension data that can usually be ignored.

    Why couldn't Microsoft have used UTF-8 for inserting this metadata? Or even just straight ASCII? Or is interoperability not a thing in Microsoft's world? UTF-8 was a thing in 2001 when this was introduced in WinXP.


  • ♿ (Parody)

    @Arantor said:

    Or is interoperability not a thing in Microsoft's world?



  • @thegoryone said:

    waste time fucking about with Sharepoint

    Opening Sharepoint, then waiting for it to get its shit together and finish loading?



  • @Arantor said:

    Is it little endian or big endian? There's no BOM or anything

    That's go to be in violation of either the EXIF or UCS-2 spec, surely?
    Filed under: reality in violation of spec.



  • It's valid in terms of the EXIF spec; EXIF only defines a text field as being ASCII, Windows writes the file using the 'stream of unsigned bytes' option instead, so it's legit for EXIF.

    Here's where it gets hilarious... UCS-2 is specced out as being BE only, but most of the time it's actually LE anyway. Meanwhile the spec for UTF-16 says if the BOM is missing, assume BE... the typical case however is LE without BOM, so yes, it violates spec whether you look at UCS-2 or UTF-16.



  • By "like", I mean "that sucks".



  • It sucks but it's not impossibru to deal with.




  • Java Dev

    htons and ntohs are C functions that also exist in posix, but don't seem to be in php.

    Php does have pack and unpack though, which could be used to convert between UCS2 string and array of codepoints.



  • Actually you couldn't.

    If it's a two byte character, sure, you could unpack it just fine. If it's a two byte character, the value will be < 0xD800. Except UTF-16 can be 4 byte as well as 2 byte, which is when the value is 0xD800 or up. Which means you need to go through the unpacking, which involves splitting the last 10 bits off both pairs of bytes, shifting them around and adding 0x10000 to it to get your codepoint out of it.

    It's about as interesting as unrolling UTF-8 by hand.


  • Java Dev

    Except I said UCS2, not UTF16. I'm aware of surrogate pairs. I'm also aware UCS2 doesn't have them.



  • UCS-2 doesn't, true. But if I'm already doing this, I might as well do the UTF-16 surrogate pairs, it's not like it's a lot of work - my total routine for doing this, from start to finish including endianness, surrogate pair handling etc. amounts to 30 lines of code. It's all built and tested, I was just venting 😛

    I'm also not entirely enthusiastic about pack/unpack, just because most of the time it seems to confuse people.



  • EXIF borrowed the tag specification from TIFF. TIFF tag spec is bad, but control of the spec went to Adobe when they bought Aldus in 1994. Adobe decided to sit on the spec and not fix it to increase the value of the PDF file format.

    So, it's all Adobe's fault.



  • @Arantor said:

    Actually maybe PHP is TRWTF because PHP can't man up enough to bundle one of these libraries into the core and never let anyone disable it.

    This must be the one thing PHP doesn't bundle into it's sprawling core.


  • Discourse touched me in a no-no place

    @Arantor said:

    Why couldn't Microsoft have used UTF-8 for inserting this metadata?

    Without reading the rest of the thread, I can tell you why: Because everything else is UTF-16. If they'd made this UTF-8, we'd be having the opposite conversation right now: "why is Windows all 16-bit characters except for file properties ⁉ Stupid Microsoft!"

    I'm sure you know this.



  • To a point, yes. And I totally get the argument for having consistency on that level.

    On the other hand... interoperability is also a thing - this isn't MS writing into an MS-proprietary format. I don't see where MS gets to dictate this one.

    My contention is that this is a file format that defines text as ASCII - with a semi-nod to 'UTF-8 is OK' but it's a sort of unspoken agreement. MS did exactly what MS usually does: a non-standard extension that follows MS' rules and to hell with everyone else.


  • Banned

    @FrostCat said:

    If they'd made this UTF-8, we'd be having the opposite conversation right now: "why is Windows all 16-bit characters except for file properties Stupid Microsoft!"

    Except it would not be a complaint about going UTF-8 with one thing but about not going UTF-8 with everything else. So, not much would change.

    @Arantor said:

    this isn't MS writing into an MS-proprietary format

    Guess why it's named XPTitle.



  • It's still not MS writing into an MS-proprietary format. It's just MS writing something totally outside someone else's standard and attaching a vendor prefix to it, which is thankfully the only way you'd actually know WTF you were dealing with.


  • Banned

    It's more like MS proprietary format that the standard author was persuaded to include in his standard "or else".



  • It's not like the format is so constrained it couldn't have been added any other way or anything.

    And, if I'm honest, it's still not the most retarded thing in EXIF.


  • Banned

    @Arantor said:

    It's not like the format is so constrained it couldn't have been added any other way or anything.

    But MS is asshole to everyone and had to make it the most invasive way.

    @Arantor said:

    And, if I'm honest, it's still not the most retarded thing in EXIF.

    Iunno, never had to deal with it. But I agree with whatever you just said.



  • Well, yes, that's kind of my take on it. While consistency is appreciated (and MS being mostly-UCS-2-but-a-bit-UTF-8 would violate that), having something more consistent with what everyone else does would be more appreciated, I think.

    The most retarded thing in EXIF is the MakerNote tag. A dense binary blob of vendor specific data that is not documented, nor actually all that useful.

    Most of what's in EXIF in terms of camera settings isn't actually that useful before you started - most people don't care about most of it, and even photographers I've spoken to don't usually care beyond the general details (focal length, shutter speed, that kind of stuff)... I don't know anyone who actually cared about the ridiculous levels of detail crammed into the MakerNote tags - beyond the manufacturer. I can see it being useful for debugging the camera itself, but that's about it.


  • Banned

    Actually, "scratchpad for whatever you want" is nice to have in any metadata format. What's not nice is one particular vendor's scratchpad officially recognized as standard.

    Notice I said metadata format, not data format - because vendor-specific information that changes data is just dick move.



  • Oh, that's just it... we have a chunk of data in the file, tagged 'MakerNote'. Any vendor can shove any blob they like in it.

    Bonus WTF: some of the MakerNote tags are partially or fully encrypted and some manufacturers use the location of the tag in the file as part of the key to decrypt it meaning if you edit the other tags for any reason you can render the MakerNote broken.


  • Banned

    @Arantor said:

    Oh, that's just it... we have a chunk of data in the file, tagged 'MakerNote'. Any vendor can shove any blob they like in it.

    But whole EXIF thing is just metadata. It doesn't change how the image look. If MakerNote stored e.g. information for decompresser to produce better quality image, that would be whole different discussion.

    @Arantor said:

    Bonus WTF: some of the MakerNote tags are partially or fully encrypted and some manufacturers use the location of the tag in the file as part of the key to decrypt it meaning if you edit the other tags for any reason you can render the MakerNote broken.

    Is whole EXIF info unusable then, or only the Maker blob?


  • Discourse touched me in a no-no place

    @Gaska said:

    Notice I said metadata format, not data format - because vendor-specific information that changes data is just dick move.

    When vendors do that, they usually call it a RAW file…

    @Arantor said:

    some manufacturers use the location of the tag in the file as part of the key to decrypt it

    WAT? That's evil.



  • No, it's only the Maker blob that's unusable in that scenario.

    MakerNote contains all kinds of data. Some vendors use it for more-specific-than-EXIF e.g. more data about the white balance bias than EXIF normally provides for (and EXIF is pretty detailed about the minutiae) though I also wouldn't be surprised if at least some of it was legacy before it was codified properly in EXIF.

    Nothing is currently known to be used that would enhance the data itself, it's still purely metadata.


  • Banned

    @dkf said:

    When vendors do that, they usually call it a RAW file…

    And they're not standarized for a reason.

    @Arantor said:

    No, it's only the Maker blob that's unusable in that scenario

    @Arantor said:
    Nothing is currently known to be used that would enhance the data itself, it's still purely metadata.

    So what's your problem then?



  • Because it seems retarded to me to have this vendor blob in a standardised file system.


  • Banned

    That blob would go somewhere anyway, so why not standarize where exactly it goes?



  • How about not making it an impenetrable blob in the first place?


  • Banned

    @Arantor said:

    How about not making it an impenetrable blob in the first place?

    It would be like lack of #pragma in C. Helps with portability, but God help you with misaligned structures.



  • Well, here's the thing. TIFF tags were already a thing long before MakerNote was, and they sort of imported the format wholesale.

    I still see no reason for current devices to be pushing MakerNote blobs, especially considering what has been decoded out of them.

    I get the deal with RAW though, and am still not a fan of this either.


  • Banned

    @Arantor said:

    I still see no reason for current devices to be pushing MakerNote blobs, especially considering what has been decoded out of them.

    Invigilation of course.



  • Well... for a lot of it, it's not really any different to security by obscurity.

    I dunno. I just see this stuff as unnecessary.


  • Banned

    I'm not talking about security - I'm talking about invigilation. In other words, anti-security. For example, encrypting MakerNote with position in file is easy test for checking if EXIF info was tampered with.



  • Oh I see. That actually makes a decent amount of sense.



  • @Arantor said:

    I don't see where MS gets to dictate this one.

    MS will always dictate anything it thinks it can get away with dictating. Nature of the beast.



  • @Arantor said:

    I can see it being useful for debugging the camera itself, but that's about it.

    That's probably the primary purpose it's used for.

    Modern digital cameras do a fair bit of work on the raw sensor data before storing it (even before storing RAW), so a built-in log of everything the camera did when it took the photo is probably very useful.

    Camera firmware updates are a thing, so they may as well leave it in the EXIF data for future debugging of customer's photos.


  • BINNED

    @flabdablet said:

    MS Any company will always dictate anything it thinks it can get away with dictating.

    FTFY



  • @Arantor said:

    I hardcoded the assumption it was UCS-2 but I saw both LE and BE cases, because the entire file may be either. Canon digital cameras seem to be LE, Apple devices seem to be BE - and Windows may or may not adhere to this depending on version.

    I am surprised. As far as I can tell, and I've been looking through the meaning of various platform defines in Visual Studio documentation, no Microsoft software (or rather no software using anything based on Win API) was ever natively big endian. So I would expect them to buttume little endian.


  • FoxDev

    @Bulb said:

    So I would expect them to buttume little endian.

    TRWTF is littlendian.

    like.... why?

    we math by hand in bigendian, we write in big endian, we read in big endian.....why would we make computers littlendian?


  • Banned

    @accalia said:

    we math by hand in bigendian, we write in big endian, we read in big endian.....why would we make computers littlendian?

    Because for any given int i the expression *(char*)&i == (char)i is true1.


  • FoxDev

    ...... what the frack does that do?

    seriously, i look at that and see "someone needs to stop being so flapjacking clever with their pointers"


Log in to reply