UCS-2/UTF-16 decoding in PHP


  • SockDev

    So as some of you may remember, I built a media gallery solution in PHP. Yes, yes, I know, spare me the TRWTF commentary. 😛

    So, anyway. One of the things I need to do is EXIF parsing. Fine, except for the fact there are two existing libraries in PHP code, both of which are GPL... so that's a no-go for a paid product. The third option is the exif extension that most shared hosts don't seem to support. Fine. I'll fucking well build my own.

    And I did. Three days, lots of swearing but it works. It doesn't do GPS yet, haven't decided if I want to support GPS or not, but I'll figure it out. Maybe I'll just write code to strip GPS tags out of pictures.

    Anyway, one of the folks interested in the software has a bunch of pictures that he put metadata into himself from Windows Explorer - you know, right click an image, hit properties, there's the title/subject/author/keywords/comment field.

    This is all, interestingly, in EXIF. It is, also interestingly, actually defined as standard in EXIF, as XPTitle, XPSubject, XPAuthor etc. tags in the IFD0 block.

    Here comes the fun part. EXIF defines the data type of the content it's storing, there is a discrete text type, referred to in spec as 'ASCII' but it doesn't have to be, it can happily be UTF-8. Different manufacturers do different things but fortunately this is mostly UTF-8 or even just 7-bit ASCII. You're expected to play guess, but ASCII has shown to be a pretty safe bet.

    But not these fields. Oh no, these ones can't use ASCII or UTF-8. They are listed as streams of unsigned bytes that the client is just expected to know what to do with - and after a bit of research, turns out they're UCS-2/UTF-16 strings with null terminators.

    That is saddening but not surprising - this is, after all, introduced with Windows XP which is still UCS-2 under the hood in a lot of places.

    Fair enough, it's UCS-2. Is it little endian or big endian? There's no BOM or anything, so problem number 1 is the fact you have to kind of manually detect whether it's little endian or not. You see, EXIF data is not actually one fixed endianness. You get a two byte string early on which indicates the endianness of things. But the endianness of the data is not guaranteed to match the endianness of the EXIF tags >_< So you have to manually detect that by guessing. Fortunately it's not that hard to come up with a semi reliable detection.

    This is not the worst of it, in fact.

    I'm working in that environment which means I can expect to be using UTF-8 if I'm very lucky or ISO-8859-SOMETHING if I'm not. I can't even buttume ISO-8859-1 for sure. I'll have information available to me to know which one it is, but yeah.

    This gets better, though... PHP has the mbstring extension and access to iconv, both of which support UTF-16LE. Allegedly. Except when they don't. And there's no way to actually tell for sure without actually trying it, which is a fat lot of use to me. This of course presumes either are installed, and unsurprisingly there's a lot of crappy hosts that don't.

    Cue me, then, spending my morning with a freshly brewed cup of tea in hand... manually decoding UCS-2/UTF-16 into codepoints and then doing something useful with them. By hand with bitshifting and everything... in PHP. The most pleasant comment I had from colleagues on the matter was '... in PHP? ICKY.'

    For the record, I solved all of this by converting it back to a codepoint and then entity-encoding everything above 127. Ugly as sin, but at least it's safe in every single encoding.

    Actually maybe PHP is TRWTF because PHP can't man up enough to bundle one of these libraries into the core and never let anyone disable it.

    I still also think MS could have used UTF-8 for this instead of UCS-2 😛



  • I still also think MS could have used UTF-8 for this instead of UCS-2

    If only...

    I've had to deal with this before, and was limited by GPL too. So I just told them It couldn't be done and forced them to manage metadata through the image upload functionality, because honestly? I'm a complete bastard. However it meant 2 weeks less work for me.

    Actually maybe PHP is TRWTF because PHP can't man up enough to bundle one of these libraries into the core and never let anyone disable it.

    Wouldn't that completely negate the point of it being an open source (ish) system?

    The most pleasant comment I had from colleagues on the matter was '... in PHP? ICKY.'

    Yeah, mine don't even act this nice. They just say "If you could actually program, you'd be a C# developer" and walk away to waste time fucking about with Sharepoint



  • @Arantor said:

    there are two existing libraries in PHP code, both of which are GPL..

    Do they work? Have you considered taking a peek at how this particular problem was solved there?



  • Nope, because then anything he writes will become contaminated with GPL as a "derivative work" and thus, if his company's lawyers' shoulder aliens are to be believed, "unsellable! to anyone! forever!"



  • You hard-coded a (correct) buttumption that the string is UCS-2, but didn't hard-code a (just as much correct) buttumption it's also LE? Weirdo.


  • SockDev

    First up, I meant PHP bundling the mbstring or iconv libraries as core, because right now you actually have to consciously enable them at compile time. I believe this stuff should be core language functionality.

    As for the other EXIF libraries, neither of them recognises the XP* tags, so I had to do it myself.

    I hardcoded the assumption it was UCS-2 but I saw both LE and BE cases, because the entire file may be either. Canon digital cameras seem to be LE, Apple devices seem to be BE - and Windows may or may not adhere to this depending on version.

    As for shoulder aliens, actually... I'm doing this solo. I don't get a lot of sales, every sale counts and I don't have a userbase big enough to make 'you can distribute it and pay for support'... and while I could charge for initial distribution, anyone that receives it is bound to the GPL too - and they are allowed to be freely distribute.



  • @Arantor said:

    That is saddening but not surprising - this is, after all, introduced with Windows XP which is still UCS-2 under the hood in a lot of places.

    Isn't windows still that way?

    @Arantor said:

    I still also think MS could have used UTF-8 for this instead of UCS-2

    Have they done that anywhere?


  • SockDev

    Yes, Windows is still that way in a lot of places.

    I could understand Windows using UTF-16 for internal stuff like the filesystem. But we're talking about metadata being inserted into an already existing semi-standard system in a format unlike anything else. Nothing else in EXIF is UTF-16, the one real exception to 'everything is numeric or ASCII' is the MakerNote tag which is vendor specific extension data that can usually be ignored.

    Why couldn't Microsoft have used UTF-8 for inserting this metadata? Or even just straight ASCII? Or is interoperability not a thing in Microsoft's world? UTF-8 was a thing in 2001 when this was introduced in WinXP.



  • @Arantor said:

    Or is interoperability not a thing in Microsoft's world?



  • @thegoryone said:

    waste time fucking about with Sharepoint

    Opening Sharepoint, then waiting for it to get its shit together and finish loading?



  • @Arantor said:

    Is it little endian or big endian? There's no BOM or anything

    That's go to be in violation of either the EXIF or UCS-2 spec, surely?
    Filed under: reality in violation of spec.


  • SockDev

    It's valid in terms of the EXIF spec; EXIF only defines a text field as being ASCII, Windows writes the file using the 'stream of unsigned bytes' option instead, so it's legit for EXIF.

    Here's where it gets hilarious... UCS-2 is specced out as being BE only, but most of the time it's actually LE anyway. Meanwhile the spec for UTF-16 says if the BOM is missing, assume BE... the typical case however is LE without BOM, so yes, it violates spec whether you look at UCS-2 or UTF-16.



  • By "like", I mean "that sucks".


  • SockDev

    It sucks but it's not impossibru to deal with.





  • htons and ntohs are C functions that also exist in posix, but don't seem to be in php.

    Php does have pack and unpack though, which could be used to convert between UCS2 string and array of codepoints.


  • SockDev

    Actually you couldn't.

    If it's a two byte character, sure, you could unpack it just fine. If it's a two byte character, the value will be < 0xD800. Except UTF-16 can be 4 byte as well as 2 byte, which is when the value is 0xD800 or up. Which means you need to go through the unpacking, which involves splitting the last 10 bits off both pairs of bytes, shifting them around and adding 0x10000 to it to get your codepoint out of it.

    It's about as interesting as unrolling UTF-8 by hand.



  • Except I said UCS2, not UTF16. I'm aware of surrogate pairs. I'm also aware UCS2 doesn't have them.


  • SockDev

    UCS-2 doesn't, true. But if I'm already doing this, I might as well do the UTF-16 surrogate pairs, it's not like it's a lot of work - my total routine for doing this, from start to finish including endianness, surrogate pair handling etc. amounts to 30 lines of code. It's all built and tested, I was just venting 😛

    I'm also not entirely enthusiastic about pack/unpack, just because most of the time it seems to confuse people.



  • EXIF borrowed the tag specification from TIFF. TIFF tag spec is bad, but control of the spec went to Adobe when they bought Aldus in 1994. Adobe decided to sit on the spec and not fix it to increase the value of the PDF file format.

    So, it's all Adobe's fault.


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.