Git hates UTF-16


  • Banned

    @dkf why would canonicalization preserve comments? I thought the point is to transform files in such a way that semantically identical files - and comments don't change semantics - become bitwise-equal.


  • ♿ (Parody)

    @Gribnit said in Git hates UTF-16:

    @boomzilla said in Git hates UTF-16:

    "That chair is built of wood. It's not wood."

    "I am looking for some wood."


    In other words, to a person looking for some wood, that chair is built of wood. It is not wood in a useful form to them and they can't have it. And if they were eyeing my chair, I could tell them the thing you quoted and not be a crazy person whargarbbl

    No, you'd still be a crazy person.


  • ♿ (Parody)

    @anonymous234 said in Git hates UTF-16:

    @boomzilla said in Git hates UTF-16:

    @Gąska said in Git hates UTF-16:

    They are not bytes. They are implemented as bytes.

    How do you keep these things in your head at the same time?

    How do you not? A .png file is made of bytes. But an image is made of pixels (and metadata and other stuff). They are different layers. Text is exactly the same.

    Yes, they are different layers of abstraction. But they're still bytes at a particular level. Pixels, too, for that matter.


  • Banned

    @boomzilla the problem with reductionism is that abstraction prevents you from applying it. When you're at some level of abstraction, you cannot "move down" unless you're the implementor. Even if the image is made of bytes, you, being stuck at high-level abstraction, have no way of knowing it.


  • ♿ (Parody)

    @Gąska said in Git hates UTF-16:

    the problem with reductionism is that abstraction prevents you from applying it.

    THAT'S WHAT WAS SAID ABOUT IT IN THE FIRST PLACE.

    Good lord. :wtf: is wrong with you?


  • Banned

    @boomzilla said in Git hates UTF-16:

    @Gąska said in Git hates UTF-16:

    the problem with reductionism is that abstraction prevents you from applying it.

    THAT'S WHAT WAS SAID ABOUT IT IN THE FIRST PLACE.

    You also kept saying "sequence of characters is sequence of bytes", which is something you cannot know in general case even if it is literally always true that all sequences of characters in entire universe are implemented as sequences of bytes.


  • ♿ (Parody)

    @Gąska said in Git hates UTF-16:

    @boomzilla said in Git hates UTF-16:

    @Gąska said in Git hates UTF-16:

    the problem with reductionism is that abstraction prevents you from applying it.

    THAT'S WHAT WAS SAID ABOUT IT IN THE FIRST PLACE.

    You also kept saying "sequence of characters is sequence of bytes", which is something you cannot know in general case even if it is literally always true that all sequences of characters in entire universe are implemented as sequences of bytes.

    No, but that's apparently something that you can't know.


  • Banned

    @boomzilla because I keep my abstractions straight, and apparently you don't.


  • ♿ (Parody)

    @Gąska said in Git hates UTF-16:

    @boomzilla because I keep my abstractions straight, and apparently you don't.

    It's 2019, dude. Don't be so close minded. And based on this comment you still haven't grokked it, which is actually kind of funny at this point, but soon it will just be sad.


  • Banned

    @boomzilla grokked what? That abstractions don't matter because ultimately everything is silicon?


  • ♿ (Parody)

    @Gąska said in Git hates UTF-16:

    @boomzilla grokked what? That abstractions don't matter because ultimately everything is silicon?

    LOL...no, the opposite, that the abstractions do matter and you should pay attention to them if you want correct results.


  • Banned

    @boomzilla said in Git hates UTF-16:

    @Gąska said in Git hates UTF-16:

    @boomzilla grokked what? That abstractions don't matter because ultimately everything is silicon?

    LOL...no, the opposite

    The opposite is that you cannot say that characters are bytes because abstraction prevents you from knowing it. But apparently you disagree with it.


  • ♿ (Parody)

    @Gąska said in Git hates UTF-16:

    @boomzilla said in Git hates UTF-16:

    @Gąska said in Git hates UTF-16:

    @boomzilla grokked what? That abstractions don't matter because ultimately everything is silicon?

    LOL...no, the opposite

    The opposite is that you cannot say that characters are bytes because abstraction prevents you from knowing it. But apparently you disagree with it.

    TDEMSYR. Unless you mean to deny reality and are just doing some kind of absurdist performance art. Which is not really any better, actually.


  • ♿ (Parody)

    @Gąska so, consider one of those Magic Eye books. Depending on how you look at it you see that it's a bunch of messed up randomish gibberish or a 3D picture. You're trying to deny that people can see one of those views of it.



  • @boomzilla Well, yes, but you're not supposed to think about the bytes if you're dealing with pixels, and viceversa. It's bad design, because it will break if someone changes something.

    It's like if you have a program that stores stuff in a database, you'd have a layer that reads and writes from the database to in-memory objects, but you wouldn't expect random code outside of to connect to the database by itself and run SQL queries.

    The problem is when tools like git mostly work on bytes but then decide to snoop on a different level and alter the line endings or something like that, without actually understanding the full encoding details. And the idea that you can do that comes from the ASCII days when one byte was one character and that's all there was to it. But those days are dead and they should stay dead.


  • Considered Harmful

    @boomzilla but is the gibberish a sequence of bytes?


  • ♿ (Parody)

    @anonymous234 said in Git hates UTF-16:

    @boomzilla Well, yes, but you're not supposed to think about the bytes if you're dealing with pixels, and viceversa. It's bad design, because it will break if someone changes something.

    THAT'S THE POINT. Yes.


  • Banned

    @boomzilla said in Git hates UTF-16:

    Unless you mean to deny reality and are just doing some kind of absurdist performance art.

    I could say the same about your argument that recycling results in a world where everyone lives in garbage, drinks from garbage, drives garbage, wipes ass with garbage, and sees garbage all around everywhere they go. Because all those things are made from recycled garbage, and that apparently makes them garbage too.


  • ♿ (Parody)

    @Gąska said in Git hates UTF-16:

    @boomzilla said in Git hates UTF-16:

    Unless you mean to deny reality and are just doing some kind of absurdist performance art.

    I could say the same about your argument that recycling results in a world where everyone lives in garbage, drinks from garbage, drives garbage, wipes ass with garbage, and sees garbage all around everywhere they go. Because all those things are made from recycled garbage, and that apparently makes them garbage too.

    b0910289-adbc-4028-9e7f-98ebea846131-image.png A dinosaur might have once peed out the water in your coffee this morning. 3ed150f5-56e6-4f18-a495-954a631465c1-image.png


  • Banned

    @boomzilla that's how I imagine you right now, yes.


  • ♿ (Parody)

    @Gąska I know. You are in full on denial of reality. You believe that abstractions are in fact concrete, dogs and cats can live together. Mass hysteria!


  • Banned

    @anonymous234 @pie_flavor I'm not talking about supposing or pretending, like humans do when working on software. I'm talking about literally being forbidden from doing otherwise by laws of mathematics and assumed axioms of the system you're operating within. It's hard to find a real-world example of that, because the physical world rarely follows abstract mathematical models, but I can think of one thing. Standard Model fundamental particles. Are these particles made from something? Probably. Do we know what it is? Not right now. Will we ever know? Some theories say we won't. That it is literally unobservable what the fermions and bosons are made of. So even if they are all made from angelic tears and hair from God's neckbeard, no human being will ever be able to say that with full confidence.

    All humans can tell that characters are made of bytes. It's because humans are unconstrained by abstraction. But in general case of arbitrary mathematical entities that might or might not be constrained by abstraction - such as computer programs - it is not always possible to see what characters are made of. So in general case, saying that characters are bytes, while not necessarily wrong, is definitely a logical fallacy.

    I'm probably using all the wrong terminology. I don't care. Bz isn't going to read it anyway, and it should be clear enough for everyone else.


  • ♿ (Parody)

    @Gąska said in Git hates UTF-16:

    All humans can tell that characters are made of bytes.

    :faints:

    e5ae0f05-5d99-4e22-807c-d6ed116451f8-image.png


  • Banned


  • Discourse touched me in a no-no place

    @Gąska said in Git hates UTF-16:

    All humans can tell that characters are made of bytes.

    It's just a common current implementation pattern, no more.



  • @Gąska Your post seems to boil down to "text characters, as an abstract concept, don't involve any bytes".

    Which is true, but consider this: text characters stored in a computer do involve bytes.


  • Banned

    @anonymous234 but is it enough to say that characters are bytes?


  • BINNED

    @Gribnit said in Git hates UTF-16:

    "I am looking for some wood."

    :giggity:


  • Considered Harmful

    @boomzilla said in Git hates UTF-16:

    But then @ixvedeusi says it's not even true. And then you get the weird response that says they're always implemented as bytes and therefore they're not bytes. I'm not sure what to think about a person who says that.

    It speaks of bad character. Or rather malformed character.


  • Considered Harmful

    @Luhmann said in Git hates UTF-16:

    @Gribnit said in Git hates UTF-16:

    "I am looking for some wood."

    :giggity:

    Damnyou :hanzo:



  • @pie_flavor said in Git hates UTF-16:

    @Zenith said in Git hates UTF-16:

    @Carnage Welcome to another episode of why I'd rather use INI files. Incidentally, that's apparently broken by UTF8 rather than UTF16...

    Why not use Toml then?

    Because I already have half a dozen functions around GetPrivateProfileString() that do what it does and more.


  • Considered Harmful

    @Gąska said in Git hates UTF-16:

    @anonymous234 but is it enough to say that characters are bytes?

    A UTF-16 character consists of one or two 16-bit words. On most CPUs, they can be mapped to two or four bytes. On a CDC-3600, both would fit into a single byte and leave some room, just like an ASCII character can fit into an 8-bit byte and leave some room.

    Thus: bytes or fractions or multiples thereof. Might as well say "some data kinda stuff", that would be just as enlightening.


  • Banned

    @Zenith said in Git hates UTF-16:

    Because I already have half a dozen functions around GetPrivateProfileString() that do what it does and more.

    Note This function is provided only for compatibility with 16-bit Windows-based applications. Applications should store initialization information in the registry.


  • Discourse touched me in a no-no place

    @boomzilla said in Git hates UTF-16:

    @Gąska said in Git hates UTF-16:

    All humans can tell that characters are made of bytes.

    :faints:

    e5ae0f05-5d99-4e22-807c-d6ed116451f8-image.png



  • @Gąska

    1. That warning was there when I was developing on Windows 2000. The function still works on Windows 10.
    2. It's not like I can't write an INI parser myself. I actually did when a web app I was working on started leaking memory through calls to GetPrivateProfileString() until I figured out where/why it was leaking.
    3. XML-based configuration is garbage. I will fight it to the ends of the Earth.

  • ♿ (Parody)

    @PJH said in Git hates UTF-16:

    @boomzilla said in Git hates UTF-16:

    @Gąska said in Git hates UTF-16:

    All humans can tell that characters are made of bytes.

    :faints:

    e5ae0f05-5d99-4e22-807c-d6ed116451f8-image.png

    Fuckit! We're doing quadnary!


  • Discourse touched me in a no-no place

    @boomzilla said in Git hates UTF-16:

    @PJH said in Git hates UTF-16:

    @boomzilla said in Git hates UTF-16:

    @Gąska said in Git hates UTF-16:

    All humans can tell that characters are made of bytes.

    :faints:

    e5ae0f05-5d99-4e22-807c-d6ed116451f8-image.png

    Fuckit! We're doing quadnary!

    That's just binary2. Quinary is next.


  • Banned

    @Zenith said in Git hates UTF-16:

    @Gąska

    1. That warning was there when I was developing on Windows 2000. The function still works on Windows 10.

    It's not that it doesn't work. It's that you're using WinAPI to parse INI files. With a function that is explicitly marked in documentation as "do not use". And even in Windows 2000 times, it was still huge :doing_it_wrong: to use that function.

    1. It's not like I can't write an INI parser myself. I actually did when a web app I was working on started leaking memory through calls to GetPrivateProfileString() until I figured out where/why it was leaking.

    So you have an alternative that presumably does everything you need. Why won't you switch over?

    1. XML-based configuration is garbage. I will fight it to the ends of the Earth.

    Because GetPrivateProfileString() and XML are the only things in the universe which you can use to store configuration.

    Also. Yes, XML is grossly overused, and it's almost always overkill. But it does have a few valid uses.



  • @boomzilla said in Git hates UTF-16:

    A dinosaur might have once peed out the water in your coffee this morning.

    Another reason to be glad I don't drink coffee, I guess.



  • @Gąska said in Git hates UTF-16:

    @Zenith said in Git hates UTF-16:

    @Gąska

    1. That warning was there when I was developing on Windows 2000. The function still works on Windows 10.

    It's not that it doesn't work. It's that you're using WinAPI to parse INI files. With a function that is explicitly marked in documentation as "do not use". And even in Windows 2000 times, it was still huge to use that function.

    Is it? The documentation doesn't say "don't use this." It says "use the registry," which Microsoft also no longer recommends.

    The real question here is why. Consider this: Microsoft makes a much bolder warning against using Office automation. You know the real problem with Office automation? One, it pops up dialogs that pause execution. Two, a failed operation not handled correctly can leave an invisible instance of the app and file open. Having read TheOldNewThing long enough, my belief is that the warning's purpose is to head off a legion of fools that jump straight to "Office is buggy shit" when they hit those issues.

    @Gąska said in Git hates UTF-16:

    @Zenith said in Git hates UTF-16:

    @Gąska
    2. It's not like I can't write an INI parser myself. I actually did when a web app I was working on started leaking memory through calls to GetPrivateProfileString() until I figured out where/why it was leaking.

    So you have an alternative that presumably does everything you need. Why won't you switch over?

    After "you don't need that," the next most common response to programming questions is "why don't you use a library?" Well, I used an OS-provided library function.

    My alternative, that I threw together in 20 minutes to get production back up, was read-only.

    @Gąska said in Git hates UTF-16:

    @Zenith said in Git hates UTF-16:

    @Gąska
    3. XML-based configuration is garbage. I will fight it to the ends of the Earth.

    Because GetPrivateProfileString() and XML are the only things in the universe which you can use to store configuration.

    JSON has a few dialects depending on the parser. It ends up loading dictionaries of dictionaries, encouraging the same too-many-layers situation as XML.
    TOML is an INI with partial JSON syntax that I don't want to deal with.
    Registry is not portable and runs afoul of rights issues.
    I could bundle SQL Compact DLLs to read an MDF and make the settings entirely opaque without another tool.

    INI is really a happy middle ground. It's fast, it's easy to read, and it's already done for me.


  • Banned

    @Zenith said in Git hates UTF-16:

    @Gąska said in Git hates UTF-16:

    @Zenith said in Git hates UTF-16:

    @Gąska

    1. That warning was there when I was developing on Windows 2000. The function still works on Windows 10.

    It's not that it doesn't work. It's that you're using WinAPI to parse INI files. With a function that is explicitly marked in documentation as "do not use". And even in Windows 2000 times, it was still huge to use that function.

    Is it? The documentation doesn't say "don't use this."

    It says "this function is provided only for compatibility with 16-bit Windows-based applications". If you don't see this as deprecation warning, then the way you handle INI files is the least of your problems.

    It says "use the registry," which Microsoft also no longer recommends.

    Yes, MS could update the texts of their deprecation warnings more often (or just have deprecated functions index, for easy lookup). It doesn't change the fact it's deprecated and you shouldn't use it. And if 15 years later its replacement also got deprecated (is it, though? Got any link? I'm genuinely curious) - it doesn't mean you should go back to the old thing; it means you should use neither.

    The real question here is why.

    Because it can cause inter-process deadlocks, among other problems - some of which you've already encountered, apparently.

    @Gąska said in Git hates UTF-16:

    @Zenith said in Git hates UTF-16:

    @Gąska
    2. It's not like I can't write an INI parser myself. I actually did when a web app I was working on started leaking memory through calls to GetPrivateProfileString() until I figured out where/why it was leaking.

    So you have an alternative that presumably does everything you need. Why won't you switch over?

    After "you don't need that," the next most common response to programming questions is "why don't you use a library?" Well, I used an OS-provided library function.

    An ancient, deprecated library function that even its authors say that it shouldn't be there in the first place.

    My alternative, that I threw together in 20 minutes to get production back up, was read-only.

    Let's say you'd have to spend 10 times longer to port entire thing. That's still just one day (rounded up). And there's plenty of ready-made libraries available online.

    @Gąska said in Git hates UTF-16:

    @Zenith said in Git hates UTF-16:

    @Gąska
    3. XML-based configuration is garbage. I will fight it to the ends of the Earth.

    Because GetPrivateProfileString() and XML are the only things in the universe which you can use to store configuration.

    JSON has a few dialects depending on the parser. It ends up loading dictionaries of dictionaries, encouraging the same too-many-layers situation as XML.
    TOML is an INI with partial JSON syntax that I don't want to deal with.
    Registry is not portable and runs afoul of rights issues.
    I could bundle SQL Compact DLLs to read an MDF and make the settings entirely opaque without another tool.

    INI is really a happy middle ground. It's fast, it's easy to read, and it's already done for me.

    Then use INI. Just not through GetPrivateProfileString() and friends.



  • @Gąska

    It says "this function is provided only for compatibility with 16-bit Windows-based applications". If you don't see this as deprecation warning, then the way you handle INI files is the least of your problems.

    The deprecation warnings in their SQL Server documentation are more explicit. I stopped worrying about GetPrivateProfileString() disappearing when it survived to Windows 8. Besides, I have a DLL that's divided between shortcuts and fixing weird/stupid library APIs, so it's easy to replace, once, when I need or want to so it's no big deal.

    An ancient, deprecated library function that even its authors say that it shouldn't be there in the first place.

    It wasn't ancient when I started building software around it. And, really, it's like alot of things where the vendor provides a "replacement" that really isn't a pure replacement. Using the registry removes portability. Using .NET's app.config is XML clutter. What was the goal here? Both changes were effectively attempts at security through obscurity. Hide it in the registry, oh wait that didn't work, hide it XML tag soup. I don't want to keep churning like that.


  • Banned

    @Zenith said in Git hates UTF-16:

    @Gąska

    It says "this function is provided only for compatibility with 16-bit Windows-based applications". If you don't see this as deprecation warning, then the way you handle INI files is the least of your problems.

    The deprecation warnings in their SQL Server documentation are more explicit.

    They're also much newer, created in much more civilized times, by much more experienced developers, with much more rigid procedures. I can't find any old copy of official documentation, but it's entirely possible this warning was there since the original release of Windows 95.

    I stopped worrying about GetPrivateProfileString() disappearing when it survived to Windows 8.

    Really? Windows Vista wasn't enough of a clue? Not to mention that Microsoft was well known for its legendary care for backward compatibility long before Vista. Before even Windows 98.

    But it doesn't change that it's still wrong to use it.

    Besides, I have a DLL that's divided between shortcuts and fixing weird/stupid library APIs, so it's easy to replace, once, when I need or want to so it's no big deal.

    The fact it's so easily replaceable and yet you haven't done it yet after all these years just makes it worse.

    An ancient, deprecated library function that even its authors say that it shouldn't be there in the first place.

    It wasn't ancient when I started building software around it.

    You're still maintaining software you started in 1994? I'm impressed.

    And, really, it's like alot of things where the vendor provides a "replacement" that really isn't a pure replacement. Using the registry removes portability. Using .NET's app.config is XML clutter. What was the goal here? Both changes were effectively attempts at security through obscurity. Hide it in the registry, oh wait that didn't work, hide it XML tag soup. I don't want to keep churning like that.

    I'll say it once again, and I'll say it for the last time: I'M NOT TALKING ABOUT REGISTRY. I'M NOT TALKING ABOUT XML. USE AN INI LIBRARY IF THAT'S WHAT YOU WANT. JUST NOT THE OLD BROKEN SYSTEM FUNCTION THAT HAS BEEN DEPRECATED FOR MULTIPLE DECADES NOW. At least for new projects.



  • @Gąska said in Git hates UTF-16:

    The fact it's so easily replaceable and yet you haven't done it yet after all these years just makes it worse.

    Hey, I get sidetracked. There's so much that doesn't work right, like controls, that I have a list that's a mile long. Is there a single fucking control in WinForms that autosizes right out of the box besides labels?

    Look, in my experience, there are two types of programs. There are those like mine that abstract away stuff like GetPrivateProfileString() and increase their utility year over year. And then there are the needful doers endlessly rewriting their shitty featureless dependency-laden framework-of-the-month web app into next year's shitty featureless dependency-laden framework-of-the-month web app. If I wanted churn for churn's sake, I'd throw out my Windows tools and jump head first into Android.

    @Gąska said in Git hates UTF-16:

    JUST NOT THE OLD BROKEN SYSTEM FUNCTION

    You haven't explained how it's broken though. All you've provided is a suggestion on MSDN. Really, what's wrong with it? That they added a bunch of kludges to make it read/write other places? Doesn't affect me because I validate paths before passing them through the interop layer.

    Not trying to be difficult here. I just have something that works and different priorities.


  • Banned

    @Zenith said in Git hates UTF-16:

    @Gąska said in Git hates UTF-16:

    The fact it's so easily replaceable and yet you haven't done it yet after all these years just makes it worse.

    Hey, I get sidetracked. There's so much that doesn't work right, like controls, that I have a list that's a mile long. Is there a single fucking control in WinForms that autosizes right out of the box besides labels?

    Look. I understand that there are always too many things to do and never enough time. I understand that functional tasks take priority over non-functional. I understand that technological debt is not the end of the world and it often makes no economical sense to upgrade. But it still was bad decision to use that part of WinAPI. What's done is done, but for new projects - and I assume you've had at least a couple new projects between now and 2000, especially since you're talking about WinForms - you shouldn't have repeated it. Or at the very least, not repeat it for the ones you start in the future.

    Look, in my experience, there are two types of programs. There are those like mine that abstract away stuff like GetPrivateProfileString() and increase their utility year over year. And then there are the needful doers endlessly rewriting their shitty featureless dependency-laden framework-of-the-month web app into next year's shitty featureless dependency-laden framework-of-the-month web app.

    And there's no middle ground at all. None whatsoever. It's either doing things your entire life the exact same way you've done them when you were 20, or rewriting everything from scratch every week and having no time for actual features. Nothing inbetween.

    @Gąska said in Git hates UTF-16:

    JUST NOT THE OLD BROKEN SYSTEM FUNCTION

    You haven't explained how it's broken though. All you've provided is a suggestion on MSDN. Really, what's wrong with it?

    • Wonky Unicode support. Can cause silent data loss.
    • 32kB limit per file. I know it might very well be enough for you, but in 2019 it's pitiful. And exceeding the limit is silent error.
    • Simultaneous access can cause deadlock, even between different processes.
    • There is no way to distinguish between a read that fills up entire buffer and a read that's too big for a buffer and got truncated.
    • The documentation describes some rather complicated registry shenanigans that happen on every read and write that I didn't fully understood. But the bottom line is, what the application sees might not always be what actually is in the file.

    I'm sure there are more, but this should be enough. Yes, you can always do a bunch of hacks to work around each of these problems, like not allowing Unicode in your app, always checking for 32kB limit before write, restricting programs so only one instance of one program can read any given file, treat full buffers as read errors unconditionally, and just put a big fat warning not to edit those files by hand - but at this point, is it still worth it?


  • Java Dev

    @Zenith said in Git hates UTF-16:

    jump head first into Android.

    Be careful with that. The pool is leaky and I don't know how much water is still in it.



  • @PleegWat said in Git hates UTF-16:

    @Zenith said in Git hates UTF-16:

    jump head first into Android.

    Be careful with that. The pool is leaky and I don't know how much water is still in it.

    And like all public pools, there is a significant amount och piss in it.


  • Considered Harmful

    @Carnage said in Git hates UTF-16:

    och



  • @Carnage said in Git hates UTF-16:

    @PleegWat said in Git hates UTF-16:

    @Zenith said in Git hates UTF-16:

    jump head first into Android.

    Be careful with that. The pool is leaky and I don't know how much water is still in it.

    And like all public pools, there is a significant amount och piss in it.

    Strictly speaking, that's true for all bodies of water.


  • Banned

    @Rhywden well, only public bodies of water. And some private ones, depending on what lives in them.


Log in to reply