Git hates UTF-16

dkf

You really are channeling blakey.

Petition: rename @Gąska to @Polskyrat.

Gąska

@dkf YES PLEASE! That would really piss him off.

boomzilla

@boomzilla said in Git hates UTF-16:

But of course, in reality you might want to do just that in some cases. Copying memory, transmitting over a network, serializing to disk.

Because it just so happens that some of the operations that are valid for arbitrary sequences of bytes are also valid for character strings. Like, for example, bitwise copy of the entire string from the beginning to the very end, skipping nothing inbetween. Java's ArrayLists are also implemented as sequences of bytes, just like character strings, and there's also some operations that you can do on arbitrary sequences of bytes that are also valid for variable-sized array objects - but if you try to bitwise copy ArrayList, you're gonna have a bad time.

I see that despite covering this, you went ahead and said it.

In the ArrayList example, the individual elements are sequences of bytes. But as you say, the characters are not laid out (necessarily) next to each other in memory, and "neither" are the sequences of bytes. So you still have a sequence of bytes there.

You can sometimes in limited circumstances treat character strings identically to arbitrary byte sequences. You can sometimes in limited circumstances treat ArrayLists identically to arbitrary byte sequences. But despite that, ArrayLists are still not the same thing as byte sequences. And neither are character strings.

I'm not sure why you think the ability to "treat character strings identically to arbitrary byte sequences" is relevant here. Show me a sequence of characters that isn't bytes or STFU.

dkf

@ixvedeusi said in Git hates UTF-16:

In computer science, everything is represented as bytes.

Above a certain level of abstraction, sure.

boomzilla

@ixvedeusi said in Git hates UTF-16:

@boomzilla said in Git hates UTF-16:

At least until you can show an example of a sequence of characters that isn't bytes

In computer science, everything is represented as bytes.

Exactly!

So I obviously cannot show you an example where a string isn't represented with bytes.

I accept your concession.

But fundamentally, computer programming is manipulating electrical charges and nothing else. We just use these electrons to represent things so that we can reason in an abstract space which happens to have similar properties to the abstract space in which the problem we want to solve lives.

Yes, I'm being very pedantic and philosophical here, but this has many rather concrete real-life consequences we have to deal with daily, such as character encodings and leaky abstractions.

Except you're failing at pedantry by focusing on the proper way to deal with a thing and ignoring that the thing is still what it is.

ixvedeusi

@boomzilla said in Git hates UTF-16:

Except you're failing at pedantry by focusing on the proper way to deal with a thing and ignoring that the thing is still what it is.

There are many ways to represent a sequence of characters without involving any bytes (or using an entirely different sequence of bytes), and there are many ways to make a sequence of bytes represent something else than a string (or a string of entirely different characters), so I can't really see how anyone could think that one of them "is" the other (or vice versa).

Gąska

@boomzilla said in Git hates UTF-16:

In the ArrayList example, the individual elements are sequences of bytes.

I wasn't talking about elements, I was talking about ArrayList itself. Is it not implemented as a sequence of bytes too, after all? A pointer and two counters? Those three are laid out in contiguous memory. Like character strings.

You can sometimes in limited circumstances treat character strings identically to arbitrary byte sequences. You can sometimes in limited circumstances treat ArrayLists identically to arbitrary byte sequences. But despite that, ArrayLists are still not the same thing as byte sequences. And neither are character strings.

I'm not sure why you think the ability to "treat character strings identically to arbitrary byte sequences" is relevant here.

Because that's what "character string is byte sequence" means. No more, no less.

boomzilla

@ixvedeusi said in Git hates UTF-16:

@boomzilla said in Git hates UTF-16:

Except you're failing at pedantry by focusing on the proper way to deal with a thing and ignoring that the thing is still what it is.

There are many ways to represent a sequence of characters without involving any bytes (or using an entirely different sequence of bytes),

Such as? I was only asking for a single one. Using an entirely different sequence of bytes is still using bytes, so I'm not sure what you're trying to say there.

and there are many ways to make a sequence of bytes represent something else than a string (or a string of entirely different characters), so I can't really see how anyone could think that one of them "is" the other (or vice versa).

I don't follow this at all. They are what they are. And what they are is a sequence of bytes.

Zenith

@Carnage Welcome to another episode of why I'd rather use INI files. Incidentally, that's apparently broken by UTF8 rather than UTF16...

boomzilla

@Gąska said in Git hates UTF-16:

@boomzilla said in Git hates UTF-16:

In the ArrayList example, the individual elements are sequences of bytes.

I wasn't talking about elements, I was talking about ArrayList itself. Is it not implemented as a sequence of bytes too, after all? A pointer and two counters? Those three are laid out in contiguous memory. Like character strings.

So then it's also a sequence of bytes, you're saying?

You can sometimes in limited circumstances treat character strings identically to arbitrary byte sequences. You can sometimes in limited circumstances treat ArrayLists identically to arbitrary byte sequences. But despite that, ArrayLists are still not the same thing as byte sequences. And neither are character strings.

I'm not sure why you think the ability to "treat character strings identically to arbitrary byte sequences" is relevant here.

Because that's what "character string is byte sequence" means. No more, no less.

It most certainly does not. That's a massive leap of logic that's adding some assumptions that are irrelevant. It's simply a statement about what they are in a computer representation.

ixvedeusi

@boomzilla said in Git hates UTF-16:

Such as?

I can write them on a piece of paper, for example. No bytes involved, and it "is" still a string of characters. The concept of "character" isn't constrained to computer science, you know; that's just an arbitrary limitation you somehow came up with.

Gąska

@boomzilla is sugar made of carbon and water, or is sugar the same as carbon and water?

boomzilla

@ixvedeusi said in Git hates UTF-16:

@boomzilla said in Git hates UTF-16:

Such as?

I can write them on a piece of paper, for example.

ixvedeusi

@boomzilla said in Git hates UTF-16:

Care to elaborate?

boomzilla

@Gąska said in Git hates UTF-16:

@boomzilla is sugar made of carbon and water, or is sugar the same as carbon and water?

I see your problem now. You've injected the word, "same" and added a lot of stuff to the statement that "a sequence of characters is a sequence of bytes."

boomzilla

@ixvedeusi said in Git hates UTF-16:

@boomzilla said in Git hates UTF-16:

Care to elaborate?

I explicitly said in a computer, since that was what we were discussing.

ixvedeusi

@boomzilla said in Git hates UTF-16:

I explicitly said in a computer, since that was what we were discussing.

And as I said that's just an arbitrary limitation which isn't relevant for the discussion. It's just that this problem of conflating "is" and "is represented as" has the most obvious practical implications in computer science, but it's in no way limited to that domain. If you disagree, please explain why it's relevant.

boomzilla

@ixvedeusi said in Git hates UTF-16:

@boomzilla said in Git hates UTF-16:

I explicitly said in a computer, since that was what we were discussing.

And as I said that's just an arbitrary limitation which isn't relevant for the discussion.

No, it was the discussion. I mean, sure, we can broaden the discussion.

It's just that this problem of conflating "is" and "is represented as" has the most obvious practical implications in computer science, but it's in no way limited to that domain. If you disagree, please explain why it's relevant.

When every instance of the thing is represented by a particular thing I don't see how that can be properly described as "conflating." Saying that there are other domains when we are discussing a particular domain might be an interesting discussion but it does nothing for the actual discussion of the domain at hand.

Seriously, I did not think that @dkf's statement could be controversial, and I've been around here for a long time.

Gąska

@boomzilla said in Git hates UTF-16:

@Gąska said in Git hates UTF-16:

@boomzilla is sugar made of carbon and water, or is sugar the same as carbon and water?

I see your problem now. You've injected the word, "same" and added a lot of stuff to the statement that "a sequence of characters is a sequence of bytes."

So the whole argument was just you being overly pedantic with dictionary definitions, and actually you agree with everything me and @ixvedeusi say. Cool.

dkf

@Gąska said in Git hates UTF-16:

So the whole argument was just you being overly pedantic with dictionary definitions

It's important for a discussion that everyone uses the same definitions.

Or that they don't (but don't realise it at the beginning ) and are trying to find that out.

Or that they don't and are using the differences to make cheap jokes and so on.

boomzilla

@Gąska said in Git hates UTF-16:

@boomzilla said in Git hates UTF-16:

@Gąska said in Git hates UTF-16:

@boomzilla is sugar made of carbon and water, or is sugar the same as carbon and water?

I see your problem now. You've injected the word, "same" and added a lot of stuff to the statement that "a sequence of characters is a sequence of bytes."

So the whole argument was just you being overly pedantic with dictionary definitions, and actually you agree with everything me and @ixvedeusi say. Cool.

If that interpretation of reality quiets the voices in your head, then sure.

You've been focusing on the reasons why the second part of what @dkf said (about the fact being useless for using them) and thereby contradicting yourself. @ixvedeusi was talking about something else entirely.

Gąska

@boomzilla no matter how many times you repeat yourself, it doesn't make it any more real. The contradiction was only in your head. Or maybe your shoulder aliens, not sure. It certainly wasn't in any of my posts.

boomzilla

@Gąska said in Git hates UTF-16:

@boomzilla no matter how many times you repeat yourself, it doesn't make it any more real. The contradiction was only in your head. Or maybe your shoulder aliens, not sure. It certainly wasn't in any of my posts.

Sure. Repeat that enough and maybe someone will believe it.

Gąska

@boomzilla are you so out of things to say that the only thing you can think of anymore is copy-pasting my posts?

Carnage

This discussion looks like two (or more) compilers arguing.

ixvedeusi

@boomzilla said in Git hates UTF-16:

When every instance of the thing is represented by a particular thing I don't see how that can be properly described as "conflating."

Let me re-phrase your sentence:

"When, in a very specific, artificial domain which in itself has no relevance or utility, a concept which comes from outside this specific domain, and which has no particular relevance inside it, is by necessity always in one form or another represented with the only thing we have available to represent anything, than saying that it 'is' this other thing we just happen to have at hand to represent that first thing, can hardly be described as 'conflating'."

Of course, as long as you don't consider the actual outside world at all, you can do whatever you want with your bytes and call them however you want to. I apologize for assuming you had some kind of interesting point to make.

Gąska

@Carnage said in Git hates UTF-16:

This discussion looks like two (or more) compilers arguing.

And one of them getting corrupted input and accidentally writing out "never" when it meant "always" and thinking the user created an implication rule when no such thing ever took place. And who knows what else it fucked up on the way.

ixvedeusi

@Gąska said in Git hates UTF-16:

And one of them getting corrupted input and accidentally writing out "never" when it meant "always" and thinking the user created an implication rule when no such thing ever took place.

You know, if you focused a bit less on the trivial careless mistakes of your adversary and a bit more on the actual matter at hand, your arguments could sometimes actually be interesting to read.

Same goes for @boomzilla btw. Both of you often actually have interesting points of view, it's just so darn hard to make them out through all the Wharrgarble.

Just a thought. Whatever, this is TDWTF, carry on.

Gąska

@ixvedeusi I'd love to leave all the pettiness aside and have a real discussion. But that requires good will on both sides. And you're not gonna find even a drop of that in @boomzilla.

boomzilla

@Gąska said in Git hates UTF-16:

@boomzilla are you so out of things to say that the only thing you can think of anymore is copy-pasting my posts?

What?

boomzilla

@ixvedeusi said in Git hates UTF-16:

@boomzilla said in Git hates UTF-16:

When every instance of the thing is represented by a particular thing I don't see how that can be properly described as "conflating."

Let me re-phrase your sentence:

"When, in a very specific, artificial domain which in itself has no relevance or utility, a concept which comes from outside this specific domain, and which has no particular relevance inside it, is by necessity always in one form or another represented with the only thing we have available to represent anything, than saying that it 'is' this other thing we just happen to have at hand to represent that first thing, can hardly be described as 'conflating'."

I can't understand how you could be a member on this forum and make the comment about relevance or utility, but yeah, we were talking about that specific domain.

Of course, as long as you don't consider the actual outside world at all, you can do whatever you want with your bytes and call them however you want to. I apologize for assuming you had some kind of interesting point to make.

I'm not sure why you think I don't want to consider any other domains. The interesting point was that you were wrong. Admittedly, it's interesting in a pedantic dickweed kind of way, but that comes with the territory around here.

boomzilla

@Gąska said in Git hates UTF-16:

@ixvedeusi I'd love to leave all the pettiness aside and have a real discussion. But that requires good will on both sides. And you're not gonna find even a drop of that in @boomzilla.

FLAGGED FOR LIBEL.

Seriously. It wasn't me who keeps promoting contradictory ideas.

Gąska

@boomzilla if it wasn't me and wasn't you, then who was?

boomzilla

@Gąska it was you. Also, since you've gone @masonwheeler and accused me of bad faith, so I started a new thread:
https://what.thedailywtf.com/topic/26259/both-true-and-utterly-unhelpfully-misleading

Rhywden

@Gąska said in Git hates UTF-16:

@boomzilla if it wasn't me and wasn't you, then who was?

Phone.

pie_flavor

@Zenith said in Git hates UTF-16:

@Carnage Welcome to another episode of why I'd rather use INI files. Incidentally, that's apparently broken by UTF8 rather than UTF16...

Why not use Toml then?

Gribnit

@boomzilla said in Git hates UTF-16:

"That chair is built of wood. It's not wood."

"I am looking for some wood."

In other words, to a person looking for some wood, that chair is built of wood. It is not wood in a useful form to them and they can't have it. And if they were eyeing my chair, I could tell them the thing you quoted and not be a crazy person whargarbbl

Gribnit

@pie_flavor Because to hell with TOML.

pie_flavor

@Gribnit If you're going to act stupid, at least be funny about it.

HardwareGeek

@ixvedeusi said in Git hates UTF-16:

conflating "is" and "is represented as"

We're arguing about the definition of "is." Slick Willie would be so proud.

Zecc

@pie_flavor said in Git hates UTF-16:

Why not use Toml then?

What are your thought on RON?

dkf

@Zecc said in Git hates UTF-16:

What are your thought on RON?

Infested with Ruby. Kill it with fire.

anonymous234

@boomzilla said in Git hates UTF-16:

@Gąska said in Git hates UTF-16:

They are not bytes. They are implemented as bytes.

How do you keep these things in your head at the same time?

How do you not? A .png file is made of bytes. But an image is made of pixels (and metadata and other stuff). They are different layers. Text is exactly the same.

Zecc

@dkf said in Git hates UTF-16:

Infested with Ruby

pie_flavor

@Zecc said in Git hates UTF-16:

@pie_flavor said in Git hates UTF-16:

Why not use Toml then?

What are your thought on RON?

What is the point of such a thing? You've got dictionaries and lists and numbers and strings and booleans. What is the point of a tuple when you have lists? What is the point of classes when you have dictionaries? The format should represent dumb data; the code figures the rest out. I like TOML because it is indeed obvious and minimal. This is neither. We've got eighty JSON alternatives; an XML alternative would be fun but this isn't a good one.

pie_flavor

@dkf said in Git hates UTF-16:

Infested with Ruby.

dkf

@pie_flavor said in Git hates UTF-16:

This is neither.

Worse, it supports comments so the canonicalization will be non-trivial. (TOML is worse, in that it has multiple ways to write strings…) Canonical forms are vital for comparison and cryptography.

JBert

@dkf I believe we weren't discussing cryptography though...

pie_flavor

@dkf Canoniwhatnow?

Gąska

@pie_flavor autoformat, I guess.