WTF Bites

Watson

Measuring string lengths is ĥ̶̨͍͉̳͉̘͈̣̈́̓͆̊͒̔̆͂͑̉̓̿̽͝a̴̞̞̙͗̒̑͗̌̇͝͝ȑ̵̢̨̛̥̲̹̠̄̆͆̒̑͂̿͐̈́̀͑̚͜ͅḑ̵̜̖̤͕̝̜͕̝͎̖̣̉̆̌̈́̄̓͋́̒͗̚͘.̸̨̡̤̦͚͈̻̮̖͇̗̫́̿̆̐̃̾̕͝ͅ

cvi

@Watson said in WTF Bites:

@Benjamin-Hall

Measuring string lengths is ĥ̶̨͍͉̳͉̘͈̣̈́̓͆̊͒̔̆͂͑̉̓̿̽͝a̴̞̞̙͗̒̑͗̌̇͝͝ȑ̵̢̨̛̥̲̹̠̄̆͆̒̑͂̿͐̈́̀͑̚͜ͅḑ̵̜̖̤͕̝̜͕̝͎̖̣̉̆̌̈́̄̓͋́̒͗̚͘.̸̨̡̤̦͚͈̻̮̖͇̗̫́̿̆̐̃̾̕͝ͅ

Also hard: measuring string heights.

LaoC

@Benjamin-Hall said in WTF Bites:

And then this note (under Unicode support) which stood out more because it confirms how much of a cluster Unicode is, although I'm not vouching for Swift's implementation either.

Extended grapheme clusters can be composed of multiple Unicode scalars. This means that different characters—and different representations of the same character—can require different amounts of memory to store. Because of this, characters in Swift don’t each take up the same amount of memory within a string’s representation. As a result, the number of characters in a string can’t be calculated without iterating through the string to determine its extended grapheme cluster boundaries. If you are working with particularly long string values, be aware that the count property must iterate over the Unicode scalars in the entire string in order to determine the characters for that string.

The count of the characters returned by the count property isn’t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.

Edit: Not sure if

UTF-16 is obviously It's the "sort of works most of the time for the stuff we use and it's fast" solution. Most people seem to assume "character" means "Unicode code point" though, so it sounds like a bad idea to call stuff "characters" when you mean "grapheme clusters".

String and character comparisons in Swift are not locale-sensitive.

That's fine. I wouldn't want say password hashes in a string to suddenly compare differently because someone decided to use a Shit-JIS locale or something.

Bulb

@Benjamin-Hall said in WTF Bites:

Trawling through the Swift (language) documentation as a refresher, this Note stood out to me:

The Swift standard library includes tuple comparison operators for tuples with fewer than seven elements. To compare tuples with seven or more elements, you must implement the comparison operators yourself.

If you have a tuple with 7+ elements...you're

Rust has them up to twelve. It is a limitation of the languages—neither have variadic generics yet.

For manually written code a tuple with 7+ elements should probably be a structure with named members instead, but generated or generic code might sometimes still benefit from unlimited tuple support.

@Benjamin-Hall said in WTF Bites:

And then this note (under Unicode support) which stood out more because it confirms how much of a cluster Unicode is, although I'm not vouching for Swift's implementation either.

Extended grapheme clusters can be composed of multiple Unicode scalars. This means that different characters—and different representations of the same character—can require different amounts of memory to store. Because of this, characters in Swift don’t each take up the same amount of memory within a string’s representation. As a result, the number of characters in a string can’t be calculated without iterating through the string to determine its extended grapheme cluster boundaries. If you are working with particularly long string values, be aware that the count property must iterate over the Unicode scalars in the entire string in order to determine the characters for that string.

There is a minor in the property being called just count. You can count bytes, codewords, codepoints or graphemes, so it should probably say.

Also, something expensive to calculate probably shouldn't be hiding as a property. Worse, I believe they index strings by graphemes, so the indexing operator is also expensive and that just makes it tempting to use the Schlemiel the painter's algorithm.

I prefer the Rust approach where the the low level indexing is by byte (so not all indices are valid) and you have iterators over codepoints and graphemes.

The count of the characters returned by the count property isn’t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.

Of course legacy NSString works differently.

Edit: Not sure if

String and character comparisons in Swift are not locale-sensitive.

I'd say not a . There are a lot of cases where you need the sorting to be stable and consistent between systems, so it's better if sorting is locale-sensitive only on explicit request. Other languages also don't make string sorting locale-sensitive by default.

@LaoC said in WTF Bites:

@Benjamin-Hall said in WTF Bites:

And then this note (under Unicode support) which stood out more because it confirms how much of a cluster Unicode is, although I'm not vouching for Swift's implementation either.

Extended grapheme clusters can be composed of multiple Unicode scalars. This means that different characters—and different representations of the same character—can require different amounts of memory to store. Because of this, characters in Swift don’t each take up the same amount of memory within a string’s representation. As a result, the number of characters in a string can’t be calculated without iterating through the string to determine its extended grapheme cluster boundaries. If you are working with particularly long string values, be aware that the count property must iterate over the Unicode scalars in the entire string in order to determine the characters for that string.

The count of the characters returned by the count property isn’t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.

Edit: Not sure if

UTF-16 is obviously It's the "sort of works most of the time for the stuff we use and it's fast" solution. Most people seem to assume "character" means "Unicode code point" though, so it sounds like a bad idea to call stuff "characters" when you mean "grapheme clusters".

UTF-16 is a legacy encoding of what was intended to end all legacy encodings. At the beginning of Unicode, the designers hoped they can encode all the world's scripts while maintaining the nice fixed-width encoding of ASCII strings. Then they realized the many ways they were wrong, because many scripts are simply Complicated™, but by the time Microsoft and Apple and some other software vendors have already jumped on the bandwagon and splattered 16-bit wchar_t all over their APIs, so we got stuck with it.

acrow

@cvi said in WTF Bites:

@Watson said in WTF Bites:

@Benjamin-Hall

Measuring string lengths is ĥ̶̨͍͉̳͉̘͈̣̈́̓͆̊͒̔̆͂͑̉̓̿̽͝a̴̞̞̙͗̒̑͗̌̇͝͝ȑ̵̢̨̛̥̲̹̠̄̆͆̒̑͂̿͐̈́̀͑̚͜ͅḑ̵̜̖̤͕̝̜͕̝͎̖̣̉̆̌̈́̄̓͋́̒͗̚͘.̸̨̡̤̦͚͈̻̮̖͇̗̫́̿̆̐̃̾̕͝ͅ

Also hard: measuring string heights.

But the worst part is, you can't know either until the font has been executed. It's a bit like depending on a third-part javascript library to do part of your rendering, except the system can change the underlying js-library at will.

...And now I have an intense desire to figure out how to force Qt to self-render fonts instead of going through OS renderer.

Bulb

@acrow said in WTF Bites:

And now I have an intense desire to figure out how to force Qt to self-render fonts instead of going through OS renderer.

Since nobody uses the “OS” (meaning X11) font rendering for some 20 years on any Unix, I am sure it has the support. You'll probably need a custom build to make it use it's own libfreetype on systems where that isn't the default though.

cvi

@acrow Yep. Fonts are and font rendering is all sorts of fun.

Bulb

@cvi especially when you get to things like right-to-left (and mixing them; numbers are still left-to-right in Arabic), initial, middle and final forms, around combining characters, and the very special case of assembling syllables to boxes. Greek, Latin and Cyrillic scripts are really simple compared to the others.

cvi

@Bulb It's too early in the morning to be thinking about all that again. I'm only on my third coffee.

dkf

@Benjamin-Hall said in WTF Bites:

And then this note (under Unicode support) which stood out more because it confirms how much of a cluster Unicode is, although I'm not vouching for Swift's implementation either.

Edit: Not sure if

Unicode is definitely even though it's conspicuously better than the mess that preceded it.

Filed under: £define ASCII "don't make me laugh"

Bulb

@dkf said in WTF Bites:

Unicode is definitely even though it's conspicuously better than the mess that preceded it.

is human writing systems, especially Arabic and Hangul (Korean). Besides the UTF-16 blunder most of the Unicode complexity is caused by the mess it is trying to represent.

hungrier

@Bulb said in WTF Bites:

Rust has them up to twelve. It is a limitation of the languages—neither have variadic generics yet.

Are those the generics that reduce damage and protect you from extreme temperatures?

sebastian.galczynski

New project.
'We can parse and analyze your logs, but convert them from your shitty format to JSON'
'OK'

Today I got the result:

{
   "timestamp": "2020-01-01 10:00:01",
   "data" : "$SHITTYFORMATSTRING"
}

But wait, there's more.
$SHITTYFORMATSTRING sometimes includes, among other things, an inner JSON

Zerosquare

@Bulb said in WTF Bites:

is human writing systems, especially Arabic and Hangul (Korean).

Hangul is actually pretty logical (it was designed to be simple to learn, after all). It's just different from what we're used to in the West.

error

@TimeBandit said in WTF Bites:

@TwelveBaud IOW, it's like when .NET came out, Microsoft claimed it was cross-platform, as long as the platform was Windows

It's eventually cross-platform. Cross-platform in some universes, just not any that exist yet.

error

@cvi said in WTF Bites:

@acrow Yep. Fonts are and font rendering is all sorts of fun.

I'm amazed that the implementations of font rendering are usually pretty stable.

Sure you can find counter-examples but I don't see significant text-rendering bugs frequently.

Filed under: Count the weasel words.

Zerosquare

@error said in WTF Bites:

Cross-platform in some universes, just not any that exist yet.

I'm sure it is cross-platform on the Earth where Discourse is an excellent piece of software.

error

@Zerosquare said in WTF Bites:

@error said in WTF Bites:

Cross-platform in some universes, just not any that exist yet.

I'm sure it is cross-platform on the Earth where Discourse is an excellent piece of software.

No, if P is exactly 0 then no such universes can ever exist.

topspin

@error said in WTF Bites:

@Zerosquare said in WTF Bites:

@error said in WTF Bites:

Cross-platform in some universes, just not any that exist yet.

I'm sure it is cross-platform on the Earth where Discourse is an excellent piece of software.

No, if P is exactly 0 then no such universes can ever exist.

Also P = NP.

Zerosquare

I looked at all the ways Microsoft Teams tracks users and my head is spinning

Microsoft Teams isn't just there to make employees' lives easier. It's also there to give bosses data about so many things.

Cut to September and Microsoft offered a little more about the Teams Activity Report (since updated). Here's a sentence that's unsurprising but still a touch uncomfortable: "The table gives you a breakdown of usage by user."

Everything from how many meetings that user organized to how many urgent messages they sent is recorded. Separate numbers are given for scheduled meetings and those that were ad hoc. Even individuals' screen-share time is there.

It's remarkably detailed. But, I hear you cry, is it detailed enough?

In October, then, Redmond offered "a new analytics and reporting experience for Microsoft Teams." (This was updated last week.)

I confess that just staring at this made me swivel several times in wonder. Microsoft is measuring privacy settings, device types, time stamps, reasons why someone may have been blocked, and "the number of messages a user posted in a private chat."

Because remote working doesn't mean you have to stop micromanaging your employees!

HardwareGeek

@Bulb said in WTF Bites:

At the beginning of Unicode, the designers hoped they can encode all the world's scripts while maintaining the nice fixed-width encoding of ASCII strings. Then they realized the many ways they were wrong, ...

... and then they invented even more ways to be wrong, ridiculously jumping the barmy shark off the rails into the crazy morass of loopy combining emoji.

HardwareGeek

Email just arrived:

Um, wat? Refurbished food?

LaoC

@Bulb said in WTF Bites:

@dkf said in WTF Bites:

Unicode is definitely even though it's conspicuously better than the mess that preceded it.

is human writing systems, especially Arabic and Hangul (Korean). Besides the UTF-16 blunder most of the Unicode complexity is caused by the mess it is trying to represent.

Why Hangul? I wouldn't claim deep knowledge (still don't speak a word of Korean) there but some time in the early 2000s I had to add Hangul support to a Natural Language Processing system that only did German, English and Spanish before that, and I hugely overestimated the effort. As one of the very few writing systems designed from scratch it's totally orthogonal and as logical as you get. IIRC the way to split a Hangul symbol into its constituent Jamos (vowel or consonant symbols) is just a couple of integer divisions, modulos and small table lookups (~20 elements). There was some C code to borrow from and doing roughly the same thing in Java wasn't more than two days and like 200 lines.

topspin

@HardwareGeek said in WTF Bites:

Email just arrived:

Um, wat? Refurbished food?

Considering how much money the industry spends on tracking users, this must be the result of personalized ads.

dcon

@HardwareGeek said in WTF Bites:

Email just arrived:

Um, wat? Refurbished food?

Haven't you heard? https://www.perfectlyimperfectproduce.com/

And have an article about it:

Emily Atkin / Jan 11, 2019

Does Your Box of “Ugly” Produce Really Help the Planet? Or Hurt it?

Startups like Hungry Harvest and Imperfect Produce say they're helping to reduce food waste in America. Critics say they're deceiving their customers and making the problem worse.

Zerosquare

The idea may have some merit, but "refurbished" is not a word I'm looking for when it comes to food.

HardwareGeek

@Zerosquare said in WTF Bites:

"refurbished" is not a word I'm looking for when it comes to food.

"Refurbished" implies used (and fixed up to be newish, kinda, but still used). And food that has been used once becomes . So, "NO" to refurbished food.

dkf

@error said in WTF Bites:

Count the weasel words.

dkf

@topspin said in WTF Bites:

Also P = NP.

Possible = Not Possible? TDEMS…

TimeBandit

@dkf said in WTF Bites:

Possible = Not Possible? TDEMS…

Must be like "Impossible mission" where the mission is always a success

dkf

@HardwareGeek said in WTF Bites:

So, "NO" to refurbished food.

Don't say refurbished, say pre-loved.

Keep the "NO" though.

boomzilla

@HardwareGeek said in WTF Bites:

@Zerosquare said in WTF Bites:

"refurbished" is not a word I'm looking for when it comes to food.

"Refurbished" implies used (and fixed up to be newish, kinda, but still used). And food that has been used once becomes . So, "NO" to refurbished food.

Making soup using a ham bone?

HardwareGeek

@dkf

Pre-digested

A few decades ago, there was a diet fad of "pre-digested liquid proteins" (basically, bottles of mixed amino acids). IIRC, it ended when people started dying of liver or kidney damage, or something like that.

HardwareGeek

@boomzilla said in WTF Bites:

@HardwareGeek said in WTF Bites:

@Zerosquare said in WTF Bites:

"refurbished" is not a word I'm looking for when it comes to food.

"Refurbished" implies used (and fixed up to be newish, kinda, but still used). And food that has been used once becomes . So, "NO" to refurbished food.

Making soup using a ham bone?

I have a ham bone in the freezer to be used for exactly that purpose when I get around to it. But I have not and do not plan to refurbish it.

Gąska

@Bulb said in WTF Bites:

UTF-16 is a legacy encoding of what was intended to end all legacy encodings. At the beginning of Unicode, the designers hoped they can encode all the world's scripts while maintaining the nice fixed-width encoding of ASCII strings.

And they would've been able to if they just sticked to contemporary written languages. Even if you count every Asian language completely separately, there's still less than 60000 characters in total. And none of the combination and composition things would be needed.

Unicode Consortium basically alternates between "how do we fit more characters without raising our arbitrary code point limit" and "now that we've upped our arbitrary limit how do we fill code points with characters", making an already hard task of representing text with bytes even harder, for the sole reason of not exceeding their arbitrary code point limit that they eventually exceeded anyway. If they just made full use of the 16 bits all the way back in 1991, we wouldn't be in this situation.

And to add insult to injury, Aditya Mukerjee still cannot write his name correctly. Unicode just doesn't have that letter.

Zerosquare

In addition to that, adding emojis (beyond the small set that originated from Japanese telecoms providers, and was imported for backwards compatibility) is probably THE worst decision the Unicode committee ever took.

Unless the goal was to make sure their job was secure for many years to come, in case it is probably THE best decision they ever took.

Gąska

@Zerosquare second worst. The absolute worst, and kind of the original sin, was Han unification. "These Asian writing systems are basically the same, let's treat them like exactly the same!"

LaoC

@topspin said in WTF Bites:

@HardwareGeek said in WTF Bites:

Email just arrived:

Um, wat? Refurbished food?

Considering how much money the industry spends on tracking users, this must be the result of personalized ads.

Soylent Green is refurbished people.

LaoC

@Gąska said in WTF Bites:

@Bulb said in WTF Bites:

UTF-16 is a legacy encoding of what was intended to end all legacy encodings. At the beginning of Unicode, the designers hoped they can encode all the world's scripts while maintaining the nice fixed-width encoding of ASCII strings.

And they would've been able to if they just sticked to contemporary written languages. Even if you count every Asian language completely separately, there's still less than 60000 characters in total. And none of the combination and composition things would be needed.

No. About 21,000 are Chinese characters. They could have left it at 16bit iff they had exclusively used the "unified Han characters" that Mukerjee complains about. Exactly that would have necessitated those variant selectors to avoid results that look σutragєѳuѕ to native readers. Adding all these forms as separate characters far exceeds the 16bit range already, and then we haven't even added some 12k for Korean yet, or any other language for that matter.
"Latin" has over 1,300 characters in Unicode where naïvely you'd think 26 would be fine.
Besides ignoring cultural differences they'd also have had to drop the concept of blocks and stuff characters in the code range in basically arbitrary order and forego any compatibility to legacy encodings, further slowing the already slow adoption.

And to add insult to injury, Aditya Mukerjee still cannot write his name correctly. Unicode just doesn't have that letter.

Doesn't seem to be true any more, at least her name doesn't use this ":" suffix:

Mitali Mukherjee - Wikipedia

Edit:
@Gąska said in WTF Bites:

@Zerosquare second worst. The absolute worst, and kind of the original sin, was Han unification. "These Asian writing systems are basically the same, let's treat them like exactly the same!"

First you argue that 16 bit Should Be Enough For Everybody, and then this?

dkf

@Gąska said in WTF Bites:

If they just made full use of the 16 bits all the way back in 1991, we wouldn't be in this situation.

If they'd just said that they can go up 2³²(-1) then we would be mostly OK. Also if they'd never added precomposed characters (though that horse had bolted long ago). Rendering would still be godawful, but that's because rendering all the writing systems of the world is definitely a quixotic task anyway.

Gąska

@dkf said in WTF Bites:

@Gąska said in WTF Bites:

If they just made full use of the 16 bits all the way back in 1991, we wouldn't be in this situation.

If they'd just said that they can go up 2³²(-1) then we would be mostly OK.

Yeah, but in 1991 it would've been very wasteful. I mean, it's always wasteful, it's just not as important nowadays. And as I said earlier, 60,000 is more than enough if you don't do dumb shit and focus on just actual writing.

Also if they'd never added precomposed characters

Precomposed characters are okay. There's only so many precompositions that occur in written languages of the world. The problem is coexistence of precomposed characters with composition primitives.

Bulb

@Gąska said in WTF Bites:

@dkf said in WTF Bites:

@Gąska said in WTF Bites:

If they just made full use of the 16 bits all the way back in 1991, we wouldn't be in this situation.

If they'd just said that they can go up 2³²(-1) then we would be mostly OK.

Yeah, but in 1991 it would've been very wasteful.

The variable-width UTF-8 encoding is the solution for that. When there is composition, the codepoint is not a useful unit anyway so the system is inherently variable-length and another variable-length layer does not make much difference any more.

Also if they'd never added precomposed characters

Precomposed characters are okay. There's only so many precompositions that occur in written languages of the world. The problem is coexistence of precomposed characters with composition primitives.

Strangely enough they didn't use composition for the Han characters. That would be possible—it's how you enter them in some of the input methods—and would reduce the code space (at expense of making individual characters longer).

Gąska

@Bulb said in WTF Bites:

@Gąska said in WTF Bites:

@dkf said in WTF Bites:

@Gąska said in WTF Bites:

If they just made full use of the 16 bits all the way back in 1991, we wouldn't be in this situation.

If they'd just said that they can go up 2³²(-1) then we would be mostly OK.

Yeah, but in 1991 it would've been very wasteful.

The variable-width UTF-8 encoding is the solution for that. When there is composition, the codepoint is not a useful unit anyway so the system is inherently variable-length and another variable-length layer does not make much difference any more.

Get rid of composition and all those problems go away.

Also if they'd never added precomposed characters

Precomposed characters are okay. There's only so many precompositions that occur in written languages of the world. The problem is coexistence of precomposed characters with composition primitives.

Strangely enough they didn't use composition for the Han characters. That would be possible—it's how you enter them in some of the input methods—and would reduce the code space (at expense of making individual characters longer).

Maybe I'm confusing Han with other Asian writing systems, but AFAIK Unicode made it an unholy mess with fully precomposed characters mixed with partially precomposed characters mixed with individual composable strokes, and to make it all even worse, there are variant modifiers that completely change how the given grapheme cluster looks?

And to continue the list of Unicode's crimes against humanity, the third worst thing they've ever done is undoubtedly Turkish i. Why. Just why.

dkf

@Gąska said in WTF Bites:

Why. Just why.

Because Turkish gotta Turkish.

Bulb

@Gąska said in WTF Bites:

Get rid of composition and all those problems go away.

And how would I write f̌ or b̌ then?

@Gąska said in WTF Bites:

Maybe I'm confusing Han with other Asian writing systems, but AFAIK Unicode made it an unholy mess with fully precomposed characters mixed with partially precomposed characters mixed with individual composable strokes, and to make it all even worse, there are variant modifiers that completely change how the given grapheme cluster looks?

Han is Chinese ideographs. I don't think there is any composition used with those.

In other scripts, yes. Part of the reason is that the initially stated goal was to be able to convert from legacy encodings losslessly, so some of the pre-composed and composable strokes mixture exists because it was available in other encodings in use before Unicode.

@Gąska said in WTF Bites:

And to continue the list of Unicode's crimes against humanity, the third worst thing they've ever done is undoubtedly Turkish i. Why. Just why.

It's Latin unification.

The Latin script works a bit differently when writing Turkish than it works when writing other languages. Turkish has I/ı and İ/i unlike all other languages written in Latin script that have I/i. It is really similar problem to how the Hanzi script works a bit differently when adopted by Japanese as Kanji. You can then try to encode the look of the letters or some kind of their identity and either way you complicate something.

It however also shows the total inconsistency of the answer to whether they are encoding the look, the meaning/identity or something else, because they unified some characters based on their look and not others.

BernieTheBernie

@Bulb said in WTF Bites:

@dkf said in WTF Bites:

Unicode is definitely even though it's conspicuously better than the mess that preceded it.

is human writing systems, especially Arabic and Hangul (Korean). Besides the UTF-16 blunder most of the Unicode complexity is caused by the mess it is trying to represent.

You haven't learned any Indian alphabet yet. Many became simple during their evolution (by placing characters after each other), but some like Khmer still create terrible clusters (not only single consonant+vowel/diphthong, but multiple consonants with vowel/diphthongs).

Bulb

@BernieTheBernie said in WTF Bites:

@Bulb said in WTF Bites:

@dkf said in WTF Bites:

Unicode is definitely even though it's conspicuously better than the mess that preceded it.

is human writing systems, especially Arabic and Hangul (Korean). Besides the UTF-16 blunder most of the Unicode complexity is caused by the mess it is trying to represent.

You haven't learned any Indian alphabet yet. Many became simple during their evolution (by placing characters after each other), but some like Khmer still create terrible clusters (not only single consonant+vowel/diphthong, but multiple consonants with vowel/diphthongs).

Arabic also has clusters with multiple consonants being considered single (extended) graphemes and the added complication of switching direction for numbers.

BernieTheBernie

@Bulb said in WTF Bites:

switching direction for numbers

They don't say twenty-two, but two-and-twenty - like German zweiundzwanzig or Czech dva-a-dvacet (not sure about spelling). I.e. writing follows pronunciation. Only bad when numbers become greater than 100.

error

@HardwareGeek said in WTF Bites:

A few decades ago, there was a diet fad of "pre-digested liquid proteins" (basically, bottles of mixed amino acids). IIRC, it ended when people started dying of liver or kidney damage, or something like that.

Status: concerned about my amino acid supplement intake

Edit: that stuff appears to still be on the market. Source for the liver and kidney damage?

error

@BernieTheBernie said in WTF Bites:

Only bad when numbers become greater than 100.

It's good that most of them aren't.