WTF Bites

Bulb

They didn't keep compatibility with any other encoding.

Yes, they did. With a whole lot of them. Not by code value, but by existence of the codes: for every two distinct codes in some legacy encoding, there are always distinct codes in Unicode so all the information in the original code can be preserved in the transcoding. That is a clearly stated goal in the design.

The Han unification is permitted by the fact that all Chinese encodings only encoded Hanzi while all Japanese encodings only encoded Kanji and all Korean encodings only encoded Hanja, so the same codes could be reused for all three. In contrast Russian encodings encoded the Latin and Cyrillic characters that look the same separately, so they could not be unified. But the Turkish encodings only encoded the dotless lowercase i and dotful uppercase i rather than completely separate pair of is, so that's how it got encoded in Unicode too.

I think many of the Indic scripts were not supported by any of the major operating systems at the time, if they even had any standard encoding at all, which is how they could so easily forget to encode a bunch of characters in them.

Kamil Podlesak

@Gąska said in WTF Bites:

@Bulb said in WTF Bites:

The Latin script works a bit differently when writing Turkish than it works when writing other languages. Turkish has I/ı and İ/i unlike all other languages written in Latin script that have I/i. It is really similar problem to how the Hanzi script works a bit differently when adopted by Japanese as Kanji. You can then try to encode the look of the letters or some kind of their identity and either way you complicate something.

Unicode already has separate code points for a and а. Why not make Turkish I and i separate code points as well? But noooooo, that would be too easy, better change the fundamental property of uppercase-lowercase relationship and make it locale dependent! That's so much easier and fixes so many problems we didn't even know we had! God fucking dammit. Sometimes Unicode Consortium makes weird decisions. But sometimes they seem to turn off their thinking entirely. Seriously, what were they smoking when they decided on this solution? And it only saves two code points! Two code points, when there are now thousands allocated to various shades of poo!

I believe that this decision was made by someone else and the only thing that Unicode Consortium did was to accept already existing standard practice. To be more specific, ISO-8859-3, ISO-8859-9 and Windows-1254. These encodings were actually used at the same time as Unicode (and I am pretty sure they are still somewhere in the wild), so there has to be a way to keep bidirectional mapping. Especially on Windows, because Windows insistence on codepages was absolute shitstorm on

cvi

Zoom providing them glorious frame rates:

There's supposed to be an option to tell zoom that you're sharing something that needs a reasonable frame rate. At least that's what the interwebs say - I can't find it for my life, though.

Carnage

@sebastian-galczynski said in WTF Bites:

New project.
'We can parse and analyze your logs, but convert them from your shitty format to JSON'
'OK'

Today I got the result:
{
   "timestamp": "2020-01-01 10:00:01",
   "data" : "$SHITTYFORMATSTRING"
}
But wait, there's more.
$SHITTYFORMATSTRING sometimes includes, among other things, an inner JSON

Our architect is doing that to our logs right now. I told everyone that it's a bad idea. Looking forward to being ignored until it becomes a problem and uI have to fix it.

Bulb

@Kamil-Podlesak said in WTF Bites:

Especially on Windows, because Windows insistence on codepages was absolute shitstorm on

Windows still supports applications that use the non-Unicode API and of course nothing has changed on those APIs since the Unicode version was introduced, so it is still as much of a shitstorm as always, one just less often gets in the area where it rages.
The Windows codepages that are almost, but not completely, unlike the ISO-8859-* encodings is a . And they don't match the ones used in DOS before that either.
And as far as I can tell, Windows still can't be configured to the codepage corresponding to UTF-8. System-wide it just depends on the localization, and in places where one can chose the UTF-8 (65001) does not always work anyway.

error

@sebastian-galczynski said in WTF Bites:

New project.
'We can parse and analyze your logs, but convert them from your shitty format to JSON'
'OK'

Today I got the result:
{
   "timestamp": "2020-01-01 10:00:01",
   "data" : "$SHITTYFORMATSTRING"
}
But wait, there's more.
$SHITTYFORMATSTRING sometimes includes, among other things, an inner JSON

Of course, representing the log as JSON means you have to parse the whole document to read any line. That could be painful when the log grows large.

Edit: or is each line a separate JSON object? I've never seen that... maybe it could work?

Bulb

@error said in WTF Bites:

Edit: or is each line a separate JSON object? I've never seen that... maybe it could work?

Then have a look in /var/lib/docker/containers/*/*.log—lines, each formatted as JSON object.

Of course the

@sebastian-galczynski said in WTF Bites:

{
   "timestamp": "2020-01-01 10:00:01",
   "data" : "$SHITTYFORMATSTRING"
}

is not a single line, so unless it was formatted for our convenience, it's not the case. It can still be a stream of concatenated JSON objects rather than one huge JSON array though.

dcon

@Carnage said in WTF Bites:

Looking forward to being ignored until it becomes a problem and ~~u have~~ has to fix it.

They won't care because it'll be your problem.

dkf

@Bulb said in WTF Bites:

And as far as I can tell, Windows still can't be configured to the codepage corresponding to UTF-8.

I thought it can, but it's recommended to only do it on a new installation because otherwise who knows what will happen with old filenames? (Breaking weirdly is the most expected outcome.) Existing data like that is the biggest problem, and is not helped by case insensitivity getting a lot more complex with UTF-8; case-sensitive filesystems are much easier by comparison (they just pass the bytes about) and the horror of what to do changing by locale can be handed off to applications (which need to handle it elsewhere anyway).

hungrier

@error said in WTF Bites:

@cvi said in WTF Bites:

ö

It looks like a surprised face.

Um ackshually it's a bear

hungrier

@topspin Make that 5, unless I missed one that looks like mine:

PleegWat

@error said in WTF Bites:

Edit: or is each line a separate JSON object? I've never seen that... maybe it could work?

Only if you additionally require the JSON is serialised without any inner newlines.

error

@hungrier said in WTF Bites:

:wipes smudge off screen:

sebastian.galczynski

@error said in WTF Bites:

Edit: or is each line a separate JSON object? I've never seen that... maybe it could work?

It's separate. Basically some devices are logging some events by posting some piece of data to a HTTP server, nothing fancy. The problem is that they have many types of devices, with different formats (some of them are JSON, but embedded in another format, probably with ill-defined escaping) and different data types. Of course none of this is documented. I'm trying to avoid reverse-engineering this crap and force them to use some predictable schema, so that it can be deserialized and put into a Kafka stream. We'll see how it goes.

sebastian.galczynski

By the way, nobody mentioned the polish Facebook clone? That was quite a disaster.

TimeBandit

@sebastian-galczynski said in WTF Bites:

By the way, nobody mentioned the polish Facebook clone? That was quite a disaster.

Let me guess, it wasn't polished enough

sebastian.galczynski

@TimeBandit said in WTF Bites:

Let me guess, it wasn't polished enough

Worse than that. A bit of introduction: a bunch of right-wing (that is Law-and-Justice-adjacent) journalists/polititians got alarmed by the recent american banwave and decided to start a new platform for their folks, much like Parler. They even managed to get some government grant money. What developers did they hire? What technology did they use?

Well, apparently it was just someone's nephew who customized a $125 PHP script.

The result was not exactly safe, exhibiting at least the following vulnerabilities:

a path traversal bug exposing database credentials (the database was open to the internet)
user registration without email confirmation (one troll claims to have registered half a million users)
no limit on password length (someone put an entire epic poem there)
GET /delete_account doing what you suspect
A troll set his username to 'delete_account', now anyone who clicked his profile is deleted

The last screens before it died showed half of the users being pope John Paul II. Now it's dead with 504, probably something overheated from all the hashing of Pan Tadeusz.

Bulb

@remi said in WTF Bites:

@topspin @dkf Since it's just one code-point and that different fonts (rendering systems?) can draw whatever they want as long as it matches the description, this is not really surprising.

Note that the codepoint is U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM is an absolute oddity. It is called a ligature, and it stands for a word that can be written by using the separate letters, but nothing says which ones that would be.

There is a whole bunch of other ligatures in the “Arabic Presentation Forms-A” block and all of those define which characters they are composed from. They exist for the benefit of font makes so they have the space reserved to put the ligature, but it shouldn't be used in actual text. It's similar to e.g. ﬃ, which has U+FB03 LATIN SMALL LIGATURE FFI, but should always be written as separate f, f and i.

Gąska

@LaoC said in WTF Bites:

@Gąska said in WTF Bites:

@LaoC said in WTF Bites:

But anyway, let me rephrase the question. How many glyphs would be needed if the only form of composition was horizontal concatenation?

Getting difficult, I guess I'm bound to forget something important … absolute minimum would probably be all the consonants * tone marks * diacritical vowels - 36*4*6=865 plus maybe 50 individual glyphs. We're back at my gut feeling of "lower four digits" :)

That's good. Lower four digits is much more manageable. So let's say there are like 10 more alphabets in SEA with similar properties, that's like 20k-30k characters tops. Add another 20k from CJK, and we still have about 15000 code points of head room. So we can probably fit all written languages of the world, and all the emoji (of which there are 1300 currently) within 16 bits without too much trouble. This concludes my proof that almost every difficulty in handling Unicode text is self-inflicted by the Unicode Consortium, and not inherent to text encoding in general.

Assuming they stuck to what you said was the absolute worst thing they ever did instead of addressing the complaints and adding a total of >92k CJK characters.

IIRC the official list of all Han characters prepared by the government of China has "only" 9000 characters, with 2000 in active use. AFAIK Japan and Koreas have even less than that. The other 50k-80k characters are only of interest to archeologists.

But I digress. My point is that Unicode made many things needlessly complicated and near-arbitrary character/accent composition is one of them.

I'm just arguing against the "needlessly". If they could have started with a blank slate and brought in some experts with a reeeeally broad overview over all the world's writing systems who could have considered all the pros and cons at once, then maybe.

You're forgetting one thing - THEY ALREADY DID start with a blank slate and brought experts of all the world's writing systems. It's actually been done. It's been the entire point of Unicode Consortium to bring in the experts to start anew and come up with one encoding that fits all. It's just, they fucked up.

Not "drop-in and fahgeddaboutit" compatibility but at least "we have exactly that character for you so you can have a bijective mapping to your 8-bit charset and be done with providing a new load/save function instead of writing a whole new renderer and checking every single one of your string comparison and search algorithms and waiting for all your dependencies to update, too" compatibility.

Except that bijective mapping was already dependent on the legacy encoding. There is no reason at all why 0x49 in 8859-9 has to map to the same Unicode character as 0x49 in 8859-1. Classic case of someone not understanding the requirements and creating a complex, overengineered solution for a problem that doesn't even exist.

@Bulb said in WTF Bites:

I think many of the Indic scripts were not supported by any of the major operating systems at the time, if they even had any standard encoding at all, which is how they could so easily forget to encode a bunch of characters in them.

Between 2000 and 2020, there have been NINETEEN revisions of Unicode standard. And these characters are STILL missing.

remi

@Bulb said in WTF Bites:

Note that the codepoint is U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM is an absolute oddity. It is called a ligature, and it stands for aseveral words that can be written by using the separate letters, but nothing says which ones that would be.

That's true, but I guess this reflects the special status of that specific sequence of words -- not the religious, or political, or cultural, or historical status, although it of course has all of these (but so do several other sequences, though arguably maybe not to the same degree), but the calligraphic or graphical status. As the image search link that I posted shows, there are tons of variations on the graphical representation of that sequence of words, and these variations are not just the "font" used by the calligrapher, but really something more. I can't really think of any other sequence or word where so much attention has been put by so many people into getting significantly different visual results, from the same starting material.

So I guess it's not so surprising that this sequence ends up being treated differently, I'd say. It's weird, but fairly understandable.

Gąska

@Bulb said in WTF Bites:

There is a whole bunch of other ligatures in the “Arabic Presentation Forms-A” block and all of those define which characters they are composed from. They exist for the benefit of font makes so they have the space reserved to put the ligature, but it shouldn't be used in actual text.

Also, there are countries that don't use Arabic script that still need to write these particular words for religious purposes. For example, Pakistan, which uses Urdu alphabet, has a law that every official government document has to start with Bismillah, written in Arabic letters. And it would be quite difficult to do with Urdu keyboard, and copy-pasting is quite fragile too, so instead they somehow convinced the Unicode Consortium to include it as a standalone code point to make life easier.

TimeBandit

@sebastian-galczynski

TimeBandit

Hey YouTube, stop switching my language to whatever that is (probably Catalan)

Edit: that's not the first time it does that. And before you ask, it didn't confuse French and Catalan. It's always set to English.

Bulb

I've just spent totally inappropriate time discussing a stupid network/configuration error. And other colleagues spent even more.

User at company C can't, probably since some time last month, start product X, because it tries to check a license and the network connection fails.
The network connection is plain old HTTPS connection to some server of company V that makes X.
The guy at V initially tasked with finding out what's going on is so perplexed that he writes a small test Java app T (X is in Java). (1—why doesn't X itself have the diagnostic for this?)
He's still perplexed even though T logs all the necessary information, so the log gets to me (I am subcontracting for V).
The logs make it blatantly obvious that the corporate proxy at C MITMs the connection and re-encrypts it with it's own key. (2—but it's quite common).
App X bundles its own JRE (3—why?) and Java by default trusts certificates in lib/security/cacerts in its installation (4—why? the system store exists for a reason).
So I suggested to set environment variable JAVA_TOOL_OPTIONS=Djavax.net.ssl.trustStoreType=WINDOWS-ROOT and call it a fortnight; we'll see whether it works with X, but it does with the test app.
The funny thing? V has exactly as snoopy proxy that MITMs connections just the same, so the people should know how to deal with one. Except it doesn't do this to the V's own servers, so nobody noticed.
Of course this whole thing, costing at least a man-week by now with all the people discussing it, wouldn't happen if German companies weren't so distrustful of their employees.

sebastian.galczynski

@Bulb said in WTF Bites:

App X bundles its own JRE (3—why?)

This is a recent trend. JetBrains does it too. Fits the general zeitgeist of containerization, node_modules, snap and other bloat in the name of 'avoiding dependency hell'.

@Bulb said in WTF Bites:

4—why? the system store exists for a reason

Same thing, although previous iteration. Java is supposed to be portable, so it tries to assume as little as possible about the host system. That's also why it has all these weird ugly cross-platform UI toolkits, separate timezone data etc.

HardwareGeek

@sebastian-galczynski said in WTF Bites:

Kafka stream

is why Apache would name software after a man whose works are associated with "isolated protagonists facing bizarre or surrealistic predicaments and incomprehensible socio-bureaucratic powers.... exploring themes of alienation, existential anxiety, guilt, and absurdity"? Sure, it's truth in advertising, but really bad marketing; I certainly wouldn't choose to use anything named Kafka, no matter how good it might be.

sebastian.galczynski

@HardwareGeek

Our folk is accustomed to weird names. Imagine growing up with a lightbulb threatening to shit on you.

TimeBandit

@TimeBandit It's even worse than i thought. Some part are in English, some in Catalan

TwelveBaud

@Bulb said in WTF Bites:

App X bundles its own JRE ( 3—why?)

Recommended since Java 7, mandatory since Java 9. Oracle doesn't want to care about backwards/forwards compatibility anymore, nor deal with non-paying "customers" (the general public); theoretically, companies ship a version of the JRE that they tested with and is known good and working, and companies are responsible for shipping updates to it that they have also tested with.

TwelveBaud

@HardwareGeek Gregor Samsa woke up one morning to find himself transformed into a ginormous ... log entry.

Bulb

@sebastian-galczynski said in WTF Bites:

@Bulb said in WTF Bites:

App X bundles its own JRE (3—why?)

This is a recent trend. JetBrains does it too. Fits the general zeitgeist of containerization, node_modules, snap and other bloat in the name of 'avoiding dependency hell'.

I think X always did.

It might also be in part because Java is no longer fully backward compatible after version 8.

@Bulb said in WTF Bites:

4—why? the system store exists for a reason

Same thing, although previous iteration. Java is supposed to be portable, so it tries to assume as little as possible about the host system. That's also why it has all these weird ugly cross-platform UI toolkits, separate timezone data etc.

Ugly cross-platform UI toolkits make sense if it allows keeping the API completely the same. Separate timezone data also, because the timezone data offered by different systems look very differently, and may or may not include historical data.

But the certificate store is just a collection of certificates in always the same X509 format, and their presence or absence in the system trust store has an important meaning.

hungrier

@Bulb said in WTF Bites:

Note that the codepoint is U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM is an absolute oddity.

No, they will no let it go

e: that would've been a much better response to

@Gąska said in WTF Bites:

For example, Pakistan, which uses Urdu alphabet, has a law that every official government document has to start with Bismillah

sebastian.galczynski

@Bulb said in WTF Bites:

But the certificate store is just a collection of certificates in always the same X509 format, and their presence or absence in the system trust store has an important meaning.

Nowdays - yes. But when these APIs were designed, a typical Windows box went for years without any updates, and had a bunch of trojans meddling with the system certs. It was safer to just use a separate store, along with a separate update mechanism (remember these pestering Java Update pop-ups? Or are they still there?)

dkf

@sebastian-galczynski said in WTF Bites:

A troll set his username to 'delete_account', now anyone who clicked his profile is deleted

That's… beautiful!

dkf

@TwelveBaud said in WTF Bites:

Recommended since Java 7, mandatory since Java 9.

Actually, people use the openjdk for the most part because that's not got quite so much of Oracle's BS in it. (I've worked with people in the past who really liked that Java didn't trust the system store by default precisely because they were seriously paranoid. Why weather data needed that level of protection is a whole 'nother matter…)

TimeBandit

@dkf said in WTF Bites:

Actually, people use the openjdk for the most part because that's not got quite so much of Oracle's BS in it.

I was under the impression that it was because you don't have to sign your soul to Oracle to get the installer

loopback0

@TimeBandit That too.

We have a ULA with Oracle that covers, among other things, Java and it's still sometimes easier to use OpenJDK although some of that is unnecessary internal politics and not Oracle themselves.

Tsaukpaetra

@cvi said in WTF Bites:

Zoom providing them glorious frame rates:

There's supposed to be an option to tell zoom that you're sharing something that needs a reasonable frame rate. At least that's what the interwebs say - I can't find it for my life, though.

How the fuck did you get it to have that resolution? Mine is stuck at like 144p (but, admittedly, at 20 FPS)!

cvi

@Tsaukpaetra Screen sharing with a 1920x1080 source.

If I just use the camera, I can actually get about ~25Hz, except that the resolution maxes out at 640x380 (the 380 is not a typo). Plus there are obscene amounts of video compression that make everything washed out and produces some pretty visible compression artifacts.

Essentially, the choice is either usable resolution but shit frame rate, or usable-ish frame rate with crap resolution.

Tsaukpaetra

@cvi said in WTF Bites:

@Tsaukpaetra Screen sharing with a 1920x1080 source.

Ah, yeah, even with screen sharing I can't get it to actually do the actual resolution. I guess we're just not paying enough...

cvi

@Tsaukpaetra said in WTF Bites:

@cvi said in WTF Bites:

@Tsaukpaetra Screen sharing with a 1920x1080 source.

Ah, yeah, even with screen sharing I can't get it to actually do the actual resolution. I guess we're just not paying enough...

To be fair, lowering the resolution to 1280x720 raises the frame rate to like 7 to 10 Hz...

Tsaukpaetra

@cvi said in WTF Bites:

@Tsaukpaetra said in WTF Bites:

@cvi said in WTF Bites:

@Tsaukpaetra Screen sharing with a 1920x1080 source.

Ah, yeah, even with screen sharing I can't get it to actually do the actual resolution. I guess we're just not paying enough...

To be fair, lowering the resolution to 1280x720 raises the frame rate to like 7 to 10 Hz...

I blame Javascript...

cvi

@Tsaukpaetra I would like to agree, but this is with their native client, and IIRC Discord can do 1280x720 at 30 Hz from the browser (and you're not paying them anything either at that point).

Also, I sincerely hope nobody is actually insane enough to do the actual video encoding in JavaScript.

Tsaukpaetra

@cvi said in WTF Bites:

their native client

Isn't it just a CEF application? Colour me surprised if so...

cvi

@Tsaukpaetra It depends on Qt on Linux. So, there is some hope at least.

Edit: s/depends on/ships with/ but same difference really.

topspin

@cvi said in WTF Bites:

@Tsaukpaetra Screen sharing with a 1920x1080 source.

If I just use the camera, I can actually get about ~25Hz, except that the resolution maxes out at 640x380 (the 380 is not a typo). Plus there are obscene amounts of video compression that make everything washed out and produces some pretty visible compression artifacts.

Essentially, the choice is either usable resolution but shit frame rate, or usable-ish frame rate with crap resolution.

I’ve not checked the resolution or frame rate before, but I’ve never had any issues with video quality. In fact, it’s the only (or one of few) video conferencing apps that seems to sanely do hardware acceleration so that my fans don’t blow at 100% after 5 minutes.

Tsaukpaetra

@topspin said in WTF Bites:

@cvi said in WTF Bites:

@Tsaukpaetra Screen sharing with a 1920x1080 source.

If I just use the camera, I can actually get about ~25Hz, except that the resolution maxes out at 640x380 (the 380 is not a typo). Plus there are obscene amounts of video compression that make everything washed out and produces some pretty visible compression artifacts.

Essentially, the choice is either usable resolution but shit frame rate, or usable-ish frame rate with crap resolution.

I’ve not checked the resolution or frame rate before, but I’ve never had any issues with video quality. In fact, it’s the only (or one of few) video conferencing apps that seems to sanely do hardware acceleration so that my fans don’t blow at 100% after 5 minutes.

Hold on, please wait, verifying Earth number....

My fans and CPU instantly go to 100% and soon thermal throttle. Granted this is a laptop device, but still....

topspin

@Tsaukpaetra are you using the MS Teams version of Zoom?
Because when I start that my fans are ready for lift-off.

Tsaukpaetra

@topspin said in WTF Bites:

@Tsaukpaetra are you using the MS Teams version of Zoom?
Because when I start that my fans are ready for lift-off.

IDK, but it's really annoying how it can barely keep a 400kbps stream steady at 144p.

cvi

@topspin said in WTF Bites:

I’ve not checked the resolution or frame rate before, but I’ve never had any issues with video quality.

It looks OK if you use the camera-feed for ... well ... a camera. I tried using it to stream some video content and text (slides essentially). The crappy resolution+very hard compression was pretty noticeable at that point. (Essentially after seeing that using the screen sharing feature won't work for anything that moves.)

Although, I'm also not quite sure what's going on. There's theoretically a few options to increase resolution(?), one with something about group video, and one that just declares your camera to be HD. Neither seems to really do anything.

Rumors say that there's a higher frame rate option for screen sharing in the Windows client. Haven't tried it. Same rumors say that said option caps resolution at some value (but there are indications that the value might actually be usable, e.g. 1280x720).

Edit: I don't really care about HW acceleration. Just work better, dammit.