I think I know what makes a truly competent programmer

anotherusername

@flabdablet said in I think I know what makes a truly competent programmer:

those are code units, not code points.

Irrelevant. Javascript treats them exactly like it treats all of the other 16-bit code points.

@flabdablet said in I think I know what makes a truly competent programmer:

When the code units 0xD83D and 0xDCA9 are adjacent in a Javascript string, and that string is interpreted as UTF-16-encoded text in the way recommended by the spec, they collectively represent the single Unicode character 💩 whose code point is 0x1F4A9.

Not to Javascript, they don't.

When it's displayed, then yes: it's interpreted as UTF-16. And if your program in Javascript needs to handle UTF-16 then yes, your code needs to compensate for this. But if Javascript supported UTF-16 natively, you wouldn't need to compensate for anything.

It's doing exactly what you'd expect for a UCS-2 system when you give it a string of UTF-16 text: Ignore the illegal characters and treat everything like it's 16-bit code points.

flabdablet

@anotherusername said in I think I know what makes a truly competent programmer:

if Javascript supported UTF-16 natively, you wouldn't need to compensate for anything.

I will agree that if the ECMAScript spec defined strings primarily as (possibly encoded) sequences of Unicode characters, and if its string indexing and substring operations referred to the encoded characters themselves rather than the implementation details of their underlying encoding (so that '💩'.length was 1, and 'ham 💩 and eggs'.charCodeAt(4) was 0x1F4A9), then you wouldn't need to compensate for anything.

But it doesn't do that, so you do. And in order to do that compensation correctly, you need to know that the natively supported underlying encoding is UTF-16. If you assume it's UCS-2, your code runs a strong risk of being incorrect when it has to deal with text containing code points outside the BMP, because such text is simply not encodable in UCS-2.

If you're not fully aware that an ECMAScript character is not guaranteed to represent a single Unicode character - as it would be if the coding really were UCS-2 - you will be at a loss to explain how /[💩]/.test('💪') could be true when /💩/.test('💪') is so obviously false.

Incidentally, there's nothing stopping you from writing perfectly correct Javascript code that handles all your Unicode text as strings encoded in UTF-8. If you wanted even less help from the standard library, you could even pack that UTF-8 into Javascript strings with two code units per Javascript character.

dkf

@flabdablet I think the key take-home from this is that a programmer should only tackle Unicode strings after they have mastered more trivial things like rounding and dates…

RaceProUK

@dkf There are two hard things about programming:

Text
Dates and times
Rounding

flabdablet

@dkf Maybe. I think the take-home is that the names charAt and charCodeAt in Javascript are every bit as misleading as char in C when it comes to dealing with strings containing encoded Unicode text.

cheong

@wft said in I think I know what makes a truly competent programmer:

If you call yourself a senior programmer but your software has problems working in different timezones, across timezones, and on leap years, or if your software has problems working with charsets other than ASCII or Latin-1, or you don't quite understand how rounding errors accumulate, you should hand your "senior developer" badge over immediately.

For a record, in .NET framework v4.5 runtime, if you try to get DateTime.Now in certain time when the machine is set to certain timezone, it'll throw ArgumentOutOfRange exception because of some bug in DST adjustment.

tufty

@dkf Unicode is not particularly hard to deal with, but you need to throw away the old assumptions that a string consists of a sequence characters that can be encoded as

a fixed number of bytes
a single codepoint

Unicode strings are linearly encoded tree structures.

flabdablet

@tufty Even setting aside the issues involved in dealing with the mapping of encoding units to code points, the fact that some Unicode code points render as glyphs that ȏ̭̣̜̝͎ͣ̇̍̎̈̚v̷͉̩̲͙͌ė̝͎͇͇̎͌͋ͣ͐r̘̖͈̙ͬ̌͛l̺̬͔̖̦̠̭ͤ̓͟a̓̊̆̊͑y̲͎̗ͨ̾̋ others breaks a whole bunch of usually-implicit assumptions about what a "character" actually is.

anonymous234

@flabdablet Unicode makes programming more interesting!

(you forgot bidirectional text too)

flabdablet

@anonymous234 Bidi text isn't really new with Unicode, though. I'm pretty sure there are Windows codepages for RTL languages.

cheong

@flabdablet said in I think I know what makes a truly competent programmer:

@anonymous234 Bidi text isn't really new with Unicode, though. I'm pretty sure there are Windows codepages for RTL languages.

For real fun, try to mix text with inline RTL text (separated by RTL-mark control character) with web textbox that has dir="rtl" and CSS direction="rtl" specified and figure out what is the final rendered output.

dkf

@tufty said in I think I know what makes a truly competent programmer:

Unicode is not particularly hard to deal with

That really depends on what you are doing; if you're just passing the bytes around and actually only looking for ASCII when parsing things, it is indeed easy. Beyond that… well, you'd think it would be easy, except almost nobody at all gets it right.

Rendering Unicode correctly (especially so that it can be edited) is another level of nasty entirely.

flabdablet

@dkf I had a little parsing problem so I decided to use regexes in Javascript. Now I have two surrogate pairs of problems.

wft

@blakeyrat said in I think I know what makes a truly competent programmer:

... seriously? That function's in literally every software product ever made that deals with moneys.
(Except usually in a better language.)

I must have had shit bosses. They were pissed that our software did not calculate shit exactly like the those reports the chief contractor's beancounters produced. And they produced a simple Excel sheet which didn't do this dance. Go figure.

Also I have learned there that it's probably better to spend a week cleaning public toilets than ever ask an accountant how to do figures in a way that doesn't fuck you up in the end; I think you feel less like a shit after that first activity. There was once an issue with some item progress being reported as negative (again, for teh reasons) which totally screwed our software, and it took me bloody two weeks to squeeze from them an answer to a simple question why they even need to do that, so I can deliver a proper fix.

Lurking in local accounting-related forums and Q&A sites gave me nothing. Also, I have learned that most accountants have more ignorance about their own stuff and unwarranted self-importance than even programmers themselves.

That's why having this explained by a fellow programmer in a simple paragraph of prose is what I value the best.

But maybe you won't get it anyway, being grateful is not what happens to you often, if at all.

darkmatter

Here's an interesting thing that happened to me a few years ago...

A user created an account on a website that only allows numeric account numbers that have to match to reference code #s. Their account number as created was 1⁴3498. Surprising enough, the SQL server lookup not only allowed it but validated that it matched to an existing reference code that it had to match for the account to actually get created. Using isnumeric() on it in mssql returns true (well, 1) but attempting to convert it to an integer fails, so I can't figure out how the hell that worked. My only guess is that because it is parameterized as in int from C#, the ⁴ must have gotten stripped from the value when it did the lookup.

Unsurprisingly, they never logged in to that account again later....

_{(they used the unicode for a superscript ⁴ character, not "<sup>4</sup>")}

tufty

@darkmatter So, a bug converting from string to int, then?

Digit checking for unicode should return true for any characters which have the unicode properties Numeric_Type = Digit or Numeric_Type = Decimal . That's the "normal" numbers in the ASCII set, plus stuff like the superscripts. So anything properly handling unicode attributes could check if the string 12³₄⑤𝟞 (digit 1, digit 2, superscript digit 3, subscript digit 4, circled digit 5, mathematical double struck digit 6) consists entirely of digits and quite happily return true.

You might even rightfully expect that string to be considered to have an integer value of 123456, but in most cases that's rather too much to expect. You could also (and far more realistically) expect an error. what you shouldn't expect, though, is 12.

For additional fun, there's also Numeric_Type = Numeric, which indicates characters which have a value but aren't digits or decimals. Fractions and Roman numbers, for example.

Tsaukpaetra

@tufty said in I think I know what makes a truly competent programmer:

what you shouldn't expect, though, is 12.

I must be MS Jaded, that's exactly what I expected... :/

ben_lubar

@blakeyrat have you never seen a picture of Blergo? He has Synergetic Tempered Spinal Blades (Infused).

Also, today I randomly found a box with some very powerful socks in it, so I guess I don't have to craft that.

M_Adams

@wft said in I think I know what makes a truly competent programmer:

Also, you know this entire class of fuckups when a new accounting department adopts a new rounding approach, and the historical data is recalculated for any stupid reason (or it is always recalculated on the fly), and there is a substantial mismatch between actual numbers invoiced and the recalculated ones?

This is hardly an unusual situation, in re rounding and related issues: foreign exchange rates, coupons (bearer certificates), stocks, bonds, fractional investments and indexed annuities, etc.

In the US — GAAP, and in Europe—IFRS, explicitly state how such a "Change in Accounting Methods" is to be handled, and how it is to be reported and explained in the Financials.

Well… GAAP/IFRS and your local IRS equivalent.

The true-ups are obviously people who don't understand the rules under which they should be operating.

PleegWat

@M_Adams said in I think I know what makes a truly competent programmer:

GAAP

Dutch for YAWN. Makes me chuckle.