Am I the only one who knows what "strongly-typed" even means anymore???

Maciejasjmj

They use int() to convert the result of the subtraction to an integer. Which already makes the whole thing silly, because the result of 'c' - 'a' isn't 2, it's \x02.

My gripe with the concept is that calculating distance isn't strictly equivalent to subtraction, and I think in case of characters it shouldn't be expressed with a minus operator - we generally want to keep the types closed under most arithmetic operations, but then we end up with results that are characters in type, but make no sense as characters.

Also because subtraction should be a binary operator with no dependence on a global state like current locale. And what if you want to calculate the distance under different locale? You need to pass it somehow, and you can't do that with a minus operator.

It's a ratherpurist argument, but hey, if @Gaska can argue that enums and flags shouldn't have an implicit numeric representation, then it holds for characters for pretty much the same reasons.

Buddy

According to this, then, @Gaska's way is more a misuse of the Eliza effect than a legitimate use. Calling something addition when it isn't, or even defining such an operation on characters, hell, even allowing character variables to be instantiated without specifying what alphabet you're talking about, is misleading specifically because of the Eliza effect. Because people are gonna think they're looking at one thing when actually it's something completely different.

And if we take that philosophy one more level meta, there's no reason to even call the above application of it ‘strong typing’ in the first place. Obviously it is internally consistent—just as plus is overloaded to mean something indescribably different, so is ‘strong typing’ given a new meaning, one that conveniently allows us to treat JavaScript or Perl as no different than C# or Ada—but it's significantly less useful than a definition that allows us to actually speak about strongly-or-weakly typed languages, or characters and numbers, in a way that makes intuitive sense to human observers, or other intelligent systems.

ScholRLEA

To be honest, I agree that it is a misuse of the term; I was trying to explain what I interpreted him to be saying, even if I disagreed with him.

As a side note, though, it should be mentioned that strong typing != explicit typing, which is a confusion a lot of people have. There are strongly typed languages which use implicit typing (Haskell is the best known example); more to the point, most explicitly typed languages are not strongly typed. Explicit typing is about typing the variables, while strong typing is about typing the actual objects (in a general sense, not necessarily an OOP sense). Since most older languages carry the assumption that variable == object (that is, that the variable is a value rather than a reference), explicit typing was often confused with strong typing, even in languages which were definitely not strongly typed (e.g., C).

Planar

@blakeyrat said:

Stop and think about it for a second and you'll realize Unicode code points are virtually useless for the purpose of sorting already. Optimizing for that use-case is a dumb idea.

Collating is a very specialized algorithm with narrow usefulness that should be implemented in exaclty one library function. Sorting is a lot more general. For example I could be sorting strings in an array to be searched by dichotomy and code points are perfect for that.

xaade

@RaceProUK said:

True, but you can't store a character without having some form of encoding, otherwise, how do you interpret the bits and bytes?

Well, you can have an encoding, but sealing that allows it to be an implementation concern.
And as long as you store strings outside of the program using an encoding, you can abstract that concern away from the character type.

Than, 'a' always is 'a'. It might be represented by 110110101110101, but it's always 'a'.

When you do that, character math using ints no longer makes sense.

Now, you can make a CharacterSpan (like the relationship between DateTime and TimeSpan) class, and now you've put it back into an implementation concern.

With Character and CharacterSpan classes, you can now do math-like operations and have +,- be overridden in a way that preserves strong-typing. And it's still an implementation concern.

That way if you negate a character, it is stored as a negative character, and then added to a positive character results in the same value as subtraction, all that is still an implementation concern. It could be stored as strings, ints, floats, images, for all we care, as long as it's consistent.

And if you choose to offer a cast operation to int, and casting back results in the previous value, that's ok too.

But the key here is type-safety.

Treating chars as ints destroys type-safety.

But I'm not sure that's what this language is doing, looking just at the OP.

Implicit casts makes type-safety ambiguous to the programmer. It doesn't destroy type-safety, just makes it ambiguous.

Buddy

Yessss! Pedantry about what the exact definition of strong typing is or is it! Now that's what I came to this thread for.

For my part I'd say that ideally those two things (typing an object vs typing a variable) should be identical; any place where they're not (casts, dynamic vars) is known as a ‘hole’ in the type system. I wouldn't say having holes like that disqualifies a language from being strongly typed, but it certainly does affect the strength rating.

Gąska

@Buddy said:

hell, even allowing character variables to be instantiated without specifying what alphabet you're talking about, is misleading specifically because of the Eliza effect

What if we define characters to always exist in the Unicode alphabet? It solves this particular problem.

RaceProUK

@Gaska said:

Unicode alphabet

loose

Reading through this topic, A thing / concept is quite clear to me. Whatever the "thing" is, whatever the "issue" is and whatever the "solution" is and because the "thing" has evolved, and past "solutions" have added to the "issue". Should be indication enough that any new solution will just become part of the next issue. Because any new "solution" has to cope with old "things" and "issues", ideally with just a little tinkering - this word is important (in my English language anyway). Unless a decision the decision is made that dispense with any thought of backwards compatibility. This has been done before by various organisations for various "things" and does result in outrage, confusion and rejection: But we are talking about the ultimate V2.00 that effects everything and everybody at the same time.

Now, who would want to do that? Bearing in mind that whatever the justification, by definition (because it had to be done), it will need doing again. I appreciate that I have not offered a solution, but then I believe that there is not a solution (in the context of my argument).

But there are "organisations" that are prepared and willing to do something like this, with the attitude of "to hell with the issues you may have because of it":

How will new £1 coin work in vending machines and trolleys?

Gąska

@RaceProUK said:

@Gaska said:
Unicode alphabet

You know, the one defined in ISO/IEC 10646:2014, which has 120,000 characters?

dkf

Ah, you mean the Unicode charset. Alphabets are something else; Unicode maps characters from several alphabets. (Alphabets also don't necessarily prescribe an order on their characters. The Latin-based ones do, but it's not mandatory.)

loose

A simple question (just using you @dkf as a launching platform :) ) But is it not the alphabet, whatever language / character set, that defines the sort order?

dkf

@loose said:

But is it not the alphabet, whatever language / character set, that defines the sort order?

As I understand it, there are several different orders possible for Chinese characters, depending on how exactly you go about prioritising some types of strokes over others. It's all really rather complicated and I know I don't understand it. If you've got several possible sort orders, you obviously can't have a single sort order defined by it (since that depends on having a single transitive ≤ relation).

RaceProUK

@dkf said:

Chinese characters

Are logographs, not letters, so don't form an alphabet. Your point is valid, but I think your example isn't the best illustration. A better example would be katakana, which isn't strictly an alphabet either, but can be treated as one for our purposes; it has two defined sortings, gojūon and iroha (hirangana can also be sorted these same two ways).

dkf

@RaceProUK said:

Your point is valid, but I think your example isn't the best illustration.

All the CJK stuff is just moonrunes to me, though I can often guess which set of moonrunes we're dealing with in a particular case.

RaceProUK

Ah, but can you tell the difference between kanji and hanzi?

Gąska

@dkf said:

Ah, you mean the Unicode charset. Alphabets are something else; Unicode maps characters from several alphabets. (Alphabets also don't necessarily prescribe an order on their characters. The Latin-based ones do, but it's not mandatory.)

Why exactly can't Unicode charset be treated just like yet another alphabet?

loose

Hmmm. Language constraints are becoming a considerable to the communication of idea here. Sometimes ther is a need to us the concept behind a word, rather than the meaning of it (which is derived by consensus).

A collection of scratchings in the dirt has a order, if only because each has to be separate from the other so that their individual identity can be preserved.
The creator can ascribe an importance, if any, on the reasons for the order of them. Thereby creating the concept of rank.
Such a construct of symbols can then be used to transfer ideas.
Until another"set" is offered as an alternative. Because of raisons, and the nature of the entities constructing the symbols means there will always be raisons
It could be that the second set contains all and only those symbols in the first set, because they are the "only" concepts. It's just that a different rank of importance has been applied.
Or it could contain new concepts and the removal of old ones.

The point is, the "value" of any given symbol and it's relative "distance" from any other symbol only has validity within the set of symbols from which it comes. Irrespective of it's "concept", even if it is the same in both sets. Therefore, any "sorting" of symbols can only be performed with any validity, with respect to the set within which they are all defined. Otherwise you have a new hybrid set, as yet, unordered.

RaceProUK

@Gaska said:

Why exactly can't Unicode charset be treated just like yet another alphabet?

Two reasons:

Unicode encodes multiple alphabets, as well as lots of symbols that aren't parts of alphabets
Two languages may share an alphabet, but have different sorting rules

PleegWat

'A' is a letter. 65 is a number. The two may share an equivalence relation, and it is perfectly sensible for an Ord() function to return 65 when passed 'A', or an Asc() function to return 'A' when passed 65. But treating them as the same thing, to be used interchangeably, as if there is no difference between the two at all, is absurd. Even if every major C-like language since C itself has done so.

Gąska

@RaceProUK said:

Unicode encodes multiple alphabets

If we rather treat it as "has symbols that also appear in multiple other alphabets", I see no reason why Unicode cannot be an alphabet itself.

@RaceProUK said:

as well as lots of symbols that aren't parts of alphabets

Do those symbol have some inherent property that prevents them from being a part of any alphabet?

@RaceProUK said:

Two languages may share an alphabet, but have different sorting rules

I'm not aware of any alphabet like this. But even if it's true, it's irrelevant to the issue of why Unicode cannot be an alphabet.

RaceProUK

@Gaska said:

Do those symbol have some inherent property that prevents them from being a part of any alphabet?

An alphabet is a collection of letters, but there are writing systems that don't use letters; hanzi, kanji, katakana, and hiragana to name but four. Plus numerals and punctuation aren't letters, yet are in Unicode.

Gąska

OK, fair enough. Still, defining a character variable to always exist within Unicode space solves the issue highlighted by @Buddy.

flabdablet

@Maciejasjmj said:

we generally want to keep the types closed under most arithmetic operations, but then we end up with results that are characters in type, but make no sense as characters.

That desire to keep the types closed strikes me as arbitrary and troublesome. For example, wouldn't it be nice if the result of multiplying an int32 by an int32 was automatically an int64 so that overflow couldn't happen?

@Maciejasjmj said:

what if you want to calculate the distance under different locale?

Then you either need a function that accepts two characters and a locale as arguments, or you might want to derive some locale-specific character types.

flabdablet

@Buddy said:

even allowing character variables to be instantiated without specifying what alphabet you're talking about

I see nothing wrong with specifying the alphabet for the most general character type in the language definition, and believe that that most general of alphabets ought to be Unicode.

flabdablet

@xaade said:

I'm not sure that's what this language is doing, looking just at the OP.

Looks to me as if it's treating char as a signed 8 bit type, which is easy to implement but has all kinds of conceptual wrongness.

I have no problem with languages providing signed 8 bit types and arithmetic to operate on them, but calling them char or character is misleading and unhelpful in a post-ASCII world.

RaceProUK

@flabdablet said:

For example, wouldn't it be nice if the result of multiplying an int32 by an int32 was automatically an int64 so that overflow couldn't happen?

No: what if you're going to store the result in a DB field that can only hold a 32-bit number?

flabdablet

Then some kind of explicit truncation is required, which calls attention to the potential overflow and gives you an opportunity to deal with it sanely. Do you want to drop the most significant 32 bits? Do you want to pin an overflowed result to the maximum representable value? Better to do that kind of thing in an explicit int64 to int32 conversion step than bake assumptions about it into the arithmetic implementation.

blakeyrat

@Gaska said:

What if we define characters to always exist in the Unicode alphabet?

Unicode contains dozens of alphabets
Which one? Unicode has like... 6 or 7 different versions in itself. (Now it's true that now we've basically standardized on UTF-8, but older systems such as Windows still use Unicode versions that existed before UTF-8 and aren't likely to be updated in the near-future. Since UTF-16 is "good enough" for Microsoft's purposes. And because you go into your tedious "Windows is teh suckz!" speech, remember Windows made this decision before UTF-8 even existed.)

blakeyrat

@Gaska said:

Why exactly can't Unicode charset be treated just like yet another alphabet?

Because it doesn't work the way human beings expect it to.

For example, the one we've already discussed ad-nauseum, it doesn't work for English where Unicode contains two characters (at minimum) for each letter. A lot of operations (such as alphabetization) require letters-- Unicode doesn't have letters, it only has characters.

flabdablet

@blakeyrat said:

1) Unicode contains dozens of alphabets

So what?

@blakeyrat said:

2) Which one? Unicode has like... 6 or 7 different versions in itself. (Now it's true that now we've basically standardized on UTF-8...

UTF-8 is an encoding, not Unicode.

PWolff

@RaceProUK said:

@flabdablet said:
For example, wouldn't it be nice if the result of multiplying an int32 by an int32 was automatically an int64 so that overflow couldn't happen?

No: what if you're going to store the result in a DB field that can only hold a 32-bit number?

We could apply an explicit type cast.

And this is the reason I wouldn't like it - most integer multiplications I encounter are with indices of one kind or another, and I wouldn't like to typecast every time I use that.

We would even need a bigger type when multiplying an int8 with an int32. And even when adding two int32s.

There is no way to ensure that overflows can't happen - that would require to cut down to Turing-incomplete languages FAIK.

flabdablet

@blakeyrat said:

Unicode doesn't have letters, it only has characters.

Letters are a strict subset of characters, so Unicode does have letters. It has lots of other things as well. That's because the idea of a character is a generalized abstraction of the idea of letters.

blakeyrat

@flabdablet said:

That desire to keep the types closed strikes me as arbitrary and troublesome. For example, wouldn't it be nice if the result of multiplying an int32 by an int32 was automatically an int64 so that overflow couldn't happen?

As I posted, the problem here is that you have to specify the bits in the first place. Why should any human give a fuck how many bits the compiler decided to use? Why is this something I, as a programmer, ever have to waste neurons on?

(Actually I personally just use int64/bigint for everything and call it good, since I virtually never work on something outside of that range.)

flabdablet

@PWolff said:

There is no way to ensure that overflows can't happen

Sure there is. You just make the receiving type of any arithmetic operation big enough to contain all possible results given the sizes of the operand types, and then you apply explicit truncation rules as and when you need to. The overflows, if any, will occur during those explicit truncation steps.

RaceProUK

@PWolff said:

We could apply an explicit type cast.

And this is the reason I wouldn't like it

You don't like having code that makes doing unusual things explicit?

If you have two Int32s and you want to multiply without worrying about overflow, then it's much better to be explicit about it. If you see an unchecked section or explicit up-casts and truncations, then you know something funny's going on, and can deal with it accordingly; it minimises assumptions and the chance of a surprise odd result.

flabdablet

@blakeyrat said:

As I posted, the problem here is that you have to specify the bits in the first place.

I have no objection to the idea that the compiler should keep track of that stuff implicitly to the greatest extent possible. For systems programming where you often need to be talking to hardware, or interfacing with external APIs like databases, it would be good to have explicitly sized types available as well.

blakeyrat

@flabdablet said:

UTF-8 is an encoding, not Unicode.

That's a distinction without a difference. UTF-8 contains characters that can't be represented in UTF-16.

In any case, the second reason is a better one for why treating Unicode as an alphabet doesn't make any sense.

@flabdablet said:

Letters are a strict subset of characters, so Unicode does have letters. It has lots of other things as well. That's because the idea of a character is a generalized abstraction of the idea of letters.

Maybe; but there's nothing inherent in Unicode that tells you 'a' and 'A' should be treated the same when sorting. To find that information, you need to go outside of Unicode. So using Unicode as some sort of "canonical alphabet for all languages" doesn't work, not even for English which is an incredibly simple language, relatively-speaking.

Anyway, whatever. You know Gaska's point is stupid, you're just debating with me because you want to debate with me. These arguments are weaksauce and you know it. And if anybody ever drilled into them, like they did Thursday, they'd just find out-- oh hey, you actually agree with me after all.

blakeyrat

@flabdablet said:

For systems programming, where you often need to be talking to hardware, it would be good to have explicitly sized types available as well.

Did I ever propose getting rid of them? No I didn't.

Why are you posting this shit? Just like the sounds of your own keyboard? Why not go type in your blog and stop wasting my time on this.

flabdablet

@blakeyrat said:

there's nothing inherent in Unicode that tells you 'a' and 'A' should be treated the same when sorting.

That's because that shouldn't happen unless you're sorting against a locale that resembles US English, in which case you need to tell your sort function what locale to use.

blakeyrat

@flabdablet said:

That's because that shouldn't happen unless you're sorting against a locale that resembles US English, in which case you need to tell your sort function what locale to use.

RIGHT!

See!? You agree with me completely.

You just made the exact same argument I made days ago.

That's it, muting this shit.

PWolff

@flabdablet said:

Sure there is.

In a way, yes. There are ways to handle integers of arbitrary length. And it is really only in very special cases that you run out of space with them. Out of performance is another matter.

@RaceProUK said:

You don't like having code that makes doing unusual things explicit?

If you have two Int32s and you want to multiply without worrying about overflow, then it's much better to be explicit about it. If you see an unchecked section or explicit up-casts and truncations, then you know something funny's going on, and can deal with it accordingly; it minimises assumptions and the chance of a surprise odd result.

That's a matter of "philosophy", I think.

Strong typing cranked up to 1110 would indeed take an int64 as the result of a product of two int32s, and an int33 as the result of a sum of two int32s. Or just an integer class that can hold an arbitrary amount of bits.

If I were to choose, I'd take arbitrary integers for high-level programming, and machine words for cases where performance is paramount.

I'll rethink my position, but for the moment, I'll stick with int * int -> int, and exception handling.

RaceProUK

@PWolff said:

There are ways to handle integers of arbitrary length.

And a tiny number of situations where that's required.

Considering the likelihood of integer overflow in general, I'll stick with what's faster.

flabdablet

@blakeyrat said:

@flabdablet said:
UTF-8 is an encoding, not Unicode.

That's a distinction without a difference. UTF-8 contains characters that can't be represented in UTF-16.

No it's not and no it doesn't. Every Unicode code point can be encoded as one or two 16-bit values using UTF-16. UTF-16 is not the same thing as UCS-2.

Note that this is not the same thing as saying that every 21-bit value has a corresponding UTF-16 encoding. This is because Unicode code points and arbitrary 21-bit values are not the same type.

The bit manipulation method that UTF-8 defines to convert Unicode code points to byte codes can encode any arbitrary 21-bit value. This doesn't mean that all such values are valid Unicode code points either.

Am I the only one who knows what "strongly-typed" even means anymore??

PWolff

@RaceProUK said:

Considering the likelihood of integer overflow in general, I'll stick with what's faster.

Which is, the type of the product of an int_n and an int_n is int_n, too.

Filed under: Just noticed I'm about to adapt the debate tone of blakeyrat, FrostCat, and others in this thread.

flabdablet

@PWolff said:

If I were to choose, I'd take arbitrary integers for high-level programming, and machine words for cases where performance is paramount.

Sounds good to me.

Gąska

@blakeyrat said:

Why should any human give a fuck how many bits the compiler decided to use?

There are several reasons, but I don't want to waste time explaining them to you because I know you've already decided I'm wrong.

RaceProUK

@flabdablet said:

Am I the only one who knows what "strongly-typed" even means anymore??

No, but unlike @blakeyrat, you recognise the difference between UTF-16 and UCS-2

sjw

@powerlord said:

In fact, what language is there where char isn't a number type?

Ada

Jaime

@Gaska said:

I'm not aware of any alphabet like this. But even if it's true, it's irrelevant to the issue of why Unicode cannot be an alphabet.

... even after I gave you an example of different sort order in the same language?

@Jaime said:

1. Treat them like their base characters, as if the umlaut was not present (DIN 5007-1, section 6.1.1.4.1). This is the preferred method for dictionaries, where umlauted words ("Füße", feet) should appear near their origin words ("Fuß", foot). In words which are the same except for one having an umlaut and one its base character (e.g. "Müll" vs. "Mull"), the word with the base character gets precedence.
2. Decompose them (invisibly) to vowel plus e (DIN 5007-2, section 6.1.1.4.2). This is often preferred for personal and geographical names, wherein the characters are used unsystematically, as in German telephone directories ("Müller, A.; Mueller, B.; Müller, C.").

The second also shows that sorting has to be applied at the string level, not the character level.