Am I the only one who knows what "strongly-typed" even means anymore???



  • Ok one more attempt at replying here, then I'm done.

    @Scarlet_Manuka said:

    The situation with characters is similar to the situation with dates. It makes sense to subtract two dates, with the result being not a date but an interval (which we can represent as a number of some smaller unit).

    Ok; so you're proposing we should have Char and "CharSortDistance", which work similarly to DateTime and TimeSpan in C#.

    I could see that.

    But that's not what Koka is doing, so you're agreeing with me that Koka is in the wrong here.

    @Scarlet_Manuka said:

    In the same way, you can take two characters in some defined collation and subtract them to find a number representing how far apart they are in that collation.

    But Koka isn't asking for a localization (or "collation" if we're going to use that term) when it does the subtraction, so, again, you're agreeing with me that Koka is wrong here.

    So... why are you posting as if I'm an idiot when you fundamentally agree with me entirely? As far as I can tell.

    @dkf said:

    e.g., not all alphabets define an ordering of their characters, but Unicode does.

    Unicode code points aren't necessarily in alphabetical order. And even if they were, alphabetizing via them would alphabetize capital and lowercase characters separately, which is bad and wrong.

    So the fact that Unicode characters have numeric representations isn't very useful when you're trying to alphabetize a list.

    @Planar said:

    Some languages even define what it means to add strings together, for example.

    Strongly-typed languages shouldn't do that, either.

    @loose said:

    How else do you expect a sort, of any type, to be performed? By spreading them out on a table (preferably wooden for best results), and doing it by "eye"?

    Stop and think about it for a second and you'll realize Unicode code points are virtually useless for the purpose of sorting already. Optimizing for that use-case is a dumb idea.

    This is a forum full of developers, right? There's a lot of "I can't believe I had to say that" going on in this thread.

    @aliceif said:

    Subtracting codepoints is not.

    Maybe not, but:

    1. it shouldn't be implemented via silently converting chars to ints, and
    2. I personally still don't see what it's good for.

    @Gaska said:

    Has anyone noticed that @blakeyrat's complaint doesn't have anything to do with strong typing

    Yes it does.

    @Gaska said:

    and he's simply discontent with the set of operations defined for char type?

    That's a different but related complaint.

    @flabdablet said:

    I'm guessing that the 30 or so characters with explicit order that you're talking about, in the context of US English, would be the letters?

    Letters, numbers, and some pieces of punctuation. I guess more than 30, but not a lot more. The point is, Unicode contains hundreds of thousands of characters, of which 99.5%, when sorted in US English, have no meaningful sort order whatsoever.

    @flabdablet said:

    The notion of alphabetic distance has enough in common with numeric subtraction to make the use of the - operator a natural way to express it: 'M' - 'E' == 8.

    Right; but since 'M' - 'e' != 8, what good is that? It can't be used for sorting.

    @flabdablet said:

    it's completely reasonable to generalize the concept of letters to that of characters,

    No, it's not. There are at least two characters for each (English) letter. If you include legacy stuff, the letter combination ae has: 'a', 'e', 'A', 'E', 'æ', 'Æ'. Other languages are far more complex-- and you're laying out rules that don't even work in English.

    You... you people speak and write languages, right? Do you not understand at all how they work?

    Seriously, I can't believe I actually had to say any of this.

    @flabdablet said:

    It certainly makes no sense to add one character to another, and it makes no sense to expect the result of subtracting one character from another to be a character.

    I concede that that is an operation that can be performed; I still don't know what good it is.

    @flabdablet said:

    My current belief is that you derive some kind of masochistic pleasure from being endlessly shown to be wrong.

    Says the guy who thinks the letter 'a' is represented by only one character in Unicode.



  • @Gribnit said:

    What, you never heard of duck typing?

    Of course I've heard of it. Who said I hadn't?

    @powerlord said:

    In fact, what language is there where char isn't a number type? Hint: If you immediately thought of C#, you're wrong.

    I know C# is wrong. I've complained about that many times in the past. It's one of my big pet peeves about the language.

    "We've always done it that way" is not a good reason for a language designed in 2012 to make this same mistake.

    @flabdablet said:

    I also claim that requiring such casts to be made explicitly when working with alphabetic distances is both ugly and unnecessary.

    SOMEONE EXPLAIN TO ME WHAT GOOD IS WORKING WITH ALPHABETIC DISTANCES IN THE FIRST PLACE!

    What is this use of this concept!?

    @flabdablet said:

    The Unicode alphabetic distances between characters that do not share a common sub-alphabet are indeed not often of much use.

    Oh, so you know it's useless, but you're still going to fall on your sword calling me an idiot because I said it's not worth implementing. Great. Thanks for restoring my faith in humanity.

    @Medinoc said:

    Of course, in a very strongly-typed language, I'm not even sure "raw" integers would exist, rather than all be some unit (or unitless factor) or other.

    Probably true.

    I'd argue that they would exist, but:

    1. They should be aliased to another type describing the actual thing being measured

    2. The distinction of how many bits they use up should be completely discarded (again: the fact that the "fastest" integer on a particular CPU is 32-bits is completely implementation-detail and should not be exposed outside of the implementation.)

    @flabdablet said:

    Requiring casts and pseudo-casts like .toUnicodeCodePoint() just adds noise.

    Wait, didn't you just a few posts ago say:

    @flabdablet said:

    Requiring alphabetic distance calculations to be done using explicit casting of code points to general integers looks like a missed opportunity to me.

    ?

    So... do you support explicit casting or not? Now I'm just confused.

    @RaceProUK said:

    C# arrived in 2000 [/pendant]

    Ok, here come the mod complaints but whatever:

    We're not talking about C#, we're talking about Koka which was created in 2012.

    @Jaime said:

    If you are manipulating characters as numbers, you are almost certainly doing something that will not globalize well.

    Exactly, which is why any conversion from char to integer should at minimum be explicit, and possibly even be considered a warning.

    @flabdablet said:

    I agree with everything you say, and note that supporting alphabetic distance arithmetic is not the same thing as providing an implicit cast to int.

    But it's also useless for any purpose.

    And if it were, the distance between two characters isn't an int, it's a brand new type.


  • Considered Harmful

    Let's say I declare that a set of symbols exist in a ring. I may express the distances between those symbols, in a given direction, as an integer, and nothing breaks about the math. Getting more concrete, I see problems when I get into variable-length encoding schemes, however, I can see perfect sense in allowing retrieval of the distance between two characters in a given collation scheme as an integer. And that seems what I'd implement "difference of characters" as, distance within a collation. The current numeric rep lets you take distance without reference to a collation definition, why do you mind?



  • Right; but as I just posted, what's the use of doing that? Any operation involving the integer difference between two chars is almost certainly implementing a bug.

    But hey, forgive me, maybe I'm too ignant for your brilliant mind.


  • Considered Harmful

    Most people are. What I see in the original example is consistent with modeling as described above, and would be able to be considered strongly typed (although an accusation of dwimmery would probably stick). It is also consistent with being modeled entirely numerically although in that case the int f would be an identity so, why.. so more likely is consistent, strongly typed, and merely involved some object classes you were unable to postulate to yourself.

    Maybe someday though, with your evident interest in the field, someday you could be a computer programmer..



  • Well I'm too ignant to understand words like dwimmery so I guess you sure showed me.

    Maybe someday I will be a computer programmer. *Looks wistfully towards stars*



  • @Gribnit said:

    I can see perfect sense in allowing retrieval of the distance between two characters in a given collation scheme as an integer.

    However, the real world disagrees with you. As one of many examples, in German, sometimes "Ö" sorts the same as the two character combination "OE". Given this, the distance between Ö and O may be zero (if the O is followed by an E) or non-zero.



  • @blakeyrat said:

    Says the guy who thinks the letter 'a' is represented by only one character in Unicode.

    You don't really do "context", do you?

    Whatever, I guess.



  • Maybe I misunderstood what you were trying to communicate in that one instance.

    At the same time, you've been doing nothing in this thread except calling me a moron while fundamentally agreeing with everything I've said. So you'll excuse me if I don't shed any tears.


  • Discourse touched me in a no-no place

    @blakeyrat said:

    Unicode code points aren't necessarily in alphabetical order. And even if they were, alphabetizing via them would alphabetize capital and lowercase characters separately, which is bad and wrong.

    So the fact that Unicode characters have numeric representations isn't very useful when you're trying to alphabetize a list.

    That's very true indeed. You need to know about the language that the two items are written in in order to collate them correctly, and that really messes up when you're dealing with multilingual work. The problem is that there's no consistent order in the first place. So we use Unicode because at least then we get an agreed amount of wrong.

    It gets even worse when you start wanting to do case changing.


  • FoxDev

    @Gribnit said:

    I can see perfect sense in allowing retrieval of the distance between two characters in a given collation scheme as an integer

    But what use does that have? Knowing the distance between 陰 and 陽 achieves nothing; let's face it, the only reason characters are stored as numbers is because it's easy (and Unicode assigns a number to every code point).


  • Banned

    @blakeyrat said:

    Yes it does.

    Assuming you don't break the abstraction with your preemptive knowledge of encoding used internally by Koka, how is having "character arithmetics" any relevant to strong typing?



  • Gaska, I'm not going to answer your stupid questions. If you disagree with me, fine. Disagree with me. Then go the fuck away and read some other thread.


  • Considered Harmful

    Hi - in that case, the 'o' character wasn't the whole symbol, the symbol was OE - as indicated, variable-length encodings play happy hell with character arithmetic. Also Unicode has more than one conceivable ring in it - was actually expecting to get flamed for that, not the already mentioned scheme breakdowns. Feel free to use something like "Your blinkered American outlook blinds you to the fact that..." as a starter.


  • area_pol

    @blakeyrat said:

    SOMEONE EXPLAIN TO ME WHAT GOOD IS WORKING WITH ALPHABETIC DISTANCES IN THE FIRST PLACE!

    Very good point. I can think of no case where subtracting/adding ints to characters would be needed.

    Even removing the char type completely does not hurt the expressiveness of a language.
    For example, Python has no char objects, a character is a string of length 1.
    Functions like

    isalpha islower isupper isdigit upper lower capitalize ...
    provide the operations one would usually like to perform on characters (and are surely more elegant than comparing the numerical values).


  • Discourse touched me in a no-no place

    @Adynathos said:

    For example, Python has no char objects, a character is a string of length 1.

    Tcl also does this, with characters being Unicode characters and having no mention of what the encoding of those characters is; they're just characters in a string, and you can have length-1 strings easily enough. The length is the number of characters in it, not the number of bytes. (It also rejects making strings into objects in the first place.)



  • Koka is a "strongly typed" language. A string was expected but some stupid used something else, and hence you see everything is "undefined".



  • @blakeyrat said:

    Stop and think about it for a second and you'll realize Unicode code points are virtually useless for the purpose of sorting already. Optimizing for that use-case is a dumb idea.

    Now, I don't recall mentioning anything about unicode code points or any other code points for that matter. All I asked was: how is the sorting achieved, irrespective of the type and nature of the sort. As ultimately, one substring needs to be compared with another in order to determine the "rank" of one over the other. I offered TableTopTechnology™ as a possible alternative, I now offer another: Massive Fuck Off sized mapped arrays.

    I do feel for you @blakeyrat (and it don't help that every time I see your @mention, I cannot get past : http://www.onthebusesfanclub.com/sitebuildercontent/sitebuilderpictures/blakeyarrrghh.jpg - too horrible even for SpolierTech™ )

    It cannot be a nice experience to always be moaned at, but sometimes.....

    Can I ask? if it is not too personal, [spoiler]Is it your time of the month?[/spoiler]


  • Considered Harmful

    A well-known counterexample? Although the usefulness of the ROT13 algorithm can be called into question, it does heavily involve addition to and subtraction from character values (in implementations not based on lookup tables). If I should like to know whether a letter comes before another in a collation sequence, I am wanting at least sgn(distance). If I should like to do some kinds of searching or sorting, it is useful to know how far off target I think I may be, in which case I would want actual distance..



  • @Gribnit said:

    Hi - in that case, the 'o' character wasn't the whole symbol, the symbol was OE - as indicated, variable-length encodings play happy hell with character arithmetic.

    You're going to have talk to all of the German speaking people and get them to agree with your scheme.

    Here's Wikipedia's sorting rules for umlauts in German:

    1. Treat them like their base characters, as if the umlaut was not present (DIN 5007-1, section 6.1.1.4.1). This is the preferred method for dictionaries, where umlauted words ("Füße", feet) should appear near their origin words ("Fuß", foot). In words which are the same except for one having an umlaut and one its base character (e.g. "Müll" vs. "Mull"), the word with the base character gets precedence. 2. Decompose them (invisibly) to vowel plus e (DIN 5007-2, section 6.1.1.4.2). This is often preferred for personal and geographical names, wherein the characters are used unsystematically, as in German telephone directories ("Müller, A.; Mueller, B.; Müller, C."). 3. They are treated like extra letters either placed 1. after their base letters (Austrian phone books have ä between az and b etc.) or 2. at the end of the alphabet (as in Swedish or in extended ASCII).
    It's common for a program to offer both of the first two as options in its international settings.

    Come back to this thread when you're done, I'll have more examples for you. You might as well start learning Turkish so you can prep for the boss level.


  • Discourse touched me in a no-no place

    @Jaime said:

    Come back to this thread when you're done, I'll have more example for you. You might as well start learning Turkish so you can prep for the boss level.

    There's also capitalization. That's really fun in Dutch…


  • Considered Harmful

    So, I suppose I'd need a context sensitive collation scheme, or I can let more than one symbol occupy the same position in the ring. There's no way the symbol encoding is going to be universally useful for collation, of course, and there's also languages for which a symbol based encoding breaks down. Luckily, there are already collation schemes for those languages, in languages I'll be using.

    In the case where I am not sorting but am performing meaningless arithmetic, of course it doesn't matter if the ring is actually useful for collation.



  • @Gribnit said:

    In the case where I am not sorting but am performing meaningless arithmetic, of course it doesn't matter if the ring is actually useful for collation.

    ... but closest anyone came to a useful application of "alphabetic distance" is sorting. Once you admit it will be buggy to use it this way, there's no reason left for it to exist. You sort strings, not characters. If you are performing a character-by-character sort, you are creating bugs.

    @Gribnit said:

    In the case where I am not sorting but am performing meaningless arithmetic

    That seems like a great reason for the feature to exist.


  • Considered Harmful

    Encryption and encoding for protocol, tend to be, relative to the semantics of the message, meaningless arithmetic, and I am glad that you did notice that.



  • Encoding is an implementation detail - you don't need to expose anything in the API to get it right.

    Encryption requires a string to first be encoded into a byte stream. All of the issues will be part of the internal implementation of the encoder. So, neither are good reasons for providing a language feature to determine alphabetic distance.


  • Discourse touched me in a no-no place

    @Jaime said:

    If you are performing a character-by-character sort, you are creating bugs.

    Only semi-relevant, but I couldn't resist it when I found it in GIS…


  • Considered Harmful

    So, let's say I'm making, a directed acyclic word graph for the purpose of solving scrabble, and I'd like to binary search at each given level. I'd like to be able to compare characters, in order to do this.

    Or, let's say that I feel that I really must binary search within each tier of directed acyclic word graph for whatever purpose, probably for speed reasons. Let's say it's holding a million POSIX paths, or something, that I need to recognize, and I want that little bump from linear to log(n), expecting large tiers.

    Please provide a way to this without comparing the characters.



  • @Gribnit said:

    Please provide a way to this without comparing the characters.

    Answered in the form of what other languages have done in the past:

    ASC(mychar)
    Encoding.ASCIIEncoding.GetBytes(mychar)[0]

    There's a ton of ways to get what you want without implementing character subtraction.

    @Gribnit said:

    Let's say it's holding a million POSIX paths, or something, that I need to recognize, and I want that little bump from linear to log(n), expecting large tiers

    You need to optimize. Maybe storing the value as byte instead of char is for you. Please don't introduce bugs into everyone else's char so your edge case will be easier.



  • @Gribnit said:

    So, let's say I'm making, a directed acyclic word graph for the purpose of solving scrabble, and I'd like to binary search at each given level. I'd like to be able to compare characters, in order to do this.

    Clearly, this is what language designers should optimize for.



  • In fact, what language is there where char isn't a number type?

    Haskell. Chars are just tokens. If it didn't require a bit of syntactic sugar to do this, they could have been defined as:

    data Char = a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z
    

    etc. Instead, they're defined by the compiler/runtime system internally, as Unicode blah blah blah.

    They are, however, Ordered and Enumerable, so you could abuse these facts to do arithmetic with them if you really really wanted. That would look something like:

    ghci>  toEnum $ (fromEnum a) + (fromEnum b)
    

    I have no idea if that even type-checks or what the output would be, but that's the gist of it anyway.


  • Discourse touched me in a no-no place

    @Captain said:

    If it didn't require a bit of syntactic sugar to do this

    It would also be boring as heck for the whole space of possible characters.



  • But imagine your CLOC productivity!



  • @RaceProUK said:

    the only reason characters are stored as numbers is because it's easy

    Characters aren't stored as numbers. They're stored as bit patterns. Re-interpreting those bit patterns as numbers may or may not be useful.


  • Considered Harmful

    @Jaime said:

    There's a ton of ways to get what you want without implementing character subtraction.

    so, you're okay with Asc() existing? but what would anyone ever use it for?



  • I have no problem with methods and properties that allow access to do all sorts of intimate things. This thread is about the language bubbling implementation details up to the surface by either implicit conversion to a numeric type, or by overloading arithmetic operator for char.

    So, obscure method that is of minimal use = OK with me. Implicit conversion to type that is of minimal use = not OK with me. Operator overloading in such a way as to encourage users to implement their own sort operation = not OK with me.



  • @Gribnit said:

    Let's say it's holding a million POSIX paths, or something, that I need to recognize, and I want that little bump from linear to log(n), expecting large tiers.

    You probably don't realize this, but POSIX paths have no encoding, and thus there's no reasonable way to deal with them as anything but byte arrays. POSIX paths are explicitly not characters.

    That is, incidentally, one of the major reasons Linux is a really, really shitty OS.

    @Gribnit said:

    so, you're okay with Asc() existing? but what would anyone ever use it for?

    Nobody's saying you should never convert a char to an int. What people are saying is you should never silently, implicitly convert a char to an int.


  • FoxDev




  • Banned

    @blakeyrat said:

    If you disagree with me, fine. Disagree with me. Then go the fuck away and read some other thread.

    I wish you followed that rule yourself sometimes!


  • Considered Harmful

    That's weird, they seem to have some kinda implied encoding, per your helpful link:

    "For a filename to be portable across implementations conforming to POSIX.1-2008, it shall consist only of the portable filename character set as defined in Portable Filename Character Set. Portable filenames shall not have the <hyphen> character as the first character since this may cause problems when filenames are passed as command line arguments."

    Now, I admit that I skimmed over the lengthy, ranting piece, so this may be out of context. Perhaps I should be substituting byte for character, to help your argument.

    @blakeyrat said:

    What people are saying is you should never silently, implicitly convert a char to an int.

    Ah, it seemed they were saying there was no point in being able to determine the difference between two characters, as an integer. Your "those should also be characters" proposal is much saner. Hail hail.


  • Considered Harmful

    This post is deleted!

  • Considered Harmful

    Certainly. In the original example, for instance, they use what appears to be an int() function - so the conversion was not implicit, and it seems the usage was already, arguably, strongly typed. Not sure what more common type you'd want than an integer type, for a difference between characters, tho.


  • Considered Harmful

    @blakeyrat said:

    Why do I post here? It's just dealing with this kind of garbage all the time. Goddamned.

    You enjoy this, is the reason. To fix this, find something that immediately strikes you as stupid, and then actually figure out why it is and is not stupid, and then do that 1000 times or so, until you stop enjoying finding things stupid.



  • @blakeyrat said:

    Why do I post here?

    Because you are the OP and have a duty / obligation to do so?



  • @‍Gaska - Days Since Last Discourse Bug: 0


  • Banned

    That was fast.


  • FoxDev

    Methinks the operator did some manual piloting…
    Not me; it's not one of mine


  • Discourse touched me in a no-no place

    No, I started it up before pulling the version which ignores old summons, so it picked the old summon up.



  • @blakeyrat said:

    Has anyone noticed that @blakeyrat's complaint doesn't have anything to do with strong typing

    Yes it does.

    To be fair, this a matter of perspective. I believe that Ga̧ska is seeing this from the perspective of operators taking a set of operand types, and is arguing that it is perfectly acceptable for an operator to be defined such as to take any specified operand types, so long as the result is defined for the given types. From this view, it is irrelevant whether the operator '+' represents the operation 'add int a to int b and return an integer' in one context, and 'add int a to the character order of char b and return a char'. From this perspective, the values are strongly typed, but the operators do not need to be, because the operator itself is no more or less arbitrary than the name of a function - using '+' as 'add' uses the Eliza effect to make it more mnemonic and easier to type, but does not bear a direct relationship to the concept of addition.

    @blakeyrat, OTOH, apparently sees the operator as part of the type interface, and as such, a part of the conceptual framework of the type. From this perspective, the problem is not with the use of the operator to perform this operation, but that the operation itself violates the type contract.

    He is further pointing out that the whole idea that there is an intrinsic ordering of characters is an assumption, not a part of the type contract; for example, not all languages use the modern Latin alphabet, nor do those which do use all of it, nor do those which use it without modification, and most tellingly, not all use the same ordering as English. Conversely, it is also an assumption - and one known to be invalid in some cases - that the representation of the characters be in the same order as the English language ordering of the Latin alphabet (he specifically mentioned EBCDIC, where this was indeed not the case), and conflating the two is invalid in a strongly typed language which handles character data in a generalized manner.

    But these are side points: his real argument rests on the position that operand typing is part of the type contract, and should follow the principle of least astonishment - which in this case means, if there isn't an obvious meaning to a given operation that can be inferred from the type's analogy to what it represents (the Eliza effect, again), then it should not be treated as an intrinsic operation at all.


  • Winner of the 2016 Presidential Election

    @another_sam said:

    > s.encode(3) does not select the encode method from the string object, but it is simply syntactic sugar for the function call encode(s,3) where s becomes the first argument.

    GTFO.

    Uniform call syntax. It's going to make future C++ even more unreadable.


Log in to reply