Shame on the majority of the internet for not building in support for the character ‪‫‬‭‮‪‫‬‭‮҉



  • Even google is affected by the character ‪‫‬‭‮‪‫‬‭‮҉

     Out of curiosity, what exactly IS this thing anyway? I figured I'd ask here, because this is the place where I'm the most likely to encounter somebody that knows the story behind this thing. Is this the product of some massive WTF on microsoft's end? I hear it only affects windows computers.

     
    Here's a few links of some affected webpages

     
    http://en.wikipedia.org/w/index.php?title=‪‫‬‭‮‪‫‬‭‮҉&action=submit
     



  • I was able to discover that the symbol is used for some language, and it's called a "Combining Cyrillic millions sign," but that still doesn't explain why it messes up every webpage that touches it. Just try to type something after typing that symbol. The following is an example.

     

    ‪‫‬‭‮‪‫‬‭‮҉ I am not writing this backwards. I swear. 



  • My suspicion is that this character is only used in right-to-left languages, and so in order to properly render, the input field has to be switched to reverse text direction when this is encountered.  Such things become important when you use a computer in such languages as Arabic or Hebrew, and I've gathered that Windows is pretty good about supporting right-to-left text just about everywhere.

     

    Perhaps someone with experience with computing in right-to-left languages could correct or elaborate? 



  • @prophet6 said:

    My suspicion is that this character is only used in right-to-left languages, and so in order to properly render, the input field has to be switched to reverse text direction when this is encountered.  Such things become important when you use a computer in such languages as Arabic or Hebrew, and I've gathered that Windows is pretty good about supporting right-to-left text just about everywhere.

     

    Perhaps someone with experience with computing in right-to-left languages could correct or elaborate? 

    This would be consistent with my past experiences with unicode bidi support (it's pretty retarded), although I don't know anything about cyrillic in particular. It tends to fall over badly when confronted with a non-trivial mixture of text going in both directions.

    I'm pretty sure that it was a really stupid idea to try and cram layout features into unicode - the result is a half-working kludge that is difficult to integrate with smarter, higher-level layout methods, while appearing to work just well enough so that Americans who don't actually rely on the features can dismiss any complaints.



  • @prophet6 said:

    My suspicion is that this character is only used in right-to-left languages, and so in order to properly render, the input field has to be switched to reverse text direction when this is encountered.  Such things become important when you use a computer in such languages as Arabic or Hebrew, and I've gathered that Windows is pretty good about supporting right-to-left text just about everywhere.

     

    Perhaps someone with experience with computing in right-to-left languages could correct or elaborate? 


    Possibly, but IIRC Cyrillic is not a right-to-left script in any language (Cyrillic is a script, not a language, and it certainly isn't RTL in any language I'm aware of -- it's descended from Greek script, which is left-to-right like the Latin script used in English), so if your diagnosis is correct there are actually [b]two[/b] bugs: incorrect handling of RTL text and incorrect treatment of this character as RTL. (The latter being a data entry error.)

    Actually, you aren't particularly supposed to be able to use combining characters by themselves anyway -- a combining character is a signal to the computer that a glyph nearby should be modified. It appears from a little testing that this character clumps together with any Cyrillic character immediately to its left, and Cyrillic characters are allowed to be used as numbers, so perhaps the bug arises because Unicode doesn't specify behavior when the character immediately to the left is not one which can be interpreted as a number, and your computer is winging it.



  • Interestingly enough, the "combining Cyrillic hundreds thousands" symbol doesn't cause any such issues ( <font size="14">҈</font> ) so.. I'm not quite sure what to say.

     (The original, for reference:) <font size="14">‫‬‭‮‪‫‬‭‮҉
     </font>



  • This is the coolest thing ever. It just flips everything on the line following it. Go those crazy Russians ;-)

    ‪‫‬‭‮‪‫‬‭‮҉ My brain hurts

     



  • @pyro789x said:

    Is this the product of some massive WTF on microsoft's end? I hear it only affects windows computers.

     

    It's not Windows-only. I see it on Iceape (Mozilla Seamonkey) on Debian... 



  • @random_garbage said:

    @pyro789x said:

    Is this the product of some massive WTF on microsoft's end? I hear it only affects windows computers.

     

    It's not Windows-only. I see it on Iceape (Mozilla Seamonkey) on Debian... 

     

    Seconded; firefox on Ubuntu does it as well...

    At first I thought it was just something the owners did for fun... Then I looked in the source and noticed <a/>[something]<"[link]"=ferh a>.

    It took me a while to realize that normally that shouldn't even be replaced by a valid hyperlink. So I telnet'ed to the server; it displayed it normally.

     

    Odd...

     

    PS: It even works on IRC :D
     



  • @pyro789x said:

    http://www.livejournal.com/interests.bml?int=‪‫‬‭‮‪‫‬‭‮҉‪‫‬‭‮‪‫‬‭‮҉

    That's not just the Cyrillic symbol in question, it's the Cyrillic symbol surrounded by a whole mess of "left-to-right embedding", "right-to-left embedding", "pop directional formatting", "left-to-right override" and "right-to-left override" control characters - no wonder it behaves oddly.  The symbol itself, "    ҉" behaves perfectly sensibly, other than, as someone already mentioned, the fact that it's a combining character (I added some extra spaces to try to prevent it from overlapping the previous text, but that probably depends on the font it's being displayed in.



  • So if you add spaces before it, nothing goes wrong? I gotta try that out. "     ҉" Hmm, you appear to be correct. How strange.

     

    " ҉" test. test ҉ test. test҉ test. Well now that's just weird, now it's not doing anything wrong for me. ҉҉҉hello internet!

    shrug I don't know what to say. The only thing odd I notice now is that all my <enter> commands are now double-space instead of single-space.



  • Alright, now that's strange. When I copy-pasted the version from your post, nothing happened. But when I copy-past the original version from my posts before you, it works just as I outlined above. Maybe it's a strange version of the way I copy-pasted mine from the original source (some run-down internet forum). Does this mean it's possible to hack fonts now? :/ "   <font size="14">‫‬‭‮‪‫‬‭‮҉" test
    </font>



  • It has absolutely nothing to do with the "Combining Cyrillic millions character", the payload is all in the invisible control characters before it, which are carried over into the clipboard in most browsers apparently.  The "Combining Cyrillic millions character" was likely chosen just because it looks neat.  For example, I've endowed the following capital letter F with the property:

     

    ‪‫‬‭‮‪‫‬‭‮F



  • ‪‫‬‭‮‪‫‬‭‮Fthat's so not true. Hmm. I see. Alright then, you make a validargument.‪‫‬‭‮‪‫‬‭‮F‪‫‬‭‮‪‫‬‭‮F‪‫‬‭‮‪‫‬‭‮F

     

    Hmmm.. alright, I give, how did you do that? =P Is there any sort of table or something out there that explains how this works and lists the various character values that cause different effects? =P 



  • @pyro789x said:

    Hmmm.. alright, I give, how did you do that? =P Is there any sort of table or something out there that explains how this works and lists the various character values that cause different effects? =P

    The main reference is the Unicode standard - (most of?) the special characters are listed in chapter 16, with details of the left-to-right/right-to-left rules in annex 9.  There might be a tutorial somewhere, that ought to be more digestible than the standard itself, but I don't know of one.  I'm sure Mr Google will help you if you ask nicely. ;-)



  • Is it just me or does this symbol look like a ? [question mark] to anyone else? I'm viewing this in Firefox 1.5.0.12 on a XP Home SP1 system with the eastern language pack installed. I'm usually able to view web pages with Chinese, Japanese, Cyrillic, etc. characters in them.

    top of the first post as I see it



  • @joemck said:

    Is it just me or does this symbol look like a ? [question mark] to anyone else? I'm viewing this in Firefox 1.5.0.12 on a XP Home SP1 system with the eastern language pack installed.


    It's a non-displaying character, so it has no glyph. Firefox (all versions, Windows, Mac, and Linux) displays a question mark if it can't find a glyph for a character in any installed font.



  • @The Vicar said:

    @joemck said:

    Is it just me or does this symbol look like a ? [question mark] to anyone else? I'm viewing this in Firefox 1.5.0.12 on a XP Home SP1 system with the eastern language pack installed.


    It's a non-displaying character, so it has no glyph. Firefox (all versions, Windows, Mac, and Linux) displays a question mark if it can't find a glyph for a character in any installed font.

    Actually, it does display if you have the proper font installed. 



  • @bugmenot1 said:

    @The Vicar said:
    @joemck said:

    Is it just me or does this symbol look like a ? [question mark] to anyone else? I'm viewing this in Firefox 1.5.0.12 on a XP Home SP1 system with the eastern language pack installed.


    It's a non-displaying character, so it has no glyph. Firefox (all versions, Windows, Mac, and Linux) displays a question mark if it can't find a glyph for a character in any installed font.

    Actually, it does display if you have the proper font installed. 


    Sounds like a bug to me -- a combining character has no glyph of its own, so it should never display by itself.



  • @joemck said:

    Is it just me or does this symbol look like a ? [question mark] to anyone else? I'm viewing this in Firefox 1.5.0.12 on a XP Home SP1 system with the eastern language pack installed. I'm usually able to view web pages with Chinese, Japanese, Cyrillic, etc. characters in them.

    top of the first post as I see it

    It should look like some sort of poof symbol, except it's built with commas.

    PS.
    Firefox 2+ has been out for quite a while now. 



  • @The Vicar said:

    @bugmenot1 said:
    @The Vicar said:
    @joemck said:

    Is it just me or does this symbol look like a ? [question mark] to anyone else? I'm viewing this in Firefox 1.5.0.12 on a XP Home SP1 system with the eastern language pack installed.


    It's a non-displaying character, so it has no glyph. Firefox (all versions, Windows, Mac, and Linux) displays a question mark if it can't find a glyph for a character in any installed font.

    Actually, it does display if you have the proper font installed.


    Sounds like a bug to me -- a combining character has no glyph of its own, so it should never display by itself.

    The control characters aren't displayed - what should be visible is the weird Cyrillic symbol that actually has no side-effects of its own. You can see the magic in action if you copy the character in the topic title and paste it into an editor that only supports ASCII. You will get a whole load of "?" symbols instead of just the two a single unicode char is supposed to cause.

    How it's supposed to look like:

     



  • @PSWorx said:

    @The Vicar said:
    @bugmenot1 said:
    @The Vicar said:
    @joemck said:

    Is it just me or does this symbol look like a ? [question mark] to anyone else? I'm viewing this in Firefox 1.5.0.12 on a XP Home SP1 system with the eastern language pack installed.


    It's a non-displaying character, so it has no glyph. Firefox (all versions, Windows, Mac, and Linux) displays a question mark if it can't find a glyph for a character in any installed font.

    Actually, it does display if you have the proper font installed.


    Sounds like a bug to me -- a combining character has no glyph of its own, so it should never display by itself.

    The control characters aren't displayed - what should be visible is the weird Cyrillic symbol that actually has no side-effects of its own. You can see the magic in action if you copy the character in the topic title and paste it into an editor that only supports ASCII. You will get a whole load of "?" symbols instead of just the two a single unicode char is supposed to cause. 


    I went and looked it up in the Unicode spec -- the Code Charts section at http://www.unicode.org/charts/ (I'd link that, but the forum software doesn't work very well on Safari) lets you see a sample version of every defined character -- and you're right. They [b]do[/b] define a glyph for it as a standalone. Well, I'll be dipped.

    (And by the way: I'm on a Mac. I'm not even sure if there [b]are[/b] any editors on the Mac these days that only support ASCII. There are plenty that can restrict to ASCII on save, but all of them seem to use Apple's built-in editing support for the intermediate stage, which is Unicode-based, and then strip out the unmappable characters after the fact.)

    (As long as I'm doing parenthetical asides, I looked up the characters for keystrokes, after that other thread that erroneously claimed that Apple was using the key combination "option-power" in XCode. Somebody had claimed, I forget who and I'm too lazy to go look it up, that the symbol they used for "Escape" is not standard, but it appears in Unicode in the "Miscellaneous Technical" code table, and the Code Chart has a note saying "= escape" after the name. So it looks like Apple was doing the standard thing after all.)



  • @The Vicar said:


    (And by the way: I'm on a Mac. I'm not even sure if there [b]are[/b] any editors on the Mac these days that only support ASCII. There are plenty that can restrict to ASCII on save, but all of them seem to use Apple's built-in editing support for the intermediate stage, which is Unicode-based, and then strip out the unmappable characters after the fact.)

    You don't need a full-fledged editor, just something where you can paste text. How about the command line? Works like a charm in windows. Or just use a clipboard viewer.



  • Speaking of Windows:

     

     

    The unicode support is indeed mesmerizing... 



  • firefox, Kubuntu feisty fawn:

     

     

    See how browser title is broken, as long as source file. Btw, test on brother PC, a windows XP with firefox, the page displays properly. 



  • Why venture so far? 
    @this very page said:
    <title>Worse Than Failure - Shame on the majority of the internet for not building in support for the character ‪‫‬‭‮‪‫‬‭‮҉</title>
    <meta name="GENERATOR" content="CommunityServer 2.1 SP2 (Build: 61129.2)" />
    <link rel="shortcut icon" type="image/ico" href="/favicon.ico" />
    <link rel="alternate" type="application/rss+xml" title=""Side Bar" WTF (RSS 2.0)" href="http://forums.worsethanfailure.com/forums/rss.aspx?ForumID=18&Mode=0" />
    <link rel="alternate" type="application/rss+xml" title="Shame on the majority of the internet for not building in support for the character ‪‫‬‭‮‪‫‬‭‮҉ (RSS 2.0)" href="http://forums.worsethanfailure.com/forums/rss.aspx?ForumID=18&PostID=128877" />

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    This thing is the ultimate source code obfuscator! Combine that with base64 and we finally have the image protection the other thread was looking for! </irony>



  • @pyro789x said:

    <font size="14">҈</font>


     

    That character should only be used when referring to a really, really happy spider... ^^

     



  • If we break down the UTF-8 sequence in the original URL, it turns out to be:

    U+202A U+202B U+202C U+202D U+202E U+0489

    The last character, U+0489, is the combining Cyrillic millions sign. The character immediately before it, U+202E, is a non-printing Right-to-Left Override character, which causes any following text to be printed as right-to-left text.

    So iwpg and seaturnip are correct. There's nothing unusual about the combining Cyrillic millions sign. In fact, I applaud browsers for rendering the bidirectional overrides correctly.



  • @VGR said:

    In fact, I applaud browsers for rendering the bidirectional overrides correctly.

    And I'd just like to restate my desire for the unicode consortium to be repeatedly kicked in the head for making them exist in the first place. 



  • @VGR said:

    If we break down the UTF-8 sequence in the original URL, it turns out to be:

    U+202A U+202B U+202C U+202D U+202E U+0489

    The last character, U+0489, is the combining Cyrillic millions sign.

     

    Which means if you want to use this sign as part of  a҉  web page, they would put "&#1161;" immediately AFTER the character they want to appear "inside" the character.



  • @asuffield said:

    @VGR said:

    In fact, I applaud browsers for rendering the bidirectional overrides correctly.

    And I'd just like to restate my desire for the unicode consortium to be repeatedly kicked in the head for making them exist in the first place. 

    I'm trying to fix a wikipedia article so that it does not need bidirectional overrides to display properly. I haven't found any combination of <span dir="rtl">, <span dir="ltr"> that will do this (percent signs are hebrew text)

    ...and seek aid [%%%%%% = 391; together 471] in the sixth millennium."

    It always renders as [391 = %%%%%%; together 471]

    (the article is http://en.wikipedia.org/wiki/Chronogram, by the way)
     



  • This thing is the ultimate source code obfuscator! Combine that with base64 and we finally have the image protection the other thread was looking for! </irony>

     

    Yeah, too bad all the formatting goes away if you just copy-paste it without the reverse character =P



  • @Random832 said:

    I'm trying to fix a wikipedia article so that it does not need bidirectional overrides to display properly. I haven't found any combination of <span dir="rtl">, <span dir="ltr"> that will do this (percent signs are hebrew text)

    ...and seek aid [%%%%%% = 391; together 471] in the sixth millennium."

    It always renders as [391 = %%%%%%; together 471]

    Try using the Left-to-Right Mark and Right-to-Left Mark characters:

    [&rlm;%%%%%%%&lrm; = 391; together 471]

    (or: [&#8207;%%%%%%%&#8206; = 391; together 471])

    (Details on why this works are here.)



  • @VGR said:

    @Random832 said:

    I'm trying to fix a wikipedia article so that it does not need bidirectional overrides to display properly. I haven't found any combination of <span dir="rtl">, <span dir="ltr"> that will do this (percent signs are hebrew text)

    ...and seek aid [%%%%%% = 391; together 471] in the sixth millennium."

    It always renders as [391 = %%%%%%; together 471]

    Try using the Left-to-Right Mark and Right-to-Left Mark characters:

    [&rlm;%%%%%%%&lrm; = 391; together 471]

    (or: [&#8207;%%%%%%%&#8206; = 391; together 471])

    (Details on why this works are here.)

    As it is, i'm using unicode directional embedding characters, but i was hoping to avoid using any unicode "markup" at all - is there a way to do this in pure html? 



  • @Random832 said:

    @VGR said:
    @Random832 said:

    I'm trying to fix a wikipedia article so that it does not need bidirectional overrides to display properly. I haven't found any combination of <span dir="rtl">, <span dir="ltr"> that will do this (percent signs are hebrew text)

    ...and seek aid [%%%%%% = 391; together 471] in the sixth millennium."

    It always renders as [391 = %%%%%%; together 471]

    Try using the Left-to-Right Mark and Right-to-Left Mark characters:

    [&rlm;%%%%%%%&lrm; = 391; together 471]

    (or: [&#8207;%%%%%%%&#8206; = 391; together 471])

    (Details on why this works are here.)

    As it is, i'm using unicode directional embedding characters, but i was hoping to avoid using any unicode "markup" at all - is there a way to do this in pure html? 

    Looks like the source of the problem is that the equals sign has "neutral" directionality. Forcing its directionality seems to work:

    [%%%%%%% <span dir="ltr">=</span> 391; together 471]



  • @VGR said:

    @Random832 said:
    @VGR said:
    @Random832 said:

    I'm trying to fix a wikipedia article so that it does not need bidirectional overrides to display properly. I haven't found any combination of <span dir="rtl">, <span dir="ltr"> that will do this (percent signs are hebrew text)

    ...and seek aid [%%%%%% = 391; together 471] in the sixth millennium."

    It always renders as [391 = %%%%%%; together 471]

    Try using the Left-to-Right Mark and Right-to-Left Mark characters:

    [&rlm;%%%%%%%&lrm; = 391; together 471]

    (or: [&#8207;%%%%%%%&#8206; = 391; together 471])

    (Details on why this works are here.)

    As it is, i'm using unicode directional embedding characters, but i was hoping to avoid using any unicode "markup" at all - is there a way to do this in pure html? 

    Looks like the source of the problem is that the equals sign has "neutral" directionality. Forcing its directionality seems to work:

    [%%%%%%% <span dir="ltr">=</span> 391; together 471]

    Note that if unicode had none of the botched bidi stuff at all, then this would be the correct solution:

    [<span dir="rtl">%%%%%%</span> = 391; together 471]

    Which is exactly what you would expect it to be. This is why I hate unicode bidi support: it breaks what would otherwise be a perfectly straightforward system to use on the html level. 



  • @asuffield said:

    @VGR said:
    @Random832 said:
    @VGR said:
    @Random832 said:

    I'm trying to fix a wikipedia article so that it does not need bidirectional overrides to display properly. I haven't found any combination of <span dir="rtl">, <span dir="ltr"> that will do this (percent signs are hebrew text)

    ...and seek aid [%%%%%% = 391; together 471] in the sixth millennium."

    It always renders as [391 = %%%%%%; together 471]

    Try using the Left-to-Right Mark and Right-to-Left Mark characters:

    [&rlm;%%%%%%%&lrm; = 391; together 471]

    (or: [&#8207;%%%%%%%&#8206; = 391; together 471])

    (Details on why this works are here.)

    As it is, i'm using unicode directional embedding characters, but i was hoping to avoid using any unicode "markup" at all - is there a way to do this in pure html? 

    Looks like the source of the problem is that the equals sign has "neutral" directionality. Forcing its directionality seems to work:

    [%%%%%%% <span dir="ltr">=</span> 391; together 471]

    Note that if unicode had none of the botched bidi stuff at all, then this would be the correct solution:

    [<span dir="rtl">%%%%%%</span> = 391; together 471]

    Which is exactly what you would expect it to be. This is why I hate unicode bidi support: it breaks what would otherwise be a perfectly straightforward system to use on the html level. 

    Wait - you think unicode should have NO directionality information? (the only way for your solution to be correct is if = doesn't have "neutral" directionality, and the only reason the hebrew needs a span in that case is if _it_ doesn't have "right to left" directionality) - that is, are you saying that in the simple case, if i have a run of english text that mentions a hebrew word, the hebrew word would have to be either visually encoded or put in an explicit RTL span? 

    That's retarded.
     



  • @Random832 said:

    @asuffield said:

    Note that if unicode had none of the botched bidi stuff at all, then this would be the correct solution:

    [<span dir="rtl">%%%%%%</span> = 391; together 471]

    Which is exactly what you would expect it to be. This is why I hate unicode bidi support: it breaks what would otherwise be a perfectly straightforward system to use on the html level. 

    Wait - you think unicode should have NO directionality information? (the only way for your solution to be correct is if = doesn't have "neutral" directionality, and the only reason the hebrew needs a span in that case is if it doesn't have "right to left" directionality) - that is, are you saying that in the simple case, if i have a run of english text that mentions a hebrew word, the hebrew word would have to be either visually encoded or put in an explicit RTL span?

    Right, more or less. Layout is a task properly solved at the level of the layout engine - in this case, html. Attempting to solve it at a lower level generates the kind of stupidity that we can see here, because you can't actually solve it sensibly at that level, and it interferes with the layout engine's ability to do the sane thing. If you have text which mixes different directions, then you're always going to have to explain this to the layout engine, so you might as well explain what you want the text to do, rather than how you want the layout engine to work around unicode's braindamaged heuristics.

    There is no single correct answer for the way in which any text should be rendered based on the character values of that text. This information must be supplied by the author, and it should be supplied as markup, not as magic character sequences. Unicode's massive failing is in breaking both these rules.

    Note that if unicode/html/css had been properly designed in this fashion, then you would specify the direction of text rendering in the stylesheet, and you would just tag the relevant parts with a suitable style. (One could even speculate about a markup system where you could specify that in this document, words using hebrew characters would default to right-to-left display - that doesn't fit within the css model, but nobody ever accused css of being an ideal model)

    Also note that some languages are properly written top-to-bottom, and unicode doesn't even try to get those right.



  • And how the hell do you mix hebrew or arabic with english in a text file? (let me guess your answer: "visual encoding")

    This information must be supplied by the author, and it should be supplied as markup, not as magic character sequences.

    How is '<span dir="rtl">' not a magic character sequence? How are the unicode bidi control characters not markup?

    Regardless, there is NO reason that hebrew/arabic words within a run of english text, or numbers within a run of hebrew/arabic text, should have to be tagged.

    And it is the layout engine's responsibility. Unicode's bidirectionality algorithm, ignoring for the moment the control characters, is simply a standard saying that A) layout engines should figure out directionality based on context without requiring everything to be explicit and B) guidelines on how to accomplish this consistently across different layout engines.

    (One could even speculate about a markup system where you could specify that in this document,
    words using hebrew characters would default to right-to-left display -
    that doesn't fit within the css model, but nobody ever accused css of
    being an ideal model)

    How about a markup system where you specify that in the entire universe, words using hebrew characters default to right-to-left display, by publishing a unicode standard? While documents that are intended to display hebrew characters left-to-right do exist, they are widely accepted to be a bad idea.



  • @asuffield said:

    There is no single correct answer for the way in which any text should be rendered based on the character values of that text. This information must be supplied by the author, and it should be supplied as markup, not as magic character sequences. Unicode's massive failing is in breaking both these rules.

    Note that if unicode/html/css had been properly designed in this fashion, then you would specify the direction of text rendering in the stylesheet, and you would just tag the relevant parts with a suitable style. (One could even speculate about a markup system where you could specify that in this document, words using hebrew characters would default to right-to-left display - that doesn't fit within the css model, but nobody ever accused css of being an ideal model)

    What you are in effect saying is that there should be no such thing as a plain text document. I think an explicit goal of Unicode was to allow plain text documents, without any markup.

    @asuffield said:

    Also note that some languages are properly written top-to-bottom, and unicode doesn't even try to get those right.

    I'm not sure what you mean. Unicode has at least three whole blocks that exist only to support vertical writing.



  • @Random832 said:

    And how the hell do you mix hebrew or arabic with english in a text file? (let me guess your answer: "visual encoding")

    Same way that you build a skyscraper out of cheese: you don't. This is not a sane thing to attempt. If you want to mix text with different non-trivial layout requirements, you need a layout engine.

    This information must be supplied by the author, and it should be supplied as markup, not as magic character sequences.

    How is '<span dir="rtl">' not a magic character sequence? How are the unicode bidi control characters not markup?

    The difference is the existence of this thread.

    Regardless, there is NO reason that hebrew/arabic words within a run of english text, or numbers within a run of hebrew/arabic text, should have to be tagged.

    There is a reason, which I already stated. 

    How about a markup system where you specify that in the entire universe, words using hebrew characters default to right-to-left display, by publishing a unicode standard? While documents that are intended to display hebrew characters left-to-right do exist, they are widely accepted to be a bad idea.

    We have also already covered why that is hopelessly broken, complete with examples.

    I do not think that you have read the thread. 



  • @VGR said:

    What you are in effect saying is that there should be no such thing as a plain text document. I think an explicit goal of Unicode was to allow plain text documents, without any markup.

    Right, mostly. The goal was to allow plain text documents to have features of a layout engine without actually having a layout engine. I'm saying that this was a bad idea. The sane thing to do would have been to say that a plain text document would either be entirely displayed left-to-right or entirely right-to-left, and if you want anything more than that, use something smarter than plain text.

     

    @asuffield said:

    Also note that some languages are properly written top-to-bottom, and unicode doesn't even try to get those right.

    I'm not sure what you mean. Unicode has at least three whole blocks that exist only to support vertical writing.

    It doesn't have an analog of the bidi stuff - there's no way to blend horizontal and vertical writing in a "plain text" file.



  • @asuffield said:

    @Random832 said:

    And how the hell do you mix hebrew or arabic with english in a text file? (let me guess your answer: "visual encoding")

    Same way that you build a skyscraper out of cheese: you don't.

    That's not a reasonable answer. That hasn't been a reasonable answer for going on ten years now.

    Providing transparent bidirectional layout for text files is a legitimate need. UAX 9 meets that need. The fact that the control characters it provides to allow authors to clarify things in edge cases can be abused doesn't mean that is a bad idea. Lots of things can be abused.

    @asuffield said:

    This is not a sane thing to attempt. If you want to mix text with different non-trivial layout requirements, you need a layout engine.

    Right. and UAX 9 provides a spec for a layout engine to be used with unicode text files. Your main objection seems to be the fact that Unicode is more than just a numbers-to-glyphs mapping (I assume that you would equally object to arabic character shaping, combining diacritical marks, collation rules beyond mere bytestring (wordstring?) comparison and [at least here you wouldn't be alone] unified CJK ideographs). Maybe you even object to the fact that they make recommendations on which non-ascii characters should be allowed in identifiers in unicode-supporting programming languages. I bet you object to the soft hyphen in ISO 8859-1, too.

    You keep saying "you need a layout engine, you need a layout engine" - yeah. every text editor (for that matter, every program that DISPLAYS text), or at least every text editor to be used by someone who writes in a language that is written left-to-right, needs a layout engine. So what's wrong with providing a standard for the behavior of such layout engines, so that people won't be surprised moving from one program to another?

    @asuffield said:
    @Random832 said:
    @asuffield said:

    This information must be supplied by the author, and it should be supplied as markup, not as magic character sequences.

    How is '<span dir="rtl">' not a magic character sequence? How are the unicode bidi control characters not markup?

    The difference is the existence of this thread.

    Your answer is incoherent - the only thing relevant to the existence of this thread is that the unicode control characters are invisible [though they can be made visible with a specialized editor]

    @asuffield said:
    @Random832 said:

    Regardless, there is NO reason that hebrew/arabic words within a run of english text, or numbers within a run of hebrew/arabic text, should have to be tagged.

    There is a reason, which I already stated. 

    Your reason is nonsense.

    @asuffield said:

    Right, mostly. The goal was to allow plain text documents to have features of a layout engine without actually having a layout engine. I'm saying that this was a bad idea. The sane thing to do would have been to say that a plain text document would either be entirely displayed left-to-right or entirely right-to-left, and if you want anything more than that, use something smarter than plain text.

    No, the goal was, given that (by consensus) all text editors/viewers need a layout engine, to standardize the behavior of those layout engines. Even when something displays subjectively "wrong", it's predictable and a control character used to fix it up will have the same effect everywhere.

    The goal was to make plain text smarter. Just like the soft hyphen twenty years ago. the only difference is that the bidirectional algorithm is more widely supported today than the soft hyphen. Hell, forget soft hyphen, even the non-breaking space (which is supported everywhere) makes text smarter than, by your argument it should be - breaking lines requires a layout engine, without one, there's no difference between "non-breaking" space and just plain space. By your logic, we should abandon nbsp and any text that needs to not break on spaces should be enclosed in <span style="white-space:nowrap">, which would be an instruction to the "layout engine" to abandon its default behavior of breaking lines on spaces. Plain text should just have an explicit CR and/or LF whenever a line break is needed, and if there shouldn't be a line break between words, just put an ordinary space and no CR/LF. That is, after all, the "sane" thing to do. But if allowing text files to wrap, to contain soft hyphens, to allow mixed directionality/combining diacritics/arabic, is insane... I don't want to live in a "sane" world.



  • @asuffield said:

    It doesn't have an analog of the bidi stuff - there's no way to blend horizontal and vertical writing in a "plain text" file.

    That's because they're not often mixed in the real world - typically if text that can only be written horizontally occurs in a vertically-written document, it is simply rotated 90°.



  • @Random832 said:

    The goal was to make plain text smarter. Just like the soft hyphen twenty years ago. the only difference is that the bidirectional algorithm is more widely supported today than the soft hyphen. Hell, forget soft hyphen, even the non-breaking space (which is supported everywhere) makes text smarter than, by your argument it should be - breaking lines requires a layout engine, without one, there's no difference between "non-breaking" space and just plain space. By your logic, we should abandon nbsp and any text that needs to not break on spaces should be enclosed in <span style="white-space:nowrap">, which would be an instruction to the "layout engine" to abandon its default behavior of breaking lines on spaces. Plain text should just have an explicit CR and/or LF whenever a line break is needed, and if there shouldn't be a line break between words, just put an ordinary space and no CR/LF. That is, after all, the "sane" thing to do.

    P.S. take it a step further - even the line break is too "smart" - just pad the line out to 80 characters, then you don't need any special control characters like "line feed" or "carriage return". For CJK 'wide' characters, just encode the left and right half separately, then every single code point can refer to a glyph that goes in a single cell on a rectangular grid, and then text files will need no "layout engine" at all.



  • @asuffield said:

    @VGR said:

    What you are in effect saying is that there should be no such thing as a plain text document. I think an explicit goal of Unicode was to allow plain text documents, without any markup.

    Right, mostly. The goal was to allow plain text documents to have features of a layout engine without actually having a layout engine. I'm saying that this was a bad idea. The sane thing to do would have been to say that a plain text document would either be entirely displayed left-to-right or entirely right-to-left, and if you want anything more than that, use something smarter than plain text.

    That may be okay for Latin languages, but right-to-left languages commonly embed Latin characters. The idea that they cannot make use of plain text files probably doesn't sit well with technical people who use those languages.

    @asuffield said:

    @VGR said:
    @asuffield said:

    Also note that some languages are properly written top-to-bottom, and unicode doesn't even try to get those right.

    I'm not sure what you mean. Unicode has at least three whole blocks that exist only to support vertical writing.

    It doesn't have an analog of the bidi stuff - there's no way to blend horizontal and vertical writing in a "plain text" file.

    I don't know any Asian languages, so I can't be sure, but the "square" Latin characters in the 3300-33FF block appear to be aimed at doing just that.



  • @VGR said:

    @asuffield said:

    It doesn't have an analog of the bidi stuff - there's no way to blend horizontal and vertical writing in a "plain text" file.

    I don't know any Asian languages, so I can't be sure, but the "square" Latin characters in the 3300-33FF block appear to be aimed at doing just that.

    Actually, they're strictly for compatibility with the rather ill-considered "full width" characters in various asian character sets, which were never intended for vertical writing, but to just get latin characters into the DBCS standard independent of the ones in ascii [despite that most encodings also allow ascii characters] - same, in reverse, by the way, for the halfwidth katakana. But his point is there's no actual _algorithm_ for what a program's supposed to do if you mix horizontal characters with vertical ones.



  • @Random832 said:

    @VGR said:

    I don't know any Asian languages, so I can't be sure, but the "square" Latin characters in the 3300-33FF block appear to be aimed at doing just that.

    Actually, they're strictly for compatibility with the rather ill-considered "full width" characters in various asian character sets, which were never intended for vertical writing, but to just get latin characters into the DBCS standard independent of the ones in ascii [despite that most encodings also allow ascii characters] - same, in reverse, by the way, for the halfwidth katakana. But his point is there's no actual _algorithm_ for what a program's supposed to do if you mix horizontal characters with vertical ones.

    This seems to provide at least recommendations for an algorithm. I hardly ever see vertical writing on computers (though I'm well aware that it exists), so it's hard to me to visualize the things that document describes. But it seems like there's at least some rendering rules, even if they're not as strict as Bidi rules.



  • I don't care how it affects Unicode rendering, what does the character MEAN?  We'll never know, apparently.



  • @VGR said:

    @Random832 said:
    @VGR said:

    I don't know any Asian languages, so I can't be sure, but the "square" Latin characters in the 3300-33FF block appear to be aimed at doing just that.

    Actually, they're strictly for compatibility with the rather ill-considered "full width" characters in various asian character sets, which were never intended for vertical writing, but to just get latin characters into the DBCS standard independent of the ones in ascii [despite that most encodings also allow ascii characters] - same, in reverse, by the way, for the halfwidth katakana. But his point is there's no actual _algorithm_ for what a program's supposed to do if you mix horizontal characters with vertical ones.

    This seems to provide at least recommendations for an algorithm. I hardly ever see vertical writing on computers (though I'm well aware that it exists), so it's hard to me to visualize the things that document describes. But it seems like there's at least some rendering rules, even if they're not as strict as Bidi rules.

    That's not related to vertical writing, it's about how to deal with the hankaku/zenkaku problem. Basically, the eastern ideograms are the wrong shape to work neatly with latin characters in monospaced mode, so you either need two sets of latin glyphs in your font, or you have big white gaps between the letters. It just happens to contain a few recommendations about how to relate the full-width characters when rotating text in your layout engine.


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.