Unicode emojis are stupid
-
Does anyone know of a way to get the number of glyphs (not simply characters) in a string in C#?
For instance, take the string
"👨👨👧👧"
. Character count is 11, 7 if you take surrogate pairs into account. But there's only one glyph there.
-
The number of glyphs also depends on the font.
-
@ben_lubar said in Unicode emojis are stupid:
The number of glyphs also depends on the font.
For now, let's just say whatever Windows 10 renders by default in an en_US locale. I don't know the name of the default font, but whatever that is.
-
-
@jaloopa said in Unicode emojis are stupid:
@pie_flavor said in Unicode emojis are stupid:
Unicodeemojis are stupidFTFY
-
Bytes, code units and glyphs are different concepts. Also, before counting glyphs, you need to define them in a way a computer can understand (ie, a completely literal and unambiguous way - good luck).
-
@jaloopa said in Unicode emojis are stupid:
@pie_flavor said in Unicode emojis are stupid:
Unicodeemojis are stupidFTFY
If the Emoji movie becomes the flop of the year (animation-wise at least), it would restore some of my faith in mankind.
-
@khudzlin said in Unicode emojis are stupid:
Bytes, code units and glyphs are different concepts.
QFT. Also there's combining characters, and if there's UTF-16 in the mix then you've also got surrogates.
-
@dkf said in Unicode emojis are stupid:
if there's UTF-16 in the mix then you've also got surrogates.
Yeah, UTF-16 is , especially because most people who claim to use UTF-16 instead use the even more obsolete UCS-2 (ie not dealing with surrogates correctly).
-
@khudzlin said in Unicode emojis are stupid:
Bytes, code units and glyphs are different concepts. Also, before counting glyphs, you need to define them in a way a computer can understand (ie, a completely literal and unambiguous way - good luck).
I am fully aware of these things. As I understand it, a combination of Windows and the current font decide which characters / surrogate pairs make up a glyph. I am not necessarily saying that it has anything to do with characters; I'm just asking how to get the number of total glyphs in a string, in default everything.
-
@khudzlin said in Unicode emojis are stupid:
@dkf said in Unicode emojis are stupid:
if there's UTF-16 in the mix then you've also got surrogates.
Yeah, UTF-16 is , especially because most people who claim to use UTF-16 instead use the even more obsolete UCS-2 (ie not dealing with surrogates correctly).
Sending the actual glyph instead of the text shortcut (
👨👨👧👧
vs:family_mwgb:
) is . But I can't control my platform, just what I do with it.
-
@pie_flavor Implement a good chunk of a font renderer.
Signed,
The guy who spent 2 hours yesterday trying to teach the vagaries of font rendering to the intellectual equivalent of a box turtle
-
Glyphs are beyond the scope of Unicode, they're a rendering concept.
-
@khudzlin said in Unicode emojis are stupid:
Glyphs are beyond the scope of Unicode, they're a rendering concept.
But there has to be some way to do this, yes?
-
@pie_flavor said in Unicode emojis are stupid:
@khudzlin said in Unicode emojis are stupid:
Glyphs are beyond the scope of Unicode, they're a rendering concept.
But there has to be some way to do this, yes?
Define a glyph in a way a computer can understand, then.
-
@pie_flavor said in Unicode emojis are stupid:
@khudzlin said in Unicode emojis are stupid:
Glyphs are beyond the scope of Unicode, they're a rendering concept.
But there has to be some way to do this, yes?
Converting to glyphs requires both the font and the string because of the complexity of handling ligatures and other forms of combining. I believe that font engines often make a way for their clients to get the glyph sequence, but it is not a commonly needed operation (by comparison with “measure size of string when rendered in a particular font” and “render a string in a particular font”).
-
@dkf said in Unicode emojis are stupid:
“measure size of string when rendered in a particular font”
This is all I've ever needed. Haven't found a case yet where I care about the glyph count.
-
@dkf said in Unicode emojis are stupid:
@pie_flavor said in Unicode emojis are stupid:
@khudzlin said in Unicode emojis are stupid:
Glyphs are beyond the scope of Unicode, they're a rendering concept.
But there has to be some way to do this, yes?
Converting to glyphs requires both the font and the string because of the complexity of handling ligatures and other forms of combining. I believe that font engines often make a way for their clients to get the glyph sequence, but it is not a commonly needed operation (by comparison with “measure size of string when rendered in a particular font” and “render a string in a particular font”).
Hence, the thread - how is this done in C#?
-
Apropos of nothing other than the title of the thread…
-
@pie_flavor In the standard library, it isn't.
I've got a copy of the PDF spec at work that probably covers the necessary details. Probably.
-
Huh. So that's possible.
I didn't even realize it was a thing.
-
Just out of curiosity, does this do anything like what you're asking?
static string ConvertTextToIndices(string text, string fontUri) { var glyphs = new Glyphs { UnicodeString = "👨👨👧👧", FontUri = new Uri(@"C:\WINDOWS\Fonts\TIMES.TTF") }; return String.Join(" ", glyphs.ToGlyphRun().GlyphIndices); }
That should give a list of the glyph indices in the string... getting the count instead would be what you're trying to do, no? The only question is which font it's using to display the emoji characters by default, since it (presumably?) wouldn't work in any other font.
-
@anotherusername Stuck with .NET Core; no System.Windows.Media here.
-
@pie_flavor Here's a really dumb answer:
- Count the number of code points in the string.
- Subtract double the number of ZWJ characters.
-
@ben_lubar said in Unicode emojis are stupid:
@pie_flavor Here's a really dumb answer:
- Count the number of code points in the string.
- Subtract double the number of ZWJ characters.
... It works. I feel dirty for doing such stupid code, but it works.
-
@pie_flavor said in Unicode emojis are stupid:
@ben_lubar said in Unicode emojis are stupid:
@pie_flavor Here's a really dumb answer:
- Count the number of code points in the string.
- Subtract double the number of ZWJ characters.
... It works. I feel dirty for doing such stupid code, but it works.
Now we just wait for someone to submit a string consisting entirely of ZWJ characters...
-
@ben_lubar hah. "Hello, $DATABASE_ERROR! How are you today?"
-
@pie_flavor said in Unicode emojis are stupid:
@ben_lubar hah. "Hello, $DATABASE_ERROR! How are you today?"
Hello, [server crashes and reboots]
-
@ben_lubar said in Unicode emojis are stupid:
Subtract double the number of ZWJ characters
What about 🇦🇺 - what does that look like? In some fonts that should render as a flag (one glyph) but if they aren't aware of that it'll be separate A U characters...
-
@zemm
I can't select just half of it, it's an all-or-nothing.
-
@lb_ said in Unicode emojis are stupid:
@zemm
I can't select just half of it, it's an all-or-nothing.I can.
-
-
@dcon said in Unicode emojis are stupid:
I can.
I can't, but if I then copy and paste the two-character glyph, my cursor becomes in the middle of the pasted sequence. And if I am at the beginning of the sequence after that, pressing Right moves the cursor to the middle, but if I am at the end, pressing Left moves the cursor to the beginning.
-
@lb_ said in Unicode emojis are stupid:
I can't select just half of it, it's an all-or-nothing.
Me neither.
-
This post is deleted!
-
Most be varying degrees of Unicode support. If you can't select the characters individually then it knows it has to join them, but it can't render the flag for another reason. I do see 🇦🇺 as a flag on my phone but my desktop computers all show the two characters.
-
@zemm said in Unicode emojis are stupid:
Well, like that. I wondered what the question was about until I read the replies.