C stringsþÝ«ÌÎ‰Š‹ÿ

OffByOne

Okay, I'll try. TRWTF is non-Latin alphabets?

... says the guy who complains about not being able to use an ogonek in his username

PleegWat

Yup, that's variable. But there's very little in our code that cares - we handle strings. As long as you've got your truncation handled correctly, there's no problem. Everything else on the string level just works - even functions like strstr() work correctly on utf-8 strings.

And if we really must know a character length, like when we are walking over a string to case fold it, we've got helper functions to determine the length of a specific character.

Gąska

@OffByOne said:

... says the guy who complains about not being able to use an ogonek in his username

Whoosh much?

boomzilla

@blakeyrat said:

Who invented it is irrelevant to my point.

This should be another blakeylaw: Things that contradict blakey's "point" are automatically irrelevant.

boomzilla

@Gaska said:

Okay, I'll try. TRWTF is non-Latin alphabets?

Now you're talking sense!

OffByOne

@Gaska said:

Whoosh much?

No. I saw your joke and decided to run with it. I also wanted to show off my knowledge of non-latin letter decorations.

flabdablet

@blakeyrat said:

At the time these systems were developed, UTF-8 did not exist and hadn't even been thought-up.

Timeline:

UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25 to 29, 1993.

Win32 was introduced with Windows NT, first released in July 1993.

Java was first released in 1995.

So you're full of shit as usual.

The most likely reason for both Windows and Java to choose a 16-bit internal character representation is that 16 bits was the original design width for Unicode. The "let's duplicate all our emoji in assorted colours" school of encoding design didn't start winning the engineering war until 1996.

@blakeyrat said:

Linux is UTF-8 all over because they were user-hating assholes who didn't even bother to think about supporting non-ANSI languages until long after everybody else.

More revisionist @blakeyrat asspull bullshit.

Unix and derivatives have had standardized provision for multiple character encodings since 1987. UTF-8 has become the most commonly used of these because it works better than any of the others do.

The reason you don't find 16-bit character encodings in any of the Linux kernel APIs is because there's no need for them.

Linux, unlike Windows, allows any character except / (ASCII code 0x2F) and NUL (ASCII code 0) in a filename. Windows, like DOS before it, prohibits all of the following:

0x00 - 0x1F
" 0x22

0x2A
/ 0x2F
: 0x3A
< 0x3C
> 0x3E
? 0x3F
\ 0x5C
| 0x7C

In the pre-Unicode Shift-JIS character encoding, any value between 0x40 and 0xFC can occur as the second byte of a two-byte character code. This means that simply storing filenames as an uninterpreted byte array and relying on userland for interpretation, an approach that's always worked just fine for Unix and Linux filesystems, won't work for Windows.

Microsoft's solution was NTFS (with native UCS2 filenames), long filename support wedged into the existing FAT filesystems, and a bunch of UCS2 APIs added to the kernel to suit. None of that stuff was available until Win32, which arrived with NT in 1993.

Gąska

@OffByOne said:

I also wanted to show off my knowledge of non-latin letter decorations.

Yeah, but saying that letters have tails make you sound silly.

flabdablet

@boomzilla said:

This should be another blakeylaw: Things that contradict blakey's "point" are automatically irrelevant.

I support this law, on the understanding that blakey's "point" is whatever shit he chooses to make up taking no account of what he wrote in the first place.

OffByOne

@Gaska said:

Yeah, but saying that letters have tails make you sound silly.

My Polish language teacher said it looked silly if I didn't write the little tails when appropriate...

Now you have me

Gąska

@OffByOne said:

My Polish language teacher said it looked silly if I didn't write the little tails when appropriate...

One thing is writing wrong (because ą is a distinct letter that's actually closer to o than a; though ę is so close to e that in some cases it should be pronounced as plain e), another is making up silly names for things (strange quark anyone?).

Fun fact: Alt key is too hard to reach for most users, so Polish internets aren't particularly nice place for spellar/gramming nazis. Especially that "e" instead of "ę" changes first-person verbs into third-person.

OffByOne

@Gaska said:

One thing is writing wrong (because ą is a distinct letter that's actually closer to o than a; though ę is so close to e that in some cases it should be pronounced as plain e), another is making up silly names for things (strange quark anyone?).

I know that there's a difference between a, ą, e and ę. I did take Polish language lessons

AFAIK it's you Polish people who decided to call it a little tail, so the silly is on you. On the other hand, naming things is hard not only in CompSci.

flabdablet

@cvi said:

char* a = 0;
signed char* b1 = a;
unsigned char* b2 = a;

produces the following two warnings:

warning: pointer targets in initialization differ in signedness [-Wpointer-sign]
signed char* b1 = a;
^
warning: pointer targets in initialization differ in signedness [-Wpointer-sign]
unsigned char* b2 = a;
^

Strictly speaking all you've done here is demonstrate that there are three distinct pointer-to-char types.

flabdablet

@Gaska said:

making up silly names for things (strange quark anyone?)

Yeah, they should totally have called that the unstaged quark.

cvi

@flabdablet said:

Strictly speaking all you've done here is demonstrate that there are three distinct pointer-to-char types.

In an earlier post, I had already included an example with three function overloads based on the three different (non-pointer) char types. Although, I'm not sure why you'd think that the types become magically the same when you make them non-pointers...

Filed under: Am I getting trolled?

ben_lubar

Hey guys.

In Go, string is directly convertable to both []byte and []rune, so you can get bytes or Unicode codepoints. rune is the same type as int32 and byte is the same type as uint8. string uses UTF-8 encoding, but there are functions in the standard library to get a string from a []byte containing UTF-16 or UTF-32 and vice versa.

Jaloopa

@ben_lubar said:

rune

Item #12342341232 in the Great Big List Of Reasons Not To Take Go Seriously

JBert

@powerlord said:

Oh, and before I forget, EBCDIC was also around when C was created and has always used code points past 127... heck, lowercase letters start at code point 129.

But then again, EBCDIC is evil so any sane language designer would prefer to ignore it...

tar

@OffByOne said:

uint8_t

+8

tar

@Gaska said:

letters have tails

gjpqy

Yamikuronue

@tar said:

~~gjpqy~~giggity

T1kTFY

Yamikuronue

This post is deleted!

Evo

@flabdablet said:

On machines without byte addressing, this is not necessarily so. POSIX says that char is 8 bits, but POSIX is not the C standards committee.

No, a char is defined to be a byte, but a byte is not defined to be 8 bits. It's 8 bits minimum, but it may be more. There exist some such processors, but I don't think they're really being used anymore.

@OffByOne said:

Or you could eliminate all confusion with #include <inttypes.h> and use uint8_t for your byte-sized variables.

uint8_t might not exist, if the byte size is not 8 bits. This type is defined to exist only if the processor supports it. A char, however, always exists. So no, you can't eliminate all confusion that way.

@OffByOne said:

A \0 may be even a valid part of the encoding of a character.

True, for utf-16 or utf-32, but not for utf-8: a \0 there is always the end of the string. Using character arrays could be fine, but only for utf-8.

flabdablet

@cvi said:

an example with three function overloads

I'm talking about that other language, the one without all Bjarne's dark little jokes.

@cvi said:

I'm not sure why you'd think that the types become magically the same when you make them non-pointers...

I would never dream of casting a pointer type to a non-pointer.

tar

@flabdablet said:

I would never dream of casting a pointer type to a non-pointer.

Not even uintptr_t?

Gąska

@OffByOne said:

AFAIK it's you Polish people who decided to call it a little tail, so the silly is on you.

Yes, but we more often refer to those letters as "Polish symbols" - partially because it's more general term that also includes ó, ż etc. But mostly because "little tails" sounds silly.

@ben_lubar said:

In Go, string is directly convertable to both []byte and []rune, so you can get bytes or magicka.

FTFY

flabdablet

@tar said:

Not even uintptr_t?

Not even uintptr_t.

It's not one of the good parts.

lucas

@Gaska said:

Ok, I'll stop being a dick and say exactly why I don't see dynamic types as beneficial. In case of XML, this is just simple deserialization. At any given position, there are three possible cases - either data matches your expectations and you save it in appropriate strongly-typed structure, or it doesn't and you error out, or you don't care and save it as string.

Sometimes I just wanna do XML to json conversion on the fly and just deal with it on the front end.

Sometimes I just wanna serialize some data and send it to the client. Making a new type really isn't necessary and sometimes undesirable.

When building some SPAs I gotta do most of the logic on the client anyway. So again static typing really doesn't help me ... and it is likely to increase my dev time.

@Gaska said:

My main argument against dynamic types is, if you operate on some object, you must know how this object looks like - otherwise, how could you even think of saving this character field as character?

The thing is that sometimes it just adds boiler plate, when all I am really doing is ripping values outta of XML or JSON data and plonking them somewhere else. In those situations I don't think static typing helps that much and makes the code overly verbose.

For a lot of webdev stuff static typing is overkill, when on the client it is treated as a string anyway.

Gąska

@lucas said:

Sometimes I just wanna do XML to json conversion on the fly and just deal with it on the front end.

You need a specialized routine anyway because XML has parameters in addition to child elements and JSON has value types.

@lucas said:

Sometimes I just wanna serialize some data and send it to the client. Making a new type really isn't necessary and sometimes undesirable.

If you're only making flat structs, that's your problem. I would make struct A that contains struct B, and send either a or a.b depending on what I need. Also, "isn't necessary" applies to anything ever made since there's always another way to do the same thing.

@lucas said:

When building some SPAs I gotta do most of the logic on the client anyway. So again static typing really doesn't help me ... and it is likely to increase my dev time.

Only because JavaScript sucks.

@lucas said:

The thing is that sometimes it just adds boiler plate, when all I am really doing is ripping values outta of XML or JSON data and plonking them somewhere else. In those situations I don't think static typing helps that much and makes the code overly verbose.

Static typing requires boilerplate by definition, so don't be surprised. Also, in this case I would first reduce amount of data I need to process to only the data I actually need, either by grepping through input data or by using generic stringly-typed XML parser and extracting the child element I'm interested in from it.

tar

@flabdablet said:

Not even uintptr_t.

It's not one of the good parts.

I would be fascinated to know why...

flabdablet

Because it doesn't do anything useful.

If you need to store a pointer, store it in a pointer type. Use a void* if you want to be a bit generic about it. Casting a pointer to a uintptr_t, doing arithmetic on it and casting it back has undefined effects so that's not useful either.

I have never seen code employing uintptr_t that wouldn't be made clearer by getting rid of it.

tar

So, pointer arithmetic using uint8_t*? Or just no pointer arithmetic at all, ever?

FrostCat

@blakeyrat said:

Your post is gibberish.

Even if that were true, which it's not, at least I'm not ugly, unlike you.

FrostCat

@flabdablet said:

Windows, like DOS before it, prohibits all of the following:

I would suspect that's a filesystem limitation, not an OS one, but I'm not going to bother to create an ext2fs partition on a USB drive to test it.

FrostCat

@flabdablet said:

Yeah, they should totally have called that the unstaged quark.

No, the unstaged index.

FrostCat

@ben_lubar said:

In Go

TL;DR

flabdablet

@tar said:

no pointer arithmetic at all, ever?

What useful pointer arithmetic did you have in mind that you can't do directly on a pointer type?

FrostCat

cough Did you miss the fact that you missed a close-quote? (...oh yeah, I see you edited the original post. I'm too lazy to post "i'm celebrating that you found your mistake" on the cupcakes thread, but what kind of cupcake would that get?

flabdablet

@FrostCat said:

I would suspect that's a filesystem limitation, not an OS one

I would suspect the opposite, since all the prohibited characters (apart from the control characters) are pathname separators or shell metacharacters.

blakeyrat

You did a nice job of making UTF-8 (or perhaps the decision to use it) a good idea in Widnows but a bad one in Linux.

I did a nice job of making UTF-8 a good idea in Windows.

I don't remember doing this, but you say your post makes sense, so I guess I did.

tar

@flabdablet said:

What useful pointer arithmetic did you have in mind that you can't do directly on a pointer type?

I actually didn't have anything in mind, and was only talking in the general case—so I would've accepted "No" as a response.

But, please excuse me while I go off and read the sourcecode for dlmalloc to see what kind of casting goes on in the bowels of a memory allocator...

tar

@blakeyrat said:

you say your post makes sense

Everybody always claims this.

lucas

@Gaska said:

You need a specialized routine anyway because XML has parameters in addition to child elements and JSON has value types.

I just use a library that serializes between the two. I don't care how it works as long as it decent performance and isn't a WTF to use.

@Gaska said:

If you're only making flat structs, that's your problem. I would make struct A that contains struct B, and send either a or a.b depending on what I need. Also, "isn't necessary" applies to anything ever made since there's always another way to do the same thing.

I don't care how the data gets to the client as long as it is sensibly structured JSON. Static typing doesn't really help me when my web service is just a XML to JSON layer because a supplier won't give me a JSON API.

@Gaska said:

Only because JavaScript sucks.

In your opinion, there are tons of stuff in JS that I wish were in other languages. Sorry but blind JS hatred is usually because a programmer has no fucking clue how to actually write it well. Yeah there are plenty of things that suck ... but you don't have to use them.

It kinda the same argument as "PHP has loads of sucky legacy crap, therefore it is crap". Nobody is forcing you to use that stuff soo ... don't.

@Gaska said:

Static typing requires boilerplate by definition, so don't be surprised.

It increases my dev time and doesn't add any real value for a lot of stuff I tend to be working on.

@Gaska said:

Also, in this case I would first reduce amount of data I need to process to only the data I actually need, either by grepping through input data or by using generic stringly-typed XML parser and extracting the child element I'm interested in from it.

That is what I am doing, but I don't need to define two types, I define one and create a new dynamic object with only the stuff I need to reduce the payload being sent to the client.

As I've said before, there are places where I will insist on things having proper types. But a lot of the work I do these days doesn't require it and it doesn't provide any tangible benefits. As per usual it depends on what you are trying to achieve, taking an ideological standpoint on it is ridiculous

Gąska

@lucas said:

I just use a library that serializes between the two.

So what's your problem?

@lucas said:

I don't care how the data gets to the client as long as it is sensibly structured JSON

What are you criticizing here, exactly?

@lucas said:

In your opinion, there are tons of stuff in JS that I wish were in other languages. Sorry but blind JS hatred is usually because a programmer has no fucking clue how to actually write it well. Yeah there are plenty of things that suck ... but you don't have to use the,.

Yeah, I totally don't have to use GC, or this keyword, or deal with both null and undefined virtually everywhere... JavaScript has plenty of fuckups, and there's no way around it - maybe except using another language that compiles to JS (but this still doesn't solve all issues).

@lucas said:

That is what I am doing, but I don't need to define two types, I define one and create a new dynamic object with only the stuff I need to reduce the payload being sent to the client.

So, the only difference is that you don't have to declare the struct. The gain is less code to write and maintain, the loss is that you don't notice that easily if you fuck up something.

@lucas said:

As I've said before, there are places where I will insist on things having proper types. But a lot of the work I do these days doesn't require it and it doesn't provide any tangible benefits

Flying doesn't require wings either. This is metaphore. Listing all the things that can generate airlift besides wings would take too much to be suitable for short rebuttal. So don't bring up helicopter or anything like that, pretty please.

@lucas said:

As per usual it depends on what you are trying to achieve, taking an ideological standpoint on it is ridiculous

Ideology is the only way to make a rational choice between two alternatives that are just as good and just as appropriate for any situation (where resources aren't constrained; otherwise, any interpreted, hence dynamically-typed language is out of question).

boomzilla

@Gaska said:

Flying doesn't require wings either. This is metaphore. Listing all the things that can generate airlift besides wings would take too much to be suitable for short rebuttal. So don't bring up helicopter or anything like that, pretty please.

They're referred to as rotary winged craft, anyways, so the best example is probably lighter than air stuff (e.g., hydrogen, helium, hot air).

lucas

Oh FFS you are an idiot aren't you? I was responding within the context of my defense of using dynamic types in C# when appropriate and you seemed to forget about the context ... REALLY?

@Gaska said:

Yeah, I totally don't have to use GC, or this keyword, or deal with both null and undefined virtually everywhere... JavaScript has plenty of fuckups, and there's no way around it - maybe except using another language that compiles to JS (but this still doesn't solve all issues).

You are a fucking tool aren't you:

The recommended way is to use undefined rather than null (it been mentioned in the JS community 1000s of times do some fucking reading or watch a JSConf video)
GC is a good thing.
Even the this keyword really isn't that difficult to deal with if you spend 5 seconds reading the docs.

As I said the most vocal opponents of JS are the ones that really don't understand anything about the language or the community. You've just proved my point.

I know what the your reply is going to be like "oh it is bad language design and it is still there" etc etc etc. Well for fucking obvious reasons they can't throw the crap stuff out overnight, just like PHP, C# and any other language that has been around for a while.