The most idiotic idea about types ever

sh_code

Suppose you're a galaxy-spanning civilization ran on some kind of subspace internet.
You have risen from what we are now to hundreds of starsystems.

And you need a crapload of IDs for all the devices in the whole universe to UID to send messages to.

You can keep increasing the word width of processors/default bit width of integer or whatever, or you can keep increasing the maximum-id-string-length constriant, and piling on forward compatibility bandage on top of the backwards-compatibility ball of mess, or...

how would you future-proof this requirement, if you could build it from zero with no compatibility regards, and deploy it instantly to everyone?

my, the most idiotic idea about types ever, which sparked this question is:
every literal value has built-in bit resolution:
10 - boolean false. first bit: "there's one bit of info". second bit: "this is the one bit of info".
11 - bool true
1100 - zero of 2bit variable
1111 - 3 of 4bit var.

This doubles every datatype size, yes, but... kinda makes every datatype variable size/resolution by default?

pointer jumps to int, sees 32 1s in, row, naturally concludes int has now 32 bits of resolution.

if it jumps to float, it knows it's 32bit float.

jump to float, read 64 1s, know 64bits of data follow - c# double.

now your number types are infinitely extendable, regardless of how many IoT orc dildos your civilization absolutely needs to plug in to the interstellar information highway.

comments? shot downs? i'm stoned so aware this is stupid, don't worry.
better ideas?

topspin

First off, you can probably address every single atom in the universe with 256 bits. If not, add a few bits more. Second, if you really want it dynamic you can use your favorite implementation of Big Integer, which should give you arbitrary length integers, with only small overhead.

Technically, those will store one word-sized size entry, e.g. 64 bits, describing the size of the variable-length integer in bytes. Obviously, that is not truly arbitrary length but good enough for all purposes, since it can address all your memory. If you're going for purely theoretical problems, then you need to come up with something else, but there should still be more efficient encodings than yours.

dkf

@sh_code said in The most idiotic idea about types ever:

better ideas?

For transport encodings, strings work extremely well and don't in themselves have any arbitrary lengths to get in the way. For computation, we have bignums and they can very easily work with numbers in the region of hundreds of thousands of bits without getting anywhere close to their limits; that's more than enough to address anything in any sane scheme (or in NFTs if you prefer insane schemes).

PleegWat

@sh_code said in The most idiotic idea about types ever:

You can keep increasing the word width of processors/default bit width of integer or whatever

A quick google indicates there are less than 10^82 atoms in the observable universe.

Assigning an address to every single one of them requires about 272 bits. We're already using 128 bits per address in IPv6.

Arantor

@sh_code said in The most idiotic idea about types ever:

10 - boolean false. first bit: "there's one bit of info". second bit: "this is the one bit of info".
11 - bool true
1100 - zero of 2bit variable

Here is your first problem. How do you differentiate between bool true and zero of a 2 bit variable? You need to make boolean false 100 and boolean true 101 so that you can allocate 110 as the prefix for 2 bit variable (consuming a total of 5 bits from the stream)

This is kind of the thing UTF-8 does in terms of identifying whether your byte is the first byte of a multibyte sequence and which format you're dealing with - 0xxxxxxx bytes are always 0-127 codepoints (1 byte), first byte being 110xxxxx is 2 byte, first byte being 1110xxxx is 3 byte and first byte being 11110xxx is 4 byte.

Then 10xxxxxx is reserved for all subsequent bytes in a character, meaning that if a byte is lost midsequence you know you can always resyncronise at the start of the next character.

Secondly, in a world where UTF-8 exists, why would we not use it for literally all the things - and why would UTF-32 still exist beyond legacy applications? The answer is performance; with UTF-8 you have no idea how to seek through the stream, you're guaranteed to have to do it byte by byte, but in UTF-32 you can seek much faster because you know each character is 4 bytes out. (Putting aside combining characters, anyway.)

Gurth

@sh_code said in The most idiotic idea about types ever:

every literal value has built-in bit resolution:
10 - boolean false. first bit: "there's one bit of info". second bit: "this is the one bit of info".
11 - bool true
1100 - zero of 2bit variable
1111 - 3 of 4bit var.

How will my app tell the difference between 1111: two bits that are both boolean true and 1111: 3 of a 4-bit variable? Just based on what it expects, or will additional padding etc. be added to signify beginning and end of actually relevant parts?

cvi

@Gurth I think the example in the OP is b0rk - possibly explained by OP's claims of stonedness.

Either way, with 2n bits you should be able to encode the length of a n-bit value as well as said value. You'd also be wasting increasingly many bits with that particular coding of the length, but whatever...

aitap

You seem to have invented a prefix code. It's a good idea, when done right, widely used from some compression algorithms (see Huffman codes) to character encodings (UTF-8).

(Probably 'd above)

Gurth

@cvi said in The most idiotic idea about types ever:

Either way, with 2n bits you should be able to encode the length of a n-bit value as well as said value. You'd also be wasting increasingly many bits with that particular coding of the length, but whatever...

My lack of theoretical background is probably showing, but after reading that, my question remains: how would my program know which bits are the length indicator and which the actual data? As far as I can tell, you would need some way to signify at least for the very first set of bits how long its length indicator will be. Unless you already know the length you need to read beforehand, but then the length indicator would be unnecessary.

Rhywden

@Gurth Well, you could do that by using always using two paired bits. The first one is always the "type" bit and the second one the "data" bit.

If the type bit is 0 then the second bit will be denotating the length and a 1 will denotate content.

I.e. 01 00 01 would be a length of 101 and 10 11 11 10 11 would contain the data of 0 11 01 and the complete set would be 01 00 01 10 11 11 10 11

Why you would want to do this is above my paygrade. This, of course, does not exactly make for fast seeking as you still don't know how long the "length" bits will be...

cvi

@Gurth Easy way with the 2n bits: given a string of bits, search for the first zero. The number of ones preceding the zero (plus 1) is the length of the data. So 0x => zero ones preceding +1 = 1 bit of data (the following bit). 10xx => length 2, etc etc.

As said, this isn't the most efficient coding, since it practically always doubles the number of bits. If you look for variable length integer encodings you can find some more efficient ones. UTF-8 uses such a scheme (albeit they cap it at a max length, but you could extend their method for a bit).

Frank Wilhoit

@sh_code Not the most idiotic idea ever, no. There have been at least two worse: "duck typing" and "there is only one type, and it is this here particular processor's native word". But you do raise two good questions: (A) how should reflection be implemented for "low level" types? (B) what price bits? -- in other words, how much computing resource (memory space, memory bandwidth, processor overhead) are we, or should we be, willing to pay for semantic integrity? Everyone who has attempted to answer these questions (typically without fully and rigorously understanding them) has come up with a different answer: and some of the answers have been quite silly.

Watson

I remember Knuth playing around with this sort of idea at one point. My google-fu is not delivering so I'll go from memory.

Prefix the length of the bitstring by its length in binary. But, like @Gurth asks, how do you know where the length stops and the data itself begins? Easy, you prepend the length of the length, so you can read how many bits to read to get the length of the bitstring.

But how...? Well, you recurse. Eventually you get down to a length indicator of '11'. ('10' would have a length indicator of '10' and we don't want to get into an infinite loop). In fact, since the '11' is guaranteed to be there (we aren't encoding really short bit strings for some reason, possibly due to padding) we could make it implicit: we know to start by reading the first three bits and interpreting it as the length for what to read next.

Gurth

@cvi said in The most idiotic idea about types ever:

@Gurth Easy way with the 2n bits: given a string of bits, search for the first zero. The number of ones preceding the zero (plus 1) is the length of the data. So 0x => zero ones preceding +1 = 1 bit of data (the following bit). 10xx => length 2, etc etc.

That, though, is different from the method proposed by the OP. This one depends on adding as many 1 bits as there are data bits (minus 1) whereas the OP seems to use the “length” bits to actually store a number instead.

cvi

@Gurth Fair - I was mainly extrapolating off OP's statement on "This doubles every datatype size, yes, but" since the examples didn't seem to work.

I would like to here, though, since what I described actually does encode a number, just in an utterly lame way. 0 => 1, 10 => 2, 110 => 3, 1110 => 4 etc. Lame because you need n bits to encode the number n.

You can do a bit better if you let the encoded length mean something other than number of bits. E.g., if you say prefix 0 => 7 bits of data, 10 => 14 bits of data, 110 => 21 bits of data, you get something that scales much better. (The number of bits indicates the total length of the payload in bytes, so e.g., 21 bits of data are due to three bytes = 24 bits - 3 bits used by the length. This is very similar to UTF-8, except that UTF-8 uses the prefix of 10 as a tag for subsequent bytes.)

If you really want to store a length as a binary number, you could also use the above scheme to encode the "length of the length". So, first a number of bits that tell you how long the length field is in bits, and then the same number of bits to store the length using standard binary representation of numbers. E.g. 0 => special case, let's have it mean length 1, 10xx => length of length 2 => xx is one of {00, 01, 10, 11}. Since the data-length of 1 has already been handled, you'd map that to {2, 3, 4, 5}. With this, to encode a length of n bits of data, you'd need 2*ceil(log(n)).

Also see what @Watson said.

Either way, for a good scheme, look for proper references.

PleegWat

@cvi Another approach I've seen is to use the high bit of the byte to indicate whether there are additional data bytes. So 3467=0b110110001011 would be encoded as 0b00011011 0b10001011.

Zerosquare

@dkf said in The most idiotic idea about types ever:

For transport encodings, strings work extremely well

If by "well" you mean "inefficiently", sure.

@PleegWat said in The most idiotic idea about types ever:

@cvi Another approach I've seen is to use the high bit of the byte to indicate whether there are additional data bytes. So 3467=0b110110001011 would be encoded as 0b00011011 0b10001011.

MIDI files use this method for encoding time differences and sizes. It works well for that use case.

cvi

@Zerosquare said in The most idiotic idea about types ever:

MIDI files use this method for encoding time differences and sizes. It works well for that use case.

Main drawback is that you don't know the length upfront. Not necessarily a big deal, though.

Filed under: Hollerith string encoding.

Gurth

@cvi said in The most idiotic idea about types ever:

I would like to here, though, since what I described actually does encode a number, just in an utterly lame way. 0 => 1, 10 => 2, 110 => 3, 1110 => 4 etc. Lame because you need n bits to encode the number n.

Yes, that’s why I said “actually store a number” :) You’re storing the number as if you’re counting with pebbles, rather than using binary arithmetic to store the number in them.

dkf

@Zerosquare said in The most idiotic idea about types ever:

If by "well" you mean "inefficiently", sure.

Inefficiency means that there's redundancy, and redundancy tends to mean that you can do framing and error correction easily. If you don't have redundant bits, if you lose data then you garble the whole of the rest of the message and might not even know it.

Zerosquare

That's faulty reasoning. A format can have a lot of redundancy and be poor at error detection/correction.

If you need to transmit a bunch of values, a properly-designed binary encoding with CRC is faster to parse, uses less space, and provides better protection against accidental corruption than XML.

Steve_The_Cynic

@dkf said in The most idiotic idea about types ever:

@Zerosquare said in The most idiotic idea about types ever:

If by "well" you mean "inefficiently", sure.

Inefficiency means that there's redundancy, and redundancy tends to mean that you can do framing and error correction easily. If you don't have redundant bits, if you lose data then you garble the whole of the rest of the message and might not even know it.

More accurately: if you don't have a mechanism for error detection and/or correction, you can't detect/correct errors, but you don't need redundant bits for error detection. In fact, if you're using them for error detection/correction, they aren't redundant. They just aren't part of the message.

dkf

@Zerosquare said in The most idiotic idea about types ever:

a properly-designed binary encoding with CRC

Except then the other side needs to know that's what you are sending, and a bit lost or corrupted means that (everything else working correctly) you lose the whole message. For long distance one-way sending (the scenario with interstellar comms when light-speed restricted), you're better off a message with more redundant bits (i.e., that can be compressed more) but where losing chunks of the message doesn't mean that you lose everything. Error correcting codes help, but they're designed to correct up to a certain rate of bits per symbol, detect errors a little more, and don't work beyond that. (CRCs are a kind of ECC which detects — but doesn't correct — one error per message section to which they are applied.)

Information redundancy is not really redundancy; it's information space for noise to exist without destroying the message.

Zerosquare

@Steve_The_Cynic said in The most idiotic idea about types ever:

you don't need redundant bits for error detection. In fact, if you're using them for error detection/correction, they aren't redundant.

From an information point-of-view, they are redundant, since you can compute them using only the original message. But they're redundant in a more efficient way than, say, sending the same message twice and hoping one of the copies is correct.

@dkf said in The most idiotic idea about types ever:

For long distance one-way sending (the scenario with interstellar comms when light-speed restricted)

Please tell me you're not sending XML to aliens. Please.

dkf

@Zerosquare said in The most idiotic idea about types ever:

@dkf said in The most idiotic idea about types ever:

For long distance one-way sending (the scenario with interstellar comms when light-speed restricted)

Please tell me you're not sending XML to aliens. Please.

HFY!

cvi

@dkf But (just) stringifying data isn't that great either. A single flipped bit can corrupt the data in a way you can't recover or even detect.

Edit: And needlessly bloating data that you are going to send on your interstellar network is a bad idea too. Needless bloat also means there's more chance for an error to occur during the transmission. Not to mention that you have limited bandwidth (and limited opportunities for the links in first place). Plus, chances are that you are buffering messages at each long-range link (instead of just the peers, like TCP et al. typically do), so you pay a lot for the additional intermediate storage. In essence, tailor-made error correction is likely more efficient.

Zerosquare

Absolutely. Deep-space communications is one of the use cases for custom protocols that are both efficient and robust (example).

dkf

@cvi said in The most idiotic idea about types ever:

A single flipped bit can corrupt the data in a way you can't recover or even detect.

Only if you insist on using some form of data communication that isn't robust against it. Digits aren't robust, but words are much more so; this is because the bit-level Levenshtein distance between semantically-reasonable substituands is typically much larger.

Of course, you'd only really do it that way if you had lots of power to transmit with, but past a certain point you don't care as much about whether you have the sustained power to transmit so much as the ability to overcome a lot of noise. One of the best ways of overcoming noise is having a much less dense encoding, one where a great many bits have to become corrupted for the message to become unrecoverable. (Also consider that it's about making the actual message recoverable, and not necessarily the fine detail.) This is very different from how we tend to do computer codes for storage right now, where compactness is a much bigger issue than framing and the error rate is generally low anyway.

Zerosquare

Again, if you want something that's robust against noise, there are far better solutions than using text.

"But text is human-readable, easy to use and doesn't require custom tools!"

Yes. So are Lego blocks. Yet for some reason, most things are not built out of Lego blocks.

(Note: that comparison is unfair. Actual Lego blocks are much better designed than most text-based protocols.)

HardwareGeek

@Zerosquare said in The most idiotic idea about types ever:

@Steve_The_Cynic said in The most idiotic idea about types ever:

you don't need redundant bits for error detection. In fact, if you're using them for error detection/correction, they aren't redundant.

From an information point-of-view, they are redundant, since you can compute them using only the original message. But they're redundant in a more efficient way than, say, sending the same message twice and hoping one of the copies is correct.

@dkf said in The most idiotic idea about types ever:

For long distance one-way sending (the scenario with interstellar comms when light-speed restricted)

Please tell me you're not sending XML to aliens. Please.

That depends. Do we want first contact to be hostile?

cvi

@dkf said in The most idiotic idea about types ever:

Digits aren't robust, but words are much more so; this is because the bit-level Levenshtein distance between semantically-reasonable substituands is typically much larger.

It's not just digits, but also stuff like delimiters. And if you actually have to send numeric data, encoding it as digits is already inefficient. If you plan on turning that into words will just blow up the data beyond any reason.

But ultimately you're now designing a protocol that makes sure that stuff like Levenshtein distances are maximized between "words" that are semantically substitutable. Why not throw out the idea that the words are human text? This gives you much more degrees of freedom when you try to find the balance between compactness and redundancy. Plus, it's going to much easier to figure out what constitutes a catastrophic transmission error...

cvi

@HardwareGeek said in The most idiotic idea about types ever:

Do we want first contact to be hostile?

Well, we could send them json. That way first contact will be part of an interstellar rescue mission, instigated after the galactic community takes pity on humanity.

Zerosquare

: Have you found any sign of intelligent life, Zglorb?
: Nah. Just that stupid JSON signal again.

Arantor

Let’s send them YAML instead?

cvi

@Arantor said in The most idiotic idea about types ever:

Let’s send them YAML instead?

I pity the alien earth monitoring services. We're basically inventing new data encoding formats every few days. Imagine having to keep track of those...

Arantor

@cvi it gets better than that though.

I was looking at Hugo recently for making static sites and it supports multiple config formats. Yes, it supports YAML but it also supports a thing called TOML.

Which is basically a revamp of ini files, but slightly worse and achieves none of the (alleged) goals of “human readable” by daring to mandate quotes around strings. (I submit that I find it easier to mentally parse TOML than I do YAML because I’m never sure if a thing is a number or a string even though YAML is allegedly strictly typed on the matter.)

Gurth

@Zerosquare said in The most idiotic idea about types ever:

Please tell me you're not sending XML to aliens. Please.

Aha! So that’s how they win the war in Independence Day!

Zerosquare

Indeed:

topspin

@cvi said in The most idiotic idea about types ever:

@dkf said in The most idiotic idea about types ever:

Digits aren't robust, but words are much more so; this is because the bit-level Levenshtein distance between semantically-reasonable substituands is typically much larger.

It's not just digits, but also stuff like delimiters. And if you actually have to send numeric data, encoding it as digits is already inefficient. If you plan on turning that into words will just blow up the data beyond any reason.

But ultimately you're now designing a protocol that makes sure that stuff like Levenshtein distances are maximized between "words" that are semantically substitutable. Why not throw out the idea that the words are human text? This gives you much more degrees of freedom when you try to find the balance between compactness and redundancy. Plus, it's going to much easier to figure out what constitutes a catastrophic transmission error...

IMO, you should first run your data through the strongest available (content specific) compression algorithm so that it has maximum entropy. After that you can compute the amount of error correction required for your channel and use a generic encoding that doesn't rely on the content having any intrinsic redundancy.

Circuitsoft

@sh_code said in The most idiotic idea about types ever:

my, the most idiotic idea about types ever, which sparked this question is:
every literal value has built-in bit resolution:
10 - boolean false. first bit: "there's one bit of info". second bit: "this is the one bit of info".
11 - bool true
1100 - zero of 2bit variable
1111 - 3 of 4bit var.

This kinda looks/sounds like ASN.1. Might be worth glancing at/considering.

Steve_The_Cynic

@Circuitsoft said in The most idiotic idea about types ever:

This kinda looks/sounds like ASN.1. Might be worth glancing at/considering.

But what you should consider is whether to use a flamethrower or a GAU-8 on it.

PleegWat

@Steve_The_Cynic said in The most idiotic idea about types ever:

@Circuitsoft said in The most idiotic idea about types ever:

This kinda looks/sounds like ASN.1. Might be worth glancing at/considering.

But what you should consider is whether to use a flamethrower or a GAU-8 on it.

dkf

@Circuitsoft said in The most idiotic idea about types ever:

ASN.1. Might be worth glancing at/considering.

I will pray for your soul. Or not.

Steve_The_Cynic

@PleegWat Or, indeed, both, as you say.

Circuitsoft

@dkf said in The most idiotic idea about types ever:

@Circuitsoft said in The most idiotic idea about types ever:

ASN.1. Might be worth glancing at/considering.

I will pray for your soul. Or not.

Honestly, ASN.1 makes sense for the purpose it was created. It's a very compact, but very flexible, serialization format with well-defined semantics. It's not text-based, but it's otherwise about as self-describing as it could be. I haven't actually minded working with it when I had to.

dkf

@Circuitsoft said in The most idiotic idea about types ever:

about as self-describing as it could be

Yes, because everyone knows all those OIDs off by heart.

Carnage

@dkf said in The most idiotic idea about types ever:

@Circuitsoft said in The most idiotic idea about types ever:

about as self-describing as it could be

Yes, because everyone knows all those OIDs off by heart.

Back when I worked with ASN.1 I remember the output of the b interpreter looking perfectly understandable. On the other hand, back then I could translate DAYS SINCE 1970-01-01 to calendar dates and the other way in my head as well. Among a few other slightly autistic things.

Medinoc

I was under the impression ASN.1 DER encoded only field types, not their names (or ID)... Gotta read that again.

Steve_The_Cynic

@Circuitsoft said in The most idiotic idea about types ever:

Honestly, ASN.1 makes sense for the purpose it was created. It's a very compact, but very flexible, serialization format with well-defined semantics. It's not text-based, but it's otherwise about as self-describing as it could be. I haven't actually minded working with it when I had to.

Depends. If you're talking about PER, it's not even slightly self-describing except for string lengths.

Gąska

@Arantor said in The most idiotic idea about types ever:

Which is basically a revamp of ini files, but slightly worse and achieves none of the (alleged) goals of “human readable” by daring to mandate quotes around strings. (I submit that I find it easier to mentally parse TOML than I do YAML because I’m never sure if a thing is a number or a string even though YAML is allegedly strictly typed on the matter.)

TOML has one advantage over JSON, YAML etc. - nested data structures don't have to be physically, syntactically nested. This has two major benefits:

You don't have to scroll up and down constantly to see which part of the object you're in. The INI-style section name has the full path. That also makes searching the file much easier.
Splitting data into multiple files/network messages is very natural and doesn't require you to create a whole new patching protocol on top of your actual protocol. You can arbitrarily concatenate files and the result will be a single object with all the data in the right places.

For anything human-writable, TOML is my go-to format. It just makes so much more sense than JSON and derivatives, once you get people over the "everything I'm not familiar with already is the worst thing ever" mentality (it's the same problem as with convincing people to try functional programming). Also, diffs are easier to read (because no syntactic nesting).