When Giant JSON Attacks

PleegWat

Last year they made me implement JSON export for it's-not-gonna-be-big-honest data. I fought, lost, and wrote a partially-streaming generator so on my end it's linear time and fixed memory.

Then they had me implement a maximum file size, so I generate even more files per minute. Each file goes to a HTTP endpoint using libcurl. I neither know nor care whether we're hitting their rate limits.

Steve_The_Cynic

@Mason_Wheeler said in When Giant JSON Attacks:

and has all the fun of needing to define endianness

Is there any system still around that uses big-endian encoding?

Doesn't matter. You still need to define what endianness you want to use.

Note: I favour big-little middle-endian for 32-bit values.

Bulb

@Steve_The_Cynic said in When Giant JSON Attacks:

You still need to define what endianness you want to use.

Among million of other properties of the format, which makes it not particularly significant. For example protobuf uses a variable-length encoding for numbers and specifying that takes a lot more than saying it is little endian, which it is.

Arantor

@Steve_The_Cynic said in When Giant JSON Attacks:

big-little middle-endian for 32-bit values

That seems like a variation on Microsoft's super neat idea for encoding UUIDs in its history (or, if you're an Azure Active Directory user, currently)

Source: https://en.wikipedia.org/wiki/Universally_unique_identifier#Encoding

Variant 2 UUIDs, historically used in Microsoft's COM/OLE libraries, use a mixed-endian format, whereby the first three components of the UUID are little-endian, and the last two are big-endian. For example, 00112233-4455-6677-c899-aabbccddeeff is encoded as the bytes 33 22 11 00 55 44 77 66 c8 99 aa bb cc dd ee ff

Benjamin Hall

@Arantor I was going to joke about doing something like this, but...yeah. Reality is post-satire and has been for a long time now.

Steve_The_Cynic

@Arantor said in When Giant JSON Attacks:

@Steve_The_Cynic said in When Giant JSON Attacks:

big-little middle-endian for 32-bit values

That seems like a variation on Microsoft's super neat idea for encoding UUIDs in its history (or, if you're an Azure Active Directory user, currently)

Classically, it comes from the 32-bit integer formats for the PDP-11, where the two sixteen-bit halves were each stored little-endian, but if you viewed memory as an array of 16-bit words on even addresses only, the 32-bit values were stored in big-endian, so your number 0x12345678 would be stored as:

0x34, 0x12, 0x78, 0x56

Endianness - Wikipedia

ixvedeusi

@Mason_Wheeler said in When Giant JSON Attacks:

Is there any system still around that uses big-endian encoding?

Positional number systems as used by humans, and by extension all text formats.

I'm here but It does make hex dumps of little-Indian binary formats more cumbersome to read.

Also AFAIK there's still "network byte order".

@Bulb said in When Giant JSON Attacks:

it would have probably been “little”

I don't think so, it would just mean that the "end" would be on the other side,

Bulb

@ixvedeusi said in When Giant JSON Attacks:

@Mason_Wheeler said in When Giant JSON Attacks:
@Bulb said in When Giant JSON Attacks:

it would have probably been “little”

I don't think so, it would just mean that the "end" would be on the other side,

The thing with numbers and Arabic text is that even Arabic digits (some Arab countries use Latin digits) are written left-to-right big-endian. In other words, the most significant digit is always on the left, but surrounding text is read from the right. This also has a weird consequence that a number before text is to the left of it and a number after text is also to the left of it as the overall direction of the line is governed by its start.

Now in computers, the numbers are still big endian (most significant digit first) and the text direction logic applies. But I assume that Arabs wouldn't actually switch the direction they are reading for the numbers, but instead would read the numbers least significant digit first, i.e. little endian.

Zecc

@Bulb Dammit, you've d me in your <abbr> just as I was typing my comment.

Watson

@Bulb said in When Giant JSON Attacks:

But I assume that Arabs wouldn't actually switch the direction they are reading for the numbers, but instead would read the numbers least significant digit first, i.e. little endian.

According to Google Translate (that font of reliable knowledge):

Seventy-three. Seventy-four. Seventy-five. Seventy-six. Seventy-seven.
=>
ثلاثة وسبعون. أربعة وسبعون. خمسة وسبعون. ستة وسبعون. سبعة وسبعون.
Pronounced "thalathat wasabeuna. 'arbaeat wasabeuna. khamsat wasabeuna. sitat wasabeuna. sabeat wasabeuna."
But then
Four hundred fifty-seven. Three hundred fifty-seven. => "'arbaeumiayat wasabeat wakhamsuna. thalathumiayat wasabeat wakhamsuna."

TheCPUWizard

@Bulb - Yes, there are implementations of JSONSchema... that is not what I was really referring to (apologies for not being clearer)... Of the last 1,000 JSON files you have seen, how many have published and validated schema?

If there comes a point where I can reliably get a json payload, read the schema definition for it (for example I do a simple GET - where is the schema published?

Arantor

@TheCPUWizard those that bother tend to include a $schema entry that points to the schema to validate it against.

Most do not bother :(

Bulb

@TheCPUWizard Applications we write here do have that. Well, the older slightly incompatible openapi variant actually, but they do. And much of the … yaml¹, actually … I wrote lately had json-schema driving the completion. So yeah, there is still a lot that does not have schema – and I got burnt by one not having it just because author just this morning – but it's getting better.

The openapi, formerly swagger, is quite useful. Around here colleagues use it so they write the web application using a framework that supports it (usually springboot) and the framework generates the schema and presents it on a well-known path (including a simple app for viewing in browser), so the front-end developers have a handy reference that always matches what the backend actually accepts. It is also possible to generate typescript classes from it, but I don't think they do it yet.

¹ Well, some YAML, like various Kubernetes manifests and Azure DevOps Pipelines, some JSON when it is Azure templates. These things do tend to have schemas.

Kamil Podlesak

@Bulb said in When Giant JSON Attacks:

@TheCPUWizard Applications we write here do have that. Well, the older slightly incompatible openapi variant actually, but they do. And much of the … yaml¹, actually … I wrote lately had json-schema driving the completion. So yeah, there is still a lot that does not have schema – and I got burnt by one not having it just because author just this morning – but it's getting better.

The openapi, formerly swagger, is quite useful. Around here colleagues use it so they write the web application using a framework that supports it (usually springboot) and the framework generates the schema and presents it on a well-known path (including a simple app for viewing in browser), so the front-end developers have a handy reference that always matches what the backend actually accepts. It is also possible to generate typescript classes from it, but I don't think they do it yet.

The problem with swagger here is that it still does not support openapi 3.1 - which is the version that added the support for JSON schema.

Bulb

@Kamil-Podlesak OpenAPI 3.1 synchronized itself fully with JSONSchema, but the older versions are quite similar and I've seen a conversion script somewhere on the JSONSchema site.

Kamil Podlesak

@Bulb said in When Giant JSON Attacks:

@Kamil-Podlesak OpenAPI 3.1 synchronized itself fully with JSONSchema, but the older versions are quite similar and I've seen a conversion script somewhere on the JSONSchema site.

The older version "schema" is too limited. In my case, the missing patternProperties is a dealbreaker because I need to represent localized strings {"en":"English","de":"Deutsch"}.

Bulb

@Kamil-Podlesak Swagger was built primarily with the idea of describing the objects serialized and deserialized by the standard java serializer. And I don't think that can express this kind of constraint either.