When Giant JSON Attacks

izzion

https://technology.riotgames.com/news/bug-blog-esports-trade-issue

Background: League of Legends’ esport has been plagued with a long standing, albeit intermittent bug that would cause failures when teams tried to trade champions between their players so the correct player would be piloting each champion, regardless of when the champion got drafted (players draft in a fixed order based on their role on the team, but that order is generally not related to the priority of drafting for each role, so there are typically 3+ trades per team per game). When the trades took too long, they would overrun the limited time teams had to trade at the end of the draft and force the officials to end and recreate the game with teams just re-drafting the same picks directly onto the designated players, which messed up the stats history for the game.

Turns out, passing around giant JSON objects and then reading through the entire object to figure out if two people can complete a trade of champions is not the most efficient way to do it. By 2-4 orders of magnitude (their workaround was specific to esports, since that platform ensures all players have access to all champions in the game all the time — just set a “yup, they own everything” flag and don’t bother with passing around or reading the list of what they “own”).

Applied Mediocrity

"Esports"

Luhmann

@Applied-Mediocrity
Competitive Excel wants a word

Applied Mediocrity

@Luhmann I will now be more vigilant when recruiters mention having competitive Outlook in their pitches.

Luhmann

@Applied-Mediocrity
Competitive OneNote is more exciting since it's real time instead of the turn-by-turn play of Competitive Outlook

Steve_The_Cynic

Re: thread title: So JSON is an enemy crab?

dkf

@izzion said in When Giant JSON Attacks:

Turns out, passing around giant JSON objects and then reading through the entire object to figure out if two people can complete a trade of champions is not the most efficient way to do it. By 2-4 orders of magnitud

Let me guess, they weren't using streaming parsers that ignored everything that wasn't needed?

Bulb

@dkf Such thing exists? </>

Bulb

@dkf (without reading TFA; that would be ) well, generating those JSONs might also have been a problem.

Mason_Wheeler

Big Data file formats

Which one data format do you pick for your next Big Data project: CSV, JSON, Parquet and Avro?

"Whatever you do, never use the JSON format. In almost all tests it proved to be the worst format to use."

HardwareGeek

@Mason_Wheeler I just wrote a formatter to serialize some data in a JSON-like format. It's hardly big data, though, only a couple MB. Dumping it to the screen took a few seconds; to a file was so fast I couldn't even time it.

Now I have to write a parser to deserialize it.

Mason_Wheeler

@HardwareGeek Yeah, at small scale JSON is great. At large scales it's terrible, and its major advantage (human-readability) is irrelevant anyway; who's going to manually read through gigabytes of JSON data?

Applied Mediocrity

@HardwareGeek Don't forget to publish it. There's always a chance it may catch on as the next industry standard, but in this case we will know exactly whom to hate for all the commited.

HardwareGeek

@Applied-Mediocrity It may actually be valid JSON. I'm not sure, because I haven't really worked with JSON before. This is a one-off; I need to copy some data from one Blender file to another, and this seemed like a reasonable way to do it. Almost however much of a pain it is, it's easier than recreating the data in the new file in the GUI.

And in this case, I do actually need the human readability. I need to do some search-and-replace in the file, because some objects have different names in the destination file than the do in the source file.

Applied Mediocrity

@HardwareGeek Excellent! It passes all the requirements.

It worked for you! (for a one-off use case almost nobody else will have)
You haven't actually worked with JSON, but you've definitely invented a better one
It was fun to make!

HardwareGeek

@Applied-Mediocrity said in When Giant JSON Attacks:

@HardwareGeek Excellent! It passes all the requirements.

It worked for you! (for a one-off use case almost nobody else will have)

Maybe. The serialization works. I'm not yet certain the deserialization is going to work. At least it's not going to be quite as easy as I thought. The serializer identifies some dictionary items by key, which seems reasonable enough. However, internally, they appear to be stored by array index, and the "name" is a property that can be used like a dictionary key, but doesn't exist until the object is created. That's probably not hard to work around, but I didn't want to think that hard when I discovered it last night.

You haven't actually worked with JSON, but you've definitely invented a better one

I don't claim to have invented a better one. I just want something "good enough".

It was ~~fun~~ to make!

Maybe.

Mason_Wheeler

@HardwareGeek The thing I'm wondering is, does Blender not have JSON support built-in, through Python scripting if nothing else?

HardwareGeek

@Mason_Wheeler It does, but pretty much everything is a custom class, so the json library just throws up its hands unless you implement custom JSONEncoder and JSONDecoder, at which point I'm really not saving any effort by using the library.

CHUDbert

@Mason_Wheeler said in When Giant JSON Attacks:

"Whatever you do, never use the JSON format. In almost all tests it proved to be the worst format to use."

Real programmers use JSONx....

Mason_Wheeler

@CHUDbert What's a JSONx?

*looks it up*

HardwareGeek

@Mason_Wheeler

JSONx is an IBM® standard format to represent JSON as XML.

Gribnit

@HardwareGeek not enough. I'm walking into the sea tonight, as we all now must.

Carnage

@Mason_Wheeler said in When Giant JSON Attacks:

Big Data file formats

Which one data format do you pick for your next Big Data project: CSV, JSON, Parquet and Avro?

"Whatever you do, never use the JSON format. In almost all tests it proved to be the worst format to use."

Let's all just start working with ASN.1 instead.

Bulb

@Carnage said in When Giant JSON Attacks:

ASN.1

I am (seriously) wondering how ASN.1 DER, protobuf, capnproto and flatbuffers would compare.

dkf

@Mason_Wheeler said in When Giant JSON Attacks:

"Whatever you do, never use the JSON format. In almost all tests it proved to be the worst format to use."

That blog was written by an idiot. For example, it describes JSON as having "poor support for special characters" despite the fact that it's UTF-8 by definition. If your character's too special to be in Unicode, then you're deeply into territory.

I'm guessing they chose a poor serialization model for their data. For block numeric data, provided you use arrays, you get serialization almost as compact as CSV.

Carnage

@Bulb said in When Giant JSON Attacks:

@Carnage said in When Giant JSON Attacks:

ASN.1

I am (seriously) wondering how ASN.1 DER, protobuf, capnproto and flatbuffers would compare.

I might actually find enough curiosity to overcome the but that's pretty unlikely. Though it would be something I could spin into a topic for a conference I guess.

Edit; Seems someone did already: https://www.diva-portal.org/smash/get/diva2:859441/FULLTEXT01.pdf
But that never prevented anyone from doing things again, in a slightly worse manner.

Edit2:

Instead of concluding the winner protocol, the results shows which of the pro-
tocols are least preferable for the measured metrics. On average, the text based
protocols produce bigger messages and are therefore not optimal when size is im-
portant. XML was the least preferable protocol description language of all tests.

JSONx to the wins!

TheCPUWizard

Is you program written in JavaScript?

Yes: JSON might be applicable.
No: Use something else

Arantor

@TheCPUWizard the sad part is that over in PHP land, I often have a need to serialise a bunch of data that is nice-to-have metadata and that I’d rather use JSON as a serialiser than PHP’s own because PHP’s own is bigger, bloatier and slower to process in the testing I did back in the day. (May have changed now, but also for a number of years JSON was also touted in the PHP world as a safer alternative because you can’t serialise PHP objects in it where behaviours will be triggered on unserialise, unlike PHP’s serialiser)

TheCPUWizard

@Arantor - If (and I see it about as likely as me winning both PowerBall and MegaMillons, multiple times, without buying a ticket) there is ever universal adoption of a proper Schema I might have something nice to say about JSON,

Arantor

@TheCPUWizard yeah, I hear that - JSON Schema is not particularly wonderful but it is at least usable for smaller things.

Bulb

@Carnage said in When Giant JSON Attacks:

@Bulb said in When Giant JSON Attacks:

I am (seriously) wondering how ASN.1 DER, protobuf, cap'n proto and flatbuffers would compare.

Edit; Seems someone did already: https://www.diva-portal.org/smash/get/diva2:859441/FULLTEXT01.pdf

That covers ASN.1 DER and protobuf, but not the other two that I mention because they have “infinitely faster parsing” (that's what cap'n proto web says).

What the infinitely faster parsing means is that the deserialization (and for capnproto also serialization; not sure about flatbuffers) is implemented via accessors that manipulate the input/output buffer directly. Which helps the case of receiving a big chunk of data in which you need to find just a couple of things (like what started this thread), but it also saves some memory and saving memory also improves speed somewhat by reducing pressure on the caches.

Also, the performance will be fairly similar, and probably depend more on specific implementation than the formats themselves, but I am also interested in other properties like how expressive the corresponding schema languages are and to what they can be compiled.

But that never prevented anyone from doing things again, in a slightly worse manner.

Edit2:

Instead of concluding the winner protocol, the results shows which of the pro-
tocols are least preferable for the measured metrics. On average, the text based
protocols produce bigger messages and are therefore not optimal when size is im-
portant. XML was the least preferable protocol description language of all tests.

Yeah, . Of course binary protocols are better here.

Which leads me to say that it would also be useful to compare bson and msgpack, which are less efficient (then asn.1 der, protobuf, cap'n proto and flatbuffers) due to using text identifiers, but that makes them isomorphic to json (and yaml and toml) even without schema.

Bulb

@TheCPUWizard said in When Giant JSON Attacks:

@Arantor - If (and I see it about as likely as me winning both PowerBall and MegaMillons, multiple times, without buying a ticket) there is ever universal adoption of a proper Schema I might have something nice to say about JSON,

JSONSchema

has implementations (of validation, code→schema and schema→code) for most major languages,
is fairly stable by now, though it wasn't accepted by IETF as RFC yet, and
can express basically everything I could think of in terms of type definitions.

In fact it has implementation for more languages than XMLSchema, so I'd call it as universal an adoption as you can get these days for any kind of schema.

@Arantor said in When Giant JSON Attacks:

@TheCPUWizard yeah, I hear that - JSON Schema is not particularly wonderful but it is at least usable for smaller things.

Granted, it's ugly as three fucks, and kinda, like, um, rather, how to say it, verbose, but otherwise perfectly usable for pretty big things (like the complete set of azure resource definitions).

I would welcome if … hm, openapi now says it's compatible with JSONSchema on their web, so I wonder whether they did indeed resolve the slight inconsistencies that existed between them in the latest version.

dkf

@Bulb said in When Giant JSON Attacks:

it's ugly as three fucks, and kinda, like, um, rather, how to say it, verbose

Schemas usually are unless they leave far too much implicit.

Bulb

@dkf Json schema can be as lax or strict as you want, from ad-hoc properties all the way to specifying properties that must be specified if and only if some other properties have specific values. It is part of the reason it is so ugly, because it tries to cover a very wide range of use-cases, but it can do it.

Of course lazy bum programmers tend to provide useless lax schemas if they provide any at all, but that's the case for other formats as well and not really fault of the schema. The ability to generate schema from code helps getting useful ones as they contain at least all the information the data type definitions in the software do.

Mason_Wheeler

@dkf said in When Giant JSON Attacks:

@Mason_Wheeler said in When Giant JSON Attacks:

"Whatever you do, never use the JSON format. In almost all tests it proved to be the worst format to use."

That blog was written by an idiot. For example, it describes JSON as having "poor support for special characters" despite the fact that it's UTF-8 by definition. If your character's too special to be in Unicode, then you're deeply into territory.

Yes, admittedly that one's a bit odd, but I'd hardly call one flaw in their analysis evidence of idiocy.

I'm guessing they chose a poor serialization model for their data. For block numeric data, provided you use arrays, you get serialization almost as compact as CSV.

...which is still pretty awful compared to binary serialization. And unless your data structure is incredibly simplistic, there will be other fields involved, not just a bit array, which means tons of overhead as it repeats field names again and again and again...

dkf

@Mason_Wheeler said in When Giant JSON Attacks:

...which is still pretty awful compared to binary serialization.

And yet binary serialization has other problems, such as the fact that it constrains the width of the values and has all the fun of needing to define endianness. And you can't easily inspect the messages. Or resynchronize after a glitch.

Mason_Wheeler

@dkf said in When Giant JSON Attacks:

it constrains the width of the values

Not if you encode them properly.

and has all the fun of needing to define endianness

Is there any system still around that uses big-endian encoding?

And you can't easily inspect the messages.

Which is also true of JSON at any non-small scale. (Remember, I'm specifically talking about large amounts of data, and freely conceded above that, at small scales, human readability is JSON's best feature.)

Or resynchronize after a glitch.

Can you elaborate on what you mean by this and how JSON is any different?

Gribnit

@Mason_Wheeler UTF8 tends to have 0s every 8 bits. If bit drops change byte boundaries it is easier to identify in UTFx than binary, precisely because it is less efficient. If your bits are the only serialized bits and are maximally space-efficient they have no EC bits.

boomzilla

@Mason_Wheeler said in When Giant JSON Attacks:

@CHUDbert What's a JSONx?

*looks it up*

As seen here:

https://what.thedailywtf.com/topic/13371/ibm-s-jsonx-or-how-to-represent-json-in-xml

https://what.thedailywtf.com/topic/13991/jsonx-is-sexy-liesx

Mason_Wheeler

@Gribnit This verification can be dealt with by periodically appending a checksum in your binary stream, and you'll still come out ahead on space.

Gribnit

@Mason_Wheeler said in When Giant JSON Attacks:

@Gribnit This verification can be dealt with by periodically appending a checksum in your binary stream, and you'll still come out ahead on space.

Yup, and/or on the wire itself. And in binary you can tune your amount of ECC arbitrarily. And you can hit that sweet spot that was apparently worth spending 100X more to get to.

Mason_Wheeler

@Gribnit said in When Giant JSON Attacks:

And you can hit that sweet spot that was apparently worth spending 100X more to get to.

Gribnit

@Mason_Wheeler said in When Giant JSON Attacks:

@Gribnit said in When Giant JSON Attacks:

And you can hit that sweet spot that was apparently worth spending 100X more to get to.

Oh, you'll find out when you're older.

Carnage

@Mason_Wheeler said in When Giant JSON Attacks:

Is there any system still around that uses big-endian encoding?

There's a few mainframe systems around, but they are hopefully going extinct soon.
I think the z/Architecture from IBM is big endian for instance.

Bulb

@dkf said in When Giant JSON Attacks:

@Mason_Wheeler said in When Giant JSON Attacks:

...which is still pretty awful compared to binary serialization.

And yet binary serialization has other problems, such as the fact that it constrains the width of the values

Depends on the format. Some do, but many don't.

and has all the fun of needing to define endianness.

So does a text representation, except that there is one dominant text encoding that implies “big”. And only because it was designed by Americans. If it was designed by Arabs, it would have probably been “little”.

And you can't easily inspect the messages.

You need a tool to inspect text messages just like you need one for the binary messages. The benefit of text messages is that such tools are much more ubiquitous (since any text editor will do). Does not really apply to compact json though.

Or resynchronize after a glitch.

That depends on whether the format is designed for it. If a brace drops from a deeply nested json, it will blow up just as badly as most binary formats.

Arantor

@Carnage some of the ARM stuff is LE, I think up to ARMv7 is mostly big endian, and the newer stuff is either LE or switchable between the two if you want.

I discovered the weird way, when I was working on an image processing app a few years ago and discovered that photos taken by Apple gear of the time all used big endian JPEGs (since the format has a built in way to specify the endianness of the data), that was 2014 or so, not sure if still true.

Bulb

@Arantor said in When Giant JSON Attacks:

@Carnage some of the ARM stuff is LE, I think up to ARMv7 is mostly big endian, and the newer stuff is either LE or switchable between the two if you want.

ARM itself has been switchable from the start or almost so, so it depends on the OS.

Most of the early systems targeting it used big endian, and so did the early Linux builds for it. But Windows CE were always little endian, and I remember building those for ARMv5TE, and I think builds for v4T existed. Because Microsoft code has been written for little endian and they didn't want to port it.

And some time around that, most Linux distros also switched to using little endian on ARM.

I discovered the weird way, when I was working on an image processing app a few years ago and discovered that photos taken by Apple gear of the time all used big endian JPEGs (since the format has a built in way to specify the endianness of the data), that was 2014 or so, not sure if still true.

… on the other hand Apple code was historically big endian, because m68k is big endian, so it makes sense they'd use big endian on arm as well (and I suppose they used it on the powerpc, which is also switchable, too). Until they went to x86 that required doing the porting anyway.

Arantor

@Bulb huh, TIL, interesting.

Bulb

@Arantor … maybe it even wasn't most; wikipedia says ARM defaults to little endian. But early Linux builds for it were definitely big endian and switched to little over time.

Mason_Wheeler

One little, two little, three little-endians...