The github for this project doesn't use this terminology, but the keyword here is 'canonicalization'. That's the process in which you transform a document which can take any number of forms to a specific document that only takes one form.
You do this to ensure that documents that contain the same "information" will look identical on the wire, not just after-the-fact once you've processed them into some execution-specific data structure.
For example, X.509 certificates use ASN.1 DER, a restricted subset of ASN.1 BER where each value takes a deterministic form, so that two certificates that contain the same information will look identical when serialized into bytes. This is a stricter application of ASN.1, while, say, when you talk to a directory service over LDAP, you can speak BER, the looser encoding, because the exact bytes by which you're making yourself understood don't really matter.
Son is a canonicalized form of JSON, and the author maintains a page of JSON subsets and supersets [1] that lists two other canonicalized encodings, both of them containing silly flaws in their design. Son is a better effort.
A stricter subset of JSON doesn't interest me without more specification surrounding how numerics should be decoded.
How do I decode 18446744073709551615? What is the behaviour of a decoder that can't handle numeric types of that size? Should it cause a parse error? Should it truncate to 1.84467e+19? Should it overflow to 4294967295?
Presumably if you want a canonical textual representation, the parse error is the only acceptable solution.
For context, let's start with what JSON is. All the JSON spec really lets you do is say "this sequence of code points is JSON", or "this sequence isn't". It includes a few interoperability suggestions for how you might avoid JSON that will be hard to interpret, but it's extremely stark on parsing and generation.
You have a comment downthread about how inadequate this is, which I totally agree with:
> The behaviour of the encoder and decoder are just as important to specify as the bytes across the wire. This is the whole problem with JSON in the first place.
Unfortunately since the goal of Son is to be a subset of JSON, there's not much we can do about encoding/decoding within Son's spec. Instead I hope to build a sane foundation on which other tools can be built. Limiting the scope of Son seems like a good way to go about this. In this case, that means not specifying ranges for numbers, so that future tools such as schemas have as much flexibility as possible (for instance to include arbitrary precision numbers).
EDIT: I'm mixing together two different things in my last paragraph. I don't want Son to know about Floats, Doubles, etc since JSON doesn't. That doesn't mean that we couldn't specify a range on number size though. The reason that is left out is to allow maximum flexibility for other tools to build on Son -- I'm trying to drop extraneous parts of JSON, but as few useful parts as possible. This needs more elaboration within the project's docs.
"In this case, that means not specifying ranges for numbers, so that future tools such as schemas have as much flexibility as possible (for instance to include arbitrary precision numbers)."
If the allowed range of numbers isn't specified, then when you emit data in this format, you can't be sure it will be read correctly on the other end...
This is definitely a problem, but the place to solve it isn't Son.
Son is intended to be a very simple project that grabs all the clear wins. Do we really need both `0` and `-0`, stuff like that.
There are too many different things JSON Numbers can represent for us to have a clear strategy for all them in Son. Int64s, doubles, floats, arbitrary precision numbers (eg https://hackage.haskell.org/package/scientific-0.3.5.2), etc. That makes this a good for for the schema layer (eg JSON Schema or whatever you're using), not the base specification layer.
> This is definitely a problem, but the place to solve it isn't Son.
But then what problem does Son actually solve? How is a canonicalisation format that's not actually canonical useful?
You take a hit (e.g. by forgoing fast existing libraries, control over pretty printing, and by being forced to sort on serialization which screws up streaming), but gain essentially nothing, not even the advertised benefit of your stuff not randomly changing as it moves through the stack. The likely main benefit of canonicalization is for security and crypto, but unless you use son without numbers (or create your own subset of son) it's kinda useless for that. Of course you can just encode all your numbers in strings or something, but at the point where you're doing your own number parsing logic (which is the hardest bit), why not just use some well-designed actually canonical format like csexp?
> What is the behaviour of a decoder that can't handle numeric types of that size?
Is this actually relevant? It sounds like an implementation detail to me and it seems to boil down to using arbitrary size bigint. If you can't handle a bigint because it's too big for your machine, it sounds like a problem with the system, not parser.
The behaviour of the encoder and decoder are just as important to specify as the bytes across the wire. This is the whole problem with JSON in the first place.
We can play this game all day. What if I pass an integer that takes 32GiB of memory to store? Also the system's fault if the resources aren't available?
I'd say so. This is kind of a security question - the data is valid from the protocol's perspective, it's the contents that are malicious. It's up to the developer to decide how to handle that - if it's super important to be able to process such big numbers, modify the parser to handle that (for example by storing them on the HDD). Same goes for another serialization format: bencoding. It supports arbitrarily nested data structures and most implementations are recursive, which can easily result in stack overflow if you mess up the implementation. But I wouldn't blame the standard - that's the price you pay for flexibility. At the end of the day, you need to sanitize your input anyway and keep your edge cases in mind. Same goes with XML bombs...
I would think the JS part of JSON would preclude any massive integers. It’s still an IEEE floating point at the end of the day, so your max is (2^53)-1
If you managed to load a JSON string into memory that contains a number that takes 32GiB of memory to store after parsing, I don't think you're worrying much about resources.
Decode it as a number with exactly that value as the numeric type the application requests. If the number is out of range, indicate that to the application (which should already have logic to handle numbers out of the range it cares about).
If the application doesn't request that you decide it, don't. Leave it in some internal format which does not lose precision. Limit yourself only to memory limits the application requests.
That's great, but most JSON libraries don't give you that level of access... they just decode in to a bunch of nested maps/dicts consisting of your languages native data types. So one day you swap out your JSON library and the behaviour of your application silently changes.
This is what unit testing is for. Test the methods that validate and return your expected object structures. This catches other things like a structure changing (as simple as a property being modified), ensuring you can handle malformed input (even if it's valid JSON) and any other data validation rules you have.
(There are competent JSON implementations, though they're few and far between. PostgreSQL's is the only good example I can point to off the top of my head.)
It stores it as positive, mantissa and exponent, allowing you to pull it apart in that way if you need to. Meanwhile, normal usage is via JsonValue (https://docs.rs/json/0.11.12/json/enum.JsonValue.html) which allows you to attempt to convert it into an f64, i32, u8, &c. as you desire, returning `None` if it doesn’t fit inside that type (e.g. -7 won’t go into an i32, 576 won’t go into a u8).
By what’s being talked about here, that looks to be a competent JSON implementation.
Wonderful. Good to know that about Rust, I haven't used it but everything I read about it looks good. Yes that is exactly the approach I meant (and a similar mechanism to that used by my own C++ JSON library I use for side projects).
Well, not the same; it eagerly interprets it as PosInt(u64), NegInt(i64) or Float(f64) rather than storing the mantissa and exponent. This is enough for JavaScript compatibility (its Number type being essentially a 64-bit float), but it’s not quite as flexible as the json crate’s approach.
Shouldn't they return a list of warnings/errors on the parse? Just blindly accepting a machine parse of some random input data is how you crash programs with bad input.
This isn't really a true canonicalization since it doesn't define an ordering. I was actually hoping it was. The matrix project [0] invented their own json canonicalization so that they could have servers meaningfully sign it. Was hoping some external standard of json canonicalization existed I could recommend them switching to.
Have you seen the relevant reference in RFC3629, page 2? It's explicitly listed there as a feature: "The byte-value lexicographic sorting order of UTF-8 strings is the same as if ordered by character numbers"
Agree that specifying the keys to be ordered >by unicode code points< instead of >lexicographically< would be less ambiguous though.
I definitely meant ordering by Unicode code points. Someone very helpfully opened an issue and we're trying to figure out the right wording there: https://github.com/seagreen/Son/issues/13
Looks good. I still foresee interoperability problems between implementations, though. It just is too easy to mix up the ‘sort by key’ and ‘escape various control characters’ steps (CR sorts before ascii characters, but “\n” sorts after it)
Even if the spec requires it, I fear implementations will also canonicalize strings differently, breaking sort order.
You could argue that Matrix’s canonical JSON form /is/ an open external standard ;) Meanwhile, Mastodon looks to have been implementing its own canonical JSON form recently too for signing purposes.
JSON loses its appeal once you're not targeting browsers and/or anything else that uses Javascript. Sure, it's simple and familiar, but it has serious drawbacks such as the time and CPU resources it takes for parsing and the lack of schema.
For those of you interested, I suggest using Google's protobufs or even full blown gRPC for such endeavors.
Sure, it's simple and familiar, but it has serious drawbacks such as the time and CPU resources it takes for parsing and the lack of schema.
Before looking for something 'better' than JSON, build a basic prototype of your app and see if your application actually needs to use something else. The data in most applications is so small and will parse so easily that actually measuring the CPU utilisation will be quite hard. 99.99% of applications aren't moving data around at the sort of speed where the format really matters.
Protocol buffers are very clever and really cool but using them where you don't need them is a huge waste of effort.
As for the lack of a schema that's not true. There is a schema. What's in the file is defined somewhere. For applications that consume JSON that's in your code rather than the file format.
JSON will easily dominate CPU usage in any modern application that uses it for its API (and is not itself something intrinsically CPU-hungry like crypto or machine learning).
We benchmarked one of our apps, written in Go, which uses JSON extensively both for the API and for internal storage, and we found that around 90% of the CPU time is spent serializing and deserializing JSON.
The JSON code itself is a tiny portion of any app, so it's one of those rare optimization cases where a small change can yield significant speed-ups for the entire app.
Depends on the way you handle JSON. You don't need to deserialize the entire received JSON string and then walk the node tree, you can walk the string as is. With this, you can process 500MB/s or more of JSON on a single core.
On the other side, you can produce JSON string directly, saving the need to produce some intermediate representation in memory.
You should be able to saturate 10gbit line on a multicore server easilly with this approach.
Agreed; in data-centric apps the JSON parsing time is a major time suck. But sometimes it can be heavily optimised with a good library, like fastjson for Java.
There are formats that fall between unrestricted JSON and strict predefined schemas like Protobufs or Thrift. I have used MessagePack for internal APIs and message queues where JSON was too slow but we still wanted a self-describing format. You can decode msgpack for debugging without needing the original definition (there is none), unlike a fixed schema format.
> Before looking for something 'better' than JSON, build a basic prototype of your app and see if your application actually needs to use something else.
That implies that JSON is somehow the gold standard of serialization formats. I don't see it that way. The fact that JSON is ubiquitous in some areas doesn't mean that it's the gold standard for all other uses.
In my experience, protobufs are easy to work with, concise, safe, and efficient. There's never any ambiguity to them, and both creating and parsing messages is fast and easy. That doesn't mean that there aren't cases in which JSONs aren't a better choice, for instance when working with browsers, or needing a text based format for debugging or inspection purposes.
> Protocol buffers are very clever and really cool but using them where you don't need them is a huge waste of effort.
In my experience protobufs require very little effort, with most of the effort being put into defining the schema. It's pretty frictionless afterwards with one notable exception - when you want to manually inspect a message. That's where JSON wins hands down.
That implies that JSON is somehow the gold standard of serialization formats.
I'm not suggesting JSON is the best format, because that depends on the project you're working on. I'm saying it's good enough for most tasks, and has ubituous support, and it's easy to work with. There are better formats for particular use cases but if your application has requirements that can work with JSON it probably isn't worth finding something else.
Protobufs do require more effort in the toolchain. Pretty much the only way to parse a .proto file is with protoc and you might need to build it from source. So you need to run the C++ compiler and depending in the build system you're using, adding a dependency on protoc at build time can be a pain.
Contrast with JSON where every decent language ecosystem has a parser readily available. That's what makes it the "gold standard" for me.
Protobuffers have one major advantage over JSON: the schemas are versioned. When your requirements for the payload inevitably change, Protobuffers can handle that with a lot less muss and fuss than JSON can.
Plus, the byte count over the wire is much smaller - something that matters even in today's world of gigabit ethernet.
You have to compile the protocol buffers so there is overhead. Also the messages aren't human readable eg through kafka so you need to write something to look through the messages. JSON is faster if your dev team need to get from A to B in a hurry, but protocol buffers are a generally a better long term choice if you can put in the effort up front. Both have pros and cons.
Not sure what you mean by "schemas are versioned". Protobufs have backward compatibility rules so you can add fields without changing versions. I don't think there's anything about explicit versioning in the spec?
JSON, like XML, is supported by everything, including not only general purpose languages, but also by more specialized environments like MATLAB. Also its compatibility across implementations is very good,
excluding support for comments and nonfinite numbers. Contrary to YAML, I mean; there's little reason to use XML except for legacy or markup purposes.
JSON-schema implementations are also rather widely available.
In performance-sensitive machine-to-machine applications, I agree, better to use someting compiled, but for general-purpose interoperation and archive data storage (where you cannot know when and with what tool you'll want to open it), I've yet to see a compelling alternative.
To be fair, MATLAB only added support for reading/writing JSON this very year. And they represent JSON objects as structs, which don't support all key names, and they have a very "idiomatic" idea of which JSON arrays are decoded as matrices or cells, and how nested arrays sometimes result in nested arrays, and sometimes result in multi-dimensional arrays. And there's an uneasy relationship between the empty array and Null. It is kind of a mess, really.
Also to be fair, although I can see some uses for supporting JSON in MATLAB, it makes a lot more sense if you're using an array language to use an array-oriented format such as HDF5, which has been supported in all the major numerical platforms for quite a while.
Honestly, that statement applies to pretty much any part of MATLAB unless the behaviour is nailed down by existing mathematical or engineering specifications - and I'm talking as someone that likes MATLAB overall. Mathworks tries a bit to polish and paint it as a coherent whole, but it's patchwork of things built upon and stuck to the sides of the existing structure, optimizing towards a particular kind of "pragmatism" than towards any kind of elegance.
I remember my colleague working with JSON in MATLAB via some thirdparty library about 2 or 3 years ago. It was regex-based, it was slow as hell but at least it was available.
Fun fact, I wrote my own JSON reader/writer in pure MATLAB, and it is also regex-based, and in many cases, it is faster than MATLAB's own, Mex-based JSON reader/writer: https://github.com/bastibe/transplant
There are few things as slow as non-vectorized MATLAB code. MATLAB's regex engine is pretty decent, actually, and supports quite a few fancy tricks. For example, I use the regex engine to find the start/end of all numbers and strings ahead of time, and for replacing escape sequences (even unicode escapes!) with their literal characters. The former is particularly important because otherwise, you'd have to manually iterate over characters until you find the end of strings or numbers, which is unacceptably slow in MATLAB.
ProtoBuf depend on the schema definitions and encoder/decoder generation. Without the schema, you cannot read the format (theoretically), if you want your data to be read easily, JSON or CBOR or MsgPack are your best options.
I've been using cbor with python and it's been just fantastic. In particular it: Can send a "None"; Can tell the difference between bytes and unicode; and is real fast.
schemaless is never really schemaless: the code that accesses your data is implicitly assigning it a schema, so you'll have the same problem.
FWIW, protobuf schemas are designed to be friendly for back and forward compatibility; unknown fields can be ignored and "required fields are generally considered harmful", interesting read: https://github.com/google/protobuf/issues/2497
> there's little reason to use XML except for legacy or markup purposes.
There's also the ability to easily and explicitly encode data that doesn't fit to JSON's limited types. A non-exhaustive example includes: sets, hash tables that don't use string keys, tuples, imaginary numbers, fractional numbers, and precision decimal numbers.
There's also the fact that XML explicitly supports streaming interpolation of documents, something which can speed up interactivity quite dramatically.
JSON has drawbacks, but I think you skipped one of its biggest benefits: it encodes in reasonably readable plain text.
Readable plain-text protocols remove a constant overhead from all debugging/introspection tasks. If you use, say protobufs, you'll still have to convert back into strings to see what was in individual messages. This makes reading, for example, logfiles or wireshark captures with "grep" more difficult. Sure, there are tools that you can throw before "| grep" that convert to text, but that's one more piece of cognitive/debugging overhead you have to remember, and one more thing that new developers have to know to do if they're used to the standard unix/text-oriented way.
That benefit applies even if your messages (or logs of messages) are stored/viewed in a database that speaks your binary protocol and provides tools to search/see the data as text. Whether or not you have something like that, it will always (always) be routinely necessary, or at least incredibly helpful to productivity, to "print out or search the textual content of messages sent/received by my code" when developing and testing apps that communicate over they network. A text protocol has a huge advantage in that department--or rather, it has a very small advantage that is present in a very large number of development/testing tasks.
Also, readable plain text is key. XML dialects, or e.g. Thrift's JSON format are plain text, but are not terribly readable. This is subjective, to be sure, but makes a big difference in practice.
There are great reasons to use binary protocols, but any comparison that leaves out the benefits of text content that is readable "for free" without using per-protocol tools is lacking.
So this comes up constantly and I tend to agree with it, but it has always bugged me. We are programmers and we control the machines, so why can't we code our way out of this problem? Why isn't the very first thing one does when working on a protobuf system installing a custom version of utils and tools that deals with the wire format better? E.g. grep, less and wireshark that autoconvert protobuf messages?
(I know there are good reasons for this involving the effort required and maintenance etc. it just bugs me.)
Because Unix. Unix was designed as using ASCII text as a lowest common format, and all its tooling follows that design, and nothing has displaced this. If, for example, a Lisp system or a Smalltalk system had won out, I am sure we would all now be lamenting the inadequacies of s-expressions or Smalltalk objects instead of complaining about ASCII.
While the "everything assumes unstructured text; destructure and restructure at will" paradigm is still incredibly powerful, and set the bar super high for data interoperability and uniformity, I think it lowered expectations for [meta]data complexity that could be supported by simple, small programs.
In Windows, PowerShell is a very good example of a low-ish-level (composable commandline apps and snippets) programming environment that isn't shackled by that paradigm.
To do this right you'd need a common registry with all the proto schemas you'll ever want to use. It's a solvable problem, but it's comparable to setting up a new build ecosystem and getting people to use it. (Consider TypeScript and .d.ts files.)
> Readable plain-text protocols remove a constant overhead from all debugging/introspection tasks.
As others have pointed out, text-based serialization has significant overhead. To save a few hours of debugging, you're willing to spend gigabytes of bandwidth, an uncountable number of CPU cycles and memory, and all the associated electricity and hosting costs over the lifetime of your service? Penny wise and pound foolish, to say the least.
Text-based serialization formats are just plain stupid for any type of service that handles over 100 requests per second. A good developer is not hampered in the slightest by the fact that their serialization format isn't text.
Would I switch to a more computationally expensive format to save a few hours debugging? Probably not.
Would I switch to a more computationally expensive format to save a few hours debugging each time someone makes substantial changes to my application's message format? Very possibly.
Would I switch to a more computationally expensive format to prevent bugs (the easier it is to view, or remember how to view, what your code is sending/receiving, the less likely it is that lazy people will skip bug-hunting/QA steps that would show it)? Almost certainly.
Should you use a binary format if you're sending tons of data per second? Well, it depends. It's like microservices: if you have the tooling to make dealing with that format from the dev/debug/tracing side a breeze, go for it; using binary provides substantial savings, as you point out. If you don't have the tooling/time to make it as easy as text to introspect, and I mean that in the most absolute way, there might still be persuasive reasons to switch, but know that if you do, you are also buying a lot of "debuggability debt".
I’d skip protobufs too. They’re horrible for performance. Many data center apps blow more cycles there than on compute. For actual applications where performance is necessary, binary “strict” formats are he only way to go. Protobufs are nice for ease of use, but they’re horrible otherwise.
You’d need to repackage it in some form to make it palatable for modern audiences. Let me take a quick stab.
Make “ASN.2” or something, update the fundamental data types to modern standards (think of what changed from NeXTSTEP to macOS X; they changed from TIFF to PNG, from PostScript to PDF, etc. but kept the overall architecture the same). Consolidate and simplify the many options (but do not eliminate options entirely). And call it… JASN or something, which sounds like both JSON and ASN, and make up some acronym where the “J” stand for “Joint” (like in JPEG, which stands for Joint Photographic Experts Group).
And, maybe most importantly, make freely licensed tooling available for both converting to and from ASN.1 and JSON, but also jgrep, jless, jcat, etc. (similar to zgrep, zless and zcat, etc.) Make these packages available for all common operating systems, readily installable using that plaform’s usual mechanism.
>but it has serious drawbacks such as the time and CPU resources it takes for parsing and the lack of schema.
Ditto. And if you still prefer the schema-free approach of JSON but with faster encoding and decoding support across all common programming languages, take a look at MessagePack.
Unlike JSON, MessagePack is optimized for speed and encodes common data types efficiently. Unlike Protobuf/gRPC, it requires no centralized schema.
I say that a bit tongue in cheek, but really: BSON has a monstrous amount of basic data types, some of which make very little sense and are very difficult to round-trip through other formats; and MessagePack had crippling basic sanity issues with what-are-bytes-vs-strings in its early history and that's something you simply can't recover from without declaring game over and making a new spec.
CBOR has neither of these issues. It's almost completely isomorphic to json, plus supports byte arrays. It's very nice.
Does CBOR support lazy/partial deserialization? It's not a make-or-break feature, but the ability to pluck out individual fields from large MessagePack documents without deserializing the whole thing has been a significant performance/memory savings for me many times.
I haven't played with flatbuffers, but it probably supports this as well, given its design. Are there other binary protocols that have first-class (i.e. present with a good API in the client libraries for all major languages) support for lazy/partial deserialization?
More cbor ranting... In Python you can replace json.loads/dumps with cbor.loads/dumps with no further changes. Going back again may prove to be an issue when you discover just how much it is that json can't do.
The json schema isn't a standard but is ubiquitous in California and in advertising (IAB standards). I dont know of anyone using another, or why you might.
It's like saying ruby isn't the only ruby language because it isn't a formal standard. So what?
I don't think JSON schema is bad at all, for what it is.
Remember that the more modern binary protocols like protobuf/etc. don't serve the same niche as JSON schema at all: JSON schema validates external documents; i.e. you take a schema and make sure a document complies with it. Protobuf/flatbuffers/etc. validate internal documents, as they are encoded or decoded. If that's all you need, then the schema definitions for those products will be much friendlier than JSON-schema, but that's comparing non-equivalent things.
Another, minor quibble is that a lot of the schema-validation code in e.g. protobuf or Thrift is often generated by the framework, and the generated code can be quite hairy or confusing. There exist code generators for JSON schema as well, but writing a JSON schema by hand isn't comparable to, say, generating code from the Thrift IDL.
Apache Avro follows a rather nice hybrid approach where the schema is required for decode (it's an untagged binary encoding) but the schema can loaded dynamically in to the library.
"You can add new fields to your message formats without breaking backwards-compatibility; old binaries simply ignore the new field when parsing. So if you have a communications protocol that uses protocol buffers as its data format, you can extend your protocol without having to worry about breaking existing code."
> Example: when you add a new element to the API response, the recipient needs to update their schema.
That's not the case with protobufs. The sender can add new fields, and they will simply be invisible to receivers using the previous version of the message.
You'd need to update the receiver only if you want it to access the new fields, which is what you'd need to regardless of serialization format.
> The contract between producer and consumer should never be fragile around additional values existing.
... it should never be fragile where fragility might cost more than it gains you. The open/closed principle is the right thing to follow in most cases, but not all.
For example, choking on unrecognized additional fields by default in development is often a big time/bug-saver, since it gives you an early "hint" that things might not be speaking the same dialect to each other.
Similarly, if you have a message convention that is heavy on "modifiers" which control irreversible things, sometimes it's better to be fragile rather than count on orchestration systems to work perfectly 100% of the time. Suppose you have a system for buying bicycles with a required field of "bike model", and modifiers of "color", "wheel size", etc. Now let's say your servers all got patched to expect a new modifier of "frame size" . . . except for one server which didn't restart due to a deployment bug, and is still serving traffic. If a client sends that unexpected additional modifier to the "old" server, and it gets ignored, this could result in sending a production order for an expensive unit with the wrong specifications. If not caught, that cost could be compounded by sending the wrong unit to the customer and pissing them off. Now, this is clearly not a problem with the message format, but rather with the orchestration/release system. But bugs in orchestration layers are incredibly common, and baking a bit of fragility into the messaging layer at the right places can help mitigate those bugs' impact.
When performance matters being the operative concept. Don’t optimise prematurely and throw out all the benefits of a textual format in favour of a binary one, unless you need to.
This spec mostly concentrates on simplifying the syntax of numbers, but it does not do anything about one of the big omissions in JSON: it does not specify the range of integers or the precision of fractions. There are a lot of interoperability problems in that omission.
(The other thing SON does is require unique keys, which is a good improvement that fixes an interop bug that can cause security vulnerabilities.)
There are no integers in javascript, numbers are always represented as double.
Most implementations of json respect that. The error people get are because numbers are converted from double to something else that's not equivalent. Precision is around 15 digits and the usual rounding issues are applicable.
JSON is not javascript, and you can write valid JSON that isn't valid javascript (using unusual line seperators)
No JSON standard restricts the range of numbers (except for universally prohibiting NaN and infinity, making JSON numbers clearly not IEEE doubles). The standards warns of potential interoperability issues with numbers outside the numbers representable as double, but they do allow all numbers representable in Base 10.
You're wrong. There isn't a JSON specification anywhere that says numerics are IEEE 754 doubles. The RFC mentions "good interoperability" can be achieved by using doubles, but does not say this MUST or even SHOULD be supported.
I'm not working with binary protocols that often, so I cannot point to an existing protocol. However, if I had to implement a serialization format that doesn't need to be human-readable, I would never use number literals. I would write numbers as one byte whose value means e.g. "the next 4 bytes are an uint32 in little-endian", followed by just these 4 bytes.
Then you also don't have to worry about whitespace, because there is none. The next field can follow directly afterwards, because the type header byte defines how long exactly the current field is. (For variable-length strings, use something like netstrings.)
The only problem that I can see with this is that Son wants to ensure deterministic encoding, but in my example, 0 could be encoded as an integer of any size or signedness. You could add a rule like "encode every integer as the smallest possible type", but that may have practical implications elsewhere.
I wouldn't be too worried about it, seeing how half-baked Son appears to be. For example:
> Object members must be sorted by ascending lexicographic order of their keys.
Yeah right, because there is only one lexicographic order in the world.
> Yeah right, because there is only one lexicographic order in the world.
The specification tells you all strings (keys) must be UTF8. There is only one lexicographic ordering of UTF8 strings (RFC3629, page 2). This should probably be explicitly mentioned in the linked specification though.
> The only problem that I can see with this is that Son wants to ensure deterministic encoding, but in my example, 0 could be encoded as an integer of any size or signedness.
To be pedantic: Not really. Assuming your example uses twos complement (what else would it use) and you meant "int32" instead of "uint32", negative zero and positive zero would still be the same encoded value. Also since you said "[u]int32", there is only one size the integer can be: 32 bit. So, in your example, the literal value zero has one and only one unique binary representation, which is "0x00000000"
I implied that a useful serialization format would include integers of multiple size. When your field only needs a uint16, you don't want to pay the extra 6 bytes to encode as, say, uint64.
You could either stipulate (as you suggested) to always use the smallest width that still fits a number and then prefix with a length byte but that would be wasteful. Ideally something like LEB128 would be used. At any rate, it's not an issue in terms of guaranteeing a bijective mapping.
Pretty much every binary format will encode integers using a fixed width or varlen scheme in "base 2".
This generally is done for two major reasons: First of all, such an encoding is significantly easier and cheaper to parse than a base10+ascii (human-readable) encoding.
I encourage you to write a parser that reads a fixed, 32-bit binary number (1) and another parser that reads a JSON-formatted number string into an internal variable in a classic language like Java or C++. You will immediately see the big difference in complexity. Make sure your parser can also deal with a message that contains more than just one number, i.e. the parser should be able to tell at which byte index an encoded number begins and ends. Even if you're using a language or library where this is hidden from you (e.g. by using parseInt or std::stod) the same work still happens behind the scenes.
The other reason is that for most numbers and fixed/varint encoding schemes, the "binary" representation will be much more compact. Storing the number "1000000" in base10+ascii (human readable) takes at least eight bytes. Storing the same number in a fixed 32-bit integer encoding takes four bytes. Using a varint encoding scheme might allow you to get down to three bytes.
(1) Ignoring stuff like byte order and representation of negative numbers; this is usually fixed in the protocol/format specification.
There's a github issue on this, titled "Son inherits the weakness of JSON by not specifying the numeric value ranges" [1], in case anyone has anything concrete to suggest.
I see many comments here on the efficiency of binary formats versus JSON. While it is certainly possible, especially with a well designed and implemented protocol, just remember that you don't automatically get massive gains by using a binary protocol:
* JSON compresses well, and gzipped JSON is usually smaller than a corresponding binary format without compression.
* Being such a common format, practically every language has a highly optimized JSON parser. Just because a binary format can be faster to parse, doesn't mean the implementation is necessarily faster.
As a specific example, I've had experience in the past where the most common protobuf implementation for Ruby at the time was an order of magnitude slower than the default JSON parser (it has probably improved by now).
So by all means use a binary format for (internal) machine-to-machine communication if the performance gains are significant, but don't just assume that JSON will be too inefficient.
"Just because a binary format can be faster to parse, doesn't mean the implementation is necessarily faster." - except "JSON compresses well, and gzipped JSON is usually smaller than a corresponding binary format without compression", gzip takes some cycles...
I mean, I can't think of many situations where I'm debugging something by looking at raw bytes when I couldn't easily just dump the decoded representation instead.
People don't use BCD for ease of debugging do they?
I mean, maybe it's worth doing anyway, but it's an extra bit of friction. Plus you probably can avoid issues where the binary doesn't decode properly because of version changes, different environment, etc.
In Javaland BSON for production messages performs close to JDK serialization, and for debugging it's not hard to override the content types at runtime (or log the messages in readable form where they are getting serialized).
I remember reading this a while back. I think the SON specification could tighten up a few more things, like number ranges and further string escaping problems.
Limiting the size of numbers is intentionally left out of Son right now, because I'm not sure where that limit should be. My hope is that we eventually have a good schema language for JSON (something like JSON Schema, though I'm not sure about that one in particular) and then people can set number ranges in their schemas based on their particular use case.
There shouldn't be any remaining string escaping problems though! If you find one definitely open a GitHub issue.
Somewhat off-topic, but I had a need recently for a simple serialization format that satisfied the following:
* Support for undef/null
* Support for binary data
* Support for UTF8 strings
* Universally-recognized canonical form for hashing
* Trivial to construct on the fly from SQLite triggers
Of all of the formats I looked at nothing matched. So I created Bifcode [https://metacpan.org/pod/Bifcode]. It is a bit of a mix between Netstrings, Bencode and JSON. It is not really human readable, but it is extremely simple and robust, easy to generate and parse, and therefore relatively secure.
I'm not promoting it for any particular use case, and I only have a Perl implementation. Posting here as I think some of the audience in this thread may find it useful.
Daily reminder that there must be a sane timeline where telcos didn't proliferate ASN.1 to death, and machine-to-machine communication was solved decades ago.
No exponential notation?? I don’t see why this change makes it suitable for machine-to-machine communication. This looks like rather for the human readability, not for the machine. It surely removes some branches but I don’t think it will much improve decoding performance. At least this has to provide convincing reason to use it over JSON e.g. performance comparison.
I'm curious about how often enough trailing zeros occur in real world documents for lack of exponential notation to have a noticeable impact on document size.
Man I want the opposite. If it is machine to machine I can use a machine to decide it. For Jon I want comments, trailing commas, and a few other things I can't think of at 9am Saturday
You do this to ensure that documents that contain the same "information" will look identical on the wire, not just after-the-fact once you've processed them into some execution-specific data structure.
For example, X.509 certificates use ASN.1 DER, a restricted subset of ASN.1 BER where each value takes a deterministic form, so that two certificates that contain the same information will look identical when serialized into bytes. This is a stricter application of ASN.1, while, say, when you talk to a directory service over LDAP, you can speak BER, the looser encoding, because the exact bytes by which you're making yourself understood don't really matter.
Son is a canonicalized form of JSON, and the author maintains a page of JSON subsets and supersets [1] that lists two other canonicalized encodings, both of them containing silly flaws in their design. Son is a better effort.
[1] https://housejeffries.com/page/7