>I'll be calling a "ternary digit" a "trit", like a "binary digit" is called a "bit".
Shouldn't it be a "tit" if you want to be consistent? In fact, please let's make that happen. The HR execs are gonna go crazy when they hear the devs talking about tits all day.
It's a joke comment but it made me realize people see the "bit" shortening coming about differently. It could be any one of (and maybe even more, I just had to stop thinking about this before I ended up spending all day on it):
- BInary digiT [BIT]
- BInary digIT [BIT[ (i.e. "merge on the shared vowel")
- Binary digIT [BIT]
That's consistent with the common usage of trit if you take it as a "trinary digit":
- TRInary digiT [TRIT]
- TRInary digIT [TRIT]
- TRinary digIT [TRIT]
But if you're one to stick to ternary digit then you can end up with all sorts of unique combos:
- TErnary digiT [TET]
- TErnary digIT [TEIT] (no shared vowel to merge on)
- Ternary digIT [TIT] (the above humor)
Personally, the middle options for the first 2 sections (BIT and TRIT via shared vowel) are what I always mentally jumped to. If I was going to stick with "ternary" terminology I think, from a pure mapping of mental mechanics, I'd have ended up on TET (i.e. when there is no shared vowel assume the vowel was from the first word instead of the 2nd). The problem with that being TET conflicts with base 4, leaving you with either the awkward double vowel or the awkward end vowel pull. Or just calling it a trinary digit where you get the same consistency results as from BIT.
Thank you for coming to my TED talk (the things a hacker thinks about while burning their PTO I suppose?)
> Personally, the middle options for the first 2 sections (BIT and TRIT via shared vowel) are what I always mentally jumped to. If I was going to stick with "ternary" terminology I think, from a pure mapping of mental mechanics, I'd have ended up on TET (i.e. when there is no shared vowel assume the vowel was from the first word instead of the 2nd). The problem with that being TET conflicts with base 4, leaving you with either the awkward double vowel or the awkward end vowel pull. Or just calling it a trinary digit where you get the same consistency results as from BIT.
If I had been trying to come up with something, I probably would have started with an approach of picking something that was as similar-sounding to "bit" as possible so it would be recognizable and then looking for a justification I could use to argue why it was consistent rather than trying to figure out what the origin of "bit" is and then utilizing the same formula. Interestingly, this would have ended up at basically the same place as you, since I would have first thought of "tit", rejected it due to it not seeming likely to get adopted due to the connotations, and then picked "trit" as the next closest thing.
It's pronounced 'b-it' like 'dig-it', not 'bye-t' like 'bye-nary'. Going by linguistics the 'i' must come from 'digit', and a 'bit' is therefore most plausibly a 'b'inary dig'it'.
A 'ter'nary dig'it' would be a 'terit', which elides to 'trit'. That sounds like a natural construction, probably why it came about in the first place. You can't drop the 'r' willy nilly because the proto-Indo-European root etymon is 'tr-' not 't-', same in most descendant languages, and 'ter-' is one unexplained corruption of 'tr-' somewhere in the Latin line.
We have more to go on than binary though. Hexadecimal "digits" are universally known as "hexits", not "hexats" -- so "it" must be the suffix, not the "t" by itself.
Nice analysis, but you missed the most obvious one, which is also what I was deriving from in the above comment: Pronunciation. The "i" in "bit" is pronounced exactly like the second "i" in "digit" and not like the "i" in "binary", which sounds more like the "i" in bite. So the "it" in "bit" has to come from digIT. Consequently, the most logical shortening for "Trinary digIT" must be TIT.
That's the "Binary digIT [BIT]" line. Do you always go based on the input sounds rather than output reading or just sometimes though? E.g. for special weapons and tactics do you say it like you would in "the cat swat at the toy" or more akin to "sw@" (same first half but then "at" for the second half because the 'a' comes from a short 'a'?).
Of course, just for GIF alone I wouldn't mind your method was the universally obvious way... but now I might as well talk about religion and politics too!
Linguistically, I can see the arguments on both sides. But to me, the much lower level of tongue dexterity required to speak the three-letter version vs. adding in the r is substantial. To be clear, I'm not actually advocating for it in the context of contemporary cultural norms and slangs. Objectively speaking though, one is physically a much simpler movement to speak than the other, much closer to the physical simplicity of "bit". I readily notice this as someone who stutters and underwent various speech therapy as a child, then later in life lost three upper incisors to a car accident and had to retrain himself how to pronounce certain sounds. Not saying that should be a dominant coefficient in the evaluation, just that physical reality is something to consider and acknowledge with some nonzero weight.
but there are no "trits" in his scheme, he's using base 343 represented in base 2, from which trits can be extracted through a reasonably laborious conversion.
also, bits have the dual nature of being boolean logic values, and integers (in fact, arithmetic is done by logic). Bits deserve the honor of their name, and pretenders to the throne are hoping for stolen valor.
A more mathematically rigorous implementation might consider the incompleteness theorems and actually need a third option between true and false (i.e. undecidable). There are non-Boolean algebras out there, such as Kleene's. So us seeing Boolean algebra as the king is mostly a historic outgrowth, not a fundamental truth.
I dug into this once and the "theoretical ideal" of 3 originated in a 1950s paper about vacuum tube computers, which itself immediately backed off and said the choice of base 2 is frequently justified.
In this case, the context are {-1, 0, 1} weights in a LLM model, which I don't think is being used for any hardware efficiency argument. I think it's just quantizing weights into 3 states.
I'm curious what the benefits of ternary numbers are here? The section on balanced ternary numbers in Knuth's work was super fun to read; but I didn't realize this was relevant nowadays?
At the outset, it seems like it is more about storing ternary weights in binary without having to convert a multidigit ternary number? That is, if you have 20 weights, if this was binary, you could store all 20 weights as a single 20 bit number. I'm assuming you could convert the 3^20 number to binary and do the same? Is that roughly what is happening, here, with most of the effort in how to get out each trit?
Edit: I should add that I was largely misunderstanding parts of the article. I was reading this as akin to BCD, but with ternary. Reading it without that misunderstanding, I can see that they are largely doing what I said here.
Ternary is a thing; however, there's a lot of times when you're hand-rolling a compression scheme. In those cases, it is common to consider all sorts of nonstandard bases. So, for instance, if you need to encode a pair of numbers, and that pair is in the (rectangular) range: `[0,2]*[0,4]`, then it can be efficiently encoded in a 4b value, using the generalized version of the technique outlined in the article. I know that a lot of GPU block textures rely pretty heavily on products-of-weird-bases, usually 3, 5, and 7. Here's some 'good' combinations (high utilization rates, easy to decode):
Doom also famously redefined a circle as 256° and reworked all of their trigonometry and polar coordinate translations to match. And so the smallest increment of angle you could do in the game was around 85 arc minutes.
At the physical layer, this makes a ton of sense. Apologies for completely ignoring that part. I was specifically curious on encoding each "digit" of a ternary number independently in a computer. Which.... yeah, that isn't what this article was even saying.
I don't know what the benefits of ternary numbers are in this particular case, but a similar technique is used in ASTC[1].
In this case, a fixed-size block of pixels stores one or more pairs of colors using a lower resolution than your usual 8bpp. Then, these color "endpoints" are interpolated using weights that allow you to approximate the original colors. For storing the endpoints, sometimes you can get a better trade-off of resolution/bit if you can pack ternary numbers in the bitstream instead of being restricted to power-of-two boundaries.
It's not directly relevant "nowadays", but it is also fun to note that hardware ternary computers existed, for example: https://en.wikipedia.org/wiki/Setun
Also, I forget which RAM format it was, but I recall that one of the RAM formats still in use is essentially ternary at the physical hardware level and uses the "sign" of the (-1, 0, 1) trit as a "check half-digit" for very fast checksums in hardware (and then the absolute value as the bit), because the physical hardware was going to be three-state anyway and cheap checksums are something RAM will always want.
At a base level, this is just how to store numbers and convert between radixes?
That is, the confusion I had in my thread was I was thinking "pack numbers" meant something more than just storing a radix three number. Where the "packing" was just taking advantage of the standard positional values.
For example, consider that the number 123 can be seen as three numbers 0-9 as opposed to one number. How do you extract the three numbers? It can take a few instructions for similar reasons as the ternary case.
Stated differently, if you can see how the numbers 0x12, 18, 022 (octal), and 0b10010 (binary) are all the same number, then you can see how the storage of them is roughly all the same.
Now, for fun, you can also see how getting out the different "digits" of a number can be very different based on the base and limiting yourself to only working with binary shifts.
Right by "packing" problem this article is maybe more about a "word-alignment" problem, finding the smallest "word" that most efficiently stores base-3 words in base-2 words with the least wasted values in that word. It's an interesting coincidence that 5 trits per 8 bits is rather optimal, because the 8-bit "byte" is such a useful and important "word" shape in modern binary computers.
For the other side perspective, as amelius asked: the historic Setun used 18-trit words and that is surprisingly efficient at storing 28-bit words (0.643 bits per trit; the most efficient up to 32-bits of possible word alignment is 30-bits in 19-trits at 0.633 repeating bits per trit) if for some reason you needed to do interop. Of course 28-bit words aren't a standard binary word-size and if you wanted to try to find a word alignment that fits for both modern binary computers and Setun, might be 16-bits in 11-trits (0.6875 bits per trit).
I suppose if you were building ternary hardware today you might shoot for a 41-trit word to align with modern 64-bit words (0.641 bits per trit).
This is exactly a Succinct Data Structure. The solution proposed by the author doesn't allow random access; that is, you can't access a specific A[i] unless you unpack all of them. There is some research (e.g., [1]) allowing random access within reasonable time with slightly more storage, although it is almost entirely of theoretical interest.
[1]: Mihai Patrascu, Succincter, FOCS 2008 best student paper.
For inference, trits/sec decoded into L1 cache is a lot more important than random access, and is going to be hardware specific (e.g. what SIMD instructions are available). 8bit seems an odd choice given most available instructions, but the Succincter paper's use of mod and individual decoding is (unless I am misreading it) much slower.
If I’m looking for the 3rd position in decimal I ignore everything above and below the third position. Which you can do by taking modulo 1000 and dividing by 100, or dividing by 100 and taking modulo 10, which is easier to calculate in the moment. You don’t need any fixed point math just normal integer floor on division operations.
It does allow random access. It works like a compressed texture format. Each block of 5 trits compresses to one byte, so you can jump to exactly the relevant byte and unpack only it.
If you like this sort of thing (and don't care about performance) check out Arithmetic Coding for the most generalized fractional symbol packing algorithm.
A lifetime ago, I delved into the ternary rabbit hole[1], developing a bunch of tools and math[2][3] for the adjacent problem ad-hoc (balanced) trinary bit twiddling in numbers encoded as binary, while retaining the ability to do fast arithmetic on the binary representation.
It's a fun rabbit hole. I'm glad it's (seemingly) finding some practical use cases.
I read the article title, glossed over the article, and thought it would be a fun problem to solve.
The scheme I came up with assumes a representation of ternary numbers as bit pairs 00, 01, and 10 in a byte. Since there are 4 bit pairs in a byte, we can represent 4 ternary digits t0, t1, t2, and t3 no issue.
Since 11 doesn't map to a valid ternary digit, I had an idea that the presence and position of 11 bit pairs could change how the other bit pairs are interpreted - so that's where t4 comes from. If there are no 11 bit pairs, t4 is 0.
Working the combinations I just needed 3 cases - 0, 1, and 2 11-bit pairs in the byte. The byte values with 3 11-bit pairs didn't have to be used.
The 4 11-bit pair, all ones, I decided could be used to signal end of stream.
> There are 3 possible values in a digit of a ternary number. 3 possible values, which could actually be anything.
I was lost on the first sentence. Article should start with the definition of a ternary number. And without added context, "3 possible values, which could actually be anything" doesn't make any sense.
Do -1, 0 and 1 all occur with the same frequency in typical large language models or would there be a benefit in encoding 0 with 0 and -1 and 1 with respectively 10 and 11? (or something even more complex)
You're looking at the alphabet utilization rate. Whereas the standard way to measure efficiency (which is the way that the author did it) is to look at entropy.
Here's how I would phrase it. Obviously, if you have a message that has 256 possible states and each one has the same probability of occurring, then the entropy is 8 bits per message.
If you have a message having 243 uniformly distributed states, then its entropy is defined as log base-2 of 243 = log_2 (243) = ln(243) / ln(2) ≈ 7.925 bits. But if you used 8 bits to express this message, then you've wasted 0.075 bits (or 75 millibits, mb) per message, so your efficiency is 7.925/8 ≈ 99.06%.
How would you achieve 7.925 bits per message in practice? By packing more messages together and sharing wasted space. By using some continued fraction magic, let me proposed to store 15867× of these base-243 messages together. There are 243^15867 possibilities, which is approximately equal to 2^125742.999994713, or just a hair under 125743 bits. So, a block of 125743 bits can store 15867× base-243 messages losslessly, which means each base-243 message uses 125743/15867 ≈ 7.925 bits on average.
The information theory comment is correct but I’m going to elaborate a little bit with some potentially useful info.
For a binary number with N bits, you can represent 2^N values. Easy example: There are 2^8 = 256 possible values that can be represented by an 8-bit value. You can go the other way, too, and ask “how many bits do I need to represent a given value?” by using some math.
2^N = 256
Log2(2^N) = Log2(256)
N = Log2(256) = Log(256)/Log(2) = 8
And you can use this to figure out how many bits you need to represent an arbitrary number: N = Log2(1000) = Log(1000)/Log(2) = 9.966. That makes intuitive sense because a 10-bit number has 1024 values.
To get to the theoretical limit you’re asking about we do the same thing. For an arbitrary number x, how many trits do we need to represent it in trinary and how many bits N do we need to represent it in binary? What is the ratio of trits to bits?
x = 3^M and x = 2^N
Log3(x) = M
Log(x) / Log(3) = M
Log2(x) = N
Log(x) / Log(2) = N
And then we can just take that ratio N/M to figure out how many bits N we need to represent M trits:
R = N/M = Log(x)/Log(2) x Log(3)/Log(x) = Log(3)/Log(2)
To be able to reversibly encode an N-trit ternary number into an M-bit binary number, the number of possible N-trit numbers must be less or equal to the number of possible M-bit binary numbers. Otherwise there would be two input ternary numbers that would map to the the same binary number (pigeonhole principle). Which means that 3^N <= 2^M.
you should read it as log₂(3) because logₓ(y)=logₙ(y)/logₙ(x) and it represents the minimal number of bits (yes it's not an integer) required to represents 3 states
In general, in order to a store a thing with N values you need log_2(N) bits of information. This is actually really simple to see if N is a power of two. If you have four things, how many yes/no questions do you need to exactly determine one of the four things? The answer is two (divide the four into two groups of two, ask which group the object is in. Then, when you have that group, split into two groups of one, and ask the next question)[1].
So, for powers of two, it's obviously log_2(N).
Now we just extrapolate. For three things, on average, you need to ask log_2(3) questions. Sometimes you ask just one, sometimes you must ask two (depending on where you split the group). Either way, the formula is the same.
log(3)/log(2) is just a way of writing log_2(3) (since the value of log3/log2 is the same regardless of which base the log is in).
Note that two isn't special. It's only because we've limited ourselves to 'how many yes/no/questions'. If you wanted to know how many multiple choice questions of 3 responses would you need to exactly determine which object of a set of 4, the answer is log_3(4).
[1] You can think of any byte, word, double word, quad word as simply a path in a binary tree of all numbers in the range 0..2^N-1 (where N is bit width). The path uniquely identifies the number.
This is the equation for converting radix. You can lookup optimal radix choice for more, I think?
For fun, consider how much space a base 10 number takes in binary. It takes about 4 bits with slack. Specifically, about 3.32 bits. Which is log(10)/log(2).
Another way to write that number is as the log base 2 of 3. The log base 2 of the number of possible states gives you the number of bits of information in a value chosen from that many possible states; here, a trit has 3 possible states.
Shouldn't it be a "tit" if you want to be consistent? In fact, please let's make that happen. The HR execs are gonna go crazy when they hear the devs talking about tits all day.