>Weiss figured out how to represent a complex image as a single hyperdimensional vector that contains information about all the objects in the image, including their properties, such as colors, positions and sizes.
How is this different from an N-dimensional embedding produced by, say, a convolutional neural network?
> The team then trained a neural network to examine an image and generate a bipolar hypervector — an element can be +1 or −1 — that’s as close as possible to some superposition of hypervectors in the dictionary...
It seems the former uses binary embeddings, which in ML is used more frequently for similarity search. They also seem to care about orthogonality. Beyond that we'd have to read the paper to know how the embeddings are derived. If you're motivated, here they are: http://www.rctn.org/bruno/papers/
This is a nice bridge between embeddings common in AI and the symbolic reasoning that we have a lot of theoretical understanding of. However, I'm not sure what the discovery is. Multiplication, addition, and recovering the bases of vectors are high-school / college-freshman-level linear algebra concepts. Can anyone shed more light on the 'discovery'? It seems like an interesting idea, but I'm surprised that Quanta (who typically has very high quality articles) is making a big deal about vector operations that are well understood.
In particular, the use of addition to create a superposition of embeddings is many decades old (it's the basis of the bag of words approach). Multiplication as the 'description' operator is perhaps interesting. Using change of basis to tease out the coefficients for linear combinations of vectors is... basic linalg.
All in all, it seems the main 'discovery' here is a way to generate richer training vectors than the typical 'one-hot' approach that we use today. Although, given that one-hot vectors are also orthogonal, it's not obvious to me why a random orthogonal vector is better than this approach. Perhaps they're going for dimensionality reduction, but if you have ten classes, then you need ten vectors and you need a 10-D output vector. You can't get 10 orthogonal vectors in 9-dimensional space. Given that there is no perf difference in computing with orthogonal one-hot vectors and orthogonal random vectors, I'm again puzzled by the 'discovery'. In fact, one hot vectors are slightly cheaper to compute with since they have zeros.
This is about sparse representation. Jeff Hawkins et al. at Numenta have been talking about this, and Pentti Kanerva's algebra is actually quite neat. There are videos about it on Youtube if you search on his name.
The name Kanerva rings a bell, the 1st edition of the RL book mentions "Kanerva coding" as an option for function approximation. What you tell sounds interesting, maybe I should look a bit deeper into his work.
My understanding is limited, but I'm guessing it's to do with the algebraic properties of the embedding vectors. I'm familiar with embeddings that you can add and subtract, which may reveal concepts existing as linear directions within the embedding space.
Here, they're talking about multiplying, dividing and permuting vectors. Multiplying combines concepts, adding creates a superposition.
They also mention randomly selecting embeddings for the concepts in mind. So my guess is that instead of one-hot encoding the classifier, they instead use random encodings on the output, and are working to give those encodings desirable properties.
I would also hazard a guess that the random vectors they choose are close to zero in most components.
This is my understanding as well. There a somewhat accessible introductory video I found useful [0].
The algebra makes it possible to encode sets, key/value associations and sequences to build a knowledge base, and the dot product provides a similarity measure for querying the base.
IIUC the key is that for large space dimensions, any two random vectors (say with uniform distribution over {-1, +1}^d) are almost guaranteed to be near- orthogonal.
This makes it easy to add new items to the base (by sampling a new random vector and updating the base using algebraic operations with other items), yet the amount of noise introduced by near-orthogonality remains controlled and can be filtered out to keep the algebraic structure working as the base grows.
Honestly it seems a bit too good to be true, I'd be very interested to see what are the tradeoffs in practice.
> […] our proposed neuro-vector-symbolic architecture (NVSA) [implements] powerful operators on high-dimensional distributed representations that serve as a common language between neural networks and symbolic AI. The efficacy of NVSA is demonstrated by solving the Raven's progressive matrices datasets. Compared to state-of-the-art deep neural network and neuro-symbolic approaches, end-to-end training of NVSA achieves a new record of 87.7% average accuracy in RAVEN, and 88.1% in I-RAVEN datasets. Moreover, compared to the symbolic reasoning within the neuro-symbolic approaches, the probabilistic reasoning of NVSA with less expensive operations on the distributed representations is two orders of magnitude faster.
Candidly, what the hell am I reading? The fundamental thing a network operates on is a (often massive) multidimensional tensor. GPT3+ has per-token dimensionality in the thousands. Not only do we already operate on incredibly high dimensional objects, it’s the only thing modern neural networks do.
I had the same reaction. The whole time I was expecting a paragraph to explain why this is fundamentally different from binary classification techniques like SVM, which explicitly uses a hyperplane in high-dimensional space to divide the classes.
Hah, same here. How is [colour, shape] or [blue, red, square, circle] not a vector? How does a NN work if not by operating on vectors/matrices/tensors?
This Quanta Magazine is absolute horseshit. Might as well just remove all that text and link to the author's paper. But hey, it makes for a nice article with a progress bar that animates as you scroll up and down.
They are not but there’s a lot of lifting being done by assuming a “hypervector” (an embedding, to literally everyone else) for a concept. Does such a separable embedding exist and do you have a good way to find it? Seems like one of those things where, oops, it’s just worse than gradient descent on massive functions.
“If you want your ANN to also discern the shape’s color — blue or red — you’ll need four output neurons: one each for blue circle, blue square, red circle and red square.“
Uhm, no.
“An algorithm analyzes the features of each image using some predetermined scheme. It then creates a hypervector for each image.”
Some “predetermined scheme”, right. Let me guess, manually predetermined?
> Instead, Olshausen and others argue that information in the brain is represented by the activity of numerous neurons. So the perception of a purple Volkswagen is not encoded as a single neuron’s actions, but as those of thousands of neurons. The same set of neurons, firing differently, could represent an entirely different concept (a pink Cadillac, perhaps).
Unless they’re talking about embeddings, this is how 99% of people think about representation.
And even if they are talking about embeddings, embeddings are, quit standardly, higher dimensional floating point vectors. Even the original Transformer paper used IIRC 512-dim vectors.
This article seems like yet another signal for the death of quantamagazine.
I don't get it. How those hypervectors are different from matrices in neural networks? Ok, having not just 0 and 1 for each separate shape and encoding each set of mutually exclusive attributes with just one parameter would make it more compact and fast, probably, but also sounds very limiting, and making it more close to Prolog than to neural networks. We, humans, can deal with "roundish squares".
For those confused about how the hyperdimensional computing (HDC) approach (also known as "vector symbolic architectures"/VSA or "holographic reduced representations"/HRR) described in the article differs from the use of vector embeddings in more mainstream artificial neural networks:
1. HDC is largely organized around the observation that certain symbolic-like operations become vastly simplified in high-dimensional spaces and can be performed with simple algebraic manipulations.
For instance, if you have a dictionary of words, each represented by a random (but fixed) high-dimensional vector, you can store subsets of these words just by summing their vector representations together. This works because random high-dim vectors are nearly orthogonal with very high probability. This implies that the sum of several of such vectors will have essentially a zero dot product with words (ie their embeddings) not included in the subset and a much larger dot product (~1 if the vectors are all normalized) with words included in the subset (as long as the number of words in the subset is sufficiently smaller than the total dictionary size). Hence the sum encodes the subset since subset membership can be checked with a dot product.
Notably, this also works when the dictionary size is exponentially large relative to the vector dimension, since it is possible to sample an exponentially large number of near-orthogonal vectors in a high-dimensional vector space (unlike in a low-dimensional vector space). This is also very mathematically similar to how a [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) works, which is a probablistic data structure commonly used for set membership queries when the set is extremely large (e.g. all possible URLs). (Note also, though, that this is at the expense of being able to decode/recover the set elements directly from the summed representation.)
HDC/VSAs capitalize on other nice properties of random high-dimensional vectors as well, e.g. via the ability to "bind" two words together using a circular convolution. See [Kanerva 2009](http://rctn.org/vs265/kanerva09-hyperdimensional.pdf) for a nice review of various examples like this. There are also various similar ways to represent sequences of words, tree structures, graphs, etc, all within a fixed- but high-dim vector space, so they provide a nice way to represent "syntactic" structure in data objects, if you will.
2. None of the above emerges as a property of network training. In fact, a network isn't even necessarily required, which could be a feature or a bug, depending on your perspective.
As a feature, this is useful in the sense that it provides an immediate no-training-required way of representing objects with fairly complex relational structure (e.g. a sequence of words or a tree-structure) as a fixed- but high-dimensional vector. In a traditional neural network (although this may be less true for very modern LLMs) trained to do something with, say, sentences of a max of 20 words, the network would likely fail to "figure out how to represent" sentences composed of 100 words. With the HDC/VSA approach, however, you get a representation of such a sentence right off the bat, and don't have to worry about it being inside/outside your training dataset. The "utility" of such a representation is not necessarily obvious, and will depend on exactly how it is created, but IMO it is nice to know that there is a systematic way of constructing one that does not interfere with others (with very high probability).
On the other hand, one of the main drawbacks of this approach (at least so far) is that it has generally not been obvious how to make these systems learn robustly, so that e.g. vector representations could change through learning to capture more semantic relationships between words and objects. Nonetheless, given the above one can imagine how the HDC/VSA approach may provide something akin to a useful inductive bias or initialization for representation learning in more trainable systems.
It will certainly be interesting to see if and how these might get incorporated into modern AI systems in the coming years.
I guess I didn't understand but they sort of make it sound like if you have the right embeddings that you can just use the "hyperdimensional permutation" operation to calculate a text completion. That would be cool.
How is this different from an N-dimensional embedding produced by, say, a convolutional neural network?