I’m a huge fan of UMAP, but this [0] paper suggests that t-SNE can be tuned to produce UMAP-like results (the algorithms are extremely similar—you can recover t-SNE with certain UMAP parameter choices). One of the insights is to use PCA first to better preserve the global structure.
For example, see figure 9 in the paper: the plot on the left is the typical result of default t-SNE (distance between global structures not well-represented, since everything is jammed together), and the plot on the right is very UMAPish.
Basically, there are a lot of preprocessing and parameter choices involved in producing these embedding plots, so it’s advisable to try to understand the effects of these choices regardless of which algorithm you choose.
I thought UMAP's main advantage was being able to project new data without having to recompute the embedding, whereas tSNE still does - making persistent plots difficult
If you have a machine learning model and you want to see what things it thinks are similar, you can use TSNE to visualize that by rendering similar points close together in two or three dimensions. UMAP is another method used for similar purposes.
it's an algorithm for projecting data to lower dimension. I.e. you have an Excel sheet with 20.000 lines (representing customers for ex) and 200 columns (representing blood pressure, height, weight, etc).
What you want to do is "visualise" those 20.000 points in 2D or 3D so you can get an idea of how the data is distributed. So you use t-SNE to "compress" those 200 columns to 2 or 3, and you display that.
Traditionally you would use Primary Component Analysis, but that only uses linear projection, and will not be able to project data that has non-linear relationships in the distributions.
Another algorithm, sometimes more powerfull and scalable is LargeViz.
UMAP has largely replaced t-SNE in our toolkit as one of our top go-to viz pipelines. Unlike most examples out there, we post-process with k-nn to expose the graph of correlations over arbitrary data sets -- bank accounts fraud scores, cancer protein mutations, twitter bots, malware files, etc. -- and then investigate. Algorithms like UMAP figure out this connectivity anyways (see also: TDA), and useful for guiding subsequent explorations. If you're doing an interactive analysis, like looking at data in a Jupyter notebook, super powerful to expose that inferred connectivity and make it interactive (on-the-fly filtering, clustering, etc.) on it.
Tool-wise, we do it in a few lines over tables with many rows/columns via end-to-end GPU acceleration using https://www.RAPIDS.ai (GPU dataframes + UMAP) + Graphistry (GPU viz, which we make).
Do you mean principal component analysis? My naive understanding after reading the original paper is that the algorithm is training a transformation to project the high dimensional data into low dimensions by best preserving both the global and local proximity. So that the samples similar at high dim space should also be close in the low dim one. It has some assumption of the distribution of the data in low dim so it won't be a random guess. It's using the t-distribution at low dim hence the name t-SNE. Correct me if any mistakes.
I've built an implementation of t-SNE in Go (https://github.com/danaugrs/go-tsne) and really like the fact that your visualization has a short Z dimension. Very interesting effect.
Writing an OSM map renderer in JS was the first modern JavaScript I ever wrote. I did it to prepare for my last summer internship which was some pretty intense modern JS.
https://github.com/lmcinnes/umap
It's much faster and usually results in better clustering / representation.