TDA and Machine Learning
Machine learning is very good at finding patterns in data, but it is largely agnostic to shape. Two datasets can have the same statistical summary but wildly different geometry. Topological data analysis (TDA) is a collection of tools that make the shape of data precise and computable, which makes it a natural complement to standard ML pipelines.
Shape is an intuitive concept to describe but difficult to quantify precisely. We first learn about shape in geometric terms, like circle, triangle, square, etc. But by the time we reach college, these seem like trivial concepts compared to derivatives and triple integrals of calculus. It turns out that understanding shape can have a great impact on how we interpret data. TDA is one such attempt to precisely quantify shape within data.
What TDA offers
TDA traditionally bifurcates into two types of applied tools with tons of theory interspersed between. The primary tool we’ll discuss here is persistent homology1. Given a dataset, like a point cloud, an image, or a time series, persistent homology tracks how topological features appear and disappear as a scale parameter varies. The topological features we find are things like connected components (think clusters), loops, voids, and higher dimensional analogues. The result is a succinct summary called a persistence diagram: a multiset of points in the plane where each point represents a feature, its birth scale on one axis, and its death scale on the other.
There are a ton of really great introductions to TDA on the internet. I am not really attempting to do it any justice here. Here is a huge curated list of resources for anyone interested awesome-tda.
Bridging TDA and ML
The catch is that persistence diagrams don’t sit in a vector space, so you can’t feed them directly into most ML models. Practically, this means you have to be very careful trying to define something like an “average” of persistence diagrams. It doesn’t work as you expect and often times, you just can’t define an average. The same goes for lots of other statistical quantities we care about. Several representations or vectorization schemes have been developed to close this gap:
- Persistence images bin the diagram into a 2D histogram, producing a fixed-size vector suitable for any standard classifier or neural network.
- Persistence landscapes represent diagrams as a sequence of piecewise linear functions, which live in a Hilbert space and support averaging and kernels.
- Topological layers (e.g., in
giotto-tdaorpytorch-topological) compute persistent homology differentiably, letting you incorporate topological features end-to-end inside a neural network.
Once you have a vector representation of your diagram, now you can continue down an ML pipeline. The outputs of these tools are compatible with ML models. This allows you to compute persistence diagrams, compute their persistence images, and then say run a clustering algorithm on the collection of persistence images. In this way, the clustering is based on the topological information encoded in the diagram and represented by the image.
Where it helps in practice
TDA tends to add the most value in situations where local geometry matters:
- Anomaly detection — topological features can flag unusual connectivity structure that density-based methods miss.
- Graph and mesh data — persistent homology naturally handles irregular domains without requiring a fixed grid.
- Time series — sliding-window embeddings turn a signal into a point cloud, and the resulting persistence diagram captures periodic and chaotic structure.
- High-dimensional feature spaces — persistent homology can detect low-dimensional manifold structure even when ambient dimension is large.
These are just a handful of examples and I hope to dive into some of these more in future posts.
Getting started
The practical entry point is scikit-tda, which wraps several TDA libraries under a scikit-learn–compatible API. To be fair, I am currently a maintainer of scikit-tda so I am definitely biased. giotto-tda is also a go-to TDA library with lots of slick improvements and parallelization built-in. Both of these libraries provide Pipeline-compatible transformers that slot directly into standard ML workflows. For differentiable TDA inside PyTorch, pytorch-topological and gudhi’s torch bindings are worth exploring.
The best way to get started is to start with a dataset of interest and start computing persistence using one of these libraries. There are various flavours of persistence, but newcomers should start with Vietoris-Rips persistence. It is generic enough to handle any type of data you throw at it, and is interpretable in low homological degrees.
I don’t think that TDA is a replacement for standard ML, nor was it ever intended to be. I view TDA as describing another source of features and geometric intuition that standard methods lack. Used together, they cover more of the structure in your data than either approach alone.
-
The other is the Mapper algorithm, which is also widely used in data science. ↩