About the talk
Clustering is a fundamental operation in many data science workflows. This talk will approach clustering from the mathematical point of view.
First we will review and compare different clustering methods (e.g. k-means and spectral clustering), with an emphasis on deciding when methods succeed or fail based on understanding their mathematical properties. We will analyze several examples.
This leads naturally to a more theoretical discussion about clustering algorithms. To this end, we will discuss Kleinberg's impossibility theorem, namely his axioms and a sketch of the proof. After explaining some ideas from category theory, we will examine the functorial approach to clustering developed by Carlsson-Memoli. We will compare the functorial approach to Kleinberg's axioms, and the role of density in the functorial approach will emerge.
One of the insights of Carlsson-Memoli is that clustering is the statistical counterpart to taking the connected components of a topological space. We conclude by discussing generalizations of clustering (persistent homology) suggested by this motto.
This talk will be mostly expository. The main hope is that practitioners will be able to better apply and reason about clustering algorithms.
Joe Ross holds a PhD in mathematics from Columbia University and was a researcher and instructor in pure mathematics, most recently at the University of Southern California. He has given more than 30 talks about his research at conferences and universities throughout the world, and has 10 publications in peer-reviewed mathematics/statistics journals. Joe has worked as a data scientist at machine learning/analytics startups for six years; in his current role at Splunk/SignalFx, he focuses on a variety of time series (anomaly detection, forecasting, correlation) and sampling problems that arise in monitoring.View the profile
All right. So, I'm following the chat. So if there any type of technical difficulties, I'll be able to see their so. So it's a pleasure to speak at a milk on the subject of the talk is mathematical approaches to clustering. This is a picture of me. And this is a summary of my life so far. So I used to do algebraic geometry in a previous life, after finishing a postdoc at USC about 6 years ago. I enter data science for the last five years more or less. I've
been swamped and signal affects us the signal taxes required by spunk about a year ago, and I worked on a variety of sampling and time series problems that arise in the two so-called observability space. Business lied about forward-looking statements that I'm obligated to show. All right. So the, the overall goal of his talk is to give a mathematical treatment of what might be a similar topic which is clustering and Saul start with a geometric View and then move into a more axiomatic approach culminating in Kleinburg impossibility, theorem. And then provide a rather
different mathematical view by which might be called a front. Aerial view provided by a Carlson and mentally and then close with some comments on generalizations of clustering that are suggested by some of these approaches namely persistent homology. Okay. So just to out to set some context so that the basic mathematical object, that clustering applies to is a metric space. And this is just the set. That's endowed with the distance. From ocean that describes how far apart are two points in the set. And I generally
the, the so-called triangle inequality is required. Although we don't actually need that for most or maybe all of the of the subjects discussed today. So you could, you could also think of this as a similarity function that describes, how similar are are two data points. Soda euclidean distance. On on point Cloud, soror n-dimensional, real 10 dimensional vectors on is probably the the most familiar a metric space,, but on different types of data. For example, categorical data, if you can also Define metrics and the Hemet distance is one way to do this on categorical data, and this is just
the, the count of the number of entries that are that are not equal. So that Define symmetric And I want one Viewpoint, is that if you're given a dataset constructing, a reasonable metric space out of, this is one of the core or maybe he core data modeling tasks. Okay. So once we have the concept of the metric space, we can Define what a clustering function is and this is science to any metric space, a partition of the underlying set us obviously with some regard for the, the distance auction of the distance Matrix. This is one of the main
types of unsupervised learning and it has several applications. Maybe not so necessary for this audience, but it can be use the segment Rose. If you like customers or patience, or, or transactions, if you try to apply and said to the columns, are, you can also view it as a way of of Performing feature selection. So you might cluster the future space and use this as a way of processing steps when applying supervised machine learning algorithm, or he's just as a way to out to sample from the future space and then maybe slightly more exotically. This can be used as the ingredients
in an ensemble message. I'm so in a classification problem. For example, you can first cluster all of the rose with a certain label, for example, for example, all of the fraudulent transactions and then try to train a separate classifiers that predict each cluster separately. So you can say your you're you can imagine that you're finding a taxonomy of the other label you're interested in. A naming individual pacifiers at those and then somehow aggregating them can, of course, depending on the type of classifier using you may or may not need to do this kind of pre-processing on
something that's part of the pipeline. But these are just some sample applications of of this abstract idea of segmentation or clustering. Okay, so probably the most familiar form of clustering is called, K means clustering and I'll briefly review it here. So the idea is that we're giving a collection of points in euclidean space and a specified number of clusters K, and we're going to return a partition of the point cloud in 2K clusters and it's a procedure is follow. So first, we choose k centroids at random. From the point about these are the the cluster centers.
And then tell convergence we alternate between signing every point to its closest centroid. So this gives us a partition and then recalculating the centroid of each cluster buy a virgin, all of the points in each partition because the data set is fine at this at this eventually Convergys. I'm. So, let's see how it looks. So I made of a very interesting raw data set that has four clusters as you can tell, I'm so we can see what happens when we apply came into it. So if we tell came in time
to clusters, of course, this talking to get the right answer, but it does a reasonable job of breaking the days that into two pieces. We can ask for 3 and then it finds these two on the right, which are relatively while, separated, and still groups the the two unless together. And of course, when we tell it the correct number of clusters that it gives us the correct answer. So so that's great. But let's make not even that much more of a lot more interesting example. Just something a little more exotic. So so let's have some strips. Okay. So I will
spoil generate a number of strips, a certain number of strips where the the y-coordinate will range between 0 and 1 lb. The strips will be centered over x equals 0 and then point one in point to and so on. Plus a little bit of noise. So we're just sampling from from strips like this. Okay, since it's two strips, this is 5 strips, you know, nothing, nothing too hard. You can most people would say, okay, they're obviously two clusters are five clusters in this example. So, let's see what what K means has to say about it. Okay, so we do one strip. Of course we get one
cluster. We make two trips. We talked a means to find two clusters. It finds two clusters. Unfortunately a kind of sliced, the date of the wrong way. So the, the the cluster labels are the coloring. So it, it's sliced it into the purple and yellow. Okay. Maybe this is an artifact of the small Cato. Maybe if we increase the number of strips and an increased scale will get a better answer. So, let's try 3, doesn't look so good. Let's go to 4 again. Not so good 5, still not so good at this. Nice. Yellow, stripe
in the middle. Going the wrong way, unfortunately. And then keep going. 10:20 until she kind of look at these examples. I think what, what sort of emerges is that the K means clustering is kind of has an idea of where the data is, right? And then it has its own idea about what the geometry should be. So, in this case, it says, well, all the data is is more or less sitting inside the, the unit square, but I have my own idea about how to break up the unit square into 10 pieces and I'm going to chop it
into into tiles, right? I'm going to, or if I have to chop it into 20 regions. I'm going to break it into into small convex region, which are in which the A small compact regions, even if the even if the data itself, doesn't have doesn't have this nice, doesn't have this nice property where the, the ranges of the effects of my values are all relatively similar. Right? And so, the problem with this example is that there are some distances within clusters that are relatively far. If you look at that, the tops and bottoms of the
strips which are relatively far relative to the distances between the Clusters. So that the first two strips are close to each other. But the their points within those clusters that are relatively far away and this is a type of geometry that the K means is simply not able to handle. So it kind of wants to cut this day to set into a 5 by for tiling roughly, but that's not how the data is actually organized. Okay, not suggest that maybe we should take the the geometry of the dataset more seriously and that K means that that K means has his own idea what the geometry is, but we don't want to
take that into the clustering algorithm. Getting. So there's a there's a motto of proposed by a Carlson and memoli, which says that clustering is the statistical counterpart of taking the past connected components of a topological space in. This is trying to take the geometry and the topology of the spaced a little bit more seriously and treat it as a as a first-class concept. So in the, in the world of math, you have no well-defined well-defined regions. Maybe these are described by some explicit inequalities on some coordinates and then you decide to to fill them in. And then one
thing you can compute on a set like this is the set of Pat components. And this means that you take the set X and you questioned it by an equivalence relation. Are you to declare that two points? Belong to the same path component? If you can, if you can connect them, by continuous map from the interval into the space. So if you can find a path connecting, the two points you because I'm the same and then the equivalent classes are your are your past components. So if you apply that to this set, of course, we'll get for past components. In the in the world, the day that we don't have
explicitly defined regions like that. But maybe we imagined that our data sets are samples from such ideal regions. And if you can if you imagine transporting the definition of pass components to the world of of data, you can imagine saying that I'm going to question asked by an equivalence relation with which looks a little bit like connecting two points on bypass. But instead I'm going to say, I I questioned by the equivalence relation generated by declaring that two points are equivalent. If they are the distance between them is less than or equal to some threshold
Epsilon. And then, if I can connect two points by a path of of consisting of links, which are all relatively small, then I declare that but your points to belong to the same path component, or, or cluster. And, and this is exactly what usually called single linkage clustering with some threshold Absalon. And then one sports. The choice of Epsilon seems makes him a little bit arbitrary and dependent on the on the data set and it is so you can consider the behavior of this for all Epsilon At Once by the by a single linkage procedure and hear you, you
declare that every Point starts in its own cluster. Then you just find the distance between two clusters as the minimum of the distance, between two points, where the two points are Range Rover, all pairs of points in those clusters, and you can replace the minimum, buy something else, and get it, and get a slight variant. But, in some way you need a method by mechanism for, for aggregating, the, the distances between points to define the distance between clusters and then you keep merging clusters. So always choosing the pair of clusters that has the has the smallest distance between
them. And then eventually you'll have exactly one cluster when your when your threshold is very large and then of course you can stop. Play. And so this, this whole procedure can bees can be summarized in a, in a dendrogram. So this is from the the same data set that was sampled from the, from the ideal shaded, polygons. I'm in. So, what you can, what the the horizontal axis is telling you, the the threshold on, on how close to points have to be in order, it or how close to clusters, have to be in order to be merged at some thresholds. And the the vertical
axis has some SM labeling but it's basically just an index of the point, so it doesn't really mean anything. I'm bored. You can see, is that at the beginning? There's a lot of horse-trading and reorganizing so you can, if you can't even see it in this, you but Point start to start to come together start to be coalesced into clusters. I know that there's a lot of movement going on, a lot of churn, and then eventually, when you get to something around one-and-a-half things start to stabilize little bit. So you have this regime where you have for cluster. And then, as you increase the
thresholds, the Epsilon threshold to a little bit above to all the sudden, two of those clusters merge and you left us three. And this is a relatively stable or Gene. But of course, eventually to those customers merge. And then, finally, when the threshold exceed six, in this example, you you have one cluster in, this is kind of summarizing the single linkage procedure over all possible choices. I'm about salon and then one thing, one thing you might ask him. This has are there some properties of different ways of slicing this dendrogram. So what kind of axioms can we place? I'm cutting this
pentagram into into an actual clustering choice, or I can give you a class ring as as picking a threshold. Okay, so tell him some, axiom's for clustering, a proposed by, by kleinberg, or are the following. So the first seems fairly non negotiable. So, is the scale in the says that if I have some distance Matrix and then I multiply all the sentries by some positive number. I get the same clustering. So, it doesn't matter. Whether my measurements are taken in in feet is 4, centimeters or inches. I should get the same clustering, seems non-negotiable.
Consistency Axiom, which basically says, if I, if I know how a metric space is clustered and then I increase the distances between clusters and decrease the distances within clusters. I should get the same answer. The saying if I have to to to metric spaces that differ by these kind of movements within and between clusters, I have to get the same results. And finally, there's a richness or surjectivity condition, which says, that, any partition of the art of an underlying set can be realized by some distance Matrix.
In other words, if I have 20 points, and I would like a clustering with what size is seven, seven and six, surely. I can write down some distance Matrix such that when I apply the clustering algorithm to it, I guess. Seven, seven and six right? Just make seven points really close together. Another seven really close together, but far away from 07 and then those remaining six close to each other. But very far away from me from the other two centers. Surely. I can I can write down some distance Matrix that realizes any partition. Okay, let me the interesting but maybe unfortunate news,
is that there is no such clustering algorithm. That that's out of spies. All these conditions. Okay, maybe in the interest of time. I won't go over the details, but you can achieve any two of the three. I'm so I'll share the slides. And and you can, of course, read the paper. But any two of these three can be realized by by choosing a different stopping condition on the single linkage procedure. So, some some different way of of cutting the dental Graham, at some point. Okay, then there's some objections. You might say. He's requirement
these these requirements or maybe or maybe too strong. So one objection to richness is that there may be some uninteresting partitions that we don't need the clustering algorithm to realize for us. For example, we don't need the clustering to produce. Everything is in one cluster or everything is in its own cluster. And then, we also might have ejected consistency saying, this is too strong because it excludes the possibility of introducing substructure.
When when distances a change within a cluster. Can you maybe some interesting new structure that emerges? So, here's a picture of such. So let's say we have this metric, space has three clusters. That that's fine. We're going to zoom in. To the the cluster on the left. There is a fascinating. We're going to shrink it. So we move all the points closer to each other and now we snap it apart, into 3 of its own sub clusters. So if you imagine replacing the original picture of three clusters are, replacing in that picture in
in the closet on the left, this this substructure consistent the Axiom would say, you still have to get 3 but maybe, but maybe a better condition is that? Well, you have to keep those too, but you should be allowed to use. You can keep the two that you didn't change. So you should be allowed to refine this cluster into three smaller clusters. Interview, if you make this relaxation that basically says you can you can't merge but you can reply or find clusters when you have two metric spaces that can be compared in that way and that you don't have to
realize any partition, but you can exclude some trivial ones. Then there is a cluster in function, satisfying those satisfying, scale and variance and the relaxed consistency and richness conditions. Okay, so that's maybe some some good news. And so the one thing that the consistency condition and its refinement, suggest is comparing how the clustering the haze under metric spaces that can be compared in some way. Mendoza is a framework in in mathematics which, which can be applied
to this wishes categories in and functors. So I'll I'll discuss this briefly and then that the slides will be available. So in math category, she's usually do it with a very fancy. See is a collection of mathematical objects that can be connected by some type of Marxism. Hey, satisfying, some conditions and so drain. Field K. There's a category of x k consisting of vector spaces over that field and the mornings are Caitlin your Maps between Vector spaces. There's a similar
one similar algebraic category for building groups. And then maybe more interesting Lee there is a category of topological spaces. So the the objects are topological space has guitar sets with some additional structure and the morphisms are continuous maps of topological spaces. So most Concepts or most mathematical object, sort of naturally, organize themselves into into categories, a fairly general concepts. Maybe just vocabulary. Okay, and then there's also the concept of concept of a functor, which is amorphous some categories and misses a compatible system of maps, from Audra
objects of one category, C to another category D. And also on morphisms in situ morphisms in deep. So whenever I have a morphism f from sea to Sea Prime, I have to get a morphism between Pepsi and a 50 Prime Cut in a manner that respects, the composition of morphisms. And the general idea is that he might be the category of topological space has Andy, might be some reasonable algebraic category where you can actually make computations. And so you'll transform a sort of nasty geometric or topological
questions into algebraic questions, which you can more readily compute on. So you can compute on Vector spaces in a way that you can compute on on top logical spaces directly. Pulls out a funkers. So maybe the most interesting one is the degree and singular homology, the topological space. So this is very roughly counting. The number of end Dimension has fears that a that a topological space contains. So I got in this transform some topological question into a into an algebraic question. In a way that
that actually helps you prove things about topological spaces. Okay, so then you can apply this whole Machinery to clustering. So you have to make some categories for yourself. So again, this is due to cross in the Middle East. First me to sign a category of metric spaces with a certain type of map. So injective so 121 on the underlined set and distance not increasing. And then the the clustering takes place in the category of partition sets, Sahara, morphism is a map of sets which is compatible with the partition. So this means the partition elements can
be merged, but not further a broken apart. And all the clustering functions, we've defined our maps from metric spaces to partition sets. And then the question is, can we make them from toriel? In other words? If we have one of these maps of metric spaces, can we produce a map of the partitions as a way of of enforcing, some some naturality on the clustering algorithm and it's basically saying that we can merge clusters, but we can't split them apart. That's what it means for the, for a clustering across during function, to actually touch the
to be described as the popular between these two categories. Okay, I'm so single linkage with a fixed threshold eyes from toriel on something you can work out. And one reason for this Viewpoint is that so you can use single Inca. Just saying that you declare two points are in the same cluster. If you can connect them by a sequence of morphisms from passive link up salon. In other words, this face where you have two points that are distance, Absalon apart. I'm into into your SpaceX. And then you can say well, what's so special about
the path of mine. Epsilon. I know some more interesting spaces. For example, I can make a triangle where all of the all of the sides have lengths Epsilon. Right? And I'm going to connect two points or declare their in the same cluster. If I can connect them by a sequence of injected maps, from triangles of a certain size into my space. Right? In this is qualitatively similar to, to the neighbors parameter, in in DD, scansource has taken to taking into account density, and the motivation is, of course the the training effect, right? So the single linkage
unfortunate answers on Spaces with different densities like this one, but if you instead require that you can connect points with triangles, then then you can you can see that there are really two clusters. Okay, and then I'm I'm basically at a time but I'll just mention that there's a, there's a generalization to higher dimensions of clustering that's suggested by this by the zoo, which is sometimes called persistence homology, and the motivation as well cluster, this space and their response might be well, that's kind of the wrong question.
The answer is that there's a loop write an estimate and that's never going to be captured by just the number of connected components. And so there are higher dimensional analogues of the number of components, which are which are, which is possible to describe using a so-called the singular homology groups. And just as you can construct a graph from a topological space Sorry for a graph from a dataset, you can kind of also construct a higher dimensional object called us, some purple complex in a way that's compatible with changing changing a threshold. The threshold Epsilon. And
just as we kind of buried the We understood the behavior of the clustering. As we as we moved in the Denver, Graham. You can understand the behavior of these higher dimensional analogues as you change the, the parameter Epsilon. And so this gives a nice, a higher dimensional version and an assistant for for studying such structures as loops and, and so on and not possible to see the string. Okay, so I am at a time besides will be available and and here are the the principal references. So thanks everyone.
Buy this talk
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.