About the talk
When facing a problem on a few millions rows of data, Jared wrote code that took hours to run, if at all. To speed things up he first split the data into smaller pieces, then did so in a smarter way. Still needing faster results, he wrote a custom function with a smarter algorithm, then sped it up further using Rcpp. All this took the runtime from hours to seconds, making it a feasible solution.
Jared P. Lander is Chief Data Scientist of Lander Analytics, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. Jared oversees the long-term direction of the company and acts as Lead Data Scientist, researching the best strategy, models and algorithms for modern data needs. This is in addition to his client-facing consulting and training. He specializes in data management, multilevel models, machine learning, generalized linear models, data management, visualization and statistical computing. He is the author of R for Everyone, a book about R Programming geared toward Data Scientists and Non-Statisticians alike. The book is available from Amazon, Barnes & Noble and InformIT. The material is drawn from the classes he teaches at Columbia and is incorporated into his corporate training. Very active in the data community, Jared is a frequent speaker at conferences, universities and meetups around the world. He is a member of the Strata New York selection committee.View the profile
Hello everyone. I guess I have the pleasure of introducing Jared. He is the person that taught me everything that I know or at least the foundations of starting, everything. I know. And so I've been very lucky to have to take my first day of science class through with him. And without further Ado gerridae, one of the organizers of this event. So I'm going to be talking about accelerating, barcode dramatically, speeding it up from somebody that takes forever to run hours. If it all, and just
seconds, this is going to be somewhat similar to a talk. I gave it the first, our conference 6 years ago, that was more of a survey. This will be more of a problem. I had them walking all the way through and all the different path. I took. So what was the ask? What are we trying to solve? What are we trying to accomplish here? Out of millions of points which are similar to each other. Essentially, which points are in a certain distance of each other. It's arranged searching problem. What's the problem about this? What's the big
deal? It's low. And the question becomes how so is it? I want to quantify how bad is it? In terms of calculations? This problem takes Endtime and -1 / 2 calculations. That's a quadratic relationship for half a million Rose has 125 billion calculations as 125 billion spots in memory. For a million Rose, that's a 500 billion calculations. I doubled the rose, but I could dribble the calculations. The calculations are bigger than squared in a million Rose really isn't that much data. So, let's check out the data for this particular problem. This is just a sampling of about five hundred
thousand points. And it's similar to do. The actual dataset had millions and millions of rows each row. Only has two Dimensions, which is nice. It's not a higher dimensional problem to mention. I remember we want to calculate how similar each row is to every other row and then we want to find the ones that are close to each other. According to some definition visualizing the data. It looks like a Rorschach test and we can see that there's more density in certain parts of the plot than in other areas. For my first attempt. I started with the obvious function
test its built-in to Besar and Returns the distance between every pair of Rose. So I call this function and it did not go. Well. I don't know when it's going to finish crash many sessions. So that didn't work. Had to go back to the data and look at it again, and I noticed something. When I looked at the ranges of the X & Y axis. I said, something's up when I summarize them. The First Dimension goes from -100 to 100. The 2nd Dimension goes from roughly 5 to 68. I said, you know what?
That looks a lot like latitude and longitude data. And if it looks like that, maybe I can use the tools in the SS package to treat the stuff like their geometries. So I use stsf to convert the data into geometries and I treat var1 and guard to as latitude and longitude. It's now like points on a map. They weren't points on the mat. There's no reason we couldn't use that technology to work with it. So I decided let's plotted on a map. And it looks like this blob trying to take over the Earth again. The map
isn't really that useful because the data aren't meant to be mapped. But I want to see what it looks like. So now they were using SF. We can use a function from sfst distance. Now, this is from a stuff, but it requires that the lwg on package be loaded and this measure is the distance between every point in every other point and I figure it's in the SF package for geometry, is going to be great. It was a bit of a dumpster fire or rather computer fire. It didn't Do I turn my attention to another function in sfst is width in distance is
exactly what we want. You to give it to a point and it tells you which points are within a certain distance of each other. This function is meant to be used with M. I know there's a new way to do it with latitude longitude data, but it's very new. So I converted into M. I'm taking my data, which I pretended to be that long. I'm not converting. That's b. M, 1 2, Step removed from the original Tatum and I have to convert my real threshold into M. Which I have is roughly 11,000, which was not exactly a one-to-one crossover, but is worth a shot. So I run this and once again,
it didn't finish. I needed a new idea. Take a back to the data which is still being plotted on the globe and I can see physically that there's some points which are just dramatically farther away from other points. Like the far left is nowhere near the far, right? So why am I compared in a point on the far left to the far right or the far top? The far bottom? So I thought let's split the data into a grid. Let's break it up and analyze different sections of the greatest separately. So, continue to use SF. There is a function as teammate grid.
This takes the range of the data and evenly fills. It with equal size boxes. These boxes can be squares on architects of our goal is to get the grid sells small enough for that compared to many points at once. But not so small, that estimate. Great has a tough time doing the competitions to make the grid. Sweet. Visualizes. Now we can see that the points fall and individual hexagon. So now, instead of comparing all the point. I'm just going to compare the points within a single hexagon. Some of the points are evenly distributed. I now have G, the number of groups
times and divided by g x, n /, g - 1 / 2 calculations goes. Assuming that there is an energy calculations in each cell. I just need to look at you to individually and I do this g x. Half a million points and six hundred groups. This is about 208 million calculations. That's roughly 125 billion fewer calculations than before. This assumes that the point really distributed which they're not and will look into that. The difference is Stark between not grouping and grouping. We're talking millions of calculations versus Millions, hundreds of billions of
calculations. But what do we do about the points? Near the edges of hexagons? What are your two points on either side of a hexagon border? What do we do about those? Because they wouldn't be compared even though they might be right next to each other. So we inflate the hexagons we use St. Buffer to make each hexagon spill over a little bit into the other hexagons. This way. I'm counting a little bit of the point outside, the hexagon yestin points near the borders, will be counted twice, but we can remove those later. It's not that big a deal. The goals me to have overlapping
hexagon. So, no pairwise distances, get missed out on and make the buffer really small just slightly bigger than your threshold. So if you plan it again, you can't even tell the difference because the buffering is so small. I can barely tell the hexagon to overlapping. So now we need to find out which points fall within which hexagon. We do a special joint using the s t joint function and we say find out exactly which points intersect St, intersects with which as it's beautiful that we can do this, using that stuff, makes it really simple. We bought this again. Now color-coding a
point, according to the hexagon. Remember, we're comparing points within a single hexagon against all the other points on hexagon not having to overlap with the hexagons are black, but will need to compare outside hexagons. It's all done for us. ST. Math grid and St. Join can be slow and I'm trying to bring out as much speed as possible. So I looked into the X-Men package which has an eponymous function in this can figure out which points fall in which hexagons and I'll make the hexagon much faster than estimate grid. But it doesn't actually give me the borders
of the polygons. And I can't deal with these border issues that can create a buffer and doing that, after the fact might actually take more time and won't be much faster overall. So this is a bit of a false start. I just went back to the grid. We have So what exam do I want to count? How many, how many points are in each cell? So drop the geometry. Because if you count with the geometry at the aggregate, the geometry of takes a lot longer. I get account and remove any hexagon with only one point in it, because there's nothing to compare it to. Show me hexagons. Now, range anywhere from 2 to
100 thousand points with the median number of points on hexagon being 201. That meant the median number of calculations was 20000, which is manageable. However, the biggest group still has five billion calculations for the biggest groups. That's a lot. Add up all these different calculations. We see we now have to make slightly under 14 billion calculations as opposed to 125 billion. That's an order of magnitude faster. Look at the ideas labeled, we can split it up. So I first tried group Nest, which will take
the data frame, and rather than making a group table. It will make a list cam of individual tables because I sometimes find it was easier to reason about I didn't use map inside the mutate to iterate over that list, and apply my function to each of them. So I do this I first transform it into M. I could use that to you within distance. I do the group next to break up into a list column and I call s t is a vent distance on each of them. Our problem just went from a possible salt to taking hours to run. If each group was small enough for
St. Is within distance. So we have two issues. Group Nest can be slow, slow process, and St. Is within distance is low and might not even finish. Do tackle. One thing at a time will first deal with group Nest then we'll deal with st. Is open distance. Instead of group Nest, I used the combination of Groupon and group, map group bike race rather than a list top column of tables and create a group table. And that is a faster operation than GroupMe. It rains over that group table and returns a list of
results soar, like the El pie from fire. Soap, I swap out goodness. And I run this grouping by cell that I do a group map applying this as he is within distance over each of the groups. We now went from hours to minutes a dramatic speed up just by swapping out group nest for group met. If each group was small enough for St. Is it in distance? And we couldn't make more self, but that would have added computational time in the celebration in the joint. So I thought I just told you better. Can I turn to data table? Because it excels at grouping.
So I take my SF object, I converted into as. A, the table and then type that into square brackets because they disabled uses square bracket notation and you can pipe in a square brackets. If you put a. In front of a square back. I grabbed it by sell as a group by option in Excel at the data table square brackets, and I called the function B. When you normally would do a group. I enjoy disable, your calling individual columns. Would you want to use a non-standard function? That's not some Armenian or something like that. You have to call the. SD object, which is a special group project and
they respect the grouping of the of the data table. And then St. Is open distance. Is meant to be called either on an SS object or geometry calling. So I extract the geometry calling from the. S. Be part of the group data table. Our operation now runs in fewer minutes and this matters. If you're running this often, if each group was small enough for s t is a vent distance. So this point I realized I needed an alternative for s. T is a vent distance. It just wasn't cutting it. So, I turned to
our fast. This is a package which has Rewritten the bunch of bass are functions to be faster and it lives up to the name, including a function called dist with a capital D. This works pretty well, and it is fast. Until you get to 46342 Rose. And I have to dig into why. So what does a C plus plus code, and I solve this function, which computes the number of calculations that will be needed? It's returning an integer. And that's the problem. Never remember. Integer has an r r, 32 B. We stopped before that. We need to make n * N - 1 /
2 calculations. In order to figure out this number, we first need to compute end times and -1 in memory. So at 46342, this comes out to be 2147534622. And this. You can be thinking as it's bigger than a 32-bit integer. But to do the 32 - 1 is bigger than this number. So what's the issue? That's because we use one of the B for the sign to tell if a number is positive or negative. So we really only have two of the 31 - 1, which is smaller than our number. So we need a bigger integer in C plus, plus. This is where our excellent tea comes
into play. It's not a 64-bit integer, but it's the biggest integer. The machine can handle which, with most modern machines is 64 bits. So I need a change of the code from this first function returning, integers to the second function returning, our excellent e. I made a pull request, but it hasn't been accepted yet because they asked me a question and I didn't notice that until last week. So hopefully this pull request will go through soon. So then I talked to Brian Lewis and he has a package called T Court, that's going to be on Kranz soon. Hopefully, it's based on
irlp a and it's really designed for computing threshold correlation matrices, but you told me it can compute, it can identify points that are within a certain distance of each other. So I do this for one cell. I convert the SF object, into a matrix using SD coordinates and then, pipe it into T dust, and he does can work in parallel. If you register a parallel front-end first, and I run this and Returns the indices of the roads that are close together and the distance, I then it rated over our data table to do this. To each of the
cells individually is similar to Cody. Before I get the cornices X Y & Shop, the geometry. I make it a datatable. I then it right over and passed the Matrix version of the. Of the group Dynasty object into T dust. This is great. But it relies on the sparsity of the answer. If there's too many rows that are close to each other this slows down. So I couldn't count on this always being fast. Sweden for about torch. This is essentially a matrix algebra library on a GPU. It's pretty new in are in the, our package is just R & C. Plus plus, it doesn't Reliant
python size. Pretty excited about this. So I turn the gym object into a matrix using SD coordinates and I turn that into a tensor and I tell it to be on the Jeep, You by sanctify sequels, Cuda. I think call NFP dust. It is returned the pairwise distances. This was unbelievably fast for 37,000. Rose is finished and 0.003 seconds. That's amazing. But my CPU only has 11 GB of memory. So it's easy to blow through that and crash. Once you get past about 60,000 Rose.
And not everyone has a Jeep you and it takes time to load the Jeep. You load the data onto the GPU for each group, and then extract the results. And then clear, the GPU memory again for you. There's a lot of overhead. Play then said, I need some custom code. Huddle together a few friends. Michael Bagel. Maher cast a komodo. And we came up with this algorithm. You first sort the data according to one dimension, one dimension, only the estimation. Double for Loop. Hugo one row at a time? And then do the inner for Loop comparing. All the subsequent, bro. You don't
look backwards because you've done that already. You just look forward. Ufirst compute the distance along the x axis X1 - X2 is that is greater than the threshold by the triangle inequality. That means the generalized distance of all the dimensions has to be bigger than the threshold also. So, you know, you failed the threshold with a bonus is that since your data are sorted according to this x-axis, you can then break out in a loop and Skip all the stuff scan points cuz they're sorted. All the other points are going to be farther away and you get to get rid of all these computations for that
round of the. I look so save a lot of time. If that passes the test, you then compute, the general Isis Square distance, you don't compute distance because that involves a square root, which is costly. You compute the distance, the square distance, and that's on all the dimensions. You do it generalized. For all the dimensions. It is to be more than two Dimensions. It could be a generalized distance. Now if that distance calculation fails, you skip it, is it passes you, then take the square root and keep track of indices. Doing this in our, we first allocate an array for the
indices and another one. For the distances. I probably could have made them one already, but this was just quick and easy. I then do a double for Lupe guys aren't burning. You can see it's doing exactly what the algorithm says. You compare the estimate in first. And if that breaks is a screw in the threshold you break out the inner loop if it's not you got to square distance if that's good. You take the square root and make note of the indices in the distance. Running this on a single cell is 3,000 rows in about a second $18,000 and Twenty-One
seconds and 37,000 Road and 72nd. And part of the slowness of that is allocating the array. And there was probably a better way to allocate that a wreck, but this seems manageable. So I'm using group. I am at a group map to iterate over all the cells and do them individually, but all those individual time is could really add up and slows things down and I can't really do it in parallel because that memory inefficiency could really hurt us because when you doing a parallel, you can really dance like the memory. I needed to go faster. This still wasn't good enough for me.
So I turned to our CPP which lets you right are like C plus plus code. This is the same function written in C plus plus, which is more verbose. The first key thing I do is instead of preallocating an array for all the indices. I make a zero length of a ray just to keep the end of season to distances. And as I add values, I can grow The Avengers and the vector it grown up that yard Seaport plus, is much more memory fishing than doing it. In our store, not going to pay a penalty for that. I do the double for Loop which is much faster in C plus.
Plus I use armadillo to the armadillo to do the Matrix calculations. My first check the extra mention I break out of the inner loop of that doesn't work either in compute, the generalized Square distance, if that works, I take the square root and I push back my doctors. I make an adjustment because people's plus starts counting a zero and I return the object as a different. I call this an individual sell for 3,000 rows, it finishes and point of 5 Seconds for 18,000 Rose and finishes at half a second. If it's 37,000, Rose, it finishes 1.3 seconds. This was an insane
speed up. I didn't even do anything in parallel people's Plus. You can see that the C+ plus code is going to as you grow. Your data is going to be much much much faster. So let's do this for all the different cells. We get the X Y data. We got the geometry to make a data table. It didn't need to be sorted. So I use set key in data table, and that's towards the data and I sorted, according to the First Dimension. I then call our function for each of the groups by using by, I say. SD card, so, I convert
just the X Y into Matrix and call my function. I get this beautiful day. The table, back my answers. For the entire data set, it took about 30 seconds. And I've tested this on millions of rows and it still stays about 30 seconds. We went from cannot finish. To finishing an hours, maybe. 2 minutes. 2 seconds in this really matters when you're running this outgrow them, hourly is really matters. Getting it down to S timing. And we did this by putting the
Buy this talk
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.