About the talk
The Statoil Iceberg Classifier Challenge in 2018 was Kaggle’s most popular image classification challenge in terms of competing teams and was ranked one of the top challenges among all data types at the time. In this session, Kaggle Grandmaster and the 1st place solution developer, David Austin, revisits the challenge with modern day state-of-the-art technologies to evaluate how adjustment of techniques and tools during experimentation could deliver higher performance — even beyond what 1st place achieved.
In this talk, David explores the modeling problem in greater depth and reveals how SigOpt hyperparameter optimization, results monitoring, and artifact tracking contribute to deriving novel insights on the model and uncovering higher-performing configurations for this particular classification challenge. Gain insights on classification tasks, Kaggle competitions, and experimentation processes that can carry into any modeling task or domain, especially regarding how best practices for experimentation impact performance.
David Austin is a Senior Principal Engineer at Intel Corporation working on AI based solutions for the industrial Internet-of-Things and edge segment. He is currently focusing on developing AI workflows for industrial anomaly detection and federated learning. In the past, he has worked on advanced semiconductor manufacturing process integration among other use cases. David is a Kaggle Grandmaster and has been ranked as high as #11 in the world Kaggle competition rankings.View the profile
Hi everybody. My name is David Austin and I am a senior principal AI engineer at Intel. I'm also a kaggle Grandmaster. If you're not familiar with cago, what that means is, I spend way too much of my personal time working on AI competitions and that's one of the things I'm here to talk to you about today. I am going to talk about how we can take a previous first, play schedule solution and see if we can supercharged that using still got to take us to even higher results than we initially. Got so excited to talk to you about this topic today. So
first, let's talk a little bit about what the challenge was here. A ice challenges are still very prevalent in the industry. They are really good at helping move, the state-of-the-art forward. Get lots of people to collaborate to cheat work on these challenges to see what we can learn about solving these problems, both at the individual and in a global level. And this particular challenge that I talk to you about is there was a Iceberg classifier challenge. Now, this challenge was launched about three years ago. And at the time it was the most participated and every image
classification competition. So there was a lot of interest in this one. And what we're trying to do is classify ships versus icebergs using satellite imagery. The reason for doing this is aerial, reconnaissance happens in different areas of the sea to see if we can determine if there's unsafe conditions that ships are traveling through if there. Hazards in the way or if there's, there's potentially other shift. So it's important to be able to distinguish between the two. Now, the reason why this challenge I believe is so interesting and why
it was so well participated in across the industry. Is there are some unique features to it that bring us closer to some real-world type scenarios that we have a specially and things like the industrial space, where it, where I tighten to do my work. So we're dealing with a small training data set. A lot of competitions have large amounts of data. If you're familiar with imagenet, there can be thousands of classes, millions of images, those bring their own challenges, but when you're talking about small datasets, you're talking about
something that's more Within Reach of the computer. Capability of most people you're talking about higher importance, on a limited basis. And it's a lot, it's a lot more reflective of how we solve real problems in the real world. So there's only about sixteen hundred image. Across the two classes and each one was only 75 by 75 pixels and to channel images. So these aren't our GB. Each of the two channels were different bands taken by radar Backscatter. And there were horizontal and vertical flirtations to each one. Now, what really caught my
attention here? And why I thought I wanted to my own time working on this challenge was take a look at the images here. I'm on the screen, and look at the ones that are labeled Iceberg. And look at the ones that are labeled ship. So when my eyes look at these images, what I see is looks like a 1970s television with a bunch of snow in the background and a single. And the center. There's not a lot that you can see here this early from a human perspective. They give you much insight into what's a ship and what's an iceberg. So, as the data scientist, this is pretty intriguing, you know, if I
chance that there's a lot of us uses that if we can discriminate the differences between classes with our own eyes. Well, then we should be able to train and a ice. Listen to do that. But when our own eyes can't see the difference, that becomes pretty interesting. So this is one of my first challenges that I worked on to really see. What can I do for me here because if this is reliant on a human to do this, kind of classification while I I think we have problems, so if a I can help us move forward and solve this problem. So that's why I think
this is a, a interesting problem and not you're going to run in the male cats versus dogs. Get a relatively large class difference that were trying to discriminate. Okay, so if you're not familiar with the competition community, in general, most of these competitions are our hosted by someone. A host of the largest one of which is currently a goal, a form for data scientist, a host dozens. If not hundreds of of competitions in any given year or else will come in four by Davis, add provide metrics. And then we can either going
to work together or work independently of a platform to share information and work on these problems. The actual sponsor of the competition who actually brought in the most interested in solving this problem and and provided the prices was Statoil. So that's a Norwegian base Energy company countries around the world, focus on energy exploration and they deal with a lot of these ships that are traveling around the world and who brought into these type of challenges. And so that's that's who I Posted and brought this all together. Something that I think is important to
understand when you're dealing with these types of well-defined constrained problems and competitions is there are differences in the real world. We have to be aware of that because it changes our approach to some of the problems. So the nice thing about competitions is you have a lot of things given to you wrapped in a bow that you can start working with this data scientist that generally don't show up in the real world. Real world problems that I work on generally. Nobody's bringing me an entire data set has fully label fully validated. Lyric
apis, understanding of how I'm going to trade off the complexity of the problem with Simplicity. I'm going to straight up accuracy versus the latency. It needs to work to run. Straight doesn't happen in the real world. That lot of that you have to do yourself in. This is a lot of the hard life worth it. Now. This is one of the nice things about challenges and competitions is that is given to you. So in this case, you're giving the dataset. It's been validated. It's been labeled. There is a single metric to optimize for and that is accuracy or some version of accuracy. In this case, log loss,
but you have you have a single metric that you're trying to optimize for. And as such there's essentially know what I'll tell real world are out with Mike and strength. If you want to have a ugly solution with hundreds of models that you're on something and stacking together, go for it. It's all about show us the art of the possible. Show us how you can Max maximize the actress it. Okay, so a little bit different than the real world, but the point here is on these types of challenges and at where accuracy is king. That's the type of solution that you're going to get. And if you're
going to stay here and I'll walk you through, what did the best solution look like for this three years ago that we developed versus what can, what does it look like today? And what specifically has to bring that solution accuracy up even higher using C. Stop. This is the winning solution architecture. I was fortunate to to be on the winning team that developed a solution. I'm also looking back 3 years, you know, I sounded that this is the type of thing that was state-of-the-art three years ago Givens given
where we are today. So what what the solution was here was what I would consider by today's standards, very ugly, very unmanageable solution, but one that could deliver the highest accuracy based on the 3500 solution where there were 181 all customized. CNN's. Every one of them customized are different layers, different, convolution sized 181 of these that were trained Stacks together on some bald. And then some vinyl post processing and prediction layers on top of us. Now, in the real world. You would not want to take believe me. You. And I want to take 181
different customized models, and try to maintain that. Sustain, it look for drift. Do the types of things that I need to do today to production is a model. But remember that wasn't the purpose. The purpose was to give me the best accuracy possible. So this is what we came up with. About a little bit, just a little bit more details here. So you can see this was what I would consider a really Brute Force, random search solution vs. More. Elegant searches that we can do today to optimize. So, you know, we buried everything from the, the compositional layers that the amount of drop out by each
player, the filter size, the type of excetera. So it was very brief or so, every iteration would go through these parameters randomly. If we can make an improvement on us, we got the model of so doing that. In an ultimately about to close to two hundred of these models, came to the the winning solution and just incrementally improve accuracy. Very ugly, very brute force today. So, natural question, as I've been using sick out for a lot of the real-world problems that, that, that we work on it. Say, if I
were to take this play solution from a few years ago, due to things to it. Let's update it. Now with some of the more modern architecture and see well, just by changing architecture, how much of a free boost could we get today? And then, on top of that, run it through some of the optimization routines and cigarettes and say, hey. And I even beat what we had done a few years ago with this is a very elaborate if it's not ugly solution. So let's take a look and see what we could do here. So, a lot of things in the world of a i
and three years, we obviously have gone through a lot of different iterations of model architectures. You knows from the two Transformers and and things like this and so very very much ahead of where we were a few years ago, when things I can text you and accepting work with state-of-the-art felt an average of those the first to channel. So we can use architecture sets that are tuned for two channels are sorry. For three channels. We used efficient at bee syrup. Okay,
with an updated learning rate schedule a little bit more modern using that one cycle learning rate and only doing lip augmentation. So we're not adding a whole lot here from augmentation standpoint or throwing much more than just let's just update the model. Architect. And Andre Baseline ourselves based on where we are today. So doing this would give us a little over 87% accuracy, which is really close to where we were with the first place solution. Originally, think this is an important way to Brie Baseline ourselves because we don't want to say, Hey, you
know, we can do all this optimization on this. This ugly architecture. Let's really just take architecture out of the equation. R e, Baseline ourselves. Okay. Now, we're back up to really close to where the first play solution was just by updating the architecture. Now, what can we do on top of this? To take it even further. So little bit more background here on the initial optimization of those 180 bottles that we were using on the initial solution that were somewhere on the order of about 100 hours of Total training time, optimization
time, data scientist, looking at the results deciding which models to choose. Stacking them. So it was a lot of manual work that went into do not because we didn't have tools at our disposal at the time to do these type of embedded, optimizations to our Solutions. Number in mind. Now with Webster. So there's a few things as a data scientist that I really like and appreciate about sick often. I think they're also very viable when you're talking about using it for these types of competition. One of the most important of which is
a framework agnostic. So at the time, refusing tensorflow more recently Now using fight or sit up really doesn't care. So it's very flexible to the type of underlined solution that I will. I want to use under the foot. Very simple to to install, I can just pick them stall. It exports set up my project and I'm ready enough to go. Now. Another thing you do, when you're working on these competitions or working on a real world, problem is very important understand. Okay. What am I trying to optimize for? If I've got a real customer who cares about
latency, you know, how many camera screen for Metro PC in my going to run? You know, how faster does the model need to run at? What kind of accuracy do? I need to get what the straight on? Sorry, you got to understand that up. Front here on these competitions, were optimized for accuracy. Great. I know what I need to do so I can set up a very simple configuration file likes over on the right hand side of the slide are the parameters that I'm going to optimize for. So in this case, I'm taking my position at BC
Remodel and I'm optimizing for your what I would consider some of the standard hyperparameter so bad side. Learning Ray number of excetera. And then in order to incorporate that into my standard work slow and a lot of these work clothes as as you may know, or relatively standard and repeatable and we find one that we like for Asian or detection or or in any, any such tasks. I can easily embedded into my workflow by just Define what my friend, but ours are linking it to this configuration file and then
poking it into my schedule or so. In this case, the yo form. I won cycle learning rate schedule. I've got it hooked. Now to my cigar parameters that are going to be optimized and then next thing, you know, I'm off and running and doing my experiments. I just have to ask my e-cig optimized to the beginning of my ex and I'm off to the races and that's all I need to do. Now compare that, to the way we initially did this optimization. It was all brute force. It was all in Lutz. It was, it was random search essentially. So it was very inefficient
required, a lot of computation time, a lot of manual. Interpretation of the results exporting, bunch of log files doing manual work to to see where it where are optimization really, really like, without a good understanding actually of what parameters Grove documentation, just picking out in cherry, picking the models. I gave us the best result. So, you know, it's really not much more than Black Box, engineering at that point, but here was adopted in total. In this case, to take this, what's now could be considered the first place solution, six lines of
code, six lines of code. If you do all of this optimization, Now when I do that, I also get some things that as a data scientist by greatly appreciate which is no visibility into what's going on inside that black box, which of these hyperparameters really matter which ones are turning the knob in terms of what's driving my performance. Yeah, this this helps me as a data scientist to know where I need to be focused other changes into my code where I might want to further further sub, optimize a model because again for
the challenge or the whole goal is to give accuracy, accuracy accuracy. Well, now that I know what's driving them, and what's not, I can really Define my experiments and, and horn in on where the most bang for the buck is going to be so. And this particular case. These are some of the visualization that the come out of stick up for. What is it? What, what's driving? My results. And so the parameter Gordon start is really important from a first-order perspective to look at to understand what the driver is here. So in this case is the starting learning rate as well as bat size.
Not surprisingly. Those are generally first-order variables that matter quite a bit in a dataset like this is actually especially important when you're dealing with a relatively small sample size only 1,600, which infinite the memory and the modern-day Chi pu. So you really want to make sure you understand it and optimize the best without overfitting. So understanding what parameters matter the most is if I don't report so I can appreciate how I can kind of walkthrough. And understand what parameters were where to give me a certain accuracy. That's also really important to take
out the variables that don't matter know so I can focus on the ones that do. I'm also as a data scientist, I like to put my eyeballs on the charts that at tell me what drove my results and looking for things like outliers. I'm looking for things like anomaly that I get the highest accuracy just because I may have had this, this weird datapoint, but all the other day to find surrounded, don't really line up to that. So that could be a hint to me that on the small datasets. I'd be a problem so I can get by looking at these civilizations that are handed to me.
And then I can dissect where, where I want to go next from doing this in an automated fashion versus doing all of this family and it's impossible to compare the two, you know. Today, I would never spend a hundred hours again on a challenge, like this, when I can go through a much larger sequence of optimization use a lot less compute power and compute time is a lot less of my own time as a data scientist so I can be doing other tasks for, you know, working on maybe a second competition at the same time. That's what kind of
thing that I want to be spending my time on where I think best value or bang for the bucket. So, you know, the punch line here is taking this now more modern architecture and doing the optimization on it. I get it took about four and a half hours of Total compensation time. And that's because this is a relatively smaller data set with the smaller images. So I can get through all of these. These optimization runs relatively quickly. I get a boost of my accuracy of 4%. Now just to put that so went from 89.17%.
91.25% accuracy to put that in perspective, that is about 400 blocks on the leaderboard. Okay, there were thirty-four hundred people in this competition. So that's about a 10% jump in. Your relative ranking on the leaderboard and took the first place solution and 4% is absolutely enormous difference between up at the top and you're being in the level with the grand Master's or being down to the point where you're not getting any ranking points or, or moving
yourself up all through an automated workflow with an. All it takes, it is a configuration file and six lines of code. Okay, as a as a casual Grandmaster and as somebody who does this all day everyday, Find me, a all of the all take this this is the kind of thing that I want. That makes my job easier and to help boost my results. Do a few things in in conclusion. So it is important to note here. To make sure we're not comparing apples to oranges updating the the network architecture along,
just from 3 years ago. It tells you how fast do the whole day. I ecosystem has been moving us back up to pair with the first flight solution. Okay, but just getting up the parody to a previous verse play solution is not that interesting. I think we all know that architectures have been proved. What's really the point to me? Hear it. And end the value that I see is optimizing. Solution so much higher than we could have done. Just just three years ago, with only four and a half hours of computer time and six lines of code. I think that speaks to the value
of of what was done here. Now, I'll tell you this is not just something that's useful for competitions, if it's illustrative to show because you got a lot of data and lot of people who work on the same problem to see Okay, this is this is something you can do an understanding the standard way and compared to other results in the real world. You know what I, what I like here is that not just can I optimize for a single variable like accuracy? I can optimize for multiple variables so important because most customers, most people who are trying to solve a real-world, a eye problem, or usually
dealing with the spray. Like, we talked about accuracy latency performance maintenance. So being able to optimize for things that are going to influence those variables, is really important here in the fact that you can do it. So easily look into standard pipelines, be framework at agnostic. This is something that you I really recommend that that people used and explore and and try to do on their own. So anyway, so that is, that is my take on what we can do here with a replace tackle solution and I would encourage you to go back in Play somewhere,
go back and take a solution that you worked on that you've established. As you know, with some credible, Baseline and do something somewhere here at a few lines of code. Add a configuration file. And see. What else can you get using using petcock. I have my, my guess is you're going to see some something similar to the results that I showed you here today? And if so, I think you'll be able to see the value on it. Thank you very much. It was pleasure being with you today.
Buy this talk
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.