About the talk
This is the story of how adopting a containerized workflow changed the way our small software team works at NOAA’s Space Weather Prediction Center. Our old architecture, a big ball of mud shared-database integration, just wasn’t cutting it - it was killing our agility.
Over the past two years, our small team has adopted a microservice style architecture, using Docker with docker-compose and environment files as our deployment strategy for all new development. We’ve discovered the joys of using containers for identical dev, staging, and production environments.
We work closely with scientists: much of the code we’re running has complicated and conflicting library dependencies. Docker captures these beautifully - we’ve even had some success teaching our scientists to use it!
I’ll share what we’ve learned, some of the persistent challenges we face, and one place we really got it wrong. This talk builds off of a popular hallway track from DockerCon 2019.
Speaker: Chris Lauer, NOAA Space Weather Prediction Center
Hi. I am looking to my Tok, predicting space-weather with Dr. My name is Chris flower, I'm a senior application, developer at the NOAA space weather prediction center in Boulder, Colorado. A part of the Weather Service say interesting organization will talk about that here in a minute, but my talk today is in three parts. The first part is our early adoption steps. How we started using Docker, which might be of interest to some of you. The second part is starting to use doctor for real. How we started to move doctor in to processing some of our critical data flows. And part
3 is how he started teaching her scientists to use Docker and collaborating in containers, with the scientist, we have on staff, so that should be interesting. The first a little bit about the space weather prediction center where part of the federal government. So we're in better than several layers of agency bureaucracy and there is 5 there we have a different subject area parent agencies. It's not it's not traditional terrestrial, hit you in the face, whether it's space-weather talk about that in a minute and that means we have our own software sections. We have about 12 people in
that section and then you give him time to team size ranges from 6 to 10. We also have a couple. May have one Linux admin. He's not a separate section and one. What is that men or two windows Edmonds? That's kind of makeup over a text. In case, you're wondering, if the small team, we have a Fizz, my high security system where National critical systems means. We're not really allowed to go down and we're not allowed to get hacked. So, apparent agency said, no Cloud. We're not ready for that yet. Let's get a little crazy, stay on, for him
and safe, and secure, you know, that's kind of meant. We had to build all this stuff on premise. Which means we can't live in a lot of people, we have a really cool Mission and that's one of the great things about working in the government is that you often have a really cool and important mission. That people feel really passionate about and that can give you a lot of energy to to to do things right to do things well and they care about your customers. Alright, so what's space-weather space-weather encompasses? Pretty much everything that's going on in the 93 million
miles between the Earth, and the Sun. So that's a few different kinds of phenomenon. The electromagnetic radiation like x-rays, but from x-ray, flares, that can impact Communications. And, and humans in Space, the energetic charged particles that come from flares and coronal mass ejections that can impact satellites. And high-frequency Communications, especially near the poles protons and electrons flying out of the sign really fast and then hitting us. And a magnetic field in Paxico Mass ejections, you can get magnetized plasma that interacts with the Earth's magnetic field and
causes some beautiful Aurora's but also can impact the power, granted, some fairly negative ways and also and pipelines and other down bass infrastructure. So, pretty interesting stuff or just look at the Sun and trying to figure out what's going to happen to us here at her, that space weather. It's a little bit of background on our culture. We move to a job which is a little bit rare and the federal government. And we started that moved back in 2012. So we trained our developers in scrum and they're really taught us how to build a consensus on our team. And inside the
organization has been super important for about eight years off, and on with different teams sizes, and different information, zandino changing, and evolving. And adapting, its would go. I'm supposed to go back to the agile, Manifesto, one of the values. We really, for, I really care about this customer collaboration over contract negotiation. So, instead of doing the federal government saying where it's like, give me 100 requirements that I'll have to wear his shadow in it or just kind of build the best thing we can and iterate on that. I really trust our
customers. So after doing this for eight years, we saw some pretty big improvements in morale and some pretty big improvements and capability to deliver but it was still time. It was time to improve our technical to win for delivering. So I'll talk a little bit about that. So here is our architectural challenge. How do you get started with continuous integration with us and continuous integration? I mean, you checking some code. It gets built and you're able to do a bunch of testing automatically in your end-to-end with every single commit.
So eat what we had was a giant, single Microsoft, SQL Server, database of all things for all of our data. And this meant, you know, that this is space. What do we have ground base data? We have several satellites. If you to stay to, you got forecast, regenerated data, forecast, observations, all mixed together. And then we have all these applications and I'll hit the database directly. Assistance service later. There's deep coupling, everything has its own Ms. SQL driver to connect back to Ms. SQL. And you probably got some to user strip procedure. So there's
some of the tracks but you're combining all these disparate data sets until one thing for your applications available as business, logic in the database. So whenever you tried to do a build and test against production data, That was very challenging because you got this giant thing. It's got all these interdependencies that are very difficult to understand if you wanted to try to improve the database databases are notoriously hard to Version Control. So this is really a sticking point. So how do we get started with? This is part one
that I that I deleted you earlier, we're going to seek out a new way and how do we do that? So back in August of 2017 we took some Jenkins and continuous integration training. Who'd heard about this continuous integration thing, we just couldn't figure out how to do it. So we had to Ryan blundon who I think is now at its worst graphs. Shout out to Ryan if he's listening. He came out and give us some continuous integration training and he delivered that with a Docker tool set. In a couple of our developers got really excited about that. This doctor stuff is cool. This is what we need
to could see how it's going to solve all of our problems. Then we labeled the old way. I think it's important to understand where you're at and it's called shared database integration and then we did some research on that and Martin Fowler does not have a lot of good things to say about shared database integration. That I read up on a new way building microservices a book by Sam Newman, even if you're not going micro and you're just doing service-oriented, really good book about the values of loose, coupling and high, cohesion, that can help you build software domain. Driven design is
another concept that might be helpful as you're trying to figure out how to break down your your legacy monolith. So then we found a small high-value team project, sorry. And It's this is a verification service so we're going to try to take some of those observations and some of the forecast that are poorly stored in our database I'm going to try to compare them to each other and say how good is our forecast. This is why they're going to be a national critical high security. Check something important off the list of these were committed to that year so we grabbed it. This one and it would
have been hard to do the old way and then we split R12 large team at this time. We had a team of, I think it's 9 way split in the six people working the new way, or just going to do the verification service completely separate from everything we done before. And then I 3 to keep working on some big Legacy database projects. I thought we had one team ready to go and then that team adopted, these some new rules to develop the verification service. So here's the new rules. We used to drive this culture change, but actually before that. Why? Why doctor? I
mentioned? We're really excited about that from that continuous integration training. Why were we so excited? One of the main things is the issues of scientific software. So, scientific software often has delicate conflicting and in just downright weird dependencies, libraries, you haven't really heard of IDL is something he's allowed at Sochi. That I actually requires a license it run time and then maybe have a lot of dependencies of build that are different than dependencies from run run time. So you know when you go to install is in a traditional system and got to get it exactly right.
And then if you upgrade this winter break that one so it's like you can have one that requires a certain version of that flat live in another another Pizza could maybe even in the same project that requires a slightly higher version. So do you have a figure that out with something? You got really excited about. So we can a Catholic says dependencies and actually as we can throw things into doctor and then cases this is the first time the dependencies have been well-documented unless it already done figuring that out cuz if you miss one you just add it to your doctor file and do it
again. Does that work? Okay, excellent. So that was really huge for us and some of the software isn't really, very configurable, either in and maybe I'll mention some of this later there. You can have hard-coded pass and lots of Fortran or idea or seller Pearl, you know, it's not from a configuration file but that's okay. Because inside your doctor, you can have whatever path. You want it. Just try to do what the date of persistence, with my nonsense into Docker compose or whatever, or considered using. So the inside the container to be really weird, but the outside can still be
really nice and tidy. And the other thing was eating deployments, I mentioned CI before, but just being able to make sure that what you have in your own local development environment matches, what's on staging and that's his production in, those are all going to work. Doctor images and the tag on those really makes that easy. So, highly recommended for that. And as long as we can solve how the date are persisted, and how each piece of software is configured, Then we pretty much got it, captured programmatically and anyone can anyone in our team can look at that
inversion control and see exactly what's going on, which is kind of new fresh, big Improvement ends and soccer quality in our ability to cross-train they're talking about and there's six of them. So the first one we've been doing share database integration for so long. We made this new hard rule, only one thing talks to a database in our case, it's the service. So if you want to post some data or you want to read some data, you're only allowed to talk to the service. And the service is the thing that has a database behind the scenes and you don't know what they decide is. So this was
really hard for us. We kept drawing that line from Services, back to the database, even though we weren't using Ms. SQL anywhere. So, we had to really get serious about this Rule, and then we're going to go microservices. So that's a collection of loosely coupled, highly cohesive services. So figuring out how to break down a problem into these different Services that's in. There was a new thing for us. We're going to design this, an automated, and, and testing in mind. So, the goal, is it Jenkins on East? Commit, can stand up all the components in a back-end database
rabbitmq, whatever. And all the things we britons dropped and haven't ingested processed posted and then retrieved, and I make sure that looks right, so this was going to be really cool. Are going to be event-driven space-weather moves at the speed of light, so it needs to be low latency. We still want that loose coupling so we can I use a synchronous messaging publish subscribe as an important new architectural feature. And we started to adopt 12 Factor. Net practices. So, this became the new default for us, especially store configuration in the environment from the
codes perspective. Anyway, And I lock the stand it out. So, stop having each piece of code manage its own log files, just blurt out, what's going on. And the orchestrator will figure that out. And we're going to drop the whole bunch of new stuff. So this time we went to know sequel for more, developer-friendly database and the nice thing about nosql, as you don't have to Version Control your schema, you don't have to declare scheme up front. So as you discover the daytime, you don't have to worry about all that as long as your coat. And this is what we did. As long as your code,
that it was it was any slower. So like it was almost impossible to deliver in the old way so it was just as fast and maybe faster to learn all this new stuff and deliver some stuff we got all this learning for free basically. So, the pace of learning is very rapid, just made it her to change is much easier to do, everything is on that container, we can all run into each other and it's just there's just so much faster to learn all this stuff and our customer was happy because we were able to live to deliver some Modern capabilities of Quarryville, service
than that's updates with low latency. He was able to brag about this verification service compared to the ones that are not International Partners, had that had been making jealous for years. But then there's always a but microservices team got way out of head of the organization very quickly. So we eat those people in the database team. How do we even begin to tell them how much fun we're having? And how awesome this is? And what this lets us to do and also the system in the system, administrators, you know, they are pretty slammed pretty busy with their old job. How do we
teach him the new way? We found it made a few mistakes, still have some learning time here in a minute. And yes, I started hitting some new obstacles, it's not always easy. So I couldn't do their containers. In our secure environment is a pretty obvious solution of that, and I'll get to that later. All right. Some of those small lesson. So here's a small lessons from her early security mistakes. First off don't give anything dr. Damon access. We made the mistake of monitoring our services with Nadia so she always used but by having it do a doctor PS and the doctor API in and
see what's running. Unfortunately that meant it had dr. Damon access penetration testers came in and nobody else wasn't very secure and it was able. They were able to compromise a machine to privilege escalation because they basically have ruined. If you have chakra demon access you basically have read. So don't give that to anything. Or the caveat that if you do, give it to something like Jenkins, because we had Jenkins trying to automate this pipeline. It need a doctor Damon access. The CI pipeline lock it down. And we made the mistake of using Jenkins
Jenkins we went to CIT training. We play the doctor a little bit over the holidays and then we started this verification service almost a year later and we didn't think the double check that are Jenkins was appropriately secure. So that 10 testers were able to take advantage of wax profile and Jenkins. It's good. All right, so that was great. What do it for real? Let's do with a critical data source. So here's our data source. Go 16. So this is the next series of operation weather satellites to take beautiful pictures of hurricanes. They did a lot of data into trust your
models for Earth weather. What else? Have a space weather instrument package? We got to replace this. We replace all of our Legacy goes from the previous series of go Satellites. With this new stuff from the new series of satellites, which is a little bit different. And goes, that has really important to us. Satellites are the 22,000 miles and its monitoring x-rays, which is one of her newest skills, which is kind of like a hurricane scale or a tornado scale but Force Base weather. Does monitoring x-rays x-ray
flare. As it's going to get Communications among other things and particle data as well at monitors. Kind of what's going on with particles of other satellite health. And then, there's also the same component of high-altitude radiation exposure that is monitoring, so that that's the other scale as the high-energy protons. And so, this has to be a little late and seeing it has to be reliable. And we can't have much to downtown Monroe switching over from the old stuff to the new stuff. And I, because it's so critical to our mission. Why do I like to see application written and a
variety of languages parole or Ms? SQL or PHP or Visual Basic, Microsoft access to the database to get this. So this is going to be a bit of a challenge to do this without in a sequel. We got to touch everything. So here's how we did that. So we use Docker here in our new services, to build a bit of a Strangler pattern. So we built the service so I can handle the data in a doctor. That's the way we did. The verification service was going to, hold on. All that
goes, there not service needs to understand really two things. So it needs to know enough about the data, both from the old satellite in the new satellite. So I can store them in in a comparable way. And then it needs to know how the applications use that data. So treating them as customers. and then if you can solve all the problems of customers needs off, And then we're going to have the old data to the service. So I still goes 14 and 15 word. We're getting 16-year. But first go to add 14 and 15 cuz that's where we're currently using. What's currently operational? And then we can print
does Legacy applications. Use the service really one of the time so you don't have to do them all and deploy them on the same day. We're just going to deploy one of the time switching over from the MSD coil direct call to calling our service and getting some Jason and will do that for each one and a variety of languages. And then why I didn't use data as it becomes available. And once everything's ready, then we just changed which one is primary, then all of our applications Downstream automatically start getting data from the new satellite and they don't even necessarily have to know that.
Straight. You go live. No, downtime everything seamless. We tried this a few times before without doctor, but this time, it was even better. And just for kind of a picture like if you're thinking about going microservices our doctor this is a breakdown of how many containers of tooken and it is actually a few more than us cuz I don't think I have to clean up one in here. Then I'm not going to go through all these but I'm interest. Maybe is the ldm container here. So I'll DM is a is a piece of software. Is used a lot in terrestrial weather in our parent organization,
to send data transfer data between partner, organizations years old. How on Earth are we going to containerize this? But on a fluke we search Doctor having it turned out they'd had a container version of it and it didn't quite work for me. First, tried it. But they were, they were excited about this doctor stuff too. So we collaborate with them a bit reached out to them. And we got to a point where it works and we were able to contain rise, this part 2, which is super exciting because it made it much easier to deploy and we understood the configuration a lot
better. I'm so that was really at that was really awesome. Always ask. All right, so now the new way of doing business is starting to solidify, we've discovered that using Docker compose their deployment. We have to host and production into host on staging right now, maybe three thrall developing our own virtual machines on our windows boxes that Linux virtual machines, and then we're putting on stage and making sure it works in the production. But as we solidify this new way, there's only for deployment artifacts. This is why the plane that feeling really easy you know exactly where the
black I'm going to go do those briefly the first, your doctor composed. I am also using Docker compose you familiar with this, but this is talking this file defines, the relationship between each service and community service in the host. So that's that's the bread-and-butter one. And then there's a guy in V. So this is a little tricky because it hides in the file system, cuz the. File, but docker-compose can read environment variables from this file. So this means that if I die Envy is different on each host, but Docker compose uses environment. Variables, we can ask tractors host
differences. In the doctor gumbo style can be the same on each host in the development and the only thing that's different is a dying and he's so that's really nice. And then your service configuration environment files. So you have all your configuration hours in files, but from your applications respective, it's coming from environment variables. So if we get a more mature, orchestrator eventually, we're more feature-rich. One we can use these these environment variables there, instead of in files, be super cool. And then, here's that problem, before we couldn't build things in
the secure environment, we saw that by adopting Harbor in. This took about an hour or two to install and then another 6 hours or so, to figure out what was going on with a certificates. But once we had it, this is our internally hosted on-premise registry once. We had it this, this became the really fun part of seeing her images and this is what are Harbor looks like right now. So these are all the projects in each project has several repositories in each repository has triple tags checked in and this one is for deployment. You just you know update the tag but build the image of the
tag push at the harbor and I can be pulled from anywhere at Sea and rolled back very easily to, to the previous tag says, I think this is a cloud made of computing Foundation incubating projects got a lot of good endorsements. Highly recommended if you're looking for an on-premise registry and having a hard time making a decision, just try one, you need one. Alright, so also part of that new pattern, that was emerging hear their deployment steps, very easy to deploy. Now so one thing I didn't mention his for each project, we make sure we have a uid for that project to everything in the same
Docker compose might use one or two different, you IDs and that should match an account on the target host. So the user inside the container map to an accountant Target host. That's probably a government thing. So, make that account in the Target host. Admit that has a matching you ID and then they go to Europe. Go to that account, set up a Docker compose set up a TV and set up the environment files for each of your services. And then you can have the robot accountant. Harvard. So he can doctor login with that house. You can pull images and then you do a Docker compose pool and we'll go get
all the images from the docker compose file, Docker compose. I asked you to bring it up in the background and Docker compose logs us asked to make sure it's working and if it's working you done, that's it. That's how we're going to point anything now. Pretty simple, the wrote this down and I made onboarding really easy, pretty cool. All right. So now we're using it for goes. How did it goes? Excuse my terrible partner again. It was amazing. It just works. The new technology be chosen by the sign and then rabbitmq or you know, the low latency in the reliability, the easy reading code
available. That's also in a container in a different way. From what we did before, you can create and get a really long Looper and relieve Loop. Nicholas. They're having a great time. So I did promise a big mistake and this is where we really got it wrong. We did not build continuous integration into our deployments or monitor it. So we kind of had this pipeline doing Inn testing. But nobody was checking to make sure it actually worked and we weren't very good at writing test. I took a long time. And here you could see that one failed after two hours and 10 minutes. They're
broke and more often than the code and because no one was checking, we didn't get an alert or it didn't come up that works if the test failed developers to stop looking at it and then they got bit frustrated with it. And so are we still don't have my working pipeline for this one because it's a super important. So because it offers in northeast El bills we didn't hook it up. We missed a big opportunity to deliver that continuous integration promise that was a big driver. Why we adopted this architecture in the first place? Write this doctor stuff is great. Let's do it for
everything really easy for devil, easy for staging to production. Really, you need to just work on the part that needs to be fixed. Great collaboration, between toes getting persistent. Getting attention about that date of persistence is going. Well, we should really do it for everything. So let's talk about that. suddenly, we had a short fuse expansion of our mission, the international civil, aviation organization, icao Passed us with forecasts of different impacts of space weather on
Aviation and then includes GPS are high altitude, radiation exposure, and communication outages. So in a few months we had to deliver this this capability to need two models and we had models but they were in varying states of definitely not ready. Some Leona part 3 collaborating with Scientists to get these things ready, I'm teasing doctor I thought, teaching doctor. So the first step was, I gave a presentation to the scientist doctor containers and continuous integration where it talked about how awesome this new architectural pattern. And
then we tried 3 different ways of collaborating with the scientist if anybody who they were and what the mall is, where to get these models in the production. Something with did never been in production before and you continue to see doctor and Doctor compose very quickly. And you're probably thinking it went kind of like this XKCD comic and I'm not going to lie. There's a little bit of that. So it's easy to be. You're right about the maximum usable frequency communication out of
this model, been running for a long time and staging but several of the developers are gone. It had four chaincode and hard-coded past. But we're able to break this in the several different doctor files for the for, you know, the model itself on a multi-stage, build a building that model and then you're putting in the in the file doctor container. And we were able to just kind of take it and get it working. Get off the fence is isolated. So it's portable for the first time and new developers were able to stand up very quickly. I'm so that was
really cool. Globaltel electronic content was little harder. In this case we let the the scientist was working on this. You didn't really have time to collaborate with us. We didn't have time to collaborate with him. So he kind of did it himself in? This was a little bit like the xkcd we had to come up and clean up a lot later to make it more reliable but you know, if he was super, super excited to work on it and where he left it, even though it was pretty messy, to be honest. The great starting point, the runtime dependencies were all captured, so that was really cool. And then carry seven.
This is where we really did it, right? And hear the scientists really took an interest in dr. I think that reproduce ability aspects were really exciting for them. I'm so we work closely with her. We have the architect it up front, and then when she got a problem, she come to us and we'd help her work through it. So we got something, we're all proud of that. Works very reliably Steven, event-driven using grabbing a few. Pretty cool. So, let's move on to the benefits. We're having a great time using doctor. So we just got a new devilry lawn work, done in about a month and get them
contributing to it. But he's in the doctor takes about 5 minutes to the plight of production are used to be an hour for Lucky, a minute to roll. Back an appointment, another 5 to try again. So it's really simple in about an hour to get your environment working with used to be about a week. Here's our team thinks fix the team. Shout out to everybody. Thanks everyone for listening to the end. Great time at doctortown.
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.