I am a technical leader who wants to work on his company's biggest problems. That used to mean working on large databases and backend processing systems because they seemed like hard problems. I still like those things, but now I gravitate more towards the highest value problem for the business, wherever it is in the stack.View the profile
About the talk
In 2019, Fitbit moved all of its production operations from managed hosting to Google Cloud Platform without any downtime. The Fitbit experience is provided by a monolithic application backed by 200+ data stores, making the task of moving service by service impossible. Fitbit decided to run services in both hosting environments and move user by user. This is the story of Fitbit’s migration to GCP, which was tested and executed mostly in production, without any effects to the users.
To successfully migrate, Fitbit reviewed their goals and requirements for the migration. What should the user experience be during this period? How to know if benchmarks are being met and the process can push forward? How to slow or reverse migration if things weren’t going well? Answering these questions led Fitbit to a migration plan that started with the movement of internal users, followed by the careful transplant of a small number of real customers, and concluded with a mass migration of the majority of our users.
This migration path required significant new additions to Fitbit’s architecture, including new testing, routing, and caching techniques. As the journey approached its conclusion, Fitbit recognized that these methods were not merely allowing for migration; they were allowing Fitbit to operate in multiple hosting environments simultaneously. The lessons from this migration have provided the foundation for a multi-region architecture that will unlock the full potential of life in Google Cloud Platform.
The tale starts with a review of Fitbit's goals and requirements for the migration. What should the user experience be during this period? How to know if benchmarks are being met and the migration can push forward? How to slow or reverse migration if things weren’t going well? Answering these questions led Fitbit to a migration plan that started with the movement of internal users, followed by the careful transplant of a small number of real customers, and concluded with a mass migration of the majority of users.
This migration path required significant new additions to Fitbit’s architecture, including new testing, routing, and caching techniques. As the journey approached its conclusion, Fitbit recognized that these methods were not merely allowing them to migrate; they were allowing Fitbit to operate in multiple hosting environments simultaneously. The lessons from this migration have provided the foundation for a multi-region architecture that will unlock the full potential of life in Google Cloud Platform.
Speaker: Sean-Michael Lewis
Google Cloud Next ’20: OnAir → https://goo.gle/next2020
Subscribe to the GCP Channel → https://goo.gle/GCP
product: Compute Engine;
event: Google Cloud Next 2020; re_ty: Publish;
Hello and welcome to my top, do it live fitbit's, your downtime migration to gcp. I'm John, Michael Lewis, principal software engineer Fitbit and I was in charge of application. Play stop for me to talk about what was unique or what we felt was unique about this migration challenge or talk about how we thought about users in making decisions to move for a migration about the technology and processes that we added and affected to make this happen. That we think to be useful for you. Should you be moving from a
provider or your own managed hosting into Google Cloud platform? So like many companies could be started with a monolithic application in a single binary to did all the same. It was quite large traffic for a long time and then we started to break it up into these Elite service. Going to do its own thing. Move its own data store. Figure things out and move into a bigger media takes up about 70% of our traffic. So moving it was quite hard. So I wasn't going to happen. Let's talk about it.
Like I said, it's a single job. A binary about a thousand instances back by around, $200 to buy user data. So if I request from there, I can get all my data from one shower. We figured out where users go and sign up. If it allows you to expand the number of my sequences We also handle cashing for which we have around 400 notes. Popular where we do asynchronous processing and messenger, largely processed by other instances within the hour. So we'll talk about, we'll talk about our users. Now, now that we know what the application looks like, we
thought about what it would look like for us to move and who we should be thinking about as removing the most important stakeholder in the mood sometimes doing the best thing to use. So, how do we get to gcp with our usual smart? And we thought about two different ways of doing this one was to do a progressively. So, move user by user, or batch of these are my batch of user or we could do it all once. There were different pros and cons to each for the progressive. We thought
there anything bad that happened in the new. Environment will be limited to only be a small overalls as we figured out the problem like just our applications, hit a woman. We can see how the environment new load in a more measured way. We knew that Network all we would be able to make every Network called That's not. And then lastly we have to operate to full operation stock simultaneously. Be able to operate them at the same time management. The same time
on two different hosts medium. I'm going to be all one solution. Any challenges you have in the new environment affects everyone, everyone's down there before but you don't know what's really going to happen. Positive. Is that once you're in the same time you need the old one on standby, in case something goes wrong. So you're not really getting rid of the old one. When you cut over, at least, we found that overall thinking of users in the Progressive, migration would be much better for them. On Irish most users wouldn't
you know slowly Rowland gcp? We would give our users the best experience and maybe even when they don't know. We were thinking about how we going to do this. Regressive migration way to think about what are the things we want out of there. How are we going to do it? So the first one was our best effort throughout users the environment with their data is Tristan. So if, if I move my data from, I should start riding my request to gcp, that was a goal of this migration is to
make sure that drop directly in my data isn't. Well, will you still want that to be want to be? So we want to ask, Want to go to move music to TCP, but if things aren't going well, when you be able to move them back and we also need to be able to use variable speed. If things are going great, we should be able to move users more quickly if they're not doing so well. We're not sure we should check with them and lastly as we're making these decisions about fast low for backwards, we don't want to
be making a lot more appropriately. The timeline for this migration. So as I said, we want to move users slow and quickly a different working and you can always run tests but until you are really, it's not. So the first user employees rather than through all the data was still in our Center. But we're serviceable by going to gcp and even if there was another Datacenter Once we felt confident that we decided it was time to start moving paying customers. So we started moving user slowly in a slow trickle. We got to watch is our system to make sure that we were
able to operate in mother and really, you know, make sure things are working in PCP. then once we got a lot of confidence, we started moving, use it in big batches, we moved in quickly and as you'll see So, how did we change our architecture and Technology to make this kind of move possible? I don't see on the left side, the same diagram for a monolithic Architecture Firm Legacy Dental Center. With your application in to see if you are my butt and there's my sequences were empty as well. And
then we had connections from all the applications and go to the Blue Line Josie. The application isn't go to that. We had to add whey to add connections backfield and to gcp for me, And we had an edge routing as well. This was the logic to allow us to Route users based on where they live. So we had to be able to push that turns out, we used cloudflare CDN and Jennifer Reider. They had a product thought, we were able to push logic up into and make you easier
to observe as well. We have connectivity between an application. Only talk to Kason one day. What that change to be replicated to the other one. So we introduced a proxy that allow us to Echo and validation together Datacenter. I'll talk about that as well. First, we'll talk about moving up users. I said, we had initially empty databases in gcp and all the full databases and game. We have this notion of buckets, buckets, have fuses in them. Mostly ID ranges, we used bucket,
but it allows us to move an individual ACP. We can move one bucket from a database server that lived in. This is the slow method of movement. I think about 20 minutes to get a lot of walking involved you just can still read their data but it's slow and gradual. We can add power from to get a little more speed as we can do many buckets at a time. We will see who shared in. This is a typical application strategy which is that are our leader database. Some point, we would slip for the leader was and make the leader of the one in
data is resident, we could say, oh this user should always go to matter what, no matter where the otherwise see where there be a season 4 of the logic of the data. It's time to play India in the car. Then the logic is simply. It's easier in the whitelist if so GC P, if not working gcp, if you have time, if not, we have some kind of error or there's no user information 7 till I get there. eventually once we had more than 50% movement and did you see Peewee major default gcp, Last technical
aspect was keeping the cash is still here in. As I said applications only talked to their local caches, and if the other day, the way, introduce the new piece of technology, MIT router and open source, then cash, Frosty, it has the ability to Echo. Echo delete and other operations to other data centers or other clusters of men cash and goats are second-class urgent that we get a very few headaches. They very few incidents around it. Where is introducing a new piece of technology just before you make a big move with much more
challenging, but I started to show at the difference between us when we have more minutes How do you spell go. She went quite well. Better than I think. Any of us expected. We the routing of employees before everyone else. I'm covered many for around 60 bugs. We close 58 of them before we moved. A paying customer. And I think there's other two. Close as if we were in production without affecting our And then we also finished early. We we had a pretty hard deadline to be out at a certain time and we started this process. You look at the left of this graph. You'll
see that information, which is the green line, and then around 8:30, and then we, we are doing well. Everything was kind of under control. And then, we can really push this migration faster and get done sooner than we expected. I want to take away from this process. If you have a sharp model S, like we do, you probably have some sort of user or something like that. And so if you can do it little by little, that's going to be the best thing for your men. Should have options supposed to be 10 direction to
Lowe's and then also, we found this was the first opportunity for us having been around Colo to use managed Services really should have benefit. Not only were in Google Cloud platform, we think there's a lot more opportunities for us to do that. so, if you don't use them, That's all. Thank you very much for your time and I hope your migration goes well as well.
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.