About the talk
Brian Wong is a Technology Fellow at Capital One and his talk was very much about the company’s transition to the cloud. Brian talked about how capacity planning is far from obsolete, but practicing it in a rapidly changing environment is entirely different from how it was practiced in the past. In his talk, Brian addressed containers, microservices, function-as-a-service, and fully managed services. He outlined the next-generation computing environment from an observability, capacity, and analytical perspective, and speculated on the form and value of capacity planning in these environments.
So I'm having been in here at CMG in the past. I'm really not talk to you about some of those things that we just spent a lot of time thinking about a really cool things. I've been working certain about like a different kind of digital transformation, which perhaps customers don't care about very much but most of us do only been involved in taking everything out of data centers and moving into public clouds. And so when I get when I made up this the title, I forgot that we were going to be in
Seattle where like even if your Mainframe guy, you know, you'll you'll be right, but I've been doing capacity planning or related things for quite some time, right and In the past couple years since my job at Capital One has become to be responsible for the efficiency and especially the efficacy of our Cloud effort convenience like everything at the bank. I've noticed a few things have kind of changed and it's a kind of digital transformation kind of thing. So people would say like, why do we do you descale everything up immediately?
It's reasonable question, right? Stop to think about it what's different about capacity planning now, then 10 years ago write something to the same and some things are not right. One thing that's different is like when you can still watch it up. But the impact of watching it up is probably not the same. When you do capacity planning is a different thing. Who does Your Capacity planning is different? And by the way, that's a possibly very very possibly a scary thing for people in this room because historically anyway, it was like us that did that. And if it's the
different set of people like weighted, who is that and what about me, right? There's a whole bunch of things that are different. Where the limits might be what what what is acceptable was not acceptable. What is success the data that goes into building a capacity plan. Is sometimes quite different? The applications have not Stood Still. I remember, you know, the the days of the giant two and a half million line and a scared scared of those if I was I'm a lot more scared
about something cuz he today But everything isn't it kind of the saying not not a lot, right? You are now you were talking about a digital transformation. That's what that digitally transform. Do, you know of capacity planning? So there's two things right? If you do mess up their capacity plan, you will still create a CNN moment. Ask the folks up Amazon about S3 Buddy 2 months ago. You can still have a pretty big black guy, but at least the difference is that
It used to be like when you made a mistake or someone made a mistake. Recovered. It was like had to go buy something and that meant you had to go and spend money which method had to go. No beg forgiveness for somebody that's what usually meant. Right and I was involved in more than one where somebody had to go to the board of directors again going like 5 million bucks that you gave me last year. It's like well really was 7 not not really the sort of thing today you often in a cloud you can do this completely differently.
It's not a problem not a capital problem. And so probably it's either one of these mistakes is kind of in the order of like a few thousand dollars a month. But at least you know, somebody who actually cares about this on the other hand one thing is probably not so well understood about a cloud is that it is whole lot easier to leak money to your provider than ever was when everybody was like paying attention to like, how much is Best Buy computers. Did I say leak?
the in antilia because of from this this is a pretty distributed thing right that used to be that there was a certain set of purchasing people that could sign for things and maybe that was maybe a few hundred people would have really big company maybe in a small company was like a dozen or something like that. And now I have a problem that I have roughly in the vicinity of 9500 people that have signature authority to go buy something from one of my cloud providers.
Pretty different too, but at least you can recover it, right. So when you do capacity planning what we used to do it before you launched anything because of course I had to go buy something, you know, you know what it said to do that and if you were really rigorous about it, you know, you would check your capacity and peanut before importing events or things like that. But what kind of bee is occasion faces, right now You probably need to do it before you launch
your first, you know stuff right, but then you probably going to lunch that stuff pretty frequently right so you when you get a chance to check it again, and you can just pretty quickly as matter fact you've depending on what you're doing. You may be changing. That's the your capacity. Open today. Let me know. There's a couple times a week, but You know, how often did you buy Mainframe a couple times a week? Probably not right? Who does capacity planning
as a real shock to me, right? There is historically at least in the big organizations that I worked in and then passed and usually work with there was some Central Steel Group many of us were you know, those people right and he had to be consulted or were part of that decision making process. This whole devops thing where you know, you have to take me to Cloud, right? This is completely to us right at least in my company. Do I have 9500 people that can sign for you knows a couple thousand
bucks that the cost to launch their stuff and you know, the author of that which is four structures code. Maybe it's somebody just like clicking out on on a little bit. Let's hope they're actually doing it in an automated way. they're the ones who decide what kind of stuff to use right now. Like most of these people have never done capacity planning people in this room knows that capacity planning is actually a discipline or sub-discipline of computer science and
You know when you do it when you do it like this, there's a terrible job. capacity planning the job right away I've had many many teams has like the boss goes like and you know picture never crashes. What part of the capacity were the cost or the efficiency and effectiveness of this isn't body is not kind of assignment. Maybe and maybe implicitly right? But when they hurt when you when you're asking somebody do this that has never done it before I wonder how well do you think they're going to do it? It's not. That's like all these people are dumb people, but
it's not really what they've been used to doing. I mean, well those of us who have been doing it or sitting off in some Corner someplace trying to solve some of the problem probably right, which is not a good situation, you know for for the organization as a whole. There are limits. I mean capacity planning is fundamentally about taking. A business problem in solving it with a bunch of computer stuff, right and at some point lot of that is just like figure out what the limit is so you can work around it. Will the Democrats definition but
pretty much what it is? There are if you haven't been in the cloud much, there are a lot of limits in a cloud that you probably haven't run into before there are limits to how often you can make a call to the cloud providers apis, or maybe not just your father Vine with maybe some of the the data sources that you used to Your Capacity planning. and sometimes these can be quite surprising. Did I mention that they're not often not published? But the other existence is not published forget
what the rate is Right surprise surprise. You can't actually tell assistant to go lunch that any lunch at those. Usually what you want anything you want, but maybe you can't terminate it. 01 dime is a revenue opportunity for provider. One of them is not right but sup. the limits that I just mentioned I thought they were the same didn't you? I mean in your in your in your data center in your inner primary region and your failover, I mean Why would they be the same
because you guys are all different. Everybody has the same Hardware pool. I mean it's in a certain sense. Like there's a saying that goes around the says there is no such thing. It's just somebody else's Data Center and there's a certain amount of Truth to that. Right? I mean, it's not very true, but it is true to a certain degree. But I mean, that's why you can certainly have a capacity plan probably like perfect plan. And I get a surprise when you're when you ask for water to go and been
executed on that now from any company at many organizations. Maybe that's not possible but like it if we move like Tens of thousands of instances from what you know, what from operating in our primary to into our backup. Amazon and Google don't have like tens of thousands of those things operating free, you know in every zone so you can get a surprise even do that. Right even like I said, it's still certain situation. Most of these are not problems that we've ever had before in a Datacenter least. I don't remember them.
They didn't realize that capacity plan. I have had one one of my staff tell me. We can't do that. That would be like gigabytes of data. Like now we're getting terribly today though. I'm going to bet that I'm going to get some I think some of that some of my things are going to be providing to me tens of terabytes of data per month probably within the next. 69 months that's a lot of data some of our code that we used to use probably isn't isn't. prayer for that sort of thing There's a lot more of it.
It means a lot of different things to different people wanted things that we talked about in the panel was the date of the to use to make decisions. Right? Here's an example of that right at least where I've been we spent most of our time focus on infrastructure like CPU busy and like, you know today, you know, I erase on networks and things like that. It really be honest. Most of our my customers in customers people in the customers of the bank really don't give a crap about that. All they care about is like you didn't respond in the right time
right where we should be correlating those things and we can now get a lot of data but Correlating the data from the application Level with all the infrastructure level things with all the cloud provider level things did I mention that you were at you that you might be in more than one provider and they all have such a slightly different semantics that's a microcosm of the date of problems at all. The organizations have the same data. It means different things. It has different representations of the same data and some of those are going to be as simple as like all these are
in a feet per second in these are in meters per second, but we're not so simple But that's what that's what it takes to do a capacity plan for some of the applications that we have today, which is kind of a kind of a frightening thing really to me. At least I have been that spend most of my career dealing with a certain set of stuff and there's always been a computer to measure. except that You know, I can't put an agent on a serverless on the server that's running shirtless function. What part of
service do you not understand what my Amazon person tilting right which I mean I get it right, but she if I can't get an agent in there. How do I collect in from it? Okay, I mean you can ask your provider for but is your provider provide you the day that you need to actually correlate with us other things that you were going to get from your applications? Probably not actually probably a little bit different. You probably have to do some sort of impedance matching between the kind of daddy. You're going to get from your Provider from the other provider. Do
your infrastructure here applications? Because they're all going to send you this thing is we'll just a little bit differently. I think this problem has been that way for a while, you know for many of us, but it's gotten a lot worse things to be tough tough to me. The application this one is just scares me. Used to be I remember thinking in absolute amazement. when when mvs crssd the million lines of code mark How do you remember many lines of code in MGS MGs? Yeah, I'm I remember seeing some of you so I know
you I know you I know some people remember this right today is really really common for an application to be like a million lines of code. By the way, I can't even count the number of libraries in packages that it that it sucks in let you know all the Java libraries and forgot to take some of my probably approaching baby lines of code just for monitoring right to figure out but microservices just the crap out of me right now to get microservices. I mean, there's a there's a perfectly great reason for entering reason why microservices, right? Because you can actually tell
whether it's right or not, right? You actually tested in some rational Probably possibly complete way right so that there's a really good reason for that. But what happens when you take like a Mike's like it application has a million lines of code that Imports to find a code. That's consisting of microservice that are hundred lines of peace. Right you get 300,000 microservices? Write wait a minute. Where do I get lost? I mean if I have to monitor like the interaction if I find
it, I want to find out like where are things getting lost fully microservices thing. I'm probably going to spend an interesting amount of CPU in resources running the monitoring code just to find out which microservice was the culprit. And I type people like the computers a lot bigger so I can actually it's not just a computer problem. It's like a complexity problem, right? How do I do these things? Not really that obvious to me, right but your applications are changing and some very very relevant
ways like this, right? So there's no hope I mean, there's a bunch of different things that we talked about here, right? How do you succeed for one and this one became apparent to me only after I kind of was in Mercy in a cloud for I don't know a year to ever live in the cloud live a cloud of Lifestyle Center really doesn't matter if you bought the computer and it doesn't really matter very much whether you run at 100% or 1% but that doesn't matter to Cloud right because
you could go buy something different spend save yourself a lot of money right you really because I forgot to say something there. Second thing is like actually many of these things really kind of fall into under the under the heading of like living a lifestyle. But if you're in this world, How many of you are not in a devops world? Oh my gosh everybody. Okay, I figured like a third would say so you really need to figure out how you're going to put a sleep planning and related skills back on the path. Right? And I think the only way you're going to do that is
to like bills tooling to put it into the infrastructure as code that you use because there's no other way to go and to everybody in your organization and train them on to pass the clinic, right? There are lots of things about like all the limits that you may or may not hit you want to find them. And even if your provider don't want that won't or can't documents it for you. They should be part of your run, right? Wish we had a few very surprising things. Like, you know, why doesn't the so-and-so work? Well because we
we hit somebody's limit usually can raise these limits usually but you have to know what they are and that's that's a that's an interesting way to work. There are a lot of things that the providers to build two to allow you to dynamically scale things. We found that many of those things are crucial like a group where you can tell it to launch more computers if you need them and turn them off you when you don't sounds like a great a great idea most of the
Most people can't Asians are a little less than perfect. But I'm sure they're going to get better. I know they'll get better. But you know when you have like variable load you can spend. We've often found you can spend 90% Less in a cloud than you did when you capitalized everything just by doing all the scale groups. Denied 90% can be kind of a big deal. Right? But you have to get people to like rock this right? You have to get them to care about what the bill looks like.
When you have to produce a plan or even an analysis of the current state of one of these modern applications, you kind of have to look at all the latest or as you have and if it's a big one, you're probably have to create your own impedance matching things right to convert those data from one form to another and to convert the semantics into something that's recently compatible. Right you will need to work with application teams to do this because I think it's probably going to be more
Invasive than in the past particularly, you know, what you doing things like microservices and herbalist functions that can't be, you know, you can't just like know what happens you have to go and stick something into their code find out what it is and you're probably going to find that they're going to get some resistance. That's my source code. Don't don't put my stuff in it. I am I the question is am I suggesting that you can have some resistance for instrumentation? And yes, I had exactly one saying unfortunately and
I probably made it sound a little more. Well a lot more antagonistic than necessary because sometimes they're really good reasons for that. They don't want to do it. Like I only have like a run down to the last millisecond on the budget and I can't take another one. But yeah, there's a lot of reasons to those things are particularly. It went the particulate. They haven't really gotten to like microservices in their kind of like a mini Services. There's a lot of code there and they don't really want minutes. It may be somewhat delicate
adding more people to that don't really know what they're doing to that code. They don't know that code can be kind of problematic. We've had an interesting number of those. Turn your planning at the at the application of business. I don't know how, this is. Maybe I have a very skewed because I worked with infrastructure container companies for a long time. Maybe I have a particular skewed version or experience. But in a planning on doing a plan for an application or service. Yeah,
my team has been quite different then planning for infrastructure Services infrastructure consumption has been recorded. All the things I guess. Like I said really come to maybe I should have titled this like living the cloud lifestyle. That's a different way to put it but I think you'd agree that most these things are pretty different. Right even if capacity planning is still the same fundamental tasks of like, you know, taking a bunch of computers and assembling them in some interesting the optimal way to provide to run some business service that
sounds like sense will cost. Isn't it pretty different world, isn't it? I think I left you there and I'll open it for questions. I'm struck by something on the side seams on Monday. I'm struck by something on this slide. It seems to me as though the cloud has met and devops it made the need for performance awareness throughout the software lifecycle and throughout the development community in among stakeholders, even more pressing then let Mayweather recognized as being before in the physical environment, which is to say not
enough. So what can one do to Advocate performance awareness and create a greater sense of performance awareness at the corporate level, but also with the university level performance is not till after many American universities and not many and not many universities abroad either. Although it's judging from some noises. I've been getting it seems to be growing in India and Germany quite a lot. What can we do to create this kind of awareness at the corporate level and
how do we as practitioners promote this among? The other stakeholders says to create this kind of awareness. There's a misconception out there that if you you don't need to worry about capacity anymore because the cloud provides it all for you they seem to forgotten that they're paying for every little piece of it and that you can still box performance by having a bad software design even in the cloud. Like I said, you can still make a moment. Right right. So I think he's a good point about About the universities and now that now that you say that
I think she I was worried about Lauren's back when I was in university and why was I was considered considered weird? As far as doing it in the corporate environment, my advice is not to express it in terms of performance, but in terms of cost. Most of the decision-makers don't really understand very much about performance. In and really don't care. What sports is they they call they care if it's good enough, right, but when you get to somebody at the level that has a profit and loss.
then they care about the cost and my experience with my own stuff at the bank as well. As you know, quite a few years before that. I'd say that. It's pretty it's pretty common to find. half of the budget for more and reclaim it Betty White when I work for one of the you know, I work for Sun and I was absolutely amazed one day to find out that I was down the factory in the and every one of these machines either we hit we we have this machine was an ATF you and
I was amazed at how many of them went out with four. like a t live or something like that for Santa Fe fuse and I know better than that whole bunch of these need to Whole bunch of these need six or a while they're all going out with for an answer was You by 4 if we don't have enough we come back and get some more we really needed to or well we still have that in the cloud. The difference is that you can change in the cloud of pretty hard change it, you know when you're capitalizing so I think is really expressed in terms of cost analysis.
Well, yeah. I need to think about that part. But yeah, I mean, it's like you're right. I mean because it's almost a lost art. Yep. I wanted to ask you what are conceptual point of view what's changed between the Outsource 7 cloud services? conceptually well other than you can still screw it up like site like my second slide was like pretty much. Everything except the basic methodology probably different right who cares who doesn't when they do it how you recover from it the date of the use they're all pretty different.
We're all pretty different than what I did is Lee recently, maybe six or eight years ago. But you still need the same basic set of skills rights Andres Point like if you if you if you teach them people the to the basic and analytical skills of like well figure this out by getting the Senate candidate and combined in the relatively few ways right to do the analysis. You're probably okay as long as somebody's going to consult you. That's the first problem right? Because we've been decided gated, right and as long as somebody cares, right and as long as you have the data on which
to do that stuff and as long as it is in the scale that you can accomplish well, you know, nothing changed. Hey, thanks a lot. I got a question for you. So you guys were here last year presented about capacity planning in at Capital One. If you were back 3 years from now, will you be presenting you think you because most of the same challenges or you guys are figured it out all your problems. That's kind of part 1. I would like to think that we'll have solved a bunch of these things. In the next
business calendar year, I think we'll be able to make a pretty big difference our operations. Obviously not anyone else's but I think we'll have a pretty good will make pretty pretty good dance in it the den and I would say the difference is not between what we will be able to do and the ideal is not mostly Tech it will most organizational. So my second my follow-up question is what can CMG do to help Capital One and other companies that are basically moving in this direction. how can you help well, I think
I would like to think that presentation like this will likely at least will help make people aware that it's the same except. It's completely different and to identify some of those those things. perhaps some perhaps working as a as a as an industry will work with some of the provider's of data to like maybe put them in. Orb, in better alignment as to how data will be used and consumed for this purpose because that's quite a bit sit quite a problem for things to be romantic and mismatched.
I guess I would do this to the two ways. I think we're the most most impact. I was thinking to myself but maybe Sherry more experiences and how people are doing their own planning and that what are the issues that we're dealing with? So everybody get to find out what everybody else is doing it. So not everybody's Reinventing the steel wheel as we keep them doing. But that would be a good way for CMU to help with best practices and I'm a I'm a big favor of the big
fan of sharing worst practices to write. I mean while you're laughing, but I mean, like don't don't fall in that hole over here. That's a pretty that's a pretty pretty pretty important piece of thing to disseminate right thing, but I say one more question I am rush me. I work for Southwest Airlines 104 application which we regularly performance test is the moving to Cloud but the perception among the development team is that you don't need to Performance tested wanted most to plowed. So
none of you have recently started. So I'm going to establish a baseline test for them. But how do I change that perception? So how would you change? Why don't you think you need to do this in a cloud and what are its we'll go back to my favorite. You know, there is no such thing as just someone else's data center. Right? Does you sold your data center to somebody help me write and and the title changed but I'll let you use the same way with that changed the way you need to do a performance test in someone else's data center.
If it's the same computers write a probably not right because you're you're pregnant don't really care who owns the title on the on the computers. You don't have to you know, I mean it. I don't wanna be dismissive. This is because Without putting too fine a point on it, you know some of this comes from like the harder and experience right like but really a computer is a computer and frankly the chips still the same chips. I mean, you can still buy a skylight just like Amazon, you know who can buy a skylight, right? And those those are still the same network pipes
except maybe they're shared more right? I don't know why you would think that you don't have to do a performance in a test in a cloud because you can get more you can still get more in a Datacenter. It was hard to do it because you have to ask more people right? I guess the other thing to say is that I believe me. I thought you were going to test I said this too many people break it in. So disguise terms. It's like so you don't care about the promise any right because you can just Turn some more on right just got paid for
that. Actually I go back. This one here the next the last bullet. We have some teams that are spending three times as much as that as they need to three times. Right that's tens of thousands of dollars a month. That's that's missing and Anna by the way, that's 10th. That's that's three times as much even if I assume that are code is perfect. And I don't probably don't need to explain myself in their right. Three times as much money is your budget goes like yeah, that's that's the word Hemorrhage on there is on there for a reason. So I mean,
I really can't think of anything more. Incorrect then I don't I'm going to Cloud. I don't need to do it the test anymore. Well, you know, what is your if your spending fifty bucks a month and you're making a million dollars a month, but great idea probably don't care about the cost. Years ago we said the same thing that for the first time you probably put a credit card on crappy performance, but the but the downside is nobody's Willie tracking that data and Reporting it, but nobody right? I mean certainly we do and we we we should we send people Bill's right then and people
know what their profit-and-loss looks like now, it's a relatively senior level of the organization, right? So it's like it's not we don't we don't push the profit and loss down to the individual devops engineer cuz we did that porting numbers. I'm saying the connection between the numbers at the bills of their getting and the crappy performance to Andres question that linkages and missing. Well, I'm overtime. Yes. We are at the end of time, but the good news is we are moving into a break Ryan. Thank you.
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.