Test Leadership Congress
June 27, 2019, New York, USA
Test Leadership Congress
Video
Mykola Gurov - Testing on Production, Deep Backend Edition
Available
In cart
Free
Free
Free
Free
Free
Free
Add to favorites
42
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speaker

Mykola Gurov
Full-Stack Software Engineer at bol.com

Mykola is a Java backend developer (calls himself full-stack). He has a keen interest in CI/CD, testing, and everything that helps to move faster without breaking too many things. Since 2015 he works at bol.com, one of the biggest online retailers of the Netherlands.

View the profile

About the talk

Why do we test on production? Why not avoid the risks?

Thorough testing before merging to master is great, but it doesn’t cover the unknowns. Staging on shared environments tends to be slow, unreliable and costly to support. Why not just learn from the only true environment by conducting safe and efficient experiments?

This talk is based on my experience of "shifting to the right" testing within the context of back-end systems of bol.com (one of the biggest online retailers of the Netherlands; logistics and purchasing domains), where correctness is often a bigger concern than performance, and recovery might require a bit more than users hitting the refresh button of their browser.

Testing on production is often associated with A/B testing or canary releases, but those aren't always the best - or even applicable - techniques. We will look instead at shadow and dry runs, controlled experiments, survival of the fittest; how to apply these techniques and what to be aware of.

00:04 A confession to make: I do tests. On production

00:40 Start to enjoy it: more than an opportunity

01:04 Today’s agenda

01:42 A little bit about bol.com

02:29 An error because of the scale can be very quickly quite costly

02:34 The back end and front ends

02:59 What is testing?

03:53 3 types of environments: isolation, staging, and production

05:11 First example: business context

08:03 Sometimes we would miss some messages…

11:25 Second example: database change

13:47 Sources idea or reference generation

13:49 Outside or within of your shadow area?

16:50 Last example of shadow: shadow across the system

20:02 When something really happens in productions then theoretically arrives in learning

20:22 The important point: one toggle to rule them all

22:06 The other techniques: golden order a.k.a. “lucky” customer

23:29 White-listing: a little bit more formalized technique

25:38 Another technique: test a.k.a. Sandbox accounts

26:08 Dry Run/Preview: perform a calculation without side-effects

26:47 Buddy service connectivity

27:56 The last example: survival of the fittest

30:29 How do we do performance testing?

31:17 Another very useful technique

31:51 We don't do engineering like monkeys

32:39 Testing in production - testing environment

33:07 The small but very important side note

34:02 Examples: best used together

36:02 The last drop. Conclusion

Share

Hi good, 00:04 Afternoon. 00:06 My name is Mycola. 00:06 And I have a confession to make. 00:09 I do test. 00:12 And in production. 00:12 When I first realized this. 00:17 I had that feel feeling of sorrow because I thought. 00:19 So this is a sign of deficiency in our testing strategy. 00:23 We should test everything before we go to production. 00:27 How's the excuse of lack of resources and time it was a little bit of a week. 00:30

But we can doing this and all the time I started to get used to this testing in production. 00:35 Start to enjoy it. 00:40 And I started to see it more as an opportunity rather than seeing. 00:40 This is what I would like to talk about today. 00:47 So. 00:51 A sketch of a button just intro, 00:51 little bit boring part. 00:54 Many people find it, 00:54 but I think it's important to give the context where I'm talking or they're coming from. 00:57 Why uh this works for me? 01:01

Then I will go into examples my favorite technique is shadow shadow around there will be 3 examples. 01:04 Other techniques are little bit operational concerns. 01:10 A reflection and conclusion. 01:15 And I will be using this. 01:18 Droplet as a reminder to me. 01:20 Keep my mouth. 01:22 Hydrated. 01:25 And if you have questions. 01:28 I would like to give us some time at the end, 01:30 but if you have something. 01:33 The short or something doesn't make sense. 01:34 This would be a good moment also to ask. 01:37

So I work in bowl.com. 01:42 It's a Dutch webshop. 01:44 Um. 01:44 And also platform for Commerce. 01:48 It's a small if you look from the side, 01:50 but in our countries relatively big. 01:53 It's a common so we don't have plans that will be crashing. 01:57 If we do a mistake. 02:01 We don't. 02:02 Have people dying but we still have quite some value, 02:02 I Googled yesterday because I didn't really care but we last year we? 02:06

Almost 2 billion sales in Europe and we are growing every year about 3220% 02:11 depending on how you count. 02:16 That part does make influence on on my work. 02:18 So if we do, if we have some. 02:23 A place one error but. 02:26 If we make an error because of the volumes because of the scale, 02:29 it can be very quickly quite costly. 02:32 So. 02:34 And I'm working on the back end, 02:36 I called it back and back and so it's not a back end. 02:38 Switch front end is stuck in a small. 02:41

Behind the QS behind the asynchronous processes. 02:43 Mostly my experiences in logistics area and in touch it. 02:47 We have some front ends. 02:52 But they mostly used for internal purposes. 02:54 So we don't have that much communication with you, 02:56 I testing. 02:59 So that's a disclaimer. 02:59 Be cause of the nature of our product. 03:04 We have only one production environment, 03:06 so we don't ship hour. 03:08 The code to anyone else, 03:09 we don't deploy to. 03:11 Mobile devices of Mars Rovers. 03:11

While this environment is highly volatile we have model 60 autonomous teams. 03:17 That uh. 03:23 Doing the deployments in their own pace doing their changes. 03:23 Uh we have. 03:28 Microservices environment was more than hundreds of microservices dozens of bigger not so microservices in couple of big legacy. 03:28 Normal is under the hood. 03:37 So with this context, what I will call testing on production. 03:42 I've almost spent much time here on destiny on testing conference. 03:46

But for production for the purposes of this. 03:51 Presentation I will make a distinction between 3 types of environments. 03:53 Going from left to right. 03:58 Call we deliver using our changes the most important is on the right the production. 04:01 That's why I always generated. 04:06 Technically, speaking that's the only environment that we need. 04:09 The rest is overhead. 04:12 But we might need that overhead cause any change to production. 04:12

Can disrupt the value 4 and one of the ways to minimize this risk? 04:17 Used to test. 04:21 Things before we deployed into production. 04:21 Yeah, isolation isolation mean running a small piece of your application. 04:26 Uh somewhere on your local development for station on continuous integration services in a highly controlled environment where. 04:32 Things are very predictable. 04:39

Or on staging environment so that I can do multiple staging environments, 04:42 which tried with different decision to emulate production. 04:46 But in our experience it comes with a cost which I will. 04:49 I see a little bit later. 04:53 So that was the fury. 04:56 For today. 04:58 And I will jump into the examples. 04:58 So first example to give a little bit of the business context. 05:11 You don't have to read it, 05:13 it just. 05:15 Make it a little bit exciting, 05:15 some pictures. 05:17

Umbowl.com Google Com is a platform so retailers can sell by outside. 05:17 And in a normal flow is able. 05:25 Yeah, then also will sell. 05:27 And do the shipment returns and all the logistics themselves. 05:30 But we also provide the service, 05:35 which is called Luckystickleyable.com, where they can send their stuff. 05:37 Is it good to our house? 05:41 It will be sitting there. 05:43 And when an order arrives, 05:45 we will take care of shipments. 05:47 Also all the logistic dirty things. 05:49

For a small fee. 05:53 And Uh, I was involved in advance service, 05:55 which was working on calculating those fees those costs. 05:58 So what would happen, we would listen will listen to events in our environment. 06:01 Like shipping for this picture. 06:07 And when we detect that this is a shipment for something that belongs to. 06:09 Our partner. 06:13 We will uh involve couple some business logic and produce. 06:13

Multiple financial micro transactions for shipping costs, 06:19 we can pack cause. 06:24 Commission that's what we earn. 06:24 Providing those services. 06:28 All is nice and simple nothing fancy. 06:28 Working everybody's happy and then business requirement is coming to make things better. 06:33 A little bit more complex. 06:39 The idea is that shipment depends on the cost of signals depends very much on how you ship. 06:40 I shipping everything in one big box. 06:46

Or you are having sound different separate shipments to same or different customers. 06:48 So the idea is to be more fair to our. 06:54 Bad news. 06:57 And look. 06:57 How the shipment was? 06:57 Spread over the customer order. 07:02 So we would listen to the customer order. 07:04 Little customer order to signals. 07:06 Will also need to listen to the cancellation because it happens. 07:10 Very often an only as and when the whole order is fulfilled. 07:13 Will produce? 07:16 Slightly adjusted Micro Micro. 07:16

Payments. 07:21 All is good and simple with it. 07:24 Implement the feature. 07:27 That's it in isolation, 07:27 even do some VG. 07:31 We try to move product owner has usually was not very easy. 07:31 But we were confident enough in the functional area, 07:37 but they still had some hesitations. 07:40 All the technical side we introduce the distributed logging 'cause we're running on multiple nodes. 07:42 And we need to we had to. 07:49 Log in order for the moment when we do the calculation there. 07:52

It was a new technique for us for our team. 07:56 So we did not know for sure how will work in production with production load. 07:58 We also knew that sometimes we would miss. 08:03 Some messages we didn't know why exactly but we knew from experience. 08:07 And that was not a huge problem when they were missing them in small. 08:11 Enough numbers for the original implementation, 08:15 but it's the new one and the whole order would be. 08:18

You have because of that we were a little bit concerned whether it will be show stopper for us or not. 08:21 So we're trying to replicate those situations on staging in isolation. 08:26 But it was always difficult to exactly replicate the situation. 08:32 The laws the patterns distribution patterns that will be on production. 08:35 So we decided why not? 08:40 Just go to production test is there. 08:41 Cost production is there since already happening there. 08:44 And we need so. 08:48

So this is a very simple example of the Shadowrun. 08:48 Within the same application. 08:53 Same deployment instead of. 08:53 Replacing old functionality within you. 08:58 We're building a new functionality. 09:01 Next to the old one. 09:03 And at the very beginning of the processing. 09:05 We will do the fork of the incoming message in this case was very easy because we were message driven. 09:08 So we will just subscribe to the same topic. 09:15 Within you shadow. 09:18 Q with a new subscription. 09:18

And all the rest would be. 09:23 Completely isolated from the main floor. 09:25 So we will do it, 09:28 then you click nation and then produce. 09:29 And you shadow transactions in a separate table. 09:31 But it's very nice here that it's very difficult that something goes wrong 'cause they're quite isolated. 09:36 Consumers they use the live data and they need to say explicit. 09:42 I want to use a shadow lady. 09:46 Please run for 2 months. 09:50 Becaus and that was 2 months is 2 billion videos. 09:53

Collect the data. 09:57 We found out that our concerns were correct about the missing messages. 09:57 But then we also nice to 25 a little bit better, 10:03 but that was not the show stopper. 10:06 And we could build a simple compensation mechanism. 10:08 We also found that our distributed Logan Logan was working fine no problem. 10:12 A very nice bonus so that was that. 10:16 During these 2 months our protocol, 10:20 Margot full. 10:22 Uh specification for every of our partners in the shadow. 10:22

So I wish we could produce. 10:28 Examples of how The new Corporation would look which. 10:29 Simplified the traditional business site. 10:33 Very much. 10:36 And one thing I forgot to mention when I'm saying that is difficult that is relatively safe. 10:41 Theoretically can do uh something wrong can happen here and you will be running out of memory or something. 10:47 It's not very likely but to make sure that. 10:53 You can quickly stop negative attacks. 10:56

Would you really have a simple feature toggle here? 10:59 Which will say? 11:02 Just skip the messages acknowledge them so they don't pile up. 11:03 Do nothing. 11:07 Never had to use it but it's always good to have. 11:07 That was the first example. 11:14 Second example. 11:25 Slightly different. 11:27 Same application. 11:27 For some reasons, we had to switch from Oracle Database Composers. 11:31 Which is a great thing if your developer? 11:36 Becaus progress is much more developer friendly. 11:39

But not so great if your organization doesn't have much operational experiences Oracle. 11:41 And if you are business owner, 11:46 or business stakeholder of financial data exposed to 3rd party partners. 11:48 And you know that your service will be the Guinea pig proof convention. 11:53 For this transition. 11:59 So we decided to get into do a shadow. 12:01 Let's find a different channel. 12:03 So we would. 12:04 Have a different deployments for life and shadow. 12:04

It could be different this could be different code base in our case, 12:09 we use the same code base. 12:13 We just made. 12:14 Our application work both with Oracle and Postgres. 12:14 And just configure it differently. 12:19 And then there's a life will be saving data to Oracle database. 12:22 Shadow will be sending to Postgres. 12:26 If everything goes well also. 12:29 They uh they the data is equal. 12:31

All the path both of variance will fully tested in isolation, 12:36 also with dockerized databases real Oracle in progress. 12:40 So we were actively developing still service was not blocking anything. 12:44 And we were waging a good moment when we can finalize this migration. 12:49 And then when we were ready. 12:55 We would re point, the life application to Postgres database. 12:57 Z consumers would keep talking to it will be in our case is more down time. 13:02

No problem for us in this in this case. 13:06 That will decommission the shadow. 13:09 Uh instances and killed the database Oracle database. 13:11 So. 13:18 Very nice side effect of this was. 13:18 Becaus the data was supposed to be absolutely the same we could pinpoint the early problem. 13:22 I mentioned that we will be missing sometimes. 13:27 Message is incredibly them to vary. 13:29 Strange. 13:32 Press box in our language that we're using? 13:32 Small things but nice. 13:37

Some problems well some concerns that. 13:40 We. 13:43 Should they wear when we're planning etc? 13:43 Sources idea or reference generation. 13:47 It's very nice if you ideas or references generated outside of your shadow area. 13:49 Then you can just use them and it will be fine. 13:54 But sometimes we would use. 13:58 Some idea generated. 14:00 Within shadow. 14:00 For samples or in our case, 14:04 it was a business screen where the compensation corrections would be made. 14:05

So they would find the transaction. 14:10 They look like and they do a conference call compensation based on the edge. 14:12 They use generated here. 14:17 Of course in a distributed environments. 14:17 It's not really. 14:21 Feasible to have the same ID generated in both life and shadow. 14:21 Cause of many reasons. 14:26 So that's one of them one concern. 14:26 In our case worker and was very easily just ignored. 14:30

In all how comparisons the data generated in this full because it was easily distinguishable from functional an audit. 14:33 For purposes. 14:41 But template some role. 14:41 Another concern. 14:46 Is the initial state? 14:46 It's very nice if your application is simple doesn't depend on the state. 14:50 We just received some message or some event. 14:54 And do some output. 14:57 In our case, 14:57 we have some initial state like existing shipments existing orders. 15:00

Probably is when we were starting out. 15:04 Processing or private in your message. 15:06 So before we are starting the shadowrun we have to do the initial load. 15:10 Copies data from in this case, 15:15 from Oracle to post this. 15:17 In this example, we were doing some down time and playing with with consumers an of the incoming messages. 15:21 They're very good. 15:28 Bonus is good very good. 15:31 Side effects of climbing in shadow that you can do a couple of times. 15:34

This initial synchronization and check that it's working properly. 15:37 The. 15:41 And in our view projects like this was where I found very useful is. 15:43 Persistent message brokers like Kafka. 15:48 'Cause then you can move. 15:51 Here, the synchronization and listen to historical data not only from the moment. 15:54 Then you started to listen. 15:59 That means that you can replay from the moment. 16:00 Where you want to start collecting history? 16:03 Yes. 16:12 License. 16:18 Becaus. 16:20 Set up. 16:46

Last example of shadow. 16:50 I love said those but I don't want to spend too much time instead of. 16:50 This is a little bit more. 16:56 Uh in vault example. 16:59 In this case, 16:59 we will uh. 17:02 This model model is an existing functionality. 17:04 To give a little bit of business feeling it is a cross dock functionality. 17:07 The idea is that you can get warehouse order. 17:11 For an item that's not yet in your house. 17:15 And then the warehouse. 17:18 Normally just sense. 17:18

Uh the order the items ordered directly from inbound to unbound. 17:22 Bypassing all the internal stages. 17:28 On our side we were looking at this a little bit broader basically we were comparing demand and supply. 17:31 On what we have in their house and what is planned to be coming. 17:39 And if it was not enough, 17:42 it would be ordering automatically more for new orders. 17:44 There's a functionality was existing but in an old monolith. 17:49

Uh and what was making the transition a little bit more complicated that it's it's in between 2 domains. 17:53 Logistics and purchasing. 17:59 So there are multiple teams working on that. 17:59 The whole thing that we've had enough experience with such migrations. 18:03 So we from the beginning. 18:06 We agreed to do it and shadow. 18:08 From the very beginning. 18:09 Uh. 18:12 It would look something like that. 18:12

'Cause we were building a Microsoft sound when you start with Microservices, 18:17 One is not enough. 18:20 Create for microservices. 18:20 They would be. 18:26 Processing all the messages in the shadow mode is the very end. 18:26 Only at the end. 18:31 The message is it still sending order to the. 18:31 Supplier mood. 18:37 Stop the processing and do the analysis. 18:37 What was very nice in this approach? 18:46 Well, not only that we could build the whole functionality. 18:49 On. 18:53 Production. 18:53

Gradually by just expanding those. 18:53 Services and learning how they interact with each other and how they interact with other data supplied because you need to collect lots of data. 18:58 And see if we have exceptions why as it's happening how we're going to fix this. 19:05 But it was also very good that we could. 19:10 Gradually go out from the shadow. 19:13 So we could segment we could. 19:16 Take part of the incoming stream. 19:19

I'll start sending it as life to the new implementation. 19:22 While keeping them all stream going the old way. 19:26 This way we could learn. 19:29 How things work in a very small batches or there's also very convenient for us in this case because? 19:31 Segregate or warehouse interwar house requirements were also different. 19:36 So for smaller houses. 19:40 Like a big uh. 19:40 Fridge is you don't have. 19:44 Releases of fridge is that everybody is waiting. 19:46

So we could learn much more there. 19:49 Another very nice side effect of this was. 19:52 There's a business involvement, even though the stream which was going live initially was very small. 19:54 It was much more lively. 19:59 When something really happens in productions then theoretically. 20:02 On some exceptions arrives in learning. 20:06 Something was not working we were getting much faster response. 20:08 Even though again the flow was much smaller. 20:11 And. 20:15 When? 20:15

Everything was done, we just switching yes, 20:19 so here one month one important point in my experience. 20:22 Is to think about how you toggle shadow end? 20:26 Life especially if you have multiple steps that's where we usually have. 20:30 Problems down the road. 20:34 Because most of the time you will need to uh make some some decisions some. 20:37 Depending on whether you are working with a shadow or life entity. 20:42 And especially if you're going in to Microsoft's environment. 20:46

These places can be multiple service is disconnected from each other. 20:50 And even if you have one toggle. 20:54 Sound like configuration, even if you can fully synchronized in the synchronous way or maybe decide by the time. 20:56 Say Well, the switching for some messages from A to B from life to shadow. 21:04 It will be still some messages in transit potentially between point I am pulling the. 21:08 The possible. 21:16 Is to? 21:16 Would the labels on the messages. 21:16 Explicitly. 21:21

So all the messages in the previous example. 21:21 Would carry 21:25 So all the messages in the previous example would carry. 21:28 The flag visited a shadow that is not. 21:30 Which is nice but it's all is of course complexity and a little bit of space for an error? 21:34 And allocates we've had a couple of edge cases, 21:39 which we didn't handle very well. 21:42 But I sent the level of confidence. 21:43 Was very much worth? 21:46 So that was about shadows. 22:03 There are also other techniques. 22:06

One of them we called it we call it gold Golden order or like a customer. 22:10 Today from I thought that there is Donald Duck. 22:15 Dog food eating on dog food. 22:19 Because we're using our website ourselves. 22:23 Will often can visa first Canary tester Canary tester also new functionality? 22:25 For example, a couple months ago, 22:32 I was involved in some refactoring of the existing functionality and I was a part of the pilot group. 22:34

Which was a show us the data on our website? 22:40 About our account in the new way. 22:44 So we know that we are part of this pilot, 22:46 you know to contact if we see something wrong. 22:48 And it's much less dangerous than expose feature to the users. 22:51 It was even more uh. 22:56 Spectacular when we were starting to use a new warehouse with the project manager leading project managers of this. 22:58 Would make those order. 23:06 It was the result specifically for him. 23:06

Everybody would be working with the camps and making steemson promo videos. 23:10 All this fills box goes through the whole huge warehouses and it's being delivered to this. 23:14 Uh to this guy. 23:19 And then when you have enough confidence. 23:19 You can go. 23:25 Maybe for broader population. 23:25 A little bit more formalized technique like this. 23:29 And it's already on the border of really strategy. 23:32 Configuration explicit roll out somewhere. 23:35 In this example, I'm. 23:39

Show me a small part of the supplier configuration screen. 23:39 Which is used by our internal business operations? 23:46 Where it's a huge we're making it seem huge for some features that by building where you can specify? 23:50 Weather this particular supplier is working the old way was a new way. 23:56 And then you can go first with very small. 24:02 That part of the population with a low risk low volume. 24:05

Friendly supplies with with which you can easily communicate or even test suppliers you can create your own. 24:09 And then when you have enough confidence to roll out server. 24:17 Maybe you find something that's not working very well each case is not necessarily software edge case it can be. 24:20 Nowadays, most some contractual. 24:27 Things that were not sorted out properly. 24:27 Then he pulls for that supplier customer. 24:31 It'll also there. 24:35

Go back and fix the remaining things. 24:35 It was very nice about this, 24:39 the way of working from my point of view, 24:41 then we can push. 24:44 There's a possibility gives responsibility or the migration to the business operations. 24:44 The people who know much better. 24:49 All the details about our suppliers than we can possibly have. 24:51 So that it can take longer. 24:57

More time and if we have a an B license of new and old functionality very much different in can incur costs in maintaining maintaining both. 24:59 The branches. 25:08 So, in this case will be trying to push help monitor how fast are we going with integration. 25:08 I mentioned fest accounts, they don't use much testicles and mold. 25:19 It come mostly it used by the business operations. 25:23 Check the something so working. 25:26

But I know some people who are using test accounts, 25:28 even to do automated testing on production will be resources, 25:31 a little bit later. 25:35 All this. 25:35 Another technique technique, which we use in production, 25:38 we find it very useful is to extend to add test APIs extent. 25:42 Normal APIs for disability. 25:46 And troubleshooting so in this case, 25:46 it is a screenshot of a swagger interface. 25:51 Uh, which exposes an API. 25:54

It's accessible to the business people Z this is analyst product owners. 25:57 Ann. 26:03 In addition to normal parameters, 26:03 which normal service would use. 26:06 It also asks for example, 26:08 calculation daytime so the moment. 26:09 I'll just click relation will be performed for this particular depends very much on the time of the day we're making. 26:12 So this way we can do acceptance testing. 26:19 Justin production with realistic data. 26:21

It doesn't have a side effect it just returns you what? 26:24 Is our expectation? 26:27 And later when question arises why. 26:27 Something happened. 26:32 It can be one of the tools that helps to. 26:32 Metal down very quickly. 26:36 Where things probably went not as we expected or we just expect our expectations for wrong? 26:36 My previous example examples, especially the shadows. 26:47 It was about a synchronous communication. 26:50

Sometimes we want to do, 26:53 we have some singers communications like asking some service for some data synchronously. 26:54 So we have our service us. 27:00 Our API. 27:02 Which would not call? 27:02 And then we will do arrest goal to some other service. 27:05 But I found useful also is to make a shortcut and exposes test API. 27:09 Before any of our processing may be messed with. 27:14

Side effects may be with some entities created just how do we interpret the response from the other service? 27:19 This helps. 27:26 First of all to find the edge cases, 27:28 you can when you have this endpoint. 27:30 And deploy to production, you can see can, 27:33 we talk to this other service. 27:35 Go through some uh examples do interpret as we expect and also later again nothing start to go wrong. 27:37 You think that they're going wrong. 27:45

You can very quickly zoom in directly to this communication, 27:47 bypassing the logic. 27:50 In side effects. 27:50 And the last example for today, 27:56 but my favorite. 27:58 This is all the examples. 28:01 I was doing. 28:02 I was involved. 28:02 Myself This was from from another team, 28:04 but that's really cool. 28:06 So there is a service at boot camp. 28:09 We just called PAC man in its responsibility is to do the picking algorithm. 28:12

And what it does it mean that from time to time every 5 or 10 minutes. 28:17 As a service gets a batch of the warehouse orders warehouse. 28:22 And then it needs to produce the orders for the people who will be going and collecting. 28:26 The stuff in which sequence? 28:31 So it's highly critical becaus if it doesn't work, 28:34 then. 28:37 Is a PC's and thousands of people just don't have anything to do? 28:37 And a lot of our customers are not very satisfied. 28:43

But is there is also lots of room for improvement. 28:47 There is a data science team working how to? 28:50 Optimize those algorithms. 28:53 And work and learn and teach and train the models. 28:53 But these data science, they also working with Python. 28:59 And it takes. 29:02 By somewhere for 2 converted some. 29:02 Language that the like to have in production like often in Java. 29:05 So how the team is addressing this these challenges. 29:09 They. 29:13

Have one central node, 29:13 which receives the batch initially. 29:16 And then for every algorithm that they have including. 29:19 Uh some experimental evidence. 29:22 Algorithms that have a separate note integrators cluster. 29:22 Where they spend out all the same batch? 29:29 And wait for time to get a response. 29:31 So some of these algorithms that will be very proven and very predictable, 29:35 but not the most optimal. 29:39

All this will be much more much more optimized but sometimes they would have edge cases and so they will not return. 29:41 Response quick enough. 29:47 Some they're not transverse yet at all. 29:47 So they will wait two time out, 29:52 and return the best that they can evaluate. 29:54 Based on the weight. 29:57 This way they can learn on production. 29:57 Of course technically. 30:01

Especially these ones just by collecting data from production, 30:01 but it's much more convenient when The thing is running. 30:07 Bill on the environment where you can see how it works. 30:11 A common question last in general? 30:28 How do we do performance testing? 30:29 Many teams we used to have a performance test. 30:32 All of our services. 30:35 In back end, 30:35 as we are not really CPU CPU bound. 30:39 It's not our main bottleneck. 30:42

But don't do explicit performance tests, 30:44 not on staging environments. 30:46 You see those that we look. 30:46 Mostly on the metrics how things are working and do some experiments. 30:51 The most common is to shut down a couple of notes. 30:56 Well, if you have multiple nodes and see how the remaining a coping. 30:59 Not really a scientific proof how. 31:02 Because the rental is the application scalar linearly, 31:05 but it can give you some indication how things are working. 31:09

Is it as you as you expect or not? 31:14 Another very useful technique for us because we're on the back end with lots of a synchronous communication is to pose. 31:17 Some queue consumer for some time wait till the backlog grows. 31:23 And then release consuming. 31:28 Uh consumers, 31:28 maybe will even with increased parallelization to see how will cope with stress. 31:31 So. 31:41 We also do sometimes a little bit of a curse engineering. 31:41

We'll also hear what gas engineering is or other food doesn't. 31:45 So that 31:49 But actually we don't do engineering with more like monkeys. 31:51 So what we do is. 31:55 When we have enough confidence in some floor we can do some disruption like deploy. 31:57 Database change, which is backwards, 32:03 incompatible, yeah, we know how to do it backwards compatible but we expect this flow is. 32:05 Resilient so will deploy. 32:10 As we expect there a couple of messages will fail. 32:10

Little late to the alarms go off Caesars still working because things are changing, 32:15 maybe someone forgot to update. 32:20 An alert when sellers are? 32:22 We see that as well as work that we would do recovery. 32:24 And this is a controlled we know and we expect the expected this, 32:27 this is going to be happening. 32:32 Whereas when we embrace. 32:36 Testing in production as a first class testing environment. 32:39 Well, of course, we cannot just go and start testing production. 32:43

We need to invest. 32:46 And take care of observability sold we see actually something is going wrong production. 32:46 We need to take care of monitoring. 32:52 Resilience be costing will be going wrong. 32:54 There's explicitly then, although they will be going wrong, 32:56 anyway, but now it's also. 33:00 Part of our development that will be making things wrong. 33:01 And the small side note. 33:07 That's what many of my colleagues asked to repeat this when I'm doing this presentation. 33:10

Customer production doesn't mean that you can deliver on. 33:15 Untested or Bad quality code to production. 33:19 You called, is supposed to be. 33:23 Up to the production standards. 33:26 That's all you can think of security is not delivering secure code operation. 33:28 I don't want. 33:33 Z recalls that will be unpredictable unpredictable unpredictable from at least from the function. 33:33

You can do some shortcuts on swallowing exceptions and errors in some shadow falls, 33:41 but usually it's better to start building this. 33:47 Order from the beginning to see what's happening. 33:50 As examples that I was showing I saw them in this. 33:57 They were working with teams with different levels of maturity of with different types of tools and not necessarily all of them. 34:02 Working very, very quick. 34:11 Is it still possible to do things like shadow runs? 34:11

What we found useful is to have such toggles? 34:16 Especially if you have a slower really deployment cycle. 34:21 'cause then you want to react more things that are happening. 34:24 Faster. 34:28 In our recent experience becaus last 2 years are doing trunk based development with continuous delivery. 34:28 Means that our development brushes are very much a synchronized with. 34:36 Oh man development branch and that one continues to synchronise this production. 34:41

We have actually less Newton feature tables because some off and we can just deploy and you change without any proof. 34:45 Speech bubbles. 34:51 Most of the options are fine. 34:51 Microsoft says I'm not realistically requiring but what how it helps. 34:55 Is to determine the blast radius? 34:59 Is much easier to know the effects if you have a small smaller service? 35:01 Yes, theoretically to have a good model is well. 35:06 Design well uh. 35:10 Was good isolation of internal components? 35:10

I haven't seen them. 35:15 Maybe. 35:15 Microservices comments in this respect. 35:18 Although theoretically again. 35:21 You can have cascading effects if some of the microservices start to misbehave. 35:21 But as we are not again performance bound or CPU bound for us is not a huge problem. 35:27 Got it. 35:34 Dynamic infrastructure helps, 35:34 especially with ability to spin up new nose and you. 35:38 Services within needles. 35:42

Uh without involvement of operations that helps alot force building shadow specially when you want to. 35:42 Have said separate from your life components. 35:51 Of. 35:55 Not necessary, but 35:55 And this is the last drop it. 36:02 In like I think that's testing in production is an opportunity. 36:10 Well, at least that's how we experience it. 36:14 But it's uh we were forced to do that be cause our acceptance. 36:16 We couldn't get enough value from our self acceptance stage environments. 36:21

But now we see that often we can. 36:25 Just go directly to production and it makes our process is lighter faster and cheaper. 36:27 Muscles and not say that we should do it always sometimes with a little bit of acceptance. 36:33 But. 36:38 Most of the time. 36:38 Yes, and I'll experience maintaining stage environments is not reveal. 36:42 For many reasons. 36:46 And this is the mental picture that I that we have now in my team in in many things around us. 36:50

Not all his mobile com is not on missing everybody decides there only. 36:56 So spend most of the time. 37:01 In 2 places in isolation building feature. 37:03 This test driven development explore functional what we expect there. 37:06 So we usually quickly skip so the staging environment. 37:11 And go to production and learn there was happening. 37:15 I'll be sharing this representation. 37:21 Uh well preparing is it I found this to YouTube. 37:24 Videos. 37:28

Quite interesting because one was about a steam account organizations. 37:28 They don't have any acceptance environments. 37:33 They just go to production. 37:35 It's a little bit, too, 37:37 extreme, but it's good to learn from the extremes. 37:38 As the compression otherwise I don't want to go by there doing that. 37:42 Another one is quite it's a short about 7 minutes. 37:46 But even the name is sufficiently advanced monitoring is indistinguishable from testing. 37:50

And there is even more interesting if you watch it. 37:56

Cackle comments for the website

Buy this talk

Access to the talk “Mykola Gurov - Testing on Production, Deep Backend Edition”
Available
In cart
Free
Free
Free
Free
Free
Free

Access to all the recordings of the event

Get access to all videos “Test Leadership Congress”
Available
In cart
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “Software development”?

You might be interested in videos from this event

September 28, 2018
Moscow
16
129
app store, apps, development, google play, mobile, soft

Similar talks

Karen Holliday
VP, Quality and Customer Care at InGenius Software
Available
In cart
Free
Free
Free
Free
Free
Free
Davar Ardalan
Founder at IVOW
Available
In cart
Free
Free
Free
Free
Free
Free
Katja Obring
Test Consultant at Infinity Works
Available
In cart
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “Mykola Gurov - Testing on Production, Deep Backend Edition”
Available
In cart
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
525 conferences
20515 speakers
7489 hours of content