Risky business: taking risks in production - Matthew Hawthorne & Leemay Nassery | #LeadDevLive
My, name is Lee monestery and I'm an engineering, manager at Comcast hello. I'm Matthew Hawthorne, and I am an engineer at Comcast. This. Is a axis. Of risk, appetite, I used. To work at Netflix and one of our quarterly meetings Reed, Hastings put. This up on the screen and it was a quarter where Netflix was not doing particularly well and he, asked us to. Think. About where we saw. Netflix, on this axis and the larger point he was making was more. Companies fail from being too timid, rather than too bold and most, companies fail anyway so, you might as well if, you're gonna fail might as well fail being bolder, rather than more timid. You. Anyone, that's watching this right now you're. On this axis somewhere and so I think it's a it's a good thought exercise for, all of us to think about where are we on this, axis and why. Are we in that spot. To. Me risk is probability, of failure it's also about. Severity. Of failure and. Failure. Is contextual. And that's what we're going to talk about today in detail is there's you know product failure versus, operational failure, versus, personal failure are very different things and very different, repercussions. And so, that that's the way that we're going to break it down and. So Lima can you talk to us about the, specific, the, example, we're going to use for our story of risk yes so along those lines in different context and we'll talk about risks we'll also talk about our own risk story and. So what that was at Comcast was, we introduced, a highly personalized, for you experience, into, our Comcast, x1, on-demand platform, so. Giving a little bit of context in case you don't have x1 or you don't know what Comcast does we, have a video, product, in which customers can browse for. TV shows and movies that they'd like to watch and. As you can see from this lovely screenshot, that we took there's a for you experience. At. The beginning about it at, the beginning experience, and. That was the result of all of our hard work and the risks that we took to. Introduce it into our platforms so we're gonna talk a little bit about how risk. Was a big aspect of, this change. All. Right so jumping right into product, risk, what. Is product, failure. It's. Creating, a product that people do, not want. We. Can't really help you with that. Or. It's creating a product that people want but delivering it in an ineffective. Way. Or. It's creating a product that people want and delivering. It in an effective way but less of like less effectively. Than your competitors, so. How, to know if you're not taking enough product, risk I think this is a really good question I think this. Question doesn't apply just a product so I'm not even a product owner I'm an engineering manager but, this, is something that like engineers, that are building products, that customers used should think about. You. Have no unanswered, questions for example what percentage, of users use teacher aid for speech would be. You've. No new features no new features ever fail so. Some a B tests should fail we'll talk a little bit about that later and some, features should be killed it's okay to deprecated, features as a product, of all that means that you're getting better at what your customers are what you're building for your customers. So. One of the solutions is collection, and analysis, of usage data what, I mean by usage data is like how your customers are engaging the platform that you've built or the product that you've built. Ami. Testing, is a specific, approach to usage data analysis, a. B. Testing is also a strategy for both risk, mitigation. So, not only is it a way to collect data on how your product is being used or your features being used but it's also it's. Also a way. To mitigate, risk. Yes. So with, AV testing there's the, it's. Likely, going to slow you down like, versus, a B testing versus not a B testing but, I think the goal would be you learn more from each of the features that you're rolling out and so I think there's three different ways that you can look at. Balancing. The, features that you're launching with what you're learning from those features you can launch a lot of stuff but not learn anything I suspect, this is what a lot of companies are doing you.
Launch Things but you don't have a lot of information about how your customers launch. Of a specific, feature might change the way the customers interact with your stuff you. Can launch nothing, but learn a lot so you, know imagine like taking. A quarter, in saying, we're gonna take a break from launching features and we're just gonna learn stuff, we're gonna focus on getting better metrics, and things like that or. You can and that that might not be feasible. A lot, of companies they might not like the idea of not launching anything or, you can take a more balanced approach which is what I called launch a little learn a little or maybe you say we're going to launch a few less features but we're gonna learn a lot more from each Bend so it's a sort of a best of both worlds approach. To. Me not. A be testing, is the riskiest, thing you, can do you know if. You're, not a be, testing, I don't see how you can, I don't see how you're deciding, what to launch I don't, see how you're evaluating, how, the things that you're launching are impacting. Your customers, and whether what you want was actually a good decision or not kind. Of like the responsible. Thing to do right yeah. Maybe. Testing I think zoning. Okay. Salima. Can you talk to us about the, specific, product risks of this for you launch yeah, definitely. So circling, back to this feature that we built this for you experience, there's, a lot of risk in building this experience, so coming from our perspective, we're an engineering team that had built a bunch of spark jobs a bunch of like a platform, a web service a bunch of components to facilitate, personalization. Or recommendations. To, improve our content discovery experience and. The next school if we wanted it to be used and. That's where we how can we get our this, product, or these features in front of our customers to get, customers, to the content we know they love quicker, so.
What That involved, was a, B test a usage, data and again. A lot of risk so, from a product perspective changes. Were often made but not necessarily, measured. Also. Making, product, decisions on a platform and in historically, utilized, a/b testing to make changes was. Risk and. It was risk because it's a mind, shift that you go from instead. Of pushing out features based on like just, decisions, made by individuals. To pushing. Out features based on data, right. And. A. Lot of those lines we, leverage data from the a/b test to improve the ordering of the personalized algorithmic. Rows represented, and we looted it to this at the beginning when. We were saying ami. Testing is not also a way to mitigate risk but it's also a way to collect, data, we. Did exactly that so not only did we release this, wonderful. For you experience, that's personalized. Getting content to the customers, that, we know they really liked it's. Also a way that we improved that specific, feature by. You know looking at the engagement within those rows. So. Mac do you want to tell us a little bit about risk. From, an operational, perspective absolutely, I, would love to so. Operational. Failure to. Me is system. Malfunction, that results in an unsatisfactory. Experience, for an unacceptable. Percentage. Of your users I, think. It's either some. Errors for too many users or it's too many errors for, some users and both of those are things you would like to avoid so. In terms of how to know that you're taking enough operational, risks I think, one. Signal. That you're not is nothing, ever fails like, zero percent errors, for all of your customers I think, that's a bit counterintuitive. I think people. I think it's, natural to think well no errors is good right that's our goal with zero percent errors, you know ten nine of uptime, but, to me if nothing ever fails, and your error rate is that low that's probably a sign, that you're, moving too slow you're not moving fast enough you're not pushing hard enough, that. Said different, products, have. Different. Acceptable. Error thresholds. You know we work in the video streaming world first. Is something like a bank or like an airplane navigation, system it's an error means very different things you know for us you. Click play on a video and it doesn't play that's unpleasant. It's not the end of the world not the end of the world verses. Of urine and airplane, an. Airplane pilot, clicking, the descend, button, or whatever in a plane not descending, that that could be the end of your world you're the person on that plane so, you know I think it's important to keep in context, the industry that we work in is definitely influenced, our take on risk. You. Probably don't have time, to. Be a hundred percent sure, that what you're about to launch is not going, to, fail. If. You think about the. Amount of time it would take you, to be a hundred percent sure, that what, you're gonna launch is safe.
Could. Be the exact amount of time it takes for your company to fail because, all of your competitors are launching more stuff than you you don't have that amount of time you, have to move faster and so, to me the solution or one, of the solutions, is move. Fast but pursue, failures, but limit, the blast radius of those failures and, so I'm going to go through a bunch of techniques, to achieve this operationally. I think, maybe the most important, one is quick rollback that whatever. You roll out you need to be able to roll it back quickly and so if you think about like having, a bunch of QA to, verify, that what you're about to roll out is going to work. That. QA is gonna fail. Sometimes anyway like it's it you're not gonna catch everything in which case you need to roll back so why not instead invest that time into being able to roll back quickly, you can even do less QA you know it's it you can roll back maybe you don't need as much QA roll back quickly that is so. Gradual deployments, you know if you're in multiple data centers you, do one at a time just. To make sure that your deployments are not an all-or-nothing affair. That you can roll it out gradually, look at your metrics and engage how things are going, Canaries. Are a maybe. A more specific type, of gradual, rollout where you can imagine rolling. Your new stuff out to a single node so if you've like a 100. Node cluster you, put your new stuff on one node that's, gonna take 1% of traffic, you. Can also have something I'm calling it a session eyes canary, I know it Netflix we call it a sticky, canary and. The. Concept here is instead of 1% of track that going to the the new stuff it, would be 1% of your users, and the idea there is that'll allow you to catch a different class, of bug, where there might be more session based or more nuanced. You. Can think of an a/b tests, kind, of like a persistent. Session eyes canary, or some subset, of your users are going to this new code, new feature whatever all, of the time. Anomaly. Outlier detection whether. It's servers, or regions. Or users or whatever attribute. You're looking at a few, one server that goes back you want that information to be pushed, on to you you don't want, to have to go and hunt for it if you have a certain. User, or like a class of users maybe I use users, that are in a certain a B test latency.
Is Higher you, want that information to be pushed to you so you can see it maybe something is wrong here or maybe things. Are just slower in one Amazon, region for example latencies, higher talking to s3 or something like that you'd like to that information to be pushed to you as opposed to having to go and hunt for it, chaos. Engineering, which I see, as proactive failure, simulations, the idea here is. You. Want to identify. Proactively. The error situations, that you you expect to see in Prato that you could see and then. Simulate. Those things. Some. Companies might call, those war war, games or something like that, circuit. Breakers and fall backs I think are. Something. That I think is very much in line with chaos, engineering, you think about your system boundaries like one system making a network call to another system where, your network latency can increase or maybe your connection, pools be overloaded, or something you want to wrap that in a circuit so, they can automatically, trip if you're late it gets too high if your error rate gets too high and, therefore, you can fall back to a degraded, experience, which, is better than an error you know maybe it's some in the personalization, context you said it first size its popularity yeah, you fall back the popularity or you've just fought back to some default list of content some say something, is better than nothing right and. I think dynamic, configuration, is in line with this and that being. Able to modify the behavior of your system at runtime as opposed to as opposed to needing a whole new deployment for, example tripping a circuit at runtime ideally. That would happen automatically but I definitely have seen situations with an incident you gotta go in and trip that thing manually, you want to be able to do that you don't have to do a deployment, to accomplish, that and from like the for, you experience, that we shipped into, our content discovery platform feature flags is a big aspect, of our deployments, right like, we would just push changes, and just. Like. Enable, it on one box just. Because we didn't have time and QA it right, absolutely, and, I. Think scope configuration. Fits. In with this too and that maybe, you want to change that config in just a single AWS. Region or availability, zone or just on a single server to test it out first I've definitely seen this during production incidents, where let's. Try tripping the circuit just in one availability, zone and see if that improves things and if it does then we're more confident, rolling that out globally, you know sometimes you try something and during an incident it actually makes things worse so this is a way of mitigating the risk of resolving. Incidents. So. Let's get, a little more specific, about the way we dealt with operational. Risks within this for you launch example. So. In this situation we had to launch quickly, we were concerned. About, some stakeholders maybe changing, their mind and saying wait don't launch this we, needed to get it out and we had limited metrics, you know we. Had what we had you. Go to war with the army that you have we did not have times add a bunch of new metrics, to see on, the. Operational. Aspects, of launching this so. We time boxed our capacity planning we said but, think about if you only had one week and you had to launch something what, would you look at you know in time it's not even infinite time to plan for this you have to do it quickly and, what's.
Uh What's an aggressive, but sane rollout. Strategy, we want to roll it out quickly but. We prefer, not to cause errors you know okay, see an error we can roll it backs nothing in the world but we prefer, not to see that and so for this example we doubled our allocation, every week for six, weeks or something like that, so. The first question we asked was do we have enough servers how. Do we know I think we looked at our CPU graphs and, it looked a reasonable, and oft we feel like we're okay we'll be able to handle the increase. In requests. But. Let's say we miscalculate. And we. Actually. Do need more servers what's the process for getting more servers you know is, it just scaling that need of AWS. Do we need to provision. New hardware in our data center so we were at least ready to, do that if we needed to. How. Close are we to capacity, in our data storage systems which, for. Us it was Couchbase at the time. Our. HTTP, connection pools how close are they to our configured limit we don't want to hit that limit so server barriers also, overall, capacity, to our downstream, systems that we were going to increase request, to as a part of this launch like do you think they can handle this float should, we talk to them about it if we talk to them about it are they gonna say wait don't launch this thing you have you have to gauge these things again it's risk. In. The end. Everything. Worked you. Know we thought about it with time boxed our capacity planning we did all these things I mentioned and we we, did not have to rollback for technical, reasons that, aspect of it worked very smoothly. Did. We get lucky. Not. Sure yeah, I know I don't look at it that way I think there's a little bit of luck involved in it. Was. A lot of hard work and if we were a little lucky who. Cares and it doesn't matter. So, Lima can you talk to us about personal. Risk yes. All. Right so what is personal. Failure, it's. Not learning anything that really sucks no one wants to not learn or anything it's. No novel, experiences, that also sucks like I don't, want to be in that state where I'm not I don't have any novel experiences. To talk about. No. New knowledge, or skills this is something that I like having nightmares about I don't want to ever get in the state where I am, NOT learning it's just it's in a comfortable feeling I feel it's, it's like it's just the, same thing over and over again, so. What's the solution do. Stuff ask questions, pursue, failure you like this guy or. Something. Like that yeah I don't know. So. Don't work on the same thing for five plus years unless, you really want to and. The like story I have in my head is that I kind of created a like rotational, program while I was a qualm at Comcast. So. That I could like experience, different projects, still like with under the constant discovery umbrella so I didn't really like leave, the organ, till later in my career but. Like it, was a way for me to gain, different, experiences. To keep learning but. Like if you're happy in your team and you're going there for more than five years and you feel like you're learning then. That's that's great too. So. Being a domain expert is, comfortable, but are you growing and I think this is something that like, lead engineers, or managers, that were originally, lead engineers often like, feel. And I felt this like when I was in the content discovery world and on video product, I feel like I knew our API I knew, our domain I knew our like I need, everything about the platform, all the different web services that were involved, to make it all work I knew it so well but. At some point I'm like am i growing, like am i learning new stuff am i going in the meetings not having to prepare are we talking, about things we talked about two years ago or still trying to solve. Like. What what is that how does that feel and like what what do you do yeah. There's that quote. You. Don't you should always want to be the worst player in your band right, right and I think that's relevant here that if you, always feel like you're the smartest, person in the room or maybe you have them most domain, knowledge that's a comfortable. Feeling is. That a feeling that you want and, then, there's the whole like your domain, expert, but like do. You have skills, that you could apply outside, of your company you might, know your API or your like your platform really well but like how does that translate to actual.
Experiences. Outside of that right. All. Right Matt so can, you tell us a little bit about personal consequences. Like, pursuing risk yes, and I've been this guy getting, walked out it's unpleasant but maybe, it's worth the risk so. If, you think about, so. You just talked about the like the high level approach. Personal, risk in your career but. I think this is more addressing, if. You follow, some of these approaches, we've talked about for product risk and operational, risk what are the potential, repercussions of. Those within your given organization and so, you can imagine if, you, start asking a bunch of product questions, or advocating for a/b testing, or advocating for a data collection, you. Might start catching heat from product or business colleagues. Who don't like justifying, their decisions with data they've just been launching stuff for years not having to answer questions about it not having to provide data how things are better now after launching, it than they were before they're. Not necessarily gonna like you asking questions you might start getting, uninvited. To meetings you might not be a part of those product conversations, anymore maybe you're okay with that maybe you're not it's something that you certainly want to consider. If. Following. Some. Of the operational. Stuff we've talked about. Causes. Outages, mm-hmm, if, you're in a culture where the goal is zero outages. And they think that's actually a realistic, goal or that they don't understand the nuances, of incidents, that you, know all incidents, are not created equal you can't have zero consents and also launch stuff in a reasonable amount of time. The. Goal I think should be to learn things from incidents and you should be getting better at handling incidents, the way I've seen is you're incidents, should be getting weirder, over, time, requiring. A like mixture. Of events. That cause very bizarre, things to happen that's actually what you want you know you want to get to a point where around having simple outages anymore but, if again. The goal from executives, or something like that is know outages, they're not gonna like you if you're advocating, things that are causing more outages so, it's something that you're going to need to be conscious.
Of. So. Lima can you talk about some. Of the personal risks you saw as a part of that for you lunch yeah, definitely, it's, one of the reasons why I love this story of launching, for you was not only was it like a risk, from an engineering perspective when, we were really this is the first time we're really trying to use our platform really, China like pushed our services, and our platforms, to the limit but. Also like there was the product perspective which we've already touched upon and, I did adding all that together there's a there is a huge, personal, risk by making these changes and. It's exactly what Matt was alluding, to earlier. So, with this the AP tests we introduced was the age-old battle of editorial, creation first algorithms, so. The, screenshots, that we've been showing historically, if, you take four year out of the picture it's a highly it, was a highly. Curated. Editorial. Experience, and then, introducing, alikum algorithms, into the mix is. It's. Changing, it up and how do you make that change on a platform, historically. Did. It measure the changes to. The full extent or didn't really use a be testing, and these are like you're changing the way we're doing things and changing the way we do things is. Hard. Especially. When. There's a lot of stakeholders involved each, with different agendas. So. You may improve, the customer experience but, then at what, personal, cost. All. Right so it's, like, conclude. What I was just trying to say is you could become a catalyst, organizational. Transformation or, you, could become a person with many enemies, or you could become both and. Again it's why I really, love this for you launching. Story is it, I grew, in so many different ways the team grew and some of the platform, they like I mean like growing even like just a number of requests we were receiving that were personalized, there's. A lot of growth that happen, in this experience. And. Especially in the career like, the personal. Perspective. You you learn how like. Data, isn't enough sometimes you have to tell, a different story you have to figure out what. Drives this person, to do what they do, what, like KPIs, are they looking at that we're not looking at there's, a lot of little dog risk and although we did. And. The. Choice is yours for, what. Type of, engineer. Manager, director, or whatever that you want to be how much much risk you, want to take. So. We'll do some closing, thoughts here, to wrap things up so. Let's go back to this risk access, that we started with I think. The overall point here, is that you. Should choose your location on, this axis, deliberately. And. Being cautious is fine and lots of context like we talked about if you're a bank I think, you have to be cautious if you're an engineer at a bank you, know and also being bold means different things in different contexts, I think the idea here is think, about where you want to be and think about um, think. About where you are think, about where you want to be and what do you need to do to get there within your organization, we. Talked about a lot of different techniques here on the product side like data, collection, a/b testing, asking, questions, on, the operational, side you know carries, and gradual rollouts circuit. Breakers would you have to scale this stuff to your organization, because all organizations, are different so. In. Terms. Of what you can do maybe a/b testing is too much, in. That case you can focus on collecting better usage data if, collecting. Better data is too challenging or, people just aren't into it you, can try asking some questions try, asking why, do you think it's simple we built this feature versus another one again there, might be repercussions, of asking these questions people, might might not like you asking these questions but it's something you can do I'm.
Moving, More into the operational, space if random chaos is too much you can do scheduled chaos and I think in my time at Netflix for. The larger chaos exercises. I they were still scheduled, we would do it once a month I think some. Company, is like a Comcast we do wargames, and or more monthly. Quarterly I forget, there's. There certainly schedule you know at, least within, a rough, date range when it's going to happen. If. Canaries, dynamic config that isn't, getting traction in your org for whatever reason, you, can simply ask the question well how do we get stuff into prod faster, what else can we do if not canaries what else can we do if not dynamic configs, that's the fundamental thing we're trying to get to is moving faster, maybe, in your orders, are different paths to achieving, these things and. If. You want to take more personal, risks it, can be as simple as asking yourself what. Do I want to build well you know what's interesting to me what you know what specific area, am. I most drawn to what type of engineer, do I want to be your manager how, do i how do I want. To see myself or want other people to see me it's I think if you start asking yourself these questions the answers will become more clear and we. Believe in you you, can do it you can figure it out. Thank. You.