Vizier: Black-box optimization and AutoML - Pittsburgh ML Summit ‘19

Vizier: Black-box optimization and AutoML - Pittsburgh ML Summit ‘19

Show Video

Yeah. I'm based here in Pittsburgh I lead the brain, team presence, and in Pittsburgh I'll be talking a little bit about this. Year which is a project on blackbox optimization, you know and Otto mal and. Be. Talking about the work of the the whole team here so it's not just me all. Right. So what is blackbox, optimization, so. It's. Called blackbox optimization, because you presume that you have some kind of function, right. Of some, known, parameters or, knobs and this and this and you can't really appear inside this system so, you. Essentially you want to find the best settings of these knobs but, the only way you can actually figure out how good they are is to, actually set, the knobs appropriately, and run some kind of experiment, or process, which is considered quite expensive then, you run this process you get back some kind of objective value. Like some performance, measure right. And then, you. Just want to you want try to find the best settings, of these knobs but. The system is relatively opaque right, and so you have a relatively small budget of evaluations, you can think of this as maybe only a few dozen or maybe a few thousand, but not like millions or billions typically. And, you want to try to do the best job that you can and this becomes. Quite prevalent in the real world because as systems growing more and more complex, often. Times even if you actually can, model them like their systems. Created an encode, for example and you can actually inspect how they work they can still be quite difficult to understand, and then there are other kinds of systems where it's, just they're, just too complex and this is essentially the only way you can interact with them so. In, particular we don't assume much about this function it's not necessarily coming into concave, or, smooth. Or anything like that and we want to it, might have many many local optimal and I want to kind of avoid those and go for the global optimal, so. That is the kind. Of framework for black box optimization. Because. We make so few assumptions about the objective function we wish to optimize there actually a wide, range of applications, that fit into this framework. So here, are four, application, domains that, I've highlighted here and much. Of the work we do is around, some over on these so. For example you, can imagine you parameterizing. Some, UI or UX and actually, do automated a be testing, on websites write various people will, when, they're designing websites say, they want to change, some aspect of it do some a be experiment, measure, the user reaction, in some way and try to figure out which design is better right, and try to find and, you might want to try to find the best design within some space and you can use a blackbox optimization, to do that another, kind, of application, is in the machine learning space. Many. Machine learning models they're, virtually, all of them have these hyper parameters, which are parameters that can, strongly, affect the performance of your model but you cannot learn them directly during training, you have to sort of set them in advance and this, is another example where the, black box here is you set these parameters you run training, you evaluate, your data you, evaluate your model on some test set for example and then you want to try to get the best performance that, can be formulated as a black box optimization, problem. Auto. Ml as well like architecture, selection, for a deep learning model, that is also an example of a black box optimization, here. There. Are other problems where you have just physical design problems, where, you want to optimize some process, or physical, design and. Perhaps. You trying to optimize it in simulation, or in reality I'll. Be talking a little bit about one where we. Are interested in optimizing chocolate. Chip cookies for taste, right and that's one where. You. Know you you can't even know you're yourself, what makes the chocolate chip cookie good for you right like you don't even have a great model of, if. I gave you you know two chocolate chip cookie recipes, if I were to bake these and feed, them to you which one would you prefer it's, actually very difficult for you to evaluate, that without just baking them and eating them with any sort of precision so, this is an example where you, know, you just cannot open up the box right. And. There are other applications in, the robotics for optimizing or about a control policies, for example to. Say like how the robot moves in order to do.

Things Like make the robot move efficiently and minimize jerkiness, so you don't damage the robot and so on so. Those are some example domains. So. A while back we set out to build a surface for blackbox optimization, give. It this name does. It's, just a royal advisor and. It sort of recommends. What to do for you and the design goals here we wanted a system that would be really easy to use for, clients, because. The easier it is to use more people more people will use it and. It's not really use useful, to build something and nobody uses we, want it to be reliable because again if, it's not reliable it's not highly available people won't depend. On it and they will want to use it we, want it to be scalable for, the same reason we, wanted it to be able to support state-of-the-art algorithms and, allow us to investigate, lots. Of new algorithms, to be sort of flexible, and support research a wide. Range of applications, so, these are the five main design goals we set out some of these are kind of in tension with each, other so if. You build systems, you, might realize that having something that's really easy to use and that's really flexible, these, things are often in tension, right. Similarly, something. That supports state-of-the-art algorithms, but is also really reliable, and scalable. These. Can be in tension right if you're if you're constantly, modifying, the system or constantly tweaking it. But. I'll. Tell you a little bit about kind of how we went to going, about building this and. What. It actually means in terms of each of these design goals so, ease of use we, built it with minimal configuration, in mind you just specify a, study name what parameters, you want to optimize, various. You. Know access control lists if if need be if you want to keep. Your data private. And. The client workflow is very simple you just register, a register, a study with some parameters, and then you can just repeatedly go through this workflow or you ask for a trial with parameters, to evaluate it run. It which is, application. Dependent, notion, of of evaluating, this set of parameters and reporting, back in the new report back results we, have clients in Python, C++ golang.

And Java, and a web front-end, and an integration, with colaborate, Ori which if you haven't used it is very similar but you put our notebooks so. It's. Really easy to use. For. Scalability we built it as a, micro, service architecture here evaluations. Evaluation. Workers these, are the ones that are evaluating these trials they asked for from the zero API, which. Talks to some database, and actually looks up any previously generated trials, where if new ones need to be created, it talks to the suggestion, service which is backed by replicated, suggestion, workers the, point is that the design is scalable, to thousands. Of parallel, evaluations. In a given study and as you can be running thousands of trials, simultaneously. You. Can have up to millions, of trials. Per study and essentially. Unlimited studies. Right. We wanted to be able to support kind. Of all of the hyper parameter tuning at Google, and. Really, ultimately like, the global hyper, parameter, tuning workload and. This. Current design actually can support that. We. Want it to be very reliable so. We use kind of the standard practices around, testing. And monitoring alerting. Release, processes, and so on but. Kind, of one interesting. Design. Aspect, to the system that I would want to highlight is I think it's it's it's quite interesting is, a lot, of these so, for, Visio we did not require that generating. New trial suggestions, be really fast because, the idea is you want to be able to spend the the trials, are expensive to run so, you're willing to spend a little bit of time to compute. Very carefully, what you should do next and. Because of this we, actually represent, the work as a. Token, in the database itself right, like a request for a trial and when that's fulfilled, and what, that allows you to do is you. Can actually allow, the system to introspect, and say, it's. Tried to generate a trial for, this, particular, study you know five, times and it's failed five times in a row because. Maybe there's a bug, in one of the algorithms, that's servicing, this study or you. Know, some study, has used up a whole bunch of quota we should throttle, it and representing.

The Work in the database in this way actually allows you to do this in a straightforward manner so, you get this nice introspection, for free and you can automatically. Kind of shut down or throttle, or shut or disable, problematic. Studies. And, this. Has helped us quite a bit in. The internal instance so we use a lot for research, purposes. As. You can imagine the internal, research scientists, really. Though. They'll heavily, leverage a tool like this and really press the limits. So. We wanted to have state-of-the-art algorithms all right then the suggestion, workers are modular we can support many different kinds of algorithms so, things like Gaussian, process bandits and Bayesian optimization. Sequential. Model based algorithm. Configuration, that's called smack. Evolutionary. Strategies, like CMAs. Older. Classics, like simulated annealing genetic, algorithms, and. Simple things like random search and grid search and and other things you, might ask well okay if you tried out all these algorithms which are the best ones this. Is kind of a natural question to ask. It's. Kind of interesting there's an interesting challenge in evaluating, algorithms, for blackbox optimization, right. There they're a few issues here, the. Cheapest. Thing you can do is come up with some analytic, benchmarks, and evaluate. Your algorithms against those there, are academic competitions, where people actually do this and various. People come. Up with challenging, functions, these are functions that are cheap to evaluate, but actually have. Like various, kinds of properties, but. The thing I and. These are like eight ones. That I fit on this slide for example that. Have different kinds of characteristics, in terms of somewhere, well-conditioned some are ill conditions some of lots of local optima, some how few and so on, but. I do want to highlight here that there really is no perfect set of benchmarks you. Know. If, any of you are familiar with these no free lunch theorems, there's, like these classic results, that say you can't have like the optimal, search algorithm, for certain kinds of search problems and. Kind. Of a corollary, of that is there is no optimal. Algorithm. For a blackbox optimization, it really depends on on what kinds, of problems you're throwing the system at right, so there might be an optimal algorithm for a particular, distribution, of problem instances, but. In the real world you don't get that right, like the, kinds of problems that people want to solve can change over time so.

Nevertheless. We. Still can. Use these, algorithms to, kind of cheaply evaluate, things and then we can do, more expensive evaluations, on real data. Sets and real kind of use cases for users, can, tell you a bit more about a couple of these so, just to give a closer look, this. Function on the left is called frozen but Rox Rosen, Brock's banana, function, it's called this because if. You look at the level sets towards the optimum, it looks kind of like a banana that's like this blue region it's. Got an interesting property that, it's. Kind of got this valley at the bottom that curves around so let's say you're trying to minimize this function, it's. Also has a nice property for. The our purposes here is that if you just run regular. Or gradient descent on this function it actually takes exponential, time, to, reach the optimum so, the the gradient descent doesn't, do well kind of going around this nice like, curved valley. But. Nevertheless, it's. It's. Got this like one, nicely, connected, these connected sort of level sets right it doesn't have like lots and lots of local optima so there's the rest region function here on the right it. Does. Have lots and lots of local optima right, so these are like two example. Kinds of functions that you might try to evaluate your blackbox optimizer on. Okay. So. This. Slide the, details don't matter so much the point to notice is that we run we run some benchmarks and like some four dimensional problems and some 32 dimensional problems with a few different algorithms, and like the ones that are best in four dimensions are different than the ones that are best in 32 dimensions, right, that's, kind of the key point here and. This. Is not, just. Some property of these particular, algorithms, but it's a fairly general phenomenon. So. Another thing of isère does is actually try to automatically, detect, characteristics. Of your problem and then select the best algorithm, for those problem, characteristics. Okay. So. I wanted to just give you a bit of background to, get an idea about what kinds of algorithms are used here so I'll just tell you a bit more about two different approaches. Approach. Number one is Bayesian optimization, this is a very popular approach it's. In many, ways the the best approach that we know of in. A fairly, broad range of problem, domains not all of them but many of them in. This, cartoon we imagine that that we have this one-dimensional, set.

Of One dimensional parameter space a one parameter optimize and the. Objective function is plotted in the y axis here we have some data points in green. The. Bayesian optimization, framework. Says. That you should try to model the objective function ok so. Maybe. This, is the objective function, right. But you don't actually know that it's. Hidden from you and not, only do you model the objective function but you model your uncertainty, in the objective function so you try to come up with some sort of a region that, you believe that the objective. Function lives, in in this case shaded in blue here and, then, once you do that you. Have an explore, exploit trade-off where. You can try to select points that, you believe are good with high probability but. Are maybe not great and other points that could be fantastic or, could be terrible but. You have to sort of just try them to see right, and there's sort of a trade-off here and. These. Are sort of fundamental you, can see this when you like choose, a restaurant to go to do. You go to like your favorite restaurant or, do you go to the new one that just opened up right like. So. Exploiting. Would be going to the rescue the old restaurant, that your is your your. Favorite one that's been open for a long time and exploring, would be trying the new restaurant. You. Know because maybe you'll get a new favorite restaurant if, the new one is really good anyway. Here one. Standard, strategy would be to find a maximum, point in this upper envelope right. The star here and say well try the parameter, corresponding to that next right. So. Maybe you try that and you actually get this orange point because it's it's not so great right it's on this curve and then, you would just kind of update this picture all right and then proceed. We, use these Gaussian process, models in. Order to to. Model the functions I mean, they're there various other choices some people use, Bayesian. Deep learning and some people use. Kind. Of gradient boosted decision trees and other things but, the Gaussian, processes are kind of state-of-the-art for small, datasets. If, you don't know what these are they're, actually quite, interesting. Mathematically. Essentially. It's a generalization, of a multi-dimensional Gaussian. Where. If you imagine a multi-dimensional Gaussian so, let's say you want to model, a function from a finite domain like the integers 1 to N 2 the real numbers well, then you can imagine. Identifying. A value, of this function with a vector, of n, dimensions, where the coefficients, of that vector are just F evaluated. One value at a two and so on right. So the ice coordinate, would be F of I, now. You can imagine a modeling a distribution over such functions, with a multivariate, Gaussian right. And. Then, Gaussian. Processes, just generalize, this to continuous, domains like that's what they are. So. There. Are pros and cons the pros are, that these things are extremely flexible. They. Have analytic. Solutions, for posterior inference, which is actually, pretty impressive. Because. It allows you to reason about these sort of infinite dimensional, objects, using finite amounts of computation. And. They're really elegant, and there are men a bolt some. Nice theoretical guarantees, the. Cons are that is that in the in naive, implementation they're. Extremely, expensive computationally. Something like n cubes or e of n data points. The. Nice thing about blackbox optimization, is you assume trials are expensive so you don't actually have that many of them so, this actually becomes feasible. Okay. So that's one approach another, kind of approach would be you just have some kind of simple probabilistic algorithm, right, some authors. Have argued that just like random search works pretty well. I'm. Not sure I agree with that but like here's a strategy, that's slightly smarter than just random search where. You essentially, you pick new points using one of three strategies one is you do some uniform random sampling another, one is you take the best point you've seen so far sample.

Some Ball around it and and pick a point in that ball uniformly, at random this. Is like a sort. Of a gradient free hill-climbing, another, one is this linear combination sampling. Where you create a new point via a linear combination of, some older points. So, why would you do this well. You, know let's say you're optimizing this function here the contour plots, the. Randomness allows you to sort of explore new areas and, make sure you don't get stuck in some local optima. The. The ball sampling strategy, allows, you to. Kind, of do this probabilistic. Hill climbing so it works really well on on well if well-conditioned. Objective, functions, right. And. It'll. Allow you to kind of walk fairly efficiently to the nearest local optima and. The. Linear combination, one or you might have like these two red points and you sample this blue point as the new one the new child point, that's. Nice, for cases, where you have really ill conditioned, objectives. So. The conditioning, here is is, kind, of how. Much the function varies as you go in different directions so. A well-conditioned, function if you will, change, about the same speed no matter which direction you walk in whereas. An ill-conditioned, one if you walk in one direction it might look totally flat and if you walk in another direction it looks really steep. All. Right a linear. Combination of sampling strategy, here is completely, immune to old condition functions actually. So. You combine those three and you get something that behaves, quite well actually. And. There, a bunch of other algorithms as well. So. In. Things like machine learning. Like. When you're just when you're optimizing machine, learning models you also have access to additional information it the black box assumption, isn't completely, true typically. As you're training on machine learning model you get information about how it's doing as you, as you train it for more more epochs so, you can look at these these, curves of white performance versus, time or accuracy versus time and that. Allows you to do things like automated, early stopping where. If. These are these training curves and you see you. Have a bunch of training curves that have completed these. Solid lines here and you, see like this green, dashed. Line where, you see this model you can basically infer early that this model kind of stinks and you shouldn't bother finishing it right. Whereas if you see this orange one that's sort of trending, up you, say well it's doing worse than the other models were at that time but it's it's got this weird trajectory, that's very different than the other ones so it might actually have some promise let's let it finish alright, so we have algorithms that will try to detect this and automatically, decide which models to stop early to, kind of save resources. Another. Type, of algorithm, here is you can do transfer learning where. If you run a study once and then you're gonna run a similar study in the future you, can kind of use the data from your your previous, study to accelerate, your future study this, actually happens a lot in case it's like machine learning where people, have, a model, and then. They. Want, to kind, of add a feature or maybe. Add a few features and then rerun the model well, it turns out that the performance, for the respective hyper parameters will be similar and so, we use transfer learning that kind of zoom in on the good regions, that. Were from prior studies. And. One. Approach this is to just kind of find, good, prior data and use that as. And. Just look at model the residuals, on top of that prior data.

All. Right so but there's still room for improvement so. We, did add what we call this algorithm playground, where, we let. It's. One of other engineers, essentially. In research, scientists, develop their own algorithms and we can plug this into this playground binary and kind, of substitute, Beziers production. Suggestion. Workers with this kind of binary so that instead of our code suggesting, what trials to do next this, kind of research research, code is doing so and, that allows us to. Experiment very rapidly whereas, from the rest of the workflow looks identical. From. The evaluation, workers don't have to worry about what algorithm, is generating the trials, okay. So we built this great system and we're thinking how do we get people to know about it and use it we're. Doing pretty good job internally, with the hyper parameter, tuning but we wanted to highlight some of these other use cases so. We decided. You. Know they go back with our roots we're in Google Pittsburgh this building, used to be a Nabisco cookie factory so. We figured why don't we optimize. Cookies. So. We parameterize the, space of cookies right, like what kind of chocolate to use how much chocolate how much sugar, white. Sugar brown sugar how, much butter how much flour eggs, vanilla, baking, soda. And then if you set these parameters you get essentially a recipe, you can bake that recipe and then feed it to people and have them evaluate, their your cookies and feed those scores back to vizier and iterate right, so, in, the, third floor of this building there's, a cafe that actually has a teaching kitchen with something like five baking, stations, so we took over that and started baking. Whatever. Vizier told the students, so. The, cool thing about this is that you start with very simple ingredients you can get extremely, diverse cookies right. So. And anyway. So we sort, of went to town here there. Were some surprises so, like I, don't, know if you can see that so well but this is a cookie we're sort, of isère went like, maximize chocolate, basically like build, a chocolate, bar with a little bit of cookie dough on it right, that.

Was Actually pretty well rated but, you know not exactly what we were we were looking for so but but you can constrain the parameter ranges. Here's. Another one, science. Does require. Sacrifices. At times so, these cookies this is what these they look like when we scoop out the dough and put it on the baking pan right we used a kind. Of like an ice cream scoop but a little bit smaller all. Right and then I took these out of the oven like. I personally took these out of the oven and they looked like this and I thought I had made a mistake like maybe like, they I thought, they had been cooking for 13 minutes but like they look like completely, raw. But. It turned out they had been cooked and we did feed them to people they, were not very good but, we did learn from this that like, this is what happens if you wait too little butter and you. Know. The. The Vizier, got better so, and in the end the results were definitely worth it. We. We used at. First we used Chips Ahoy as a baseline when, taste, testing because it's, a pretty easy baseline to beat but pretty soon we had to switch to using the classic Betty Crocker recipe which, is not such an easy baseline, to beat but, we were pretty reliably beating that after, about 40, 40. Trials for two different cookie recipes. All. Right so, in. Terms of the recipes you, know we have the optimal recipe here there's a paper workshop, paper a Bayesian optimization, for a better dessert you can look it up. Some, you, know high-level, lessons learned at, least when, you're feeding people. Who work in this office your, cookies should have less sugar and less butter than, the recommend like typical recipes and more, chocolate more salt and a little bit of cayenne pepper. That. Was a lot of fun it was a good test of the system because, it tested a noisy evaluation, we. Had a. Trial. Infeasible, trials, so like sometimes the dough came out was just too crumbly you couldn't make cookies out of it and vizier does support infeasible trials. Configuration. Changes, to unlock ingredients. Manual. Overrides. Because. I might be human error in baking that we catch later, transfer. Learning when we adjusted the baking site so we did the first. Very. First sort. Of phase of this study here in Pittsburgh and then when we had enough data we actually scaled it up in Mountain View so we we took over the dessert commissaries, there and started, feeding experimental. Cookies like 20,000. Co-workers so. That was great. Yeah. Come. Mm-hmm. And. Yes. We didn't kill any of them right yes. Nice. Thing about Google you can experiment on your co-workers and, you don't need to even file an IRB. Yeah. So there. Was transfer. Learning involved, and, it, was a good test of our baking skills. Yeah. You, can actually use Vizier right now it backs. The hyper parameter, tuning capabilities, and the googly, eye platform, it also backs many, of the auto ml, capabilities. I guess, I'm running. Out of I have one minute left so I didn't want to give people opportunity, for questions. Thank. You.

2019-11-29 21:46

Show Video

Other news