Business Analytics Class-Spring 2018 at CU Boulder
Ok. Everybody where can. I get started. There's. A. Quick. Little change, going on I went, and I looked at how the camcorder. They set and set up in the back was working, wasn't. Super happy about its quality it. Was impossible to see some slides I wasn't super easy here so I'm hoping some it my own, equipment, will make. Things a little bit better. And. As. We. So often begin at the start of class let's, get into canvas. To, review a few. Administrative. Things. The. First link that we've got, up here is till slack. After that, is a YouTube playlist I moved all of the ultrix tool videos there. And. I'm working on getting other things there. Hopefully. Classes, will be streamed, to that and then added. Automatically at the end who. Knows. The. Next. Couple of things. Are. Something. That based, on the questions I've been getting and some of the comments I, thought. Was pretty important I have. Added a. Forum. Where. You can, say. Hey I thought a difficult term I don't. Know what it means. You. Say where you found it and. Then anything else you have to add and click Submit. Once. You've done that it, will wind up on. This. Google sheet, for. Responses, the. Second tab of which is, where. Classmates. Can fill out definitions, and you guys can review oh hey, a bunch of us have the same term or, anything. Like that it's, a great great, way to earn some those out of class participation, points and it, also shows me huh, a dozen. Of you are asking about the same thing I probably. Need to explain that term. I'm. Pretty, bad at metacognition, on what I don't. Know when it comes to computers it would seem, or. Knowing. What I don't know. I don't know that you don't whatever, not. Important, the, important thing is now you guys have a good way to, come, up with things that. Words. Terms and other basic, things that you're having trouble with and. Share it with the class, um. Mixed. The. Titanic. Project, has. The full details that we talked about last time. And. There. Is another link to. Titanic. Novel. Idea sharing. This. Is a nice bourret. That. You'll. Put in hey. I'm on team number whatever. Who. The three team members are and then, you've got your one two three slides for your novel idea.
This. Is where you will both put. Your slides in for your team and where. You will look to see what other teams have come up with so, we don't have two teams saying oh but. We. Both wanted to say this was an important feature. Yeah. Or. You know on the flip side you can say oh that's a good idea I'm gonna try that so, whatever this is here, for you guys to use and share. Finally. From, this I. Found. Out that, the way permissions, work, on. Student. ID photos, that I'm given for the course roster I am. NOT allowed to reuse. Them even for academic purposes within the class in. As. Long as we using an external not, important the, important thing is I can't. Manually, upload pictures, of you guys and. I. Had intended at the end of the course to say, okay. Slide. Slip slide show this. Is what Lisa. Looks like listen. Lisa. Looks like how. Often did she help you or, how helpful will shoot your class scale of one to five that's. How I was planning on collecting student. Feedback on, each other so I know how much other, students are helping you, because. I can't do that quite how I planned, I would. Would. Like I'm. Telling, all of you to, upload. A picture. That. Clearly identifies you, that, is not obscene. To. Be your profile image in slack I, can. Then pull that down and since you uploaded, it doesn't, break any rules. So. Make. Sure that you get that uploaded. Just. A profile picture of you. It. Is sin oh. You. Guys. Can't. See this cuz there you. Should now see an administrative, tab which, I sorry. It was turned off. Thank. You for pointing out the error I was making. Do. We have any other. Administrative. Questions before we get started on content. Excellent. So. I am now going to. Yeah. And. I, will, begin. Now. Can it actually show, up that's good to, know. So. We. Are going to talk about model. Insights. To. Start today this is. Going to be going over some. Ways. That we can find out interesting things about features, and parts, of the. Algorithms, beyond. Merely, looking, at what its prediction is. So. Obviously, pretty, thing things matters most. We. Have a bunch of predictors, we're trying to figure out a target all, of our business goals at the end of the day are going to boil down to can we get this target right. However. It is possible while. We are doing that we. Can learn something. Else on the side or. We, can learn something that's useful before, our algorithm, is perfected, and ready to deploy in a useful way. Let's. Actually. Get. To this. And. I'm not gonna break you guys up into groups for this. But. I am going to ask think back to the diabetes homework. Assignment. Does. Anyone remember anything. About that. Okay. So. Basically. What, the data set is it's. A collection of a, whole bunch of. Records, on patients, you. Are trying, to predict if they will be readmitted. To, the. Hospital. In. The context. Of patients. With IVs. So. Some people will go in for an emergency. The. Doctors oh hey I checked your chart it says diabetes. Cool. I'll make a note of that, then. They will leave some, of them will come back on another day. Of them will not. That's. The sets all about predicting. So. We. Know that our target is. Readmission. It's. A bully are they going to be readmitted true/false. For. Training, data, we. Have. Tons. Of features we have their age their, height their insurance. Type. Who. Their provider is some types of into the hospital before, their. Sex. Race. Whole. Bunch of other fields like that. Now. Let's pretend for a minute that I am. In. Charge, of managing a hospital, I want to know this readmission, stuff I sent, you on this project. What. Other things. Might, you. Be able to learn while. You're still playing around this data. To, give you a clue let's focus in, on a. Categorical. Field. Smokes. Doesn't, smoke it's, a nice binary. Category. Bully. Do, you guys think that there might be some correlation between, readmission. And smoking. Yeah. People, who smoke might, wind up in the hospital more. You. Might have discovered that while working with your data and training your models maybe, your models, are still kind of bad or they have problems with target leak or they aren't ready for a, team. But telling your hospital administrator, by, the way keep an eye on the smokers. That.
Could Be useful. Now. Another, thing that you might have is. Well. I, don't. Know exactly how. Zip. Code correlates. With readmission, but. It, seems, to, be there. Seems to be a strong connection there. I. Don't, know exactly which groups are gonna be oh, yeah. This, one totally gonna get readmitted I don't know which ones are gonna be no. This will never come back. But. I know that zip code could, be really important, it's. Worth looking into I'll. Tell my administrator. Maybe you want to look at it have another team work on trying to parse out what's. Inside, zip, code that might tell me more information. And. Partial. Differences aren't important because you guys haven't taken enough stats yet. So. Where. Are these, features. Coming from. Your. Source data is totally, coming from, right. So you're saying predictors that's not changing. However. The, insight that you're gaining is coming from algorithms, that you have run at least partway or, we've run a few iterations of even if they aren't perfect. When. These algorithms are done some. Of them will output, useful. Information. That you can use like this. Decision. Trees that. We talked about. Well. As explaining a decision trees worked who. Remembers, how it decides what, feature. To. Use. So, it's at a node it wants to go down how does it decide what. To pick there. So. The, voting one these random forests. Can't. You ride okay. True. Fall it's. Going to tour false is going to be out how do they decide what it's looking at true or false on how. Does the algorithm decide, I. Am. Going to look at whether smoking, is true or false here instead, of whether. Ate. An apple a day is true or false here. It. Looks for. Whatever. Node will, improve. Purity. The most, so. It goes from being hey, this node up here is like a 60/40. Split I, can. Turn it into a, node that's like, 8020, and a, node that's like 30, 70. That's. Good I've got two things that are much closer to perfect. So. It'll go and it'll compare each of those and find out what's. The, thing that will increase my purity, the most. That's. How a decision treaty can give you importance, that or that's one way a decision, tree can, give you feature importance, it. Knows which features. Contribute. A lot to purity. Because. It's looked at a lot, of them in fact it is looked at all of the features. Thousands. Possibly, millions of, times. You. Can just keep a list. These. Are the good ones, I've. Never heard of these I didn't even know it was in the data. Random. Forests, the. One with voting how. Might we use voting. So. The. Random, forest in. Case you guys don't remember it wasn't super big. On it it's, the one where you'll have a lot of different decision trees and then they will vote on. Which. Feet and, they will vote on what, the prediction should be based. On follow you in their own tree they get more votes if they're good less, folks if they're bad. How. Might I be able to use something like that to, find features that are important. Indicators. Exactly. Right the trees that get the most votes are, the. Trees that are awesome. We. Can look at what features are in the trees that are awesome and, then. We can look at what features are in trees that have very few votes. Important. Ones are in all. The algorithms that have lots of votes or most of the algorithms they have lots of votes because. When. It randomly, selected, and screwed up trying, to figure out trees the, ones that were good wound up getting a lot of votes at the end and if. You can't get a good score without, including that as a factor then, that's a pretty darn important, feature, I. Finally. Broke down I'm gonna talk about regressors, I didn't want to talk about them until we. Move past classifiers. But, I've just said oh we'll talk about it later too many times that's not fair to you guys later. Today you will learn that, at. Least a little bit. For. The. Next little bit we're gonna be running through data robots so go. Ahead and. Log. In to date a robot. Or. Open it up log in. Make. Sure that your. Current project, is the one selected. Before. We have only looked at the data tab which, is where you put in your, initial information got some characteristics, pigs, what's, categorical. What sword and all that sort of thing. The. Model tab is where it ran a whole bunch of different models. Today. It's all about the insight tab. This. Insight. Tab is. Kind. Is probably, the biggest feature that separates, data robot from, some. Of its free alternatives. These. Pictures are pretty they're, easy to understand. You can drag things around to tweak them it's, a lot like dashboards if, you've taken, courses, on, making.
C-level Executives, smile by. Showing them pretty pictures because they're, easily entertained but, don't tell them that. So. These, visualizations matter, anyway. Insight. Tab. Is. Anyone not yet at the insight tab I. Am. Going. To move on. Slowly. Okay. You, do have to have run some models before the insights tab will show up it'll probably grayed out uh. Yeah. He was how to give you a rough summer I. I. Would say, hop on with somebody else and follow along there. So. After. Our insights, to have we're going to talk about tree, based variable, importance, first. Simply. Because it's the easiest one to explain. It's. Over here. What. If we got on this screen I. Winced. And. Pulled up I. Guess. Looks, like in the top left I've gone into the models to have to do this but it'll, be the same information just a slightly every way you. Have your algorithm, selected, and it, will show you a, list. Of your features off, to, the side. It. Goes off to a hundred percent. That. Means that your best feature will always have a hundred percent and everything else is scaled, to. Be how, good is it relative. To that. Feature so. Based. On the toy data set I threw, up for the Titanic project name, is highly predictive. That. Looks terrifying, and scary to me because it, shouldn't. Be super predict. I'm sorry if. You had someone's, name and whether they survived or died you, absolutely, can, totally. Figure out if somebody survived your diet based on that name, the. Problem is it's not super predictive, because. There aren't a lot of people that have the same name and when they do they. Don't necessarily have the same score, it. Turns out the reason name is at top is because name is broken down into a few features in, these algorithms but let's. Talk about that the textarea. So. This. Is name it goes up to 100%. That's. Not there. Is no non, relative metric here so, it's not like, oh it's. A hundred percent that means this explains everything.
It's. Just so it has a point, of reference compared. To other features. So. When, I go down to fare and, look. At that, it's, gonna say oh that's like 70, to 73, percent of, what. Name has. So. That means it. Still contributes, a fair amount. But. It's not quite as good as name, it's. Better than a lot of other things and we get some. And. It ease on a ratio scale I suppose so we can get some information out. Another. Very. Useful. Tool I have found in this, is. Exports, up here. Which. You can pull out and yes you can get the pretty picture out of it great for reports. Great. Further write-ups. But. The CS beats have in here can, give you a little more detailed. Information so, instead of guessing what the numbers are you, know here's, what they are in order and here's, what the actual number is so, it was oh, sorry. Is your % house pretty close to right. Yes. We have 170. 372, 57. And so, on. The. Zip is just the PNG and the csv together but. PNG. Is for. The picture. CSV. Is. For the data. This. Data you. Can open up in Excel or all, tricks or anything else you've played with CSV is a lot before. PNG. Is I'm sure any computer, that you guys will interact with can use it. The. Other nifty thing. Up here is. Selecting. The creating. A new. Feature. List, with. The top end features. The. Way that this. Works is I, sit. Here and I look at my chart. I. See. Yeah. Hundreds, good. Yeah. Seventies are good. I'll. Tolerate this, 50, over here. Okay. Then, I. Will. Just not talk about that then. Totally. Fine Hey. I had. A little bit longer diatribe, between tree-based importance from the model tab versus tree-based importance, in the insight. Tab they show the same information, but are slightly off on detail so I didn't want to go over so. Sorry, that I went down the wrong direction with you guys, by. Conceptually. They have very similar things. Now. I'm going to talk about regressors, a little bit. Simply. Because I can't avoid it any longer a. Regresar. Is something. That instead, of predicting, a category. Will. Predict, a, continuous. Value. It. Will do this by, adding multiple, different features together and weighting. The features, based. On how much they contribute. So. You, can think of the. Probability that someone that a couple is going to get divorced is. Equal. To. Minus. 0.02. Times, their, age when they got married plus. 0.4. Times. If, their, parents were divorced. Now. I. Suppose. I apply this to an individual is it a couple. So. Individual, age were married whether their preferred horse or not and then, I just have a plus, 0.25. That's. A base, level for. Assuming zeros. On the other stuff. So. If. Zero, year old whose. Parents, were not divorced. The. Prediction, for a zero year old whose parents were not divorced, is, point two five a. Twenty. Year old would. Be. Negative. Point. Four. Plus. Let's. Say their parents they'd either so nothing plus, zero. Point two five. Oh point, 4 plus 0.25, gives. Us point 6 5. This. Is in, its most basic form.
How. Regressors, work. They. Get a hell of a lot more complicated, than this but. This. Is what it is at its heart you, have your features the each contributes some amount and you. Go from here, are. We cool with this. Do. You guys understand. This. That. Is a base rate so, it. Would be if everything, else was zero, that's. What, the prediction would be, so. It. Is also known as the residual, in a stats class. It's. The. Leftover, bit of the. Target. That you can still predict with no features. So. This, plus 0.25 would indicate. Who. Would indicate that, oh. The. Odds of getting a divorce are slightly, more likely than not just. Overall. In the population, that's what it would indicate. Yes. That's. Exactly right we've, got a. Target. Up here. We. Have the feature or factor. Or variable. The. Predictor, over. Here. We. Have a weight that's. What the feature is x. And. Then. We have this other little bit. That. Is a. You. Can think of it as a weight but. The feature is just one. So. It's. A pretty R where absolutely, everyone, has a value of one. So. That's why you're just adding the point two five goes. To everybody, I'll get the same thing. A reasonable. Exam question, maybe I, give. You a. Regression. Formula, I. Give. You a table I. Ask. You to give me a score, the probability. The number the, target, for. A row, from that table. Another. Reasonable, exam question, that I may ask I. Have. A whole bunch of these I ask, you. What. Features are most important. What. Has I will, not say what has the highest weight but. That's. What it is. You. Will also need to realize if I ask that question. That. Most significant, are most impactful. Cares. About distance, from zero or cares about the absolute value. If. I found out that. Membership. In the, Church of. We're. Gonna execute anyone who gets divorced, has. A, divorce. Or has a weight of. Negative. Zero. Or has negative 0.98, as, its weight. That's. A very impactful feature. It. Is highly predictive of the target even if it's in. A specific direction. And. That. Leads into why we're talking about regressors, at all right now. For. Our tree-based, importance, we only talked, about one direction, how, often, does it make the model better when. We included, it as a. Feature. When. We get into variable, effects. We. Also gain. A direction. So. We, can find, out. For. The titanic project. Sex. Equals female is, a very. Strong, positive, predictor. If. You're a female you, are much. More likely to, survive if. Your. Title is master. You. Are very likely to survive it's. The proper, form of address for, a. Trial. A boy child, so. Your type of child women & children first, huh. What. Do you know, women. And girls plus boys. That's. Exactly, the lunatic first would be I've got it down to two features, strongly. Positive. Over. On the negative side if. Your, title is mister you're, a little, bit Sol. Thats. A pretty highly negative, predictor, right there. If. You're. On c-deck, you're. Not doing so hot, I. If. I'm remembering correctly about midway, down the, ship little, close for the bottom. So. Above F and, steerage, but that's. It. You'll. See in here that it has constant.
Spline, High and then, a feature name for a few of these. That's. Just, an artifact, of how it is, how. It be, into the data. Some. Of these algorithms can automatically. Bend data. Just. This part of how they work, so. That's telling you but, method that it did for doing that within this algorithm. But. The important part is, age. Is, really important passenger, class is very important, a specifically. High. Passenger, class. I'm. Sorry medium, high passenger, class or high, age. So. Being older, hurts, you. Having. A high class, number, where. One, is first class two is second class three is third class being. Two or three isn't good for you. So. It's. Learning children first but like yeah, the, poor kinda don't count so much. So. When. You're at this screen you do not have a variable effects. So. It. Actually it. Well. I suppose yes the trivial answer is it's today it because always the data but the, more specific answer, is we. Need is running models. Some. Of the models can contribute different, kinds of information you can only get tree based variable importance from things that are a kind of tree for example you. Can only get variable, effects, from something that has a regresar, under the hood, or. In. Every closure like 3 or other things. If. You don't have anything, of that type. Data. Robot just won't show you that option because it has nothing it could show you. Yes. I've. Used to get modified version of yeah. So. The. Way that you can do that, let. Me exit, out of this real quick and show. You. You. Know it's defeating whether I should have cut this slide or not apparently, the answer was, no. I should not have. So. When, you go to your models, list. Some. Of your models, will. Have. An. Icon. Fine. I'll go here and then click, to their leader. That's. Artists. So. I can try everything huh. It. Isn't our models leader. No. Oh. So. It's definitely leader board then. Filter. One, page, down, okay. Fine don't pitch down. Ah here, we go so. There. Is a little symbol right here. It's. A B with an i under a beta with an I next, to it. Beta. Is the, Greek letter commonly, used for weights on these coefficients. These. Are all of the models that I have that contribute. So. If. You guys don't have any of those you can go and manually click add new model. And. Force. It to try something. That, you know is going to or that you think is going to be useful. So. Gradient. Boosted trees ah. Informative. Features is fine, I'm. In a hurry, so, I will drop this. I'll. Go ahead, and click no. Duplicates. Of jobs is already completed so apparently that was a good one. The. Algorithm already found it that's what I get for.
When. I ran it grabbed. It automatically, based on the training aid I put, in uh. If. You manipulate your data more data robot. Should pick up some of these other things. Alternatively. You should be able to force. It under add a new model and then selecting. Let's. See. Regularized. Logistic. Regression. Is something. That will give you. Weights. What. Isn't showing up. So. Is it not what's, step, right do you have a date. Oh. Why's. It. So. I. Can. Actually have a great. Example that I used in the book to, answer that question. So. It turns out that one of those highly predictive, features, for whether someone, is. Readmitted. To a hospital, from the diabetes dataset is. If they are deceased. Very. Few people who are deceased get readmitted to the hospital. That's. Something, that data. Robot struggles, a lot with because. The reason that they're not readmitted is different. From the reason everyone else has not readmitted so. It's trying to train two models at the same time in, a, sense because. It says I'm trying to make a binary classifier. But there's three groups and I don't know how to handle that so. I'm just gonna put, this one much more important, feature. Like. One, of the best printers in my data I don't. Know what to do with it. That's. The kind of place that data robot can screw up and fall into a hole. And. When you're in a situation like that, it. Absolutely is best to manually run models to, see if you can't find, a way to, force it out of that hole or if you can't find a way to tweak your data for the next run so that you'll not be in that hole. And. How. Do you know if you're in that hole or if it's just you threw in garbage and it's saying I don't know what to do with this you. Don't you. Can't possibly know. You. Just have to try and, that's. Why it's good that we don't actually pay for these workers. You. Just have a. Fee. That lasts for two years or whatever it is. Easy. To keep trying things until you figure it out, similarly. I encourage, you guys to try some of the algorithms, in the add an algorithm. Right. Now a based. On discussion, that Cayenne I've had with them there are a number of algorithms. That are. Good. Or state-of-the-art from research or are, very, good situationally. But bad in other situations, they, just don't try right now dinner. OS is capable, of doing it it's just well. I'm gonna throw the best twenty ideas I have at it, some. Of these fall in the wayside because, they're too new or they're too specific, so. Going around and trying some of those new models. Might. Be a way to find something the data robot just isn't, trying even though it could. Yes. I guess, like what should we make sure it is. If, I could answer that question, then. I could, just give everyone in the class of Si and go home because, cooking. The data robot button would solve the would solve it for us so. The. Answer is there isn't an answer but that's a good thing. Some. I guess rough intuitions, on things you'll want to try. You're. Going to want to make sure that you've got, um. Okay. You know what I'll just admit something, that's frankly. I should be more embarrassed, stuff than I am I. Look. In here and things that I recognize, the name of from, other, times i friended they did well I. Do. Things, that have a logo I do not recognize because, that means it's probably new. But. Otherwise. I would have remembered from the last time I ran it. Um. I, tend. Not to have too. Many variants, of the same thing so. While, like. Naivebayes. Combiner. Classifier, may have multiple different, options. In here i assume. That the differences in there are going through more marginal, than substantive that's not always. A valid assumption but it's. One I make, and. Another. Great. Resource. In. Doing, this would, be Google. It's. What the pros use I suppose. I use Google Scholar but it's the same thing. And. When. I see, one that has enough fancy. Sounding words it. I'll. Look it up or if they. Have the name of an author I recognize. Or. Famous, mathematician. It's. Like if the word bayes's in it I'm probably gonna run it. Very. I'm sorry they're days in it and I'm lost I'm gonna run. Um. But. It is a great thing to do is your hey what's my novel thing for the Titanic project oh well. I'm gonna try three.
Or Four of these I'm. Gonna, read. About them on Google, while. It's running in the background. And. Then I'm gonna put some slides together explaining. So. This. Is what a generalized, additive, model is. That's. Something, that will be a value to other classmates to know and it's something that I have not covered in this class so. I. Have. Got to set it up so that left. So. So. The last thing, I'm going to cover and things, you can learn from model, are. Partial, impacts. When. Looking. At the, homework. I saw. A lot of people. We're. Like ask James, what this chapter is all about or, what. On earth is this section mean once, we get to around nineteen, point four, I think. The. Reason for that is data robot. Had. An old version that was confusing, it. Was bad enough they needed a fix. Now. They've made a fix but there, isn't a great explanation of what it is yet so. Your. Textbook just is out of date on this it's. One of the drawbacks of, the cloud, so. I'm, going to explain what partial impact is conceptually, and I. Will not, require. Its use. During. The class or. I will not require the use of the tool you. Will need to know the concept. So. The. Question, of partial impact is how much does the model change if. Everything. But one feature stays, the same. So. Let's. Start with a smiley face here. If. I change it the. Number if I change the number of eyes that are closed. Right now I'll. Produce these I. Have. Got from zero eyes closed to one eye closed, how. Similar to these things look. Different. I think there's a different notion about once-happy one's a little more, maybe. You know something you don't. Then. When I swaps it both eyes are closed well that's clearly different. Between. All of the those. Changes. I, guess. How. Many eyes are open, has. Some predictive, power when everything else is the same. Let's. Make another comparison, got. A smiley face right here. Just. Gonna copy everyone. Let's. Add a hat. As. This emotion changed. Does. Anyone think that, the left one is less happy than the right one or. That there's a secret. Desire to kill in one of them. No. No there - exactly as happy. Guys when. We change that one feature. Has. Hat from, false. To true, the. Model didn't change at all none, of our prediction. Altered. In any meaningful way. That. Means the partial impact of has hat is very low. Now. Let's. Look at eyebrow, rotation, right here. I'm. Rare rotation, we have all, this, kind of little earnest guy in the middle he's just like oh. I'm. Sorry I spilled the juice. But. You still love me right. Well. And we've got another guy up here that's a little bit like you know what I spilled the juice and I knew that was your laptop. Don't. Know what a laptop is but. I know you don't like juice on it yeah. What you gonna do about it not on the hit me in this state. I'm. Not going to look it up and, then, we've got like this last guy over here that I don't really know what he's feeling oh, wait, now I know what he's into. Somebody. Likes. The guy who spilled the juice on accident. Point. Being. Between. The three things that we have eyes open and closed your. Eyebrow, rotation, seems. To be pretty important. What. Emotion, was being conveyed when everything, else was exactly, the same. Was. Significant. Top. Hats meaningless. How. Many eyes are open sometimes. It mattered sometimes it didn't. That's. Another very, useful. Interesting. Powerful. Part. Within. Partial. Impacts. It. Lets us break things down within, chunks. Or groups, of a, single, factor. Strictly. Speaking partial. Impact is only, at the factory level is it impactful, or not but when, looking at that we can see parts. Of it are parts of it aren't. Now. I'm. Going to go through and explain how some of the things in here work I. Know. It's, confusing I know none of you can run it right now. But. You will, start by going to your models list. Once. You are in your models list select. Something I recommend, something that they blended algorithm. Blended. Algorithms, are algorithms that combine other ones. Talk. About that a little. We. Will talk about that at some point but. Blended. Algorithms, are usually. Going to be once near the top of your performance, they're, good ones to look at for stuff like this, because.
They Get to look at feature impact, from across different models and how they all interplay. But. If you go to the model x-ray tab from there. You. Couldn't click a compute, model array. And. That, gives. You this. Information. Overload, disaster. I mean. A wonderful, product. What. We've got here in the plus sign, is. What. The. Predicted. Value is. Based. On, the. Model that we have so. For. Everyone. Who is on. C-deck. What. Let's, run a prediction, for everyone who is on see dad. Then. Let's average together all of those predictions. Next. We have an orange circle that is actual. So. Instead, of taking, the predictions, from the model average those together will. Average the truth together. When. There is a big gap there. That's. A sign that something, interesting is going on, your. Algorithm. Is screwing. Up. Right. There. It. Is making a predictable. Error. Across, a. Large. Volume. Of data that fits within this narrow group. So. If your algorithm, were. Or. If the test set of your algorithm, that. You were being evaluated and graded on happens. To only have people on c-deck in it you'd. Get a bad score. Because. This is where it failed. If. You hover over any of the columns. It. Gives you a little bit of information about it summarized. In something or. In a table. It. Will tell you the. Partial. Dependence, what. The predicted and actual averages. Are and the number of rows. The. Number of rows is also available down. Here as a little bar chart and. From. This I have found out oh man, I'm looking at a garbage feature. But. Happens to be where the air is so high I just had a lot very few samples for most of these. But. Knowing. How big each group, is can. Tell you how important, it is that there is a failure going on at that type and, it. Can also tell you that something. Like a deck. I only. Had to correct rows on. I'm. Training. On so. Little data. There's. No way it isn't over fit. It's. Probably, just something safe to brush off to the side. It doesn't matter what it says for feature important, therefore partial dependence here because. It's applying to so few rows. Finally. We've got over here your. Features sorted by impact so, this. Goes and shows you VM, the, importance very. Similar to what a tree based importance, does. And. One. Last thing. The. Partial dependence. The. Most years that, I can give you about this number is. You. Want the. Difference between the minimum and maximum to, be big. Anything. Else just requires too much math to explain. So. Over. Here the biggest number we've got is. 4.25. The. Smallest number we've got is about three, points. Six. Ish over. Here that's, not, a very, wide, range, that. Means this, model, is probably. I'm. Sorry this feature is not giving.
Us Much new information in the model. And. That. Corresponds. With our. 27%. Importance. So. That's. What's going on there. Yes. You can click any of these and it will show you the charts, for different ones and within. Them you'll have a number of useful, options, the. Most useful one aside from export, to get your data out and to, make the pictures are, under. More, there, is an option, to. Hide. Missing. Frequently. You'll want to hide missing you can also change what is on a log scale and, what it automatically, scales. Since. You don't know that your data is necessarily, going to be linear it's, good to try it in different spaces, and look at it and say oh the gap looks really big or. Now. It doesn't look that big. There. Is also a, bins. Column, if you have continuous data or a being this thing up here if you have continuous data. And. I'm. Gonna end with. A. Little. Bit about, text. This. Is not a hardcore, text, analysis, or anything fancy like that, but. Merely. Counting, how many times different, words appear. Is. Really. Useful, being, able to figure out if text, contains, hey is mr. in this name. Can. Actually be pretty good predictor. So. What. The text. Mining option. Does from. Your, insights. Tab, is. It goes it gives you something just like your. Variable. Effects positive, negative except, it tells you. What. Feature it is and what the value is on that feature so, the top one up here is named mrs., if. Mrs. in the name. Super. Positive yes totes, gonna live. If. Missus, in the name not. As good as missus but you're, still doing pretty well. Way. Down at the bottom if, your. Name is, mr., John you're, not having a good day. Why. Did you pick on mr. John because. The algorithm doesn't know how memes work. It's. Just breaking. Up well there's a bunch of words here we're through things with spaces between them it, doesn't know that mister is a. Title. And. John. Is a proper noun, so. It just said let's take a naive approach I'll do that and look I found something. When. You're manually constructing, data and pre-processing, it in Alteryx you're. Probably not, going to want to put is named. Mr., John as. One. Of your formula, tools. Okay. And. That. Is what we have got, for today. From. These I am. Going to do something, a little different than last class, on. You. -. I should like you to know all the other stuff I. Am. Going to open the quiz during, the last five minutes of class so, that everyone doesn't count running. Out when I say I'm done because. You, know I want. To spend time with you guys, all. So I hope that you learn something you get something out of this last. And. Until. Then. Play. Around in all tricks look at that Ana projects try out new models. Yes. The, instead of reading goos yes I'm aware that that means like four of you actually have to stay I'm hoping you have enough group you, know team members that there will be peer pressure involved. Yeah. Yeah. I suppose that would be the fair answer I. Am. No the fair answer is so that everyone. Else again free of seeing the class so yeah sure I. Think. You have unearthed the, flaw in my plan. So. Uh. Go. You. Drink you marry do your data. Wave. With the people on the youtubes. Let. Me just. Get. So. That the Internet is watching the class so. Okay. So. I. Would do at this point is. Abandon. Ship. Products. Check. Out. Sheets. So. Positive. Predictive value, and. Microphone. Positive. Predictive value and negative predictive value, are. The. Positive, class survived the negative class died. It. Or. At least in a binary classifier, that's how it works, it's are you. Gonna predict. Secretly. Inside, these, it. Has one, class that's positive, that's a one and one class that's negative, that's a zero. Survived. Is one. Died. Is zero so. Positive, is trying, to go to that negative. Is. Died, when. You get some multi-class, classification. So. Things where you're trying to predict. Either. Which of three groups to someone belong in or which of three or. Of. These three labels which of them apply, best to this user. Then. Positive and negative become a little wonky. Er but. The. Basic. Idea is still. Remember. That under the hood it's dummy coded as ones and zeros.
Where. One. End of it is going to be positive. So. You are though, if. I am. Sex. Mail that. Is going to give me minus. Point, four on surviving, if, my age is under fifteen and then you get a positive, point eight I wind. Up positive. But. I still had negative components go into it. It. Means that some people will be predicted, on, a scale of zero to one did you survive you're, a negative twelve. That's. Just let's round to the nearest. A. Quickly the solution to that is using logits. It's. A probability thing that makes it more expensive each but you go further away. So. If. I am twice as likely to die up at the very top the. Change, is actually not, much. Used. To house ninety-eight percent likely. To die twice, as bad as that is what, ninety-nine percent. That's. Tell you but it's still doubling so that's, how it's actually fixed, so you can never you never do wind up with never, side the range but. Don't. Worry about that it'll just get round it. Just. A sack. Okay, uh. Turn. And, what's, up. You. Mess with. Okay. The. Basic workflow that you're going to have is going. To be play, around with the training data within, Alteryx. Do. Any fancy nonsense you want to do to it there, upload. That. Into. Data. Robot. Data. Robot builds a model on the stuff you fiddle around with. Then. There'll, be something that's awesome sauce within, data robot. You. Then take your test data put. It into the same ultrix workflow, that. You put your training stuff through and. Upload. That change, set, into. Date a robot to have it predict it with your awesomesauce algorithm. So. This. Is general the, basic idea is you doing your pre-processing, you're doing your feature engineer, and you're doing you're. Handling. Outliers, anything. That you're doing to the data that. You can understand conceptually, where you believe you are adding value before, the answer is just throw it to the algorithms. You're. Gonna have to do both train and test. Because. You have to do the same thing to all of your data. But. Data robot. Should have access to the training data to play with and learn on. While. You, are doing your other steps, don't. Waste your time on, doing. Changing. The test data if. When. You try it on the training data it only made bad models. So. If you know that you screwed something up like. Maybe your regular expression wrong so it's, replaced, everyone's. Age, with, a null. Well. That's not useful anybody, so, all of the models will be bad you. Don't have to go and run the test data through Alteryx because what's, the best you can get a bad. Prediction, from a bad model. So. Within. This specific example, there. Is probably, very little that you're going to add today to robot. Aside. From, some. With. Age at least the only thing I could really imagine you adding, is finding. Out, what. The age of majority was, at the time. It. Doesn't know that there's a real, difference. Between over, 18 under 18 or. Over. 12 under 12 from. A rights perspective former, responsibilities, from society, he doesn't know any of that it just knows huh there's a cut-off around here by.
Manually, Creating an over 18 under 18, you're. Saying hey this, isn't something that also matters above and beyond what age alone would matter so. You can figure out some stuff looking, at age you'll make some great bins do other, stuff like that but. I know something, you don't, because. You can just math and I, have Wikipedia. Some. Of their models are trained on Wikipedia, but no. They. Have not implemented that outwardly. Facing yet. Oh they, some cool, tech presentations, on like so we're playing your text data what. Happens when we oh you, name this field gender, what, do we know about that already. But. Generally speaking most bottles are droids know. Oh, one. Administrative. Thing. In. The swap between through canvas courses I. Some. Information, about the red large regex homework God got. A little lost on my end I have. All the files but I don't have which of you is. Some. 42. Character, long string of letters and numbers. So. A few, of you received a mess an email should have received an email. Make. Sure that you or someone in your group has responded to that. If. You did not receive a message or an email do not worry about it I'm not plugging you. I think, there's like four of you that I should talk to right now. I, I. Would, need the files again. We're. In the last five minutes of class so people who have a quiz can take that now. It. Should be open for you. I can't. Wrap my head around that being a C, 922. To, be perfectly, honest I don't actually know what the hotspots tool does I, know. The word cut is I just think, they're overhyped like don't, get me wrong they always get a positive reaction, people. Love them but, I don't. Think that I personally, have a grudge against them for my linguistics, background they're, totally fine as. For hot spots I honestly just don't know how to interpret it so. Not. Gonna present it to you guys. Roughly. It's supposed to. Very. Roughly there is something. Inside of it that says here's. The intersection. Of two features. Like. How they play, with each other and, then. We're going to look. At how, they play with each other and we're, gonna use an algorithm that makes things, that do that similarly. To each other closer, and things. That are, unlike. Each other far apart. So. It's supposed to create areas of oh, the. Things in this ball or read the things in this ball or blue and there's. A lot of space between the red and the blue one or there's. A whole bunch of blue or out here but there's this one exception, for red in here, that's. The kind of stuff hotspots are for. Or. So my understanding of it goes. Okay. Everybody that is the, end of class I. Hope. Today was useful. Or at. Least. Gave you things that you could salvage, and rip up into your write-ups, um. Have. A good.
2018-03-10 12:29