MIDAS: Responsible Data Science and AI mini-symposium - "Equity in Data Science" with H. V. Jagadish

MIDAS: Responsible Data Science and AI mini-symposium -

Show Video

all right so uh let's get started welcome everyone to our mini Symposium um on responsible data science and AI um let's see all right so um my name is Jing Liu I'm the executive director of the Michigan Institute for data science I want to spend a couple minutes give you guys a little bit background about who we are and why we do this so for those of you who don't know the Michigan Institute for data science or Midas where a campus-wide organization supporting data science and AI research across many research Fields with the goal of in maximizing I would say the scientific and the societal impact of such research so we're we have about 500 affiliate faculty members coming from every single School in college from three Universal Michigan campuses and with that we're the largest data science and AI Institute at any U.S University and also one of the most scientifically diverse I also want to mention closely related to the program today is that we have actually two postdoc training programs the Michigan data science fellows program and the Eric and Wendy Schmidt AI in science post-op program and these are unique in that these are among the very few postdoc training programs in the US and also around the world focusing on data science and AI so today's mini Symposium is part of our annual future leader Summit program this program every year we invite PhD students and postdocs from around 20 universities including major research universities Midwest universities and also minority serving institutions to come to Ann Arbor honestly under the theme of responsible data science and Ai and the goals really are to explore cutting-edge research directions under this theme and also nurture the next generation of data scientists and AI researchers especially because they will play a major role in the future to ensure the responsible use of data and Ai and also we want to use this to build momentum on campus with other with UVM researchers and trainees so why responsible data data science and AI I think a lot of you all have a pretty good idea this is not just for human research but it's for really everything you do with data and AI you'll see very soon that two of the speakers this afternoon are actually in conservation biology and astronomy but still you need to for example if you study conservation biology if you want to study um uh sustainable sustainability you need to make sure for example rare species are included in your data set if you do physical science such as astronomy you still need to make sure your analysis your data your analysis are robust and trustworthy right so importantly uh so the increasing importance of data and AI in every Walk Of Our Lives for example everybody knows about chat GPT these tools are coming to us at such a Breakneck speed so and um many of them are not very transparent and so really it's this is responsible data and AI this really is important for everyone I I want to mention three upcoming and related events that we organize next Monday there will be a colloquium in Palmer Commons about implementing AI in health uh next Wednesday there will be a forum uh actually I think in on north campus at the music school and the theme is generative AI music composition and creativity and then in mid-may May 16th there will be a one day Symposium on ethical Ai and this is a joint event with Rocky companies and there will be keynote speakers from Academia government and Industry about AI development application and regulation so today's speakers I just want to mention um so four feet four speakers uh Dr HV jagadish Ellie sake from Microsoft um so these are the speakers are not only prominent scientists but the the external speakers also represent organizations that have been collaborating with us for a long time and so Tanya bergerwolf the director of the translational data analytics Institute at the or the Ohio State University and Andy Connolly from University of Washington the director of e-science Institute which is their data science and AI research Hub and the at the end the panel discussion will be led by Josh pasig a faculty member here at University of Michigan and also associate director of Midas um so um Dr jagadish will give more detailed introduction to these speakers later on but now I want to welcome Dr Jack dish for his talk um as you can see he's a well-decorated faculty member director of Midas and also um Edgar F Cod distinguished University professor and Bernard a dollar professor of computer science and engineering I want to mention two things about Dr jagadish one is he's a computer scientist but he's also the first person who offered data science ethics book in the country and he has been working with various organizations Academia and others to promote uh ethical use of data science and AI another thing I want to mention here is his own research is about is the development and usability of very large databases also in recent years there's a there's a new focus on data Equity systems to ensure the inclusion fairness I would say equity and validity validity of data so with that let's welcome Dr Jack Tish today [Applause] yep thanks okay so so I want to talk about Equity today um and as a starting point um I'm gonna talk about okay this works um as a starting point I want to talk about fairness okay so there's a lot of talk about fairness in the context of AI and uh and so just to establish a baseline I want to spend a couple of minutes talking about this uh so historically uh people think of data scientists like us think of ourselves as being very objective and very neutral and following where the data goes and and so on um and there is value to analyzing data and learning from the data but it I think by now is generally widely accepted that data don't speak for themselves that there are a number of choices that are made in terms of beginning with choosing what data you you look at how you what kinds of models you choose to build Etc that can result in biased that can produce bias results um I'll I'll be saying more about some of these things in a bit there are other things too that I think uh you will probably not have difficulty agreeing with me on uh as a starting point which is it's important to make sure that your training data is representative of the population and which you're going to run your model and it's almost always the case that you train on whatever data happens to be available to you which is uh never the data that you'll actually be running your model on so you hope that the two are going to be the same distribution right at least the math that you're using for your system assumes that um a thing that a corollary of this is that you're assuming that the future is going to be exactly like the past okay because by definition if you're building if you're learning a model it's on past data and you're going to deploy it on future data and if the data that you're dealing with are things that have you know if your learning happens in a relatively short time constant over things that are relatively stable this is a reasonable assumption so for example if you're doing let's say optical character recognition this is a very reasonable assumption to make it's very unlikely that over the period of time that you train the model people suddenly start writing in an amazingly different way however if you're going to use learn models for things like employment decisions where the VA or admissions decisions where where their goodness of what you have done takes years to find to become evident to you the Assumption has to be that you're in a static Society for that to be for for your models to work okay and that's almost never true and and so uh being data driven can actually become an impediment to societal change in dimensions in which you might be looking to change society um I'm not going to say more about you know correlations versus causation I think enough people have said things about that so just gonna keep going uh okay now I went too far I think I'm gonna have to find sorry about this there okay um so before I get to equity I want to spend just one slide on diversity so often the result of some system is to create a selected group of whatever and often there is a desire for achieving diversity at the end and this can mathematically be tricky because you're trying to diversity is a group construct and if you're applying scoring or labeling to individual items the math gets interesting because you're trying to achieve a group result based on individual actions okay and I'm not going to say more about this it's it's related to fairness it's not necessary it's it's definitely not the same thing so I want to get to equity Equity may sound very much like fairness but it's actually saying that you treat people differently and you intentionally treat uh them differently to achieve comparable outcomes okay um so where fairness says I want to treat everybody the same Equity explicitly says I want to treat people differently or subjects differently or items differently whatever might be the thing that you're applying this to so here is this cartoon that think nicely describes this picture and I must point out that this cartoon is focused on only one dimension of difference right which is height and when I've been talking about things like diversity and equity and so on you'll say well there's no gender Equity here this is ableist you assume people can climb boxes and look over you know etc etc but be that as it may it illustrates the concept okay um so when one looks at equity and says oh we're going to do something different like in that cartoon and have have unequal allocation of resources uh that kind of sounds very socialist and then there's a thing of gee how how are you going to give everybody according to their need because that's that's that's basically what that Equity statement was I want to convince you that there are many places in society where Notions of equity are deeply ingrained okay and that you that we don't question so it's fair to give every student in class the same amount of time on an exam okay and nevertheless at pretty much every educational institution that I know of in the US there's a process of allowing extra time for some students with with some needs for for the extra time on the exams okay and this is very much an unequal time thing clearly doesn't meet a fairness Criterion of everybody gets the same and is pretty Universal this applies in other places so for example um standardized tests are a fair way to measure performance or knowledge on a subject and our university has now stopped asking for GRE scores for grad students and I think across the U.S there's a trend towards moving away from standardized tests okay and the reason for that is equity where there's an explicit where there's been enough scholarship that shows that there's a strong correlation between test performance and socioeconomic status particularly but also gender and race in ways that suggest that working with uh this with standardized tests would be unequitable in terms of the outcomes that would be achieved okay and so and so this is this is an explicit case where you've you've decided what you want to measure and this is a societally important thing that is done and really the reason for me to talk about these everyday examples is to say when you practice data science you want to think about issues like this when you start framing your problem right so whatever be the problem it is that you're studying you've got to say what data am I going to work with and what characteristics do these data have and in what ways are they likely to bias the results I get because of the things that I have chosen to focus on okay so now I want to talk about data equity so we fairness or validity even in terms of a learning model requires or assumes right that your training data is representative of the population on which you're going to run the model after you've learned okay now in practice the way that a standard learning algorithm is learning processes set up is there's an optimization function right which says you minimize total aggregate error and if you're going to minimize total aggregate error it is often best to meet that Criterion by a completely ignoring very small minorities okay so a 10 minority is probably large enough that it won't get ignored okay but again depends on exactly what you're doing but there are many cases where you have small minorities okay low enough in population occurrence that it is it the the way for the model to optimize the stated objective function is to completely ignore that Minority and optimize itself for the rest are you fix this either by changing the objective function and having a more complex learning process okay or by in the data creating additional data points are overweighting super sampling from the small minorities okay so basically my point is it isn't enough to worry about do I have adequate or Fair representation Equity might require over-representation so I want to talk about four facets of data equity okay um and I'll say what each of these means um as as we go through and um and so I'm going to begin with representation equity which is a pretty much what we were just talking about on the previous slide okay um often particularly uh in Social data uh underrepresented groups sorry groups that have been historically suppressed are suppressed in the data record and they tend to be underrepresented okay and in just just to give sort of a running example I'm going to talk about data collection in the content and data analysis in the context of covid and it was there were there were significant racial disparities especially early in in the pandemic on the availability of testing as well as in the desire of individuals to be tested okay both of which tended to suppress data collection uh for African-American communities in the US which led to systemic systematic biases in the collected data which then would lead to if you did a database thing analysis of something and made policy decisions it would lead to to mistaken policy confusions potentially so representation Equity if you think about a table is about growth and whether you have the right representation in the rows feature Equity is about the columns in your relational table you've got to think about what features are you considering in your data set okay um if in the process of delivering medical care for instance you decide that you know things like race and income or immaterial after all you're treating somebody's illness and what should it matter and later you want to have some perform some kind of analysis on racial disparities or income disparities with respect to how uh in this case the pandemic or or any other medical issue health issue was being represented you have a hard time you're going to have to impute the missing columns okay and so the question is if you know that this kind of analysis is likely to be performed or likely to be of value then the quote right way to do this is to collect the data at the time rather than attempt to impute afterwards okay access equity is who has access to the data or the results of the analysis after right um and in the case of of data of all types even public organizations government organizations ngos tend to be tend to want to hoard their data it's a source of power and and even in the presence of a huge emergency like the covet pandemic there were significant barriers to data sharing partly because people still have their instincts of hoarding data and partly because the systems just weren't set up to share so when there is a view to not sharing uh there are barriers that then have to be overcome and and those may be technical um this kind of thing by the way is a big issue right now everybody's talking about large language models and the impact that they're likely to have on society and open AI when it started call itself open Ai and said that they were going to be open but as soon as they've had recent successes it's it's all the standard things kicking in of we had a for-profit company and we are not open and we're we have we have a proprietary advantage in terms of the language models that we have right and if this is something that we as a society are going to use we've got to think about who has access and what do we know of what's inside a black box the next thing I want to talk about is outcome equity they're frequently are unintended consequences and we have to think about how we Monitor and mitigate these unintended consequences um in the in the context of covet the best example I could come up with had to do with contact tracing apps and and what happened there with with especially early in the pandemic but to me the place where this is a real issue today societally has to do with all of the web scraping and web-based profile creation that lots of companies do so one place where there is very good well-established data collection about what we do in a way that affects our lives is our credit reports these have been around for decades okay there are well established processes for each of us to contest what is in our credit reports well-established processes for us to go and see what's in there to be able to contest in the first place okay and in spite of all of that uh it's it's been noted that approximately 20 percent of the U.S population has a significantly credit affecting error in their credit report in their credit history and significantly is defined by something like I think 75 points on on your credit score okay and that is in a well-established process that's that's the error rate with respect to things that people are constructing based on best efforts attempts of whatever was scraped from the web the error rates are going to be much much higher and if these are being used to make employment decisions uh of you know financial decisions even dating decisions right these are this is It's really problematic because there's no recourse okay so um what a point out one more little little vignette um so um there's a if you're if you're trying to build a model to predict uh how much how sick you are how sick some patient is and how much care they need okay um how much care they have already consumed is a very good predictor okay and and it turned out that in terms of uh classifying patients triaging patients in a hospital and so figuring out who needs additional care a standard model that was being used this amongst the many criteria that went in this was one of the heavily weighted things okay a very influential paper a few years ago noted that a consequence of this well they didn't actually find the reason what what they noted was that algorithmically driven care allocation in hospitals was discriminating against African-Americans Americans okay and the reason it turned out was because the model was constructed to predict greater need for care for people who had already consumed more and so if somebody came from either either because of their their you know availability or resource consumption or fear or dislike for whatever reason if they had consumed less Medical Care in the past they were assumed to be less sick by the model okay and this is this is the kind of thing where you really have to think about you know this is the this is the GRE exam story um uh so one thing that uh is standard for some of you who work with fairness in AI is to say well I've got my machine learning problem stated as an optimization problem and now I realize that as originally stated this problem doesn't um give me fairness or Equity or whatever it is that I want goodness properties that I'm looking for so what I'm going to do is throw in some additional constraints into my optimization problem okay and so with these additional constraints these fairness or Equity constraints that I throw in I'm now going to solve the optimization problem and what this explicitly says then is to say I'm going to give up on the goodness of my algorithm right because I've added more constraints so by definition your optimization function is more constrained it's more limited won't achieve as well as if you didn't have those constraints in place right and this I think is a very poor way of modeling things because you set the problem up to be to have this artificial score in opposition to the fairness or Equity thing that you're imposing as a constraint okay the question to address is did I model things correctly in the first place am I scoring the right thing right and usually if you correct those things you will have a much better holistic solution to what you're doing rather than this mathematical dance okay which I think is not the right way to address this um I think I can I can keep going um trying to think how far we should go let me maybe just say a couple more things and then stop um I've said a lot of things about how to do how to do data science right with respect to equity I want to point out that all of that shouldn't be taken to mean the data science is bad or that algorithmic driven decisions are bad right we've also got to think about what's the alternative and our Baseline is human decision making okay and and in terms of our Baseline we have to recognize that we are all biased okay as hard as we might try we have deeply ingrained implicit biases we all have triggers of various types that we react emotionally to and that we don't even know we are doing that and if you're clever enough and most of us are more than clever enough we'll come up with fantastic post-hoc explanations for decisions that were made intuitively and in a completely biased way right and so what happens with human decision making is there are lots of biases and these are hard to measure and very hard to prove with algorithms at least to the extent that you can apply mathematical definitions for whatever it is that you're trying to accomplish you can measure something you can run it against a million instances and check things out and there is some scope of doing things that could in some ways be better okay this isn't going to happen by itself and it's not certainly not not going to happen without care and if you do bad data science we can result we can create algorithms that do very poorly right but but I I actually am optimistic that algorithmic decision making can make us as humans less biased okay which is not just that they're equal or that they're good enough but they can actually be better um yeah right what time should I finish okay um so this is the hype right um and companies propagate this too um and in in practice I think this is more like what things look like uh there are many many steps in the data ecosystem that you have to deal with before you get to doing the model building and you build your model and then you've got to interpret the results that that you get from there and if you're going to worry about things like fairness and equity um you have to it isn't enough just to look at the model you're dealing with bad data you're going to get bad answers um and so uh what I want to encourage you to do is to think about all of the steps in this pipeline um and uh think about how things could go wrong I have a lot of things that I can say about about each of these and in the interest of time I'm going to flash these through um so data cleaning for instance is a thing that most people with experience in in doing this let's say is the place that they spend the most time okay an important thing in cleaning data is you is is there's an assumption that you know what clean data looks like okay which means you're filling in things you're correcting things and if you don't know what good data is supposed to look like or clean data is supposed to look like um then you can't clean data right and people make very strong assumptions so for example missing values are almost never missing at random okay there's a reason why things are missing um and and that that reason might be for instance that somebody who is you know some kind of data was difficult to collect or some people didn't want to reveal some things about themselves right um and and assuming that those things are distributed like everything else which is a very standard data cleaning assumption gives you a data set that looks clean but has a lot of garbage in it now um yeah let me go through this example actually this is probably what it's doing so um you have a customer database that you're collecting one record per customer you define the gender field and when you look at what you got you've found that your customers 48 had filled in M 32 percent were filled in half 20 percent had left it blank what are you going to assume about the 20 um so Simple Solutions that people do is one is ignore these records they're not complete okay and if this 20 percent represents a fraction of the population that has characteristics or needs that are different from the 80 percent who had filled this in then that is an inappropriate thing to do so that's actually a pretty bad thing to do okay the other another thing you can do is to impute the missing values and then the question is how do you impute it okay and um okay um I'm just just gonna keep going representation choices matter there's often a lot of pre-processing to represent things so for instance you might bucketize values or Define geographic boundaries or describe sentiments to text and often you don't question how you did that now you can get very different answers depending on how you do that right you've all heard of gerrymandering as an example right in the political context but the same thing applies to any data analysis and even if you're not doing it intentionally you might just do it thoughtlessly um okay I think I'm going to um let me let me just um go to the go to the end so I just want to find out uh Jing mentioned uh this mooc uh so just wanted to give you the link you can Google it Google data science ethics and uh with that I think I'm going to stop and allow a couple of minutes for questions [Applause] bring you the mic so when you talk about fairness in terms of access to data um it feels like that's only one step I mean one of the challenges we're facing with a lot of the kind of data-driven projects is that it's it's great to make data accessible to everybody but it doesn't solve the fact that there are certain institutes and organizations and groups who have many more resources in order to be successful associated with it and I was wondering do you have thoughts on how do you go beyond the the access what can we actually do to try and address um particularly in a world of of Big Data what can we do to try and make that more equitable yeah um so a thing that uh I'm going to answer it in the context of this slide um so there is there's a subfield of of data management called provenance and the idea is that your associate provenance with every data item and the theory for actually even the algorithms for how to do this is all pretty well known okay there's a certain overhead to doing this and at the end of the day if I give you the results of some analysis you can ask me for the provenance of it and if I have an appropriately configured system I can give you the full prominence I can see exactly what data I used how I used it what I did with it the whole caboodle okay so the methodology for doing that is known but there are two major problems associated with actually using this in practice first there are proprietary concerns okay if if there's any profit making involved anywhere it's unlikely that there'd be a willingness to share at that level yeah and it's unreasonable to expect that okay um there are also uh potentially privacy concerns for the for if there are individuals whose data was involved with this so even if there weren't proprietary concerns that would be that but then there is the other thing which is if as a user you ask well can you give me an explanation for this result and I say well here is this 50 gigabytes of stuff right and you know you figured out this gobbledygook that's completely useless so what one needs to do is to be able to have an appropriate level of explanation to be able to develop that that kind of transparency and that is still I think at its infancy as a as an academic question there there isn't enough of an understanding of what constitutes good explanations there is a robust and growing field trying to do this better um and and I and I think that that's that's the nature of what where I think we're going to have to land where there will be proprietary stuff there will be things that is privacy Limited and we still get adequate understanding and adequate Trust questions yeah so we often as well we often assume that algorithmic approach is the way to go how do we and occasionally the algorithmic approach in and of itself the technocratic approach is the one that introduces bias particularly when it is you know designed in absence of the community participation the stakeholder participation the engagement of those for whom the solution is intended is part of the solution designers um when do we can you comment on the Do no harm in the decision to use an algorithmic and data-driven approach in the first place and um this sort of gauging the value-added versus harm done in in in in applying these kinds of solutions yeah and I think that the reasons why um algorithmic the adoption of algorithmic methods can do harm is either because of a representational harm because the right kind of data is not included or because of an access harm because the you know it assumes for instance that people have access to a cell phone and people who who don't have cell phones can't do it you know things like this um and and so my belief is that if one is going through systematically and addressing the equity issues of the type that I talked about and I am not going to claim that that is complete but it is as complete as I know today and addressing it adequately will usually require understanding the subjects involved right and and so Community participation for instance is is a time-honored way to develop that understanding uh then hopefully we will avoid the kind of misfires that you're talking about in other words I'm not I don't believe that it is the use of the algorithm in itself that's the problem it is the use of the algorithm in the absence of having addressed an issue or more than one issue that that's causing it to be twisted in some way so and you don't agree and we can argue about it over a beer

2023-06-12 11:07

Show Video

Other news