Connext 2021: "How Data Science and Artificial Intelligence Are Changing Business"
[MUSIC PLAYING] Good morning, good afternoon, or good evening, wherever you are in the world. Welcome to our session on how data science and artificial intelligence are changing business. My name is Rick Bullock.
I'm a product manager at Harvard Business School Online, and I'm delighted to be joined by Professor Yael Grushka-Cockayne. Professor Grushka-Cockayne is a professor of Business Administration and the Senior Associate Dean for professional degree programs at the University of Virginia's Darden School of Business. She was previously a visiting professor at Harvard Business School, and she recently developed the Harvard online course, Data Science for Business.
Thank you so much for joining us, Yael. Thank you so much, Rick, for having me. I'm excited to talk to you about the topic and I'm excited about the course being on the platform. It's really great to see it out there and to engage with the learners. Yes, it's wonderful. So I thought we'd just get started just by sort of level-setting everybody just to sort of get everyone on the same page as far as the terminology of data science-- artificial intelligence, machine learning.
If you could just sort of kind of give a sense of how these fields and concepts are related and one of the key differences that we need to understand. Yeah, the differences are subtle and at some point, also, we can debate and be pedantic about semantics, but I think the bigger picture here is to know that the field of data science is definitely related to all the action and activity in machine learning and artificial intelligence. And those terms get used interchangeably. So first word of advice from me, at least from my perspective, is don't get overly obsessed about the definitions, but have a conversation with whoever you're discussing it to kind of understand a little bit better what exactly they are implying.
Typically, artificial intelligence is the broadest definition in the sense that it's very, generally speaking. It's not new. It's not a new term, but it goes back to sci-fi in the '50s and the '60s where there was a vision that machines will start to imitate human reasoning, and that's very, very broad. And I deliberately define it as broad because at its heart, that's what the artificial intelligence is. These days, most of the artificial intelligence that we see around us gets done and executed through machine learning, meaning through using machines and robots to learn, and often learn from data how to imitate that human reasoning.
But it doesn't have to be that way. And so a good example, and again, some might even argue with this, but a good example is if you train a machine to think or to play chess against a human expert, then you're training a robot or a machine to think like a human and to kind of have the same kind of logic and reasoning so they can beat them at their own strategic game. Machine learning, typically today, relies on the data and there is a lot more use of looking at data in order to make certain conclusions. And then finally data science-- data science is probably the most hands-on applied term of those three. It's not so much technical, but data science really brings together some machine learning, so it uses algorithms that use machines. A lot of computing power, a lot of sophisticated algorithms that run really fast.
It uses a lot of data and statistics, and it always has a domain. It's applying. It's using those algorithms in a context.
In the context of my course, that's data science and business. So there has to be a domain there, and in my case, it's business. But you can see data science in other areas-- in agriculture and in medicine.
There's other areas that you apply the concepts, but you always have a domain expertise. What domain are you relying on to look at the conclusions from running a certain machine learning algorithm with data, and that's data science. So I don't know if that helped. I hope that helped. Yeah. Yeah, for sure.
Yeah, and so focusing on the business domain, what are the most exciting kind of new applications that you're seeing? And I guess just to put it in the world we live in now, have there been any changes or acceleration, maybe, in since the pandemic? And how has the pandemic affected the adoption of data science and AI in business? OK, let me take those in parts, because there's a lot there. I feel like we can talk for hours just on that question alone. Let me start by answering your first part, which was some exciting applications of data science and business. So I'm really excited at what we're seeing today, which involves more, I would say less obvious domains and maybe more traditional, kind of lower tech industries that are getting invigorated and excited about the application of data science. So for instance, a domain that I do a lot of my work in, my academic work is project management and operations.
And we're seeing a lot of use of data science in that context. Running projects more efficiently, predicting projects in a much more rigorous way using historical data. In operations, airports, transportation infrastructure-- all of those types of operations are pausing and saying, OK, a lot has been done. How can we then use that to apply that in our context? And that's really accelerating industries that haven't seen much change or development for many, many years. So it's exciting to see. This happened a few years back to agriculture.
We know that the agritech has taken the world by a storm, which has been very promising and exciting and we're seeing that now in these other kind of adjacent fields like construction project management, transportation infrastructure. Of course, retail has been one of the first players and fast movers in use of data science everything to do with personalization. A lot of apparel companies use data science and creative ways, not only in terms of personalization and customer access, but even in their own new product development, which is exciting to see.
How do we think of the new products? How do we physically develop them in New and creative ways how do we detect the quality of our products all of that is being infused with data science and it's exciting to see where that's going. Another field that has embraced and absolutely broken barriers is banking. I know it doesn't necessarily jump out as like, oh, this is where the tech is, but if you look at some banks around the world-- I do some work with Commonwealth Bank in Australia-- what they are doing in terms of their use of data is groundbreaking.
They are rethinking their whole engagement with their customers and how to use information that flows in the various channels. How to detect patterns of interest from their customers in order to serve their customers in a better way. So it's everywhere and it's exciting.
And I can list other domains and other companies. Specifically I'll pause and go to your second question, if that's OK. Yep.
So if I think about the pandemic-- so one obvious area that I think has accelerated tremendously and has benefited from some of the data science that we have, has to do with drug development. So vaccine development, drug development, new product development. That typically is very regimented and slow and very much relies on trials and phases. There's a lot of use of data science along the way. Use of data science to think about conducting more efficient trials using historical information to learn about patterns, understand and manage expectations around probabilities of success, develop more efficient supply chains, predict outages and needs at different locations, predict transportation and supply chain peaks and troughs. All of that is being automated, and not only automated, but developing a more sophisticated manner with better predictive power, which is at the heart of what a lot of data science is.
And we're all of the beneficiaries of that, because we are seeing what the world has been able to accomplish. A year has been a long time, but we also see that so much development in this past year and we're on the verge of a much brighter future with that in mind. So it's all around us. That's probably the biggest area. Maybe if you were to press me for another one, I would say that we're seeing great and exciting developments in terms of online learning. So I'll talk about HBS Online and the amazing courses that are being developed.
There's a lot of growth and innovation related to how to use data science to improve the learning experience. Everything from targeting the right students to be in the class to developing more engagement tools that better detect how a student engages, how a student collaborates online, how we can improve the experience moving forward. Lots of tools to help professors improve their teaching and learners learn better.
So yeah. Awesome. So you mentioned the application to industries or contexts that are necessarily kind of digital first, but are becoming more digitalized, so I guess. What are the ways in which this digitalization is changing the ways companies are collecting and kind of using data? I think-- well, first it's interesting to see that if you are, for instance, engaged in a startup-- if you're starting afresh or if you're in initial phases, data is much more top of mind.
You think about the data that you're going to need today, but you also think about your data needs several months out in order to prepare when you start off in the right trajectory, in the right way. Companies realize that the every step of the process, every engagement of an employee, or every engagement with a customer is an opportunity to collect information. And so the starting point-- our savviness in terms of our awareness is way higher than it used to be. It makes sense.
Storage is no longer a concern. So it used to be the case that when you started off, especially a young company, you didn't have that much space to store a lot of data, so you had to make choices. These days with cloud computing, companies barely have to make trade-offs and they can collect a lot of data. They also employ people who understand how the data is going to be utilized later on. And so if they think about the databases that they need to set up, the access, the uniformity of their data collection tools, if that is all part of the conversation early on, that saves a lot of time later in their maturity as they try and use the data in a more sophisticated way. So data awareness is much higher.
Tools and capabilities are more readily available, so you hire more individuals who have some SQL background, or they've experimented with Tableau or Power BI, or one of these visualization tools, or they've tried to run some code and use some models. Many more people have some of that experience. They feel like they can relate to it, and they can ask the right questions in order to progress their thinking. So it is fundamentally changing the way that organizations are working. And again, one of the examples I sometimes like to give is I've been involved in a project with Heathrow Airport in England.
And if you think about where Heathrow was five years ago compared to where they are today, their entire mindset has shifted in terms of how the data needs to be stored and collected and shared between all the different stakeholders that are involved. That's from the airline, the airports, the baggage handling system, the security. It used to be very fragmented, very much in isolation, and today you can no longer progress in that way, and therefore everybody's thinking about having a much more robust infrastructure at the organization level. But that presents just a big sort of change management challenge for big organizations. You can hire data science kind of expertise, but you also have a larger organization and those who don't have that capability, don't have that language.
So I guess there are some other examples of organizations that have organized themselves or kind of developed a culture to be data-driven, even for those that don't have the background of [INAUDIBLE].. So I've seen a couple of models work, and you're right. It is definitely a challenge for our company that perhaps has been around for a while, has more legacy systems, and they have to go through the transformation.
Sometimes these organizations-- there's several levels of maturity, OK? And so one approach to get to a higher level of maturity in terms of the databases or data science in general, is to really invest in essential set of resources that grow in the center of the company and have touch points to the various units. That works. And I've seen that model. And for instance, Anheuser-Busch started with hiring a group of data experts.
I think they were located primarily in India. And that unit grew and grew and grew, and continue to support AB InBev and Anheuser-Busch internationally, to the point that then they could move out of the origin of a single unit, but actually infuse themselves in the various parts of the organization-- data science that are supporting the agriculture and the yield and the crop yield and agriculture and data scientists that support the more marketing and customer facing activities. So but they started from a unit in and of themselves. And so that's one approach. I've seen other organizations that were more successful with a very concrete project. Let's bring a team in to answer a very specific task.
Let's start with a very defined product. Let's see it to the end, and then we can, after we've developed that proof of concept, we can expand and see how this can then be utilized across the organization. And that's sometimes useful in cases where not everybody may be on the same page or there's more skepticism across the organization, or it's not quite clear how these buzz words relate to your organization.
And so starting with the very concrete deliverable, a product that has very specific use case and business case, then helps organizations start to open their mind and to then grow organically from that one application. But rarely have I seen a case where organizations basically bring a various data scientists to the various units and have them more decentralized. You kind of have to start with a centralized effort before you decentralize it across the organization. I think ultimately, you probably want to get to a point where it's decentralized, but it might take a while to get there. Fantastic. So that's a great perspective on it at the organizational level.
Like looking at it from an individual professional's perspective and someone who doesn't have a background in data science, what are the kind of key skills that they should focus on developing to be data-driven themselves, and you know, to work effectively with the data teams in their organization? One of the most important skills in my mind is this willingness to understand again, the business domain. I cannot underestimate that enough, because your role as a data scientist is a lot about translation. Taking something very technical, very much an algorithm that you write in code, and linking that to the business. Making sense of it in the context of your business using it to drive decisions. And actually, that's a non-trivial part of the process, and one that a lot of the success or failure of a data science initiative will depend on.
How well were you able to find a business problem, seek the data that you need to get close to answering that question, do some analysis, and then feed it back to the business to have an action-oriented mindset. That is a very non-trivial process, and folks that are successful in that understand that their role is to really take the technical part and translate it into the business to make sure that people in marketing, sales, finance-- people in the organization understand why and how it's bringing benefit to the organization. So take some time to understand how your business operates. What is it that you're trying to do with this analysis? What is it that you're trying to predict, and how does that prediction affect your organizational success? That could be an analysis of your team dynamics if you're talking about like a people analytics project, but that can be some analysis of the manufacturing plant if you're talking about a production company. So it's really on the data scientists to understand the domain. And sometimes, or all that very frequently, data scientists, initially when they are trained, sometimes over focus on the technical components.
They want to know all the latest machine-learning algorithms. They want to understand how to hack away at a leaderboard on Kaggle and compete their way to extra predictive power, all of which is great and a good data scientist needs, but in the wild, in the workplace, you will rise and fall and you will succeed on how well you understand what the goal of your business is and how to then utilize data science there. So really, it's-- you need as much passion for the business and the field that you're in that you need for the data science side of things.
So if you like sports, find a job in sports analytics. If you are into retail, lots of luxury goods have great and exciting data science applications. If you like manufacturing or robotics, find your position there.
If you're drawn to HR, then people analytics is a very flourishing and exciting domain to be in. So there's data science everywhere. Match it with your interest, because it's going to be very important that you can execute all the way. Fantastic.
Thanks. Just wanted to data visualization and talk about the importance of visualization and kind of what's the role of visualization in the analytics process? Kind of what stage and what stages does it come into play and how to do that successfully? So there's a reason why data visualization is a really hot topic these days, and it's risen to everybody's awareness, along with data science and machine learning and the likes, because as we grow in our data, meaning as we have more and more data as we use more sophisticated programming languages to analyze that data, we're moving away from Excel, where we can see all the data on one screen and you have your 50 rows and you have your 10 columns and you can kind of eyeball it. That's no longer the case.
If you have heaps and heaps of data, if you have a lot of updating data that changes daily, you cannot grasp it in the same way that you used to. And so in order to support that, in order to kind of keep finding new trends and new insights and new ideas, visualization is your friend. Like that is the way for you to and you can do that pretty dynamically. You can link a visualization tool to a database at the back end and look at charts daily to see trends changing or to bring in multiple dimensions.
We have so many more x variables or columns of information about each and every feature that we need multiple dimensions and some of these sophisticated visualization tools can do that. They can show us five, six dimensions at once in terms of Geo and color and size and time, and it's just such a rich set of tools that allow us to, in a very flexible, dynamic, and exciting way, get a better understanding and a feel for our data. It also allows us to brainstorm hypotheses, and then try to find some initial evidence-- some evidence in the data. If you have some hypothesis, then you should be able to dig in with the visualizations to kind of get it start getting a sense for are you heading in the right direction? Or are you totally off-base in terms of your understanding of your own data? And maybe I should have actually started, and my first comment should have been, if I think about the sequence in which things happen, one of the first parts of visualization, and we talk a lot about this in the course.
One of the first steps along the way that visualization is really very useful for is to even understand the quality of the data is to understand are you missing a lot of data? Or is the data seen? Are there values that don't make sense? Are there outliers? I'm very hesitant to call them outliers, but are there extreme values that catch your attention that you're thinking to yourself, hmm, I should dig in and understand what's going on here. It really gives you a way to dig in and understand your data better before you even do any analysis of any kind. And then you can build on those visualizations to check your intuition, check your predictions, improve your models. It really accompanies you along the entire data science process.
And maybe final pitch for visualization-- these days, a lot of the algorithms or prediction is done through code, as I mentioned. And code is not-- while you can get people comfortable to looking at code and you can kind of create this Dojo atmosphere like I like to talk about it, where people are comfortable watching each other code. That's not typically where you would go if you had a broader forum of individuals that you're trying to show some analysis to. And so if you're trying to present some results to either a group of leaders in the organization, or a group of your peers, or even to your team, visualization is typically a very welcoming tool that allows a lot of individuals to get a quick instinctive kind of reaction to the data, and everybody feels a lot more buy-in once they see some visuals. So it's a nice way to welcome people into your analysis, as well. So the tools that make data science so powerful can also be intimidating for those who don't have a background in programming.
So what are the key things that non-data scientists professionals need to know about the programming languages that are most prevalent in data science, and just, I guess, to know enough about the tools to be able to work effectively with data scientist colleagues on problems? Great question. First, like any other language, it is a language, meaning once you immerse yourself in it, you do learn pretty quickly some basics. And so it is not-- the barriers are similar to learning just a new language. So if you think about it, you know, how can I learn R or how can I learn Python? Well, how would you go about learning French or Spanish or some foreign language that you always wanted-- foreign to you-- language that you've always wanted to learn? Well, you immerse yourself. Often you immerse yourself in that experience. You recognize that at first, it's going to be a little bit slower than you maybe want, and you're kind of going to keep on going back to your natural language often because you don't have the vocabulary.
When you teach people R, they often open Excel, and that's them going back to their native language. But over time and with exposure, you remember and recall more and more, and then it's easier for you to accelerate up. The whole purpose of moving from a more Windows-Excel kind of point and click environment to a programming language, it's helpful to know why this shift has occurred. And one of the reasons the shift has occurred, there's probably two or three main reasons, but one of the reasons that the shift has occurred is that it's easier to keep like a documentation of the steps you took and it's easier to replicate. So if you're running a certain process and you're generating certain predictions or certain insights, if you want to redo the analysis, in Excel it's very hard to do. Like if you did some steps and then you close it down and you open it again, I'm sure that you've been in that situation where you're like, how did I get there? What did I do? And it's very hard to remember.
It used to be, in the 90s, we would record macros, but those are not very efficient. And so what's so beautiful about writing code, whether in R or Python or C or C++ or anything that you fancy, it's really this idea that you can open up the code and it's all written down. There it's documented, meaning the code itself-- you can rerun it fairly easily, and you can get to the same place and so that replication is very helpful. And in a world where we're doing these processes, often and we want to do it with most recent data. And we want to do it with a lot of data, that's where you start moving to programming, because other tools cannot cope with that. They can't process things as fast.
They can't process as much data, and they're hard to replicate over and over. And then a final kind of plug for these domains is that yes, it seems intimidating at first, but one of the beauties of the platforms and the software tools is that they're very often-- they have some component of like, crowdsourcing, so a lot of-- R is completely open source, so it's completely free, and everybody can contribute to it. And Python, it's free, but there is some more regulation around contribution, but still, many individuals can add to it. And so if I develop a process or a function that creates a certain calculation and you want to use it, you can use my development. It's a great place for people to share, which is why it's accelerated so fast, because the community keeps on chipping and building on each other, and we create all of these libraries that we can all enjoy. And so the mindset is-- a welcoming mindset.
It's about inviting people in to generate more and more knowledge, and that's why there's a very vibrant online community. So one of the biggest gifts I can give my learners when they're learning data science is the ability to google correctly and know what to look for in Google, because it's an ever-evolving field. The answers are out there, and you are-- and people typically answer each other and help each other out because it evolves so fast. And then they go back to the community when they need their own answers. So it's part of the process to know that there are no barriers like that is that actually reach out and somebody will answer your question.
Wonderful. What are the most common mistakes that you see business professionals make as they're kind of moving beyond the spreadsheet to really large data sets to unstructured data to trying to kind of visualize new sources of data? What are the kind of big pitfalls to watch out for? Yeah, so again, this kind of is continuation of the promise and the excitement and why we're moving in these new directions that require more sophisticated programming and more tools is because we can now move beyond just tabulated data or quantitative data, but we can work with a lot of unstructured data like you mentioned. Voice, images, free text, all of that is like totally legit sources of information to use in certain predictive tasks. And part of the mistakes I guess that I see is that people try and add too many bells and whistles immediately.
Like, they try and they get overwhelmed by all of the possibilities that are presented without actually making it through to the end. And so part of the there's a lot of interesting analysis and research now that considers where is the mis-prompts because there are some areas that folks are feeling like there isn't as much delivery on the promise that data science has to offer. And most often companies struggle when they're trying to accomplish too much from the get go.
So develop a proof of concept, starts from the data, and show that you can get to a final set of predictions. Maybe those predictions aren't very good quite yet, but show that you can take the process through. Understand what you would do with those numbers, with those predictions at the end. Where would you use that in your organization? And then go back and try to improve your algorithms and improve your predictive power. But there is some discipline to saying I'm going to make some basic assumptions.
I'm going to run with it, see how far I can get, and then go back and try and refine my assumptions in the background, which sometimes get paralyzed because there is so much data out there. There are so many different directions we can take it that we're trying to accomplish too much at once and we're not seeing anything through to the end. So I mentioned at the beginning I teach data science. I teach project management, as well, and I do a lot of research on project management.
And it's interesting to see how with the emergent immersion and the abundance of data science applications, the field of project management is also growing, because folks are recognizing that we need to deliver. And there are dangers that you don't deliver if you aren't focused on the final deliverable and a project mindset. So that can be helpful sometimes. And then what are the key kind of sources of bias that professionals undertaking analyses need to be mindful of and how can individuals and kind of organizations just proactively kind of try to kind of mitigate the risk of bias in data collection and data analysis? So first, it's a great question.
There is a lot more awareness these days that using machines and using artificial intelligence does not remove any biases that human intelligence may have. In fact, in many instances, using machines could actually exacerbate the problem. So the awareness is there.
By and large, the awareness is there, and that's a great place for the industry to be. Now overcoming these biases is not always easy. And we're still, sadly, blindsided by instances where we thought we had it under control and we don't. So biases emerge for many different reasons. One reason could be that you just don't have data in your data set. So if you don't have a source of data that represents, for instance, a set of decisions or a set of choices made by a sub-population, if it's not in the data, then you won't know how to predict.
An individual from that group that you're trying to work with, you won't know how to come up with predictions for that individual because you don't have examples from the past. So if your data is segmented, you will struggle to account for the entire rich set of intentional customers or employees that you want to cater for. Even if your data is comprehensive, your algorithms are based on patterns in the past. So based on observations in the past.
You could have biases in there that any algorithm will only reproduce because the algorithms learn from data and they pick up on patterns from historical data. So if the data exhibits some biases, you will only exacerbated by using your algorithms without pausing for a moment to reflect on that bias that exists. And then finally, it could be sometimes that you introduce a bias because you have a variable in your data set that while it is not the variable that you're concerned about it is highly correlated with it. And so if you're using a different variable and you're saying, oh, I'm ignoring gender, let's say. I'm just giving an example.
I'm ignoring gender. But you have some kind of other substitute in there that actually is highly correlated with gender then you're just going to reinforce the bias through that other correlated variable. And so you have to think about, not only what don't you have or avoid what you're trying to avoid, but actually what is the nature of the relationship of that sensitive variable with what you do end up having in your data set. So there's many check points that you have to consider. You have to always look at some-- it's not only enough to take the process through and get predictions, but you have to look at those predictions, go back to the visualizations, examine those to see how they look and to check them against any biases that you're trying to avoid and maybe account for.
And then it's an iterative process, then, if there are any identified biases, go back and try and refine it with either new data or tweaks to the algorithm to lead in a different direction. And then what do you see as the most promising opportunity is kind of on the horizon for organizations adopting know new data science and AI capabilities? You know, what are you excited about in the next three to five years? Oh, so many things. So many things. So first I'm excited-- I hope-- or my senses-- is that some of what we've seen in the past, let's say decade or so, it's been-- challenges related to data have been met with applying new techniques and getting into a new mindset of data science.
So organizations have been having to deal with both of those things at once. I believe that we're in a different place in terms of data maturity, and now organizations can take that information and take it to the next level. So for instance, I do some work in drug development, and you can see that there's so much more data now-- some of which is published by the FDA, some of which is published by the companies themselves-- in which now we are at a point where we can take a much broader perspective and we can develop more sophisticated tools because the data has reached a maturity. We can rely on it. It's out there, and we can now take it in different directions. So I think, generally speaking, I'm excited because the need to reimagine our data thinking, I think we're at a high level of maturity now that we weren't maybe five, 10 years ago.
As I mentioned, I'm excited about the field, specifically about the field of project management. I'll come back to that because I believe that there's a lot going on there that is going to just absolutely transform the industry. And personally, I find that exciting, because for instance, and in the US, we're seeing a huge investment in infrastructure right now and a huge need for better and improved tools. And I think that the benefits that we're all going to see are tremendous.
Personalized medicine-- we know that it's coming. We know that we need it. We know that we need personal treatment because potentially, some of our thinking is that maybe we can avoid these types of pandemics moving forward if we had more personalized information and personalized care. And so I believe that we're heading in a healthier direction, in a good direction, that will make us stronger long-term.
And then how can business professionals who are just sort of entering the data science ecosystem, learning about this world, how can they sort stay up to date on trends in kind of what's the latest in data science thinking and AI thinking? So textbooks are not as helpful. It's tempting. There's a lot of books being published and it's fun to buy books, but online is where a lot of the action is, so definitely social platforms. And again, we're getting better at that, too, and thanks to the pandemic that has accelerated. But Clubhouse and Conversations and Twitter and LinkedIn and different forms and conferences-- they're all much more accessible than they used to be and belong to associations and belong to groups-- professional groups that kind of help you stay in touch. Some folks compete on platforms like Kaggle and participate in all kinds of competitions and belong to communities, because that's a place to constantly reinvent yourself.
And it's an area that the field has always relied on. And I think as someone progresses in their lifecycle as a data scientist, they can rely on those communities for different things. At the beginning, that might be to demonstrate capabilities, to find their first job. But then as they mature themselves, they can stay in touch by entering another competition, joining a team, or supervising a group of individuals or maybe students that are working on a new competition learn the latest and the greatest. And I think that that's a way to kind of stay in touch, and the community embraces that.
That's part of the community-- that constantly is giving back. Yeah. Awesome. Well thank you so much, Yael.
We really appreciate it and we appreciate the conversation and all of your insights on this important topic. My pleasure. Thank you for having me, and thank you for all the great questions.