What's Now and What's Next for AI in Assessment?
Hi, everyone. My name is Alan Mead. I know some of you. Hi, everyone, that I know. And nice to be in all of you who I haven't met yet. I'm the chief psychometrician for Certiverse. And as a psychometrician, you know, I love analyses of exam data. But when we do analyses of exam data, which is something we're really good at, the exam is already built and has been administered to a bunch of people. So I became interested in how we can apply quantitative
analysis to other parts of the exam development process. And that's really been my entree to this area. So I'm super excited to talk about machine learning and AI and natural language processing. Hi, everyone. I'm Nathan Thompson, and I'm CEO co-founder of Recessing Systems. And
like Alan, I'm a psychometrician as well. But like he said, psychometrics ends up becoming a relatively boxed topic in many cases. And I'm interested, like Alan is, in innovating the profession by automating some of the more mundane aspects and making it easier to implement best practices, whether it's automated essay score and automated editing generation, computer ice adaptive testing, those types of things. And in many cases, like we'll be talking about on this presentation, they've been around for 30 or 40 years, but they're still not used very often just because they're relatively hard to use. Thanks, Nathan. So this talk is one that we did at the Association of Test Publishers,
or it's similar to that. And our objective is to demystify AI and talk about how it can be useful for you. So we're going to start at the beginning, at the very, very beginning, that is AI. And AI is using technology to do things that have required humans in the past. And AI has a long history, but clearly it's making progress and influencing things
and it's key to doing things like understanding facial recognition, or we're going to talk about generating exam items, or it's used in monitoring remote test takers. So you can see that it's a general technology or a general methodology that could be applied in as many different places as human effort is applied. So that is extremely broad. And when most people say AI, they probably mean machine learning, ML, because all of the notable recent successes have been machine learning applied to big data sets. That's an important thing to realize that most of the recent amazing breakthroughs have been when we've applied machine learning techniques specifically to really big data sets. And
machine learning is statistics. Maybe that's a little dismissive or simplistic. But at its basis, it is statistics. But it is usually addressing messy data sets. So data sets that we wouldn't be able to address using our traditional methods like either response theory or classical test theory, things like tweets or written responses, essay responses. And within machine learning, two things that are important to know are two different types of machine learning models are supervised learning models, which are models that predict a criterion or predict a category that's known. And if any of you have used multiple regression, you've used a supervised learning model. It has predictor variables, and it has a variable that it is
predicting the y variable. That's a supervised learning model. Unsupervised learning refers to models that look for patterns. And so if any of you have used cluster analysis or factor analysis, those are unsupervised models. And within machine learning, because it's
messy data, there's an important concept called feature engineering, which is manually creating measures or variables from the messy data. So for example, some of you are familiar with Project EssayBrain, which was a psychometrician's approach to addressing essay grading. And it would typically have 30 to 40 features that were created from the essay, such as sentence length and vocabulary level and word length. So feature engineering is creating variables that can be used as predictors in either supervised or unsupervised learning models. And if you've heard about deep learning, the exciting thing about deep learning is that you use huge data sets to have the model automatically create the features using unsupervised learning, and then to use those features to predict or classify. So for example, in deep
learning facial recognition, the computer learns about things like noses and eyes and eyebrows and the symmetry of a human face. You don't have to teach it those things. It learns those things from the data, and then it uses those features to predict whether a picture is a person or not a person. And what's natural language processing? Natural language processing is using a computer to process language. An awful lot of the messy data that we're interested in is either written or spoken language. It doesn't mean reading it like a human, but it's key to a lot of applications because it's the preferred modality or the type of data that we have most often. So smart assistants, if you're familiar with
sentiment analysis, sentiment analysis, the methodology from machine learning for determining whether a given written piece of information is positive or negative. Like is this a positive restaurant review or a negative restaurant review? And it can also be used for generating content. Of course, we'll be interested in talking about that for generating exam content like items or scenarios. And it can also be used for translation. And another term that
you might have heard is data science. And within mathematics and statistics, the traditional research would be, you know, knowing math and statistics and having substantive expertise. You could argue that data science is taking that and adding the ability to work with large data sets and work with software. Most of the analysis that happens for data science happens in R or Python or some other specialized package. So data science is very similar to what we've done, but it includes an element of being able to be facile with technology.
And then there's automation. And automation is something that's of interest to people. It's using computers to process language or other things. And it may or may not use AI. And I think automation is what makes a lot of systems more efficient. So maybe sometimes
when people are talking about AI, they really think about automation because, you know, automated payroll systems, automated banking systems, automated exam systems are more efficient. And the one thing I would say about that is that they might also be rigid. And one possible application of AI is to possibly make an automatic system a little less rigid by incorporating some decision-making into that. So that's the connection between automation and AI. And I also want to talk about how I as a Psychometrician see economists thinking about technology. So in economics, they've been talking about technology for a long time. Technology is anything that improves worker productivity. And that's important because most of the success
of our economy over the past couple centuries has been through the adoption of technology that has driven most of the growth of the U.S. economy. And so if we enjoy our current economy, or even if we don't, if we want to compare it to what it would have been without technology, an agrarian economy where we didn't have much of a service industry, we have technology to thank for that. And AI is, of course, a technology in this perspective. That increased productivity inevitably destroys jobs. But in the past, it's always created more jobs than it has destroyed. The only problem with that is that that doesn't seem to be due to an economic law.
It seems to be due to, it seems to be a more of an outcome that we've observed as opposed to a law. And there's some concern that with enough technology, this might not be true. But there's certainly a lot of optimism that this would be true of AI as well. And so AI is a technology that makes workers more productive. And we can expect that technology will allow more exams or bigger item pools to be created with fewer subject matter experts. I don't think,
and I'll talk about this at some length, and Nate will talk about it, that we're probably going to see subject matter experts replaced by technology wholesale where there's no subject matter experts anytime. Thanks, Alan. So going back to the agenda that was briefly shown at the beginning, we're going to be talking first about the definitions which Alan just did, then I'm going to be talking about the past and the present of AI within the field of psychometrics and assessment, and then we'll be going into the future and what are the current opportunities. So when I'm starting that discussion about the past, I like to start with the distant past, and this is a great example that I've used in many presentations over the years. So I took a class on machine learning from Johns Hopkins University here in the US on Coursera. Johns Hopkins is one of the top universities for those of you not here. But you can see on the bottom how far the playback bar is. I was only about three minutes into the course, and they put up
psychometrics as an example of very, very early machine learning. So the work that psychometrics was doing with the structure of intelligence and the structure of personality 100 years ago was one of the first applications of machine learning. It was using unsupervised learning, like Alan was talking about, doing cluster analysis, factor analysis, trying to make sense of data around what is intelligence, what is personality, and that's where we got to the ideas of general intelligence or fluid versus crystallized and big five personality, those types of things that were done by early machine learning research. So when we're talking about the past of AI and
machine learning in psychometrics, psychometrics is really one of the leaders of the space. Another example of machine learning is item response theory. So as Alan mentioned, regression problems are really a supervised machine learning problem, where you've got something that you're trying to predict, in this case, the probability of getting an item correct, and using things to predict it. So item response theory also qualifies as machine learning that's
been around for about 50 years. Another great example I used, Alan mentioned, was multiple regression, where you've got a predictor that you're trying to predict, and multiple variables that are going and predicting that. That's also been around for a very long time in psychometrics and assessments, in the case of predicting job performance for the purpose of personnel selection or pre-employment testing. So all those cases where you take a pre-employment test or you get an interview, maybe they grade an essay or something like that, those are all predictive variables that are then trying to go into a predictive model, which then predicts probability of having high job performance, or probability of you leaving the company within a year, or some sort of positive or negative thing that the company says is a goal that they want to achieve. So that also is a very good example of machine learning that's been around for a long time in psychometrics and assessment. Computerized adaptive testing is another good example. So it's based on a response theory, which is machine learning. And if you go by the definition
of AI as using algorithms to do things that humans normally do, adaptive testing is a qualification of AI, because for thousands of years, people have been delivering oral exams, right? You can think of some old professor in a monastery a thousand years ago in the Renaissance or something, quizzing one of their students, that's essentially an adaptive test. It was one of the first people to formalize it was Alfred Benet, who's that picture there, who did some initial work on intelligence testing. And he put together a pool of questions and said, because, you know, he's trying to develop IQ assessments, so your chronological age versus your intelligence age, let's say. So he had items at nine-year-old level, 10-year-old level, 11-year-old level, 12-year-old level. And if a nine-year-old came in, he gave him a nine-year-old question. If he got it right, he gave him a 10-year-old question. If he got it right, he gave him an 11-year-old question. If he got it wrong, he gave him a 10-year-old question. You can see
how this is an adaptive test, really. So by taking that approach, but building it around complex machine learning models like IRT, it's essentially making an AI version of Alfred Benet's original algorithm. Another good example is automated essay scoring, like Ellen mentioned with Project EssayGrader. And the way that this works, this is also a supervised learning problem. So imagine that you have a thousand essays, and they've all been graded by teachers.
And maybe they've been graded on rubrics like argument zero to three points and grammar zero to three points. So you've got a nice data set of the essays, and you've got two predictive variables. What we need to do is take natural language processing, which Ellen mentioned as well, to take these essays, which are pretty amorphous, and try to turn them into things that are easily measurable or quantifiable. Some of the examples he gave was vocabulary level, word length, number of words within the essay, those sorts of things. And then there's also
what's called the document term matrix, which breaks it down into phrases or words that are used more often, because maybe top students use the word egregious more often, and the bottom students use less advanced words more often. I was an essay grader when I was in grad school, and there was a great example of that. It was asking kids, should we keep sports in schools, because your school board is considering the dropping of them. And I can't tell you how many times, first of all, the students said, we have to keep sports in schools, because they're the only thing that turns profit for the schools, which I think in that case, the word profit would tip off of being a lower ability student. And number two, I saw dozens at least
of cases where they use the word ludicrous, but they spelled it like the rapper ludicrous, and not the actual word ludicrous. So that also would be a very big tip off for having a lower ability student. Now, those things aren't just words that we pick manually as a psych maturation, by going through and looking at them, the machine learning models will try to pick up on which words are used more often, both on the top end of the bottom end of students, and thereby make a predictor model that, and then anybody writing a future essay, we can run through the same model and predict what they would have got their scores for arguments and grammar.
A good example of automation within the field of assessment is template-based automated item generation. So you can see here, this is an example from my platform, where you got, you're writing a question, which is on the left there, a blank year old blank was found in a room unresponsive. The words with bold and dollar sign before them are dynamic variables. So you can change age-year-old gender, it could be three, four, five-year-old boy or girl was found in the family room or the living room, playing with marbles or Lego. You can see how this can then create a bunch of permutations of the item automatically, without really changing the spirit of the item. It's still going to be a CPR question of a child that's fluid and unresponsive, what do you do with them? But then it increases the security of the exam a little bit, because if people are talking about it, they talk about, I've had a kid that was choking on a marble in the family room, and nobody had it was a five-year-old girl choking on a Lego in the dining room or something.
So this is a good example of automation. This has been around for maybe 10, 15 years now, but we're now moving past this with the generative AI, which Alan will be talking about later. So those are good examples of past research in machine learning and AI and automation in assessment. And I'll talk a little bit about ones that are a little more current or in the past couple of years. One good example of that is process data. So process data is essentially treating the assessment situation as a big data situation. So we're not just tracking which answer somebody
gave to multiple choice questions. We're tracking where the mouse went, how long they spent on every question. Did they click on A and later change the answer to B? Maybe it's a drag and drop question. You're tracking what things they drag and dropped and which order, that sort of
thing, because maybe they chose the obvious one first or not the least obvious one first on the drag and drop list. All of these are possible pieces of information. It gets more complex than a human can make sense of. Again, it's something where we put into a machine learning model and they're looking at the log files of all this drag and drop time spent, that sort of thing, and trying to make sense of that. And then using that to predict student scores based upon past student scores. And this can be used both for higher level scoring, but it can also be used for feedback too, because it's possible that the machine learning models be trained well enough to give feedback on somebody. Maybe it's a, you know, I saw an example years ago of a chemistry
item where they were drag and dropping of beakers and water and salt. It was about measuring the salinity of water within a beaker and they were tracking which things are being dragged first. And you could use that to get feedback to help understand whether the student actually knew about salinity or not and help them learn. Another good topic, which is leading into the future one here, and that is the full generative AIG, automated NN generation. And one of the first
articles on this was by Matthias von der Wier in 2019 about Optimus Prime generating all of our medical exam questions for us. But it's been amazing to see just how far we've come in four years now as ChatGPT has gone from two to three to 3.5, which is the one that got all the attention in December to now version 4.0, which was just released two weeks ago and we're talking about today. And ASC and Certiverse both have the generative AIG built off ChatGPT and our software.
And there's other parts of AI that we have integrated in our software as well too, which we're not going to deeply into that as part of this presentation, but that's a good example of trying to solve assessment business problems and psychometric business problems through AI and not just doing substantive methodological research. Another good example is computational psychometrics, which I try to describe as the merging of adaptive learning and adaptive testing. So adaptive testing was the type of AI as we talked about before, but there's also adaptive learning. In which case you're trying to take learning
modules, let's say about learning fifth grade math, and you're putting those on a scale of progression in terms of what's more advanced or not. And you can recommend learning on that progression based upon scores upon an adaptive test. And that could even be at a more macro level where you're talking about students that are off-grade, maybe they're two grades behind or two grades ahead. This is also very useful for them too, because then a quick formative adaptive
assessment is able to realize where they are on the scale and provide them with a learning module that's appropriate. And both of these are based on machine learning models as well. Alan mentioned AI remote proctoring earlier, and that's a great example of AI in the operation side of assessment. And basically what we're doing is trying to flag very specific things. So you don't think about AI remote proctoring as a really big thing, like big brothers watching, kind of thing? No, we train machine learning models to look for very, very specific things that we think are flags for somebody cheating. One example is a face looking away or having two
people within a screen. So that's a good example you see right here on the screen with that webcam. Two people in the screen and a face looking away, this would be flagged for potential cheating. On the other hand, you might have like the second picture there, no face within the screen. That's
also worth flagging because you know somebody should be there taking the test. Having noises is another good example. Having two faces very close together would be a really big example. So what these graphs on the right here are probability of each one of those things happening. So you can see there's text probability, voice probability, phone probability, facial recognition probability. It's looking at the probability at each one of those things that are happening,
combining them in a machine learning model to an overall probability of cheating, which is what the bottom graph is. So that's how AI remote proctoring works is it's breaking this down into looking at machine learning models for very specific things and then using that to determine flags. One of the other current things that's being worked on is taking some of those other things that we talked about earlier like automated SS scoring, adaptive testing, and making them more advanced. Another good example that is automated essay scoring. I saw a presentation by Paul Edelblatt from Vantage Labs last July at the IPAC conference and he was talking about how he's using his AES algorithm to look for more complex notions like voice. So no longer we're just looking for grammatical errors or something that's relatively
easy to measure like that, but looking for strength of voice in an argument and that's harder to measure. All right, thanks Nate. I think this is where I pick up. So I'm going to talk about large language models. I think you've probably heard about models from OpenAI and other places, Google and Meta and Facebook, like GPT2, GPT3, chat GPT, which is GPT3.5 and as Nate said, GPT4 is now out. These are all neural networks. Neural networks operate like a markup process and I got some feedback that not everybody knows about markup processes, but it's also called a random walk. It's choosing the next word at random, not completely a random. It has something called attention and it's learned what words go together and it uses those to create the language that we see. So you submit a prompt like write a multiple
choice item about psychometrics specifically about the assumptions of 3PL and you send that prompt to one of these engines. Hopefully it's not GPT2, but if it's GPT3 or 3.5 or 4 or barred, well, regardless of the engine, it'll return a textual response and it'll actually be an item if it's one of the more recent models. And one thing that's really, really, really important to emphasize is that the way that the models work, where they're randomly choosing the next token to put together into a response that's related to the prompt, is that they're really good at being storytellers, but there's not necessarily an equal strength in being truth tellers. And some people are optimistic that as we build bigger models, this will just solve itself. Those people have evidence in GPT4 where supposedly GPT4 is more truthful than previous models, but it seems to me like it's a hard problem to solve. The breakthrough that allowed these models was the realization, and this breakthrough happened six years ago, 2017, that attention was sufficient to build really good responses, attention and really, really large training datasets. And so I think there's probably going to have to be another breakthrough or to solve
the problem of these large language models hallucinating information. And open AI is probably gotten the most market share, if you will, but there's lots of other rivals. Google, the researchers who published the attention paper were actually from Google Brain, and Google will have things. I think one of the reasons why open AI is so popular is that they have designed themselves to be open. They have designed themselves to supply this technology to the world, whereas Google developed this technology to make its own products better. So I do think that there's something to be said for open AI. And one of the things that makes
these models better, and one of the ways that GPT-3 is different from GPT-2 is not the only way, but one of the ways is that there's been this reinforcement learning from human feedback, and supposedly there's been 12,000 hours of this reinforcement learning that produced the latest models from open AI. And that's a huge amount of effort that's about a, that's a, what is that, 10-person years, a person decade. But that's probably within the grasp of large organizations like Google and Facebook. And so I think we'll probably see models. I think, in fact, Bard is quite capable, if you haven't tried that. So although I'm going to talk about the open AI models, I think this is probably something that will be true of most large language models, at least the ones that use the same technology. So this is open AI's website. I don't know how fuzzy this is. I'll zoom in on the item. But I've taken the prompt that I just described,
write a multiple choice item about psychometrics, specifically right about the assumptions of the 3PL model, which is the prompt. And I've clicked submit, and it has written an item. And we'll take a look at this item in a minute. So that's the response. And you can see that it looks like an item. You choose the model over here. So this is using the most capable of the GPT-3 models.
You set the temperature. The temperature in all of my research is about 0.7. If you use a value of zero, you will get a deterministic response. You'll get the same response every time, because a temperature of zero will force the random choice of the model to choose the most likely next token. So by that mechanism, it will always produce the same result. And you could set the temperature higher. Temperature is anthropomorphized to be that the model is being more creative, the higher the temperature you set. So 0.7 is a reasonably
creative level. And the other thing to attend to is that this item required 89 tokens. And I'll explain what a token is. And here's the item that it wrote. So which of the following is an assumption of the 3PL model? Ability is the only factor that affects performance. Performance of a test is affected by multiple factors. Item difficulty is the same for all test takers, or item difficulty can differ from one test taker to another. So I want to emphasize that all that was required to generate this item was me going to this interface and typing, write a multiple choice item about psychometric specifically right about the assumption of the 3PL model. And I guess that
I probably had to put in a credit card. So a credit card and being able to write that prompt was all that it took to generate this item. And so this technology is incredibly easy to use in comparison to the kinds of technology that you had to train a model from scratch in the past. And I chose this item on psychometrics in part because I'm a psychometrician and I can evaluate it as something I'm an expert. But also psychometrics is a little bit of an esoteric topic. And so, you know, GPT-3 is trained on a general, you know, scrape of the internet or something like that. And so, you know, this reflects how well it does on a very, you know, niche part of that information. It seems to me like this is a very good model, or at least as a person who's
used AI for many years, this is amazing that it can form this response with this bear of an input. So what did that item cost? Well, the pricing varies and the model that I used is the biggest model. So it's the most expensive. And the price is two cents per thousand tokens. And a token is about three quarters of a word. It's a short word or a piece of a word or a punctuation or something like that. So it works out to be that, you know, a thousand tokens, there's probably about 750 words. So that means that it didn't cost very much for me to generate that item.
It's, if I did the math right, I can generate over 500 items for a dollar like that. And this is the most expensive model. There are less expensive models. And since I wrote this slide and presented it, things have changed even more. So the engine behind chat GPT is GPT-3.5 and there's a turbo version of it. And open AI claims that it outperforms the GPT-3 DaVinci. And the amazing part is that it also costs a lot less, a lot less. So certainly you could get into high costs if you were generating
lots and lots and lots of calls. There are circumstances where you might be trying to, you know, make multiple calls per item. But if you do what I just did, it's not very expensive to use this technology. So we did a research study to determine how good the items are. And in order to do that, we had to be some better experts. So we said, let's make a test about psychometricians. So it's a very hypothetical test. And we generated a blueprint and we generated some items. We were actually interested at the time in whether version two or version three of the model, you know, version three was supposed to be better, but there were some things about the way they made it better that made us wonder if it would be better for writing items. So it also provided
the opportunity to just quantify how good GPT-3 is at writing items. And I guess it's limited to, you know, items about psychometrics. So we'll see how generalizable that is. So myself and a colleague rated 90 items on a scale of useless as one, needs a lot of editing as two, needs a little editing as three, and more or less ready as is, as a four. And of the 90 items, the modal rating was a two. So most of the items, well over half of the items would be usable, but most would require subject matter experts to revise them. And the most common rating was that we felt that it
would need a substantial amount of editing. And we experienced problems like having no correct answer, having two correct answers, having a stem that didn't contain enough information to exclude one of the answers, one of the distractors as a correct answer. So those are the sorts of things. These data suggest that if we wanted to use GPT-3 to create items without subject matter experts, we're not likely to be successful. But if we use this technology to assist item
writers, that probably would be successful. And this seven times increase has to be taken with a huge grain of salt. But we do think for sure that subject matter experts who use this technology will be more productive because they'll see the item. Maybe they'll get lucky and they'll
get a four. But if they get a three or a two, which is more likely, they'll be able to edit the item and they'll be able to produce that item. It'll be in some ways of a higher quality, more creative, more likely to use the full spectrum of the domain. And they'll be more efficient in doing that. Oh, and this is hot off the presses. This is the results I just talked about. We've also done the
same thing for the GPT-3.5 and four. And so the modal response for GPT-3 and two was in the range of between two and 2.5. The results for GPT-3.5, this is the cheap model that's 90% less than GPT-3 is better. And GPT-4 is even get better again. I guess on the other hand, we want it to be a four, ideally, for the average item to be acceptable without any revision. And it looks like it's going to be a couple more iterations of the model before we get there. And that's assuming that this is a linear trend, which may or may not be true. It looks fairly linear here.
So it's true that there are simpler models, and the simpler models don't work as well. So the last thing I want to show you is how we integrated GPT-3 into our item writing. I'll be very brief. When you go to write an item about a hypothetical or a real blueprint, you'll choose write, and it'll ask you if you want to use AI assistance. And if you choose that you do want to use AI assistance, it will ask you what topic you'd like to write about.
I've actually had pushback about, oh, I have to tell it what topic I want to write about. But there's a couple reasons why we think that's like, I think every item writer has to start there to begin with. So I don't think that's an additional requirement. And there's some reasons that it's important to do that. I think that the way that we form our prompts, we say, oh, write an item, and we tell it what exam and what topic you're writing. So like JTA, all aspects of JTAs. But if two item writers are writing about JTAs, then they would both
be sending the same prompt to the system. So it seems to us important to allow the subject matter expert item writer to specify the specific topic within JTA they want to talk about. So they specify that they click generate items. And then it generates the items. But the first time that I did this, I had forgotten that this particular blueprint in QA, I had gone in and added to the prompt to the end of the prompt, the instruction to write the items in Chinese. So I wish I spoke
Chinese and knew whether these are good items or not. But I guess it demonstrates that the system can do what you want, which includes writing in other languages. But I went back and changed the prompt. And here are some items about sample size requirements for a JTA survey. And so these are three examples. And if you like one of these, then you would click accept, and you would go into an item wizard, and you'd have the opportunity to adjust that item or revise it or change the key response. And if you didn't like it, then you can reject the item, and it'll go off and try again,
and you get up to 10 tries. So that's the way that we've implemented the GPT-3 item writing in our system. And so what's next after GPT-4? GPT-4 was just released, so nothing immediately. But all of these models have become increasingly larger with fresher datasets. And the other thing that we know about GPT-4, and the other thing is that GPT hasn't been great at math. Like one of the things I noticed is I would sometimes give it prompts like, give me two items or give me five items. It's not very good at following instructions that are mathematical. But apparently,
chat GPT and GPT-4 are better at math. They're better at classification tasks, which is what sentiment analysis is or zero shock classification. So we can expect GPT-4 to be better than chat GPT, which is better than GPT-3. It's actually true that although the turbo model is price cheaper, GPT-4 is actually more expensive. It's more similar to GPT-3. Maybe some day I'm speculating
that there might be a turbo version of that that might be cheaper. But one thing to know is that a chat version will always be more expensive because the way that the chat works is you tell the model something. You ask it a question or ask it to do something and it responds to you. And then the chat part is that you then respond back to it. And the way that that's implemented
is that that second submission is I said this and you said this. Now I'm saying that and the model responds. And when you respond to that, what's being sent to the model is I said this and then you said that and then I said this and then you said that. Now I'm saying this. And so it's called the chat window. There's always a larger amount of text that is used to do the chat. That's the sort of way that the model remembers what it said is that it's actually responding to the whole history or to a window of history about the chat. We also know that GPT-4 will be or is today but we don't have access to it is multimodal. So it will be able to read images as well as text. There
were rumors of video. I haven't heard that those rumors came true. And this is an example of that was given in the chat GPT or tech report about how well GPT-4 does at solving. I think I don't know if these are actual exam questions or simulated exam questions. But the remarkable thing is that
whereas GPT-3.5 was mixed in its response and some of them were not so great. Chat GPT-4 does better on average. And so it made headlines that it passed the bars for example or something like that. And we know that GPT-4 is multimodal because GPT-4 with vision does a little bit better than GPT-4 without vision. So there are certain tests that apparently like apparently the uniform buyer
exam doesn't need much vision but the GRE quantitative does for example. So it does better when it can read. So will AI take our jobs? And OpenAI's charter mentions highly autonomous systems that outperform humans at most economically valuable work. So that sounds a little threatening to me. It sounds like their charter is to create systems that do most economically valuable work which presumably is my job as well as yours. But that sentence, that phrase is actually sandwiched
inside a mission statement that talks about AGI as highly autonomous systems that do valuable work benefiting all humanity. So we'll see whether this part comes true. It's absolutely clear that AI will continue to increase worker productivity which will reduce the need for headcount. It will make each individual worker more productive which will mean that will need fewer workers maybe in some places more than others. And this increased productivity will create more jobs but some jobs will lose. That's at least been the trend that technology has essentially moved workers from one job to another. And it's sometimes hard to predict how technology will affect this. A paper from 10 years ago predicted that people like telemarketers
and then people who process data, insurance underwriters, data entry workers and insurance processing clerks would be most likely to be automated away by AI. And Webb has a more recent paper where he does something interesting. He compares the wording of patents to job descriptions to determine whose job is most similar to things that are being patented. And on the base of that argues that people like clinical lab technicians, chemical engineers, optometrists and power plant operators might be a danger. So I think it's clear that some jobs will be more impacted but I think
it's hard to tell which jobs. And I think in terms of us writing items, I think it's absolutely clear that AI models will allow us to write more items with the same number of subject matter experts or allow us to reduce our headcount in terms of subject matter experts. And I think that's exciting. I think that, for example, that could enable large item pools that might help solve some of the problems we have with people seeing items. And along with that, I wanted to talk about
what AGI is and the singularity. AGI, which is referenced in Open AI's mission statement, is artificial general intelligence. And it refers to an AI that has human-like performance. And the concern is that if humans can create an AGI, that that AGI can then improve itself and that we would have this intelligence explosion. We'd have these super intelligent AIs that would that would that would introduce uncertainty. It's unclear how that would happen, what would happen at that point, but it's a concern about that. Well, so a good question would be whether this
will happen or when this will happen. And in 2016, some AI researchers pulled other researchers and they felt that there was a 50% chance by 2050 and a 90% chance by 2075. So based on that, I would say no AGI in the next five to eight years. Since this research, we've had an explosion from Open AI of these new models over the past several years. And Open AI is a strong proponent that if we make bigger models, that they will just naturally evolve artificial general intelligence. And I read the result of a similar survey that was more recent about AI researchers and they were asked how many of their colleagues believe this proposition as you build bigger and bigger models that just marches towards artificial general intelligence. And they felt that 50% of their
colleagues believe that, but only 17% of AI researchers actually held that belief themselves. So I'm a little skeptical. I think we're probably going to need to have a breakthrough in ways that were similar to the breakthroughs that allowed these large models, that allowed their good performance before we see something that approaches human level intelligence.
So I think, Nate, you're going to talk about the next step. Yeah. So we wanted to briefly touch on some of the tools to help you or your organization implement AI. One of the first ones that Alan mentioned before is the programming languages are on Python. The R isn't so much a programming languages, a programming environment. But both of these are very, very useful. The vast majority of all machine learning and AI work out
there is using one or both of these. They're now starting to work a little more closely with each other, but they've been competitors for quite a while. And it's kind of like a religious thing, which one you're part of, just like if you're a 3PL or a raw psychometrician. There are off-the-shelf tools like Amazon. So some of the real proctoring companies out there that utilize AI, they didn't build all the algorithms themselves. They're able to tie
in to off-the-shelf tools available through Amazon Web Services for facial recognition, facial detection, noise detection, things like that. And they just run all of their video through that. They didn't have to build, quote, unquote, the solutions themselves from scratch. You can, of course, build your own things from scratch. So if you wanted to build your own essay, automated essay scoring system, you could write code to do natural language processing, word counts, vocabulary level with flesh reading level, all that kind of stuff, and then develop your own code for regression scoring or neural networks or anything like that.
But again, that's kind of a waste when things like R, Python, and Amazon have off-the-shelf tools available for you to do that. It's just so much easier. In many cases, those tools are free. And then, of course, there is buying these. So utilizing ChatGPT is a great example of that. You're buying or renting access to open AI's ChatGPT product. And Elon gave some of the pricing there, which shows you just how incredibly affordable it is. And even if you're not producing questions, and you're using it to make questions, but even you're not producing questions that are high quality, and you're only at level three out of Elon's guidelines there, if you're writing 1,000 questions and you're able to easily toss out half of them and keep the other half with minimal editing, that still saved you a lot of time if you're only paying half a cent per question. It's just the economics that has completely changed content development. And I can see that also affecting other areas of content development within our field, too. So
you're generating reading passages for English. That's another good example. Or you're generating voice conversations for English, and you're trying to score them as well, too. Those are all the things that are more doable with the API tools that are available.
So there's our contact information. If you have any questions about us or some of the researchers that we have done, or the tools that we provide to make the AI and ML methodologies more available to more organizations within assessment. So generally, like what I'd like to think of Elon and myself doing is some of this functionality has been around for 30, 40 years or even more than that. But for most of the time, it's only been done by those
big international exams that we all know that have 30 segmentations on staff and have unlimited budgets. What we like to do is take some of this functionality and make it available so that, let's say, individual university in Peru is able to set up their own adaptive test or their own automated essay score or their own automated item generation for English testing or whatever it is that they want to do. It's not something that's just in the domain of the big multinational corporations. So that, Alan, do you have any closing points before we go to the Q&A? Let's do any questions. We'd be happy to address them. And Andre, are you? I will get you. Yeah. So I will say, first of all, I would say screenshot or write down
these email addresses because we will not be able to get to all of the questions that we had. Sorry about that. No, that's like, you know, people are curious and people have a lot of interest. So we will get through as much as we can. But if you don't hear an answer to what you're wondering about, please feel free to reach out to Nate and Alan. One question that we had
was can you share any best practices for authoring large language model prompts to create items, for example, indicating the correct response, number of response options, you know, kind of like, it sounds like, can you ask it to follow Helladine and downing when creating a prompt? So I would answer that briefly by saying that I can send you, or you can Google, you know, prompt engineering, where I can send you a link to tutorials on prompt engineering. But you, I would start off simple with a prompt like the prompt that I showed, and then I would use it and evaluate the items, you know, a few responses, and then add additional instructions to address problems that arise. That's how we created our prompt. And I think that process is pretty natural and and calling it prompt engineering makes it seem a lot scary that it really is. Right. I'm not sure if people wanted to be anonymous or not. So I am anonymizing answers if you, but if you do want a direct answer to follow up again, you can just email folks afterwards.
Next question is for the GPT-3 AIG experiment, were there any observations on the types of items that had greater success than others, for example, qualitative versus quantitative? There weren't any quantitative items. And so I, you know, there were quantitative items like, you know, what is the interpretation of Coen's D if the Coen's D is 0.4. But there weren't mathematical questions. So although I did generate one item about disattenuating the liability and it reported a formula, it asked you to pick the correct disattenuation formula, and it actually didn't have the correct answer, but it would be easy to fix the answers that it had. So I'm not sure that it's great at math. I am currently using it for that. Just as a research funds these project, trying to make a practice SAT exam using chat GPT, and it is able to produce the math items as to what quality they are yet. We're not gotten to that level yet to
have a former math teacher reviewing them. Excellent. And then Alan, did you do any investigation of how redundant the items are if they're requested using, if they're requested multiple times using the same prompts? If anybody knows a good way that you could take a huge bank of items and determine how redundant they are, I would love to talk to you. The items are at least as different as two different people would write an item about the same topic.
And this is one we heard multiple times, and both after the live presentation and today, basically asking, can you draw on approved reference material when generating items through AI? I think that if you want the AI to reference your materials, you're going to have to train it on your materials is the short answer. And therefore, I don't think you can go to the GPT-3 API call and have it call and have it reference your materials. Now, that said, things change fast. Chat GPT now has a plugin that can search the web. I don't know if that could allow it to look within your materials to find the best fitting reference or something like that. So, this is a fast moving area. But it might be the case that you might have to train it on your
materials in order to generate. That's been through in the past. And related, and I know we may have some updates here as well, are created items able to also contain metadata, metadata, such as the link to Blueprint content area reference they created or other facts that are often associated with items in a bank. We've experimented with asking GPT-3 to generate a rationale or reference or something like that. I think that that doesn't play to GPT's strength as a storyteller. It asks it to be a truth teller. And so, you get mixed results, but it can do that. And so, you could add anything to the prompt. As far as tying it back to the Blueprint, I know what part of the
Blueprint I'm generating it from. So, my software adds that metadata automatically. And it was asked, you can see how we'll need fewer SMEs to write items, but this will not necessarily reduce the number of SMEs involved in reviewing and approving items. That's absolutely true. And maybe in some ways it makes the review process even more important. So, I think that the SMEs will be more efficient, but I don't think that we're ever going to get away from having subject matter experts. I think it will just need a somewhat smaller pool. And that might be a little bit smaller or a lot smaller, but it's not going to be nobody.
Not for the foreseeable future. I'm sorry, go ahead, Nate. Absolutely. Do we have any, do we have any for Nate about adaptive testing? I don't think so. There was one about reducing bias early in the list. For when you establish machine learning models for something like AI proctoring or automated essay scoring. And the answer is that the machine learning model is only as smart as what we tell it to do, of course, and the data we feed into it. So, we've all seen the news stories about facial recognition being biased against people with darker skin color, right? And that would play into facial recognition for checking against ID cards if you're using that with an AI proctoring.
If it's using the same algorithm trained by Amazon Web Services that used a skewed sample of humans when it made that model, that's going to go into checking the ID cards as part of your AI proctoring. And that's absolutely not going to happen then. So, you have to be careful. Same with automated essay scoring. You're basing that upon having good scores from the humans as part of your
training set. So, unless you're not sufficiently training the humans as part of that, you're going to get potentially biased or inaccurate scores there as well. You know, when I was working as essay scorer, when I was in grad school, there was a lot of work that went into training that you had to qualify and pass a qualifying set to be allowed to score any live students. And, you know, it's a good example of the type of quality assurance that would have to go into training any of these AI models. And then there's evaluation too. You can audit a system for bias. You can do research to audit system for bias. That's also very important as well. Yeah, I think so. Like you're looking at, in that case, you're comparing the AI scores versus human scores
and trying to find outliers or things that might tip off outliers in terms of test length being related to it or something. You can develop your own flagging systems like that and then bring in third managers. Fantastic. Well, it looks like we are right at time. There were multiple other questions. I would again encourage people to reach out to Nate and Alan with some of those
specifics. I do see a lot of things as well that are kind of focusing on security and cheating and plagiarism and ownership and all those things. And we are actually working on kind of a follow presentation as well about addressing some of those concerns and, you know, some of the things as the world is changing. So keep an eye out for that as well. But if there is anything directly
that you would like an answer to, you have these fantastic psychometricians at your disposal, who are always happy to talk about these things. So thank you all so much for attending. Yeah, thank you all. And again, if you need anything, follow up. You can also reach us at AnswersAsServerse.com if you have any broader questions. Thank you so much.