Evaluating Deep Facial Recognition Technologies — Justin Norman — Ph.D. Research Reception
a second year student here who's going to talk to us about an interdisciplinary framework for evaluating deep facial recognition technologies for forensic applications. Welcome, Justin. Justin Norman: I'm Justin. I want to just kind of talk about a few of the things I've got going on in flight. I'm going to focus, as he said, on the deep facial recognition framework evaluation, but I've actually got a few things going on, two derivative of the facial recognition evaluation projects.
Once something permits, actually we use quite a bit of synthetic data until we discovered quite a bit about some of the limitations up for use in machine learning applications and then I've also been working with Olivia Tang who's also here today and completely left field sort of thing, I'm working with the Center for Technology on some haptic interfaces for provisionals. So just jumping into it so some of my motivation for going after this idea of how we can increase the robustness and actually really have serious conversations about whether or not facial recognition can be effective at all long-term comes from many of the things you've probably all seen in either recent history or through the news or through even applications within the criminal justice system. Facial recognition technology is old.
It's actually been around since the mid '60s in some form or another, but in the last decade or so, with the increased accuracy, power, computational capability of especially deep-learning models, it's really exploded. Unfortunately, being used primarily in surveillance and government-use cases. You probably all heard of Clearview AI and how they scraped through billions of images from publicly available sources and integrated them into a lookup or sort of software that has been made available to many governments and a lot of organizations. So Clearview would probably say that they're bringing safety, public safety, and contributing to transparency. Of course, those that you've seen on the cases on the screen demonstrate a very different view, which is that it can be used not only improperly, but also ineffectively, to not just publicly surveilled, but also to wrongfully accused and has some pretty terrifying downstream locations. So there is a really rich body of work evaluating the qualitative side, the human side, and also really focusing in on the inference side of things, how technology working on you, and trying to make sure that the public is aware of what actually occurs in the software and what the limitations are.
So it's pretty critical, especially in forensic settings. So going in front of the courts or being used in criminal justice process that we really understand, not just how the models evaluate against images themselves. So it's some technical outcome, but how well they perform the tasks that they're actually being used in, which is the forensic setting. So where are we today? This time period of 2018, 2019 where much of the initial alarm from the public marginally brought on many of the other really people in the area. Unfortunately, legislation is still really lagging. There's really three or four major areas in the United States, particularly lots of limitations, but still pretty broad view stickability.
Almost every law enforcement organization in the entire country and the Department of Defense Federal is using some form of facial recognition technology. Additionally, China, Russia, and other authoritarian governments have given sort of a false sense of efficacy as they deployed these tools directly under the populace with little or no control. So, the primary mode to fight back or against this has been through activism and legislation and both of those are lagging. So the expectation really is that this is going to continue to be utilized. Facial recognition technology in many forms, at least for the short term, over the next couple decades or so.
So what can the public do is a question that I started with. Well, it turns out that almost all the research tools that have been developed in recent memory really focused on a couple of areas. First is in this category of coordinating the datasets. So you see that in the second box there.
So can we make sure that our images, if we upload them to the Internet, are not used in a training pipeline and thus are not actually reidentifiable when it comes to time to look you for inference or if you are being subject to a facial recognition system, can I somehow wear something or do something to confuse the system as it's going about its process. So those are the two areas. Now, unfortunately, the one thing that facial recognition technologies are getting a lot better at is defeating these mechanisms of adversarial attacks. So for the time being, poisoning the dataset would work for the model architecture that you did, that you used it against, but at any other more robust architecture, instantaneously, it fails and by the way, almost all of us have already uploaded video or photos of ourselves without any of that protection.
So it's really not a viable way forward and of course, none of us can control what we wear, what we have, what we see at all times. We're being surveilled, probably more often than you think. Some of the preliminary research questions I had started with assuming that it was going to take a long time to enact any real legislative or framework change and models are going to take a while to actually get better to a point where we could trust them. What part of the facial recognition pipeline and I'm talking from the actual technical pipeline to its use in the criminal justice system and beyond, it would be most immediately impactful. So I came up with criminal justice system and courts process because that's really where it's doing a lot of immediate harm.
Since courts, police, government, and corporations are going to use the technology and we know that's going to happen, really, it becomes important to ask are we evaluating the technology in the mechanism which has been used as opposed to in a lab setting? And the answer is almost always not. The primary performance metrics typically used is just like any other lab setting. You take a dataset that is a benchmark, you beat the benchmark, and you beat the benchmark, you beat the benchmark as you improve your models. However, that's not at all what happens in a forensic setting. A forensic setting you usually have one exemplar image, a set of images that are of people who are really close to you or look very similar demographically to you, and then the model's task is to find you out of that smaller list of really similar people.
It's a much harder task for the model and so when you start going down that route of evaluating this, honestly they have a much lower general accuracy under that and that's often not recorded or even done as an evaluation mechanism for visual recognition technology. So a couple of questions that came from here were how might we actually evaluate technology for robustness and accuracy under these real-world conditions and do it in a way that's agnostic to a model site so not just saying FaceNet or ArcFace or one of the other more popular frameworks, but being able to generalize to the future where we know Clearview and other types of companies and organizations can continue to try to push the state-of-the-art and make claims about efficacy and how do we do this without creating a dataset that just exacerbates the harms that we are hoping to avoid. So one of the things that was pulled up brought up in one of the papers that I said earlier was that oftentimes constructing a benchmark dataset that includes diverse individuals requires these diverse individuals to be actually segmented out of the population and causes harm in that process.
So this is something we want to avoid. So how can we do that? I mentioned that much of what's being done right now is focused almost exclusively on adversarial attacks. So there are a couple of different toolkits, you might have heard of Fawkes or LowKey, those two tool systems are about data poisoning, but what we've learned, unfortunately, which supports the hypothesis we had earlier, is that these simply just do not persist into future iterations of the model architectures that are in a question.
From a background standpoint, we decided to more carefully control the composition comparison group to discern if facial recognition systems can be robust for law enforcement or forensic tasks. So we evaluated two very well-known face recognition technology architectures called BaseNet and also ArcFace and they're pretty close to the state of art. ArcFace, maybe about six months to ten months off the start of the art. We utilize two datasets: CASIA-WebFace, which is a well-known older public data dataset, which is a lot of actual noisy core real-world like images in it, which was important to us and then, this new contribution was that we parametrically generated a completely new synthetic dataset that we were able to control the input of and to utilize as a mechanism for safely evaluating these tools. So in order to be able to do this, of course, you have to establish that these two datasets are a reasonable facsimile of each other. So we can't just use webface, which has a certain accuracy under these systems, and then use another synthetic dataset and just compare them.
That doesn't make sense. So, the good news with having parametric control of the synthetic dataset, we can simply move down the list of accuracies and find the mix of variables that gets us where we need to be in terms of accuracy and so what we ended up doing was reducing the incidences of unrealistic poses. So, for example, these models really struggle with poses like this, which are more natural, not in a webface data as you might imagine, it's essentially IMDB scripts from celebrities, but once we got that under control, we were able to match it almost exactly and it turns out that using this technique that I'm about to describe, you get to about 74% as your baseline accuracy, which is way off from the 99.4% that most of them are going to tell you in the papers. So how do we get there? Here's an example of two generated lineups from this process I'm about to describe. So what you have is the source image essentially and then from there, there are five or actually there are many many more similarities that are similar images that are calculated to that face and so we use that as a mechanism for evaluation throughout the rest of the project.
Okay, so going back here, the process for creating these datasets come from taking a literal representation of every single image in the dataset. For those who may not be familiar with deep learning, essentially, there are things that are unique to you or unique to a face in any image that can be captured in a model. In deep learning architectures, this idea of taking a latent representation of it really is just the features of that base compressed into a space that can be reconstructed later. So what's really powerful about this is that because it exists in the space, we can actually try to do things like distance calculations, distance calculations make things like similarity calculations really easy to do. So that gives us the ability to find out how similar every single face is to each face that is not within the dataset. That again of course allows us to take the next five or six or 100 most similar and construct that for example one image that I just showed.
What we do from there is we then evaluate each model's architecture so ArcFace and FaceNet and we do this by actually inserting another random image from the same identity, the source identity, and then progressively degrading that image as we evaluate the accuracy and accuracy in this case is do we match the source image with the identity of the programs that we put inside of a new lineup. So what you see is four different types of degradations and this one that you see in the animation here is noise for noise, blur, JPEG compression, gamma correction, and resolution scaling, we can actually understand on a curve how the model performs in terms of the accuracy and also where it actually falls below chance. So one sixth, of course, in this case is chance. This allows us to understand visually and also together what is the contribution of these types of degradations to the effectiveness of the model.
This is important because real-world entities are subject to all kinds of degradations. Sometimes they're blurry, sometimes they have low-resolution, sometimes there's noise, all kinds of things that affect the process. Again, these datasets that are usually evaluated on are or they are disks that have been used for a long time and so the training methodologies overfit to the types of noise or types of errors that they have. This framework can now be applied across any kind of model, it really doesn't matter anymore, and we can find out just how effective it is and I mentioned before but it bears repeating that using this framework more similar tasks than it will probably likely to be used at instead of getting 94%, 80% or above, which is what a lot of people use it as their mechanism for forensic applications in court systems or judges really what happens is a lot of them so what we probably should do is take approaches like this since it's across all the different tools whether or not their software architecture that's been deployed in public or research application and really have a conversation about whether or not they meet the standard for what we would accept in criminal justice use case. So three outcomes so far. The research is still ongoing. Actually right now,
I have like 14 different compute clusters running so quite a bit to show just a bit later, but one is the dataset that we created that is going to be really useful going forward. There are no real humans in it, of course, and we were very careful to make sure that we were matching the population according to social science and also according to the research work about how to construct datasets in a humane way. Secondly, we discovered a ton of points of tension, which are around scope for this, in creation of synthetic datasets, which was the foundation of the paper we submitted and then finally, the framework overall for being able to systematically evaluate facial recognition systems are adaptees. Audience member: What's the success here? Is it the production of false positives or what's a success in your eyes in this research agenda? Justin Norman: I think a success would be to be able to deliver a framework for evaluation to defense attorneys and to judges, to officials who are required to make decisions about whether or not a system's effective enough for use against the task, and to have them in an empirical way to be able to say, hey listen, I'm not the one who's actually making a decision about whether or not it should be used, but here's a mechanism for how you can tell where it actually is according to the task that you are asking it to do and in that case, today, what we're doing is using expert witnesses or using the state of results in lab and that's not really something that a person who's not professional in the field has a lot of experience, but hopefully, that's going to be the best outcome. The follow-up what we want to do is about figuring out if fine-tuning using these synthetic datasets is a mechanism for improving, especially the underrepresented groups that are going to be considered by these systems, without involving humans in that area so that is an active area of research. A lot of people are doing work in creating the datasets. They're not
doing it in a way that I would consider to be a socially or social sciences or SDS conscious. So we're adding that contribution to the dataset world. Audience member: That's really interesting. One of the questions that I have was the outcomes, lots of these systems are not I don't necessarily know if they're being these it's usually like what I'm thinking about is the case of Robert Williams, that was an assessment made in terms of beliefs so I mean I'd imagine it could also be a tool that's used at the point of selling, evaluating the systems as sort of a public oversight tool. Can you speak about that a little? Justin Norman: Yeah, I think that's a hopeful goal as well. You're absolutely right that a lot of the sort of in the wild or in the field use of it is how it ends up.
It doesn't have evidence though, which is considered in a judicial process, which is something we still want to consider, but in terms of public evaluation, we're going to release the evaluation framework publicly. That's one of the main goals here. What I would hope would occur, what I would hope would happen in that space is that there would be a repositories of different papers systems and their performance, especially according to different cases. So for example, a woman with dark skin that happens to be in a dark lighting area with more noise in the picture.
Those sorts of things are ultimately what comes down to in these expert witnesses cases. It's not so much is the facial recognition tool accurate, you have some idea of that, it's is the condition that it was in, the image that you have, sufficient to tell, to identify the person, and that's a much more difficult question and oftentimes, it just comes down to a human intuition and so what I would hope is that in the real-world setting, we'd be able to overlay it on top of some of the systems and if the quality was just too poor and we knew that the model that was behind it wasn't useful or wasn't accurate, but that's a hope that I think will take quite a bit more work to achieve. Audience member: So if I understand what you're doing here, you're generating synthetic images of faces, critical comparison among those synthetic images, but you could also do the same thing with IMDB and actors, right? You did? Ok, is the challenge there that the poses that you get from the IMDB is like I don't have that variability [inaudible] Justin Norman: It's actually pretty reverse.
The issue is that you can't control them. So in a whopping or even labeled faces in the wild has labels. There's not parametric controls so you can't sweep a face for five degrees, ten degrees, 15 degrees, you only have the ones you have and for example, in webface, they're wildly different and some are resolution of four case, a resolution of 160 by 160, some are noisy, some are not, and so the synthetic dataset gives you the ability to control that and essentially to walk the model through each one of the inferences to see exactly how it performs in that case. Everybody, not everybody, many people who have been trying to do things like this with real-world images, but the challenge is you run into a literal dataset size issue and at this point, many of it rightfully so have been removed.
So, there's not actually a way for the public to evaluate them all. So that was the reason for this. Audience member: Have you thought about other ways of using stable diffusion type models to make more realistic controllability over those? Justin Norman: So I want to do another paper on this whole thing, but the answer is yes and there were a ton of artifacts. So, for those don't work in this area as much, essentially, these features that I'm talking about extracted the model doesn't really care where they come from. It's just making a completely agnostic decision about what it thinks the image looks like and so if you generate an image from stable diffusion and even unperceptively, there are areas in the pixels or blurs or things that just hallucinated. Those become a part of what the model thinks you look like or the image looks like and of course, that gets broke from the system.
So oftentimes, it will look at two identical individuals or people that were generated along the same prompts and will think they're not so it just fails. So I would love to expand this to a much more robust review of the different diffusion and different generative techniques.