AI @ IA : Research in the Age of Artificial Intelligence — Internet Archive's Annual Celebration

e e e e e e e hello everyone would you please take your seats the program is about to begin welcome everyone and thank you for being here let's get started and hear first from David mccraine a bestselling author a top 100 podcaster and friend to the internet archive David please take it away for those of you who don't know me my name is David mccrady and I really wish I could be there with you in that gorgeous room amongst those somewhat creepy statues you have there and that's sort of the wonderous thing about technology isn't it it extends what human beings can do in part that's what we're going to talk about here tonight I'm a science journalist a lecturer a podcaster I write books about things and what I mostly cover in research is what human beings can and can't do writers like myself sometimes like to think where outside of Technology because we can do what we do with a pin and a piece of paper but a pin and a piece of paper are technology books are technology and many of these books were written with a computer emailed back and forth to an editor researched with a search Eng and the truth is this technology has been extending what we can do all the way back to the printing press way before that and I have to tell you that as soon as you adopt some new piece of technology a newer piece comes along technology changes the the world then the world uses technology to change technology and then that new technology changes the world so more and just keeps it's a good idea for podcast so I'm going to take a digital note on this so yeah technology it's going to change things some of it for the worse a lot of it for the better and where it changes things for the better it's going to expand the limited capabilities of human beings it's going to extend the reach of those capabilities both in speed and scope it's about a newfound freedom of mind and time and democratizing that freedom so everyone has access to it tonight we're going to look at how technology and the internet archive are extending the capacities of research so that we can see more so that we can understand more so that we can access more and tonight's program is about the past present and the future of the internet archive artificial intelligence and that dream of universal access to all knowledge so thank you very much for inviting me to do this and I do really wish that I could be there to enjoy it with you so have fun h thank you so much David and again welcome everyone for those of you who haven't noticed yet I am your AI generated host also if you haven't picked up on why I'm a talking bust we're doing a sort of antiquity to the age of AI design theme I also apologize if I appear quite uncanny but I'm guessing that many of you are going to be seeing a lot more AI generated everything in the coming year so by the end of this event hopefully you're a bit more prepared for that I was made by some wonderful artists and volunteers and I'm here because I founded what's considered to be the first successful public subscription library in America which still exists as a research library today and tonight is precisely about research libraries like the internet archive in the age of AI so let's start at the beginning of it all by hearing first from the internet archive founder everyone please welcome our very own digital librarian Mr Brewster kale welcome thank you and thank you uh AI Ben uh so welcome to the internet archives 2023 annual event I'm really glad that you're here okay so we promised a little old to old to new I can't quite go back as far as Ben but I can go back to 1980 I was a student at MIT at the artificial intelligence lab um and we were dreaming of building machines okay I had hair it wasn't great hair um uh there we were building machines we wanted to build machines that could think Danny Hillis said we wanted to build machine that would be proud of us that we were these were heady days um we were trying to build these things though and we were data starved even the supercomputer that I helped build in that time held all of 32 megabytes of memory the dis pack which was the size of a bar was 5 gigabyt we were a long way from making the global brain that we were dreaming of we all knew knew what was coming right we all sort of understood the the the graphs and the like um but I thought a role that I could play was to try to build the content if you will the data sets collect the materials so I collected back then I collected all the phone books of the Boston area just before uh they were going to go to the printer I collected the Boston census that had sort of you know how many people were where but underneath that you could go and deduct What the how all of the streets linked together you could make a street map and reimagining with these databases could be used for that would be something audacious we wanted to make a direction assistant we wanted to make it so that you could use your phone to call up a machine and would answer hello this is thinking machines Direction assistance can I help you tell me where you are and where you want to go and I will tell you how to drive there okay this is state-ofthe-art okay so we didn't have cell phones then we didn't we barely had modems and this is uh what we what we were able to go and then describe how people should go and drive from one place to another in Boston okay truth be told it didn't work all that well but the future um where machines augmented with data interacting with people was going to help us in new and different ways what I didn't realize then what I think is so important is that you actually have to have the data in your hands to be re able to use it and reuse it in new and different ways ways that were not imagined by the original Publishers of say the phone book uh or the census that that was going to be key once the data was in our library we could do new things with it so we started building building and building and building along so I went off to build the digital library um to try to create these new unimagined and Fantastical new services that we knew would start to come out roll time forward 40 years and the internet archive is maturing starting with web pages uh we started collecting in 1996 we now have hundreds of billions of pages from hundreds of Millions of websites then we started collecting television um in in the year 2000 Russian Chinese Japanese Iraqi alaz bbcn NBC 24 hours a day to be able to have a record of television news because we knew it was going to be important we started collecting in the year 2000 but um we made a small service in 2001 but it was not until 2009 that we started to actually build the service on top of that hundreds of thousands of pieces of software are now archived 7 million digital digitized books we digitize it about a million books a year which is excellent music software all these sorts of things it's a service that's used by about two million people a day that are go to the website there are about five million people that use the internet archive every day and they don't even know they're using the internet archive but we're there for them and the idea that this uh has turned into uh the mo the 200th to 300th most popular website I think is kind of a testiment to not only this group but also the world that actually wants to see older materials the internet archive by weaving data and machines and empowering people to pull off some amazing feats um I found that a way of describing the internet archive is through Wikipedia so Wikipedia everybody knows Wikipedia it's kind of awesome uh one of the marvels of the of the modern times but it's written by real people that need libraries but also there are people reading these that want to go deeper and so we've been working with Wikipedia um to go and run over the uh the Wikipedia for the last 15 years collecting any new URL that's ref refer by any wikipedian And archiving it and we knew it would be important at some at some point so then about I don't know six seven years ago we made another robot that went and fixed broken links in Wikipedia and it turns out that links rot after about a 100 days um so a lot of them were gone and as of a couple years ago we had uh fixed 11 million broken l links working now um for the last more just keep going keep going I just heard today we crossed over the 300th wikip Wikipedia language Edition that is now getting fixed their their broken links um so we're now at 300 as of today which is kind of great and another Milestone from today is uh we just fixed our 19 millionth broken Link in Wikipedia [Applause] yes so what we wanted to do is go further than that as well to go and make it so people could go and get to the books that are referenced in Wikipedia so we basically crawled Wikipedia again tried to find all of the uh the books that were referenced acquired those books digitized those book books and tried to weave those links back into Wikipedia and as of now um we've uh got 1 million links in these uh wikipedias two uh books and they open often right to the right page so the idea of going deeper and not being just sort of this uh AI Skynet thing that's going to you know detach from people but it's woven in with people has been I think the Great lesson of the last uh 40 years uh from me and it doesn't seem like it's going to stop um just as people are now depending on online resources for accurate information libraries are more necessary than ever libraries have been coming under attack however so okay a lot of good news but right now it's actually being a little bit scary to be a library and it's not just us um fortunately people are starting to pay attention to the plight of libraries and for the first time in several years they published our words in a real mainstream paper the guardian about what uh is going on and the forces that are now aligned and attacking libraries we've probably heard a lot about the book Bannings that are going on and Me by many politicians and promot Ed and actually happening and threatening libraries but you probably don't know also know that there's been large numbers of defunding of libraries that are going on in such a way that they're just getting stripped down so the legislatures are now starting in many places aligning against libraries in many places in the United States but the thing that I think is under appreciated is what corporations are doing through Draconian licensing terms that make it so that libraries don't actually own anything digital at all that they can only get streaming access to these things so if you go and borrow an ebook say from the San Francisco Public Library you're not actually borrowing their ebook you're getting passed through to a thirdparty database uh that's driven by the Publishers so they can surveil every page turn and they can change that book or delete it at any time this is just not that great right so we're having these kinds of approaches and the internet archive um when it uh uh went and digitized its Holdings so that it could lend things out um we're sued by the uh the Publishers because it wasn't part of their streaming Vision um to go and have all the 20th century available they just wanted to have just their greatest hits available through their uh their their service and the big surprise to me is that the Judiciary has now sided with the Publishers not just in our case but in other cases that have been brought against uh libraries so we've got a problem out there people need libraries more than ever um but we have uh a set of forces that are making libraries harder and harder to happen so we have to do something more about it and it was really great that people came to our Aid that when we um needed uh support people came and protested and helped um the uh internet archive to be able to find uh uh support for us and to go and propagate the message just in time for another lawsuit against the internet archive this one was brought by the Raa with their major uh uh record labels against us for going and having the audacity to making 78 RPM records uh available which we had been doing for 15 years and they didn't go and say oh you should take these down and then we refused they just hit us with a lawsuit so we basically got this on us but it's coming at libraries from many different directions so we need to stand behind libraries more than ever and I'd like to highlight somebody that's that's doing an extra amount of that so we had our protest on the on the uh steps of of of this building um to go and show that we were uh in support of libraries and our uh San Francisco city supervisor she said she was sorry that she couldn't come but she had another thing that she could uh she might be able uh to do to help she said that she uh could write a resolution and maybe just maybe the other supervisors would support the idea of supporting libraries unlike what we're seeing in a lot of the country where they're taking away their fundings and not supporting them and it happened the was the uh the resolution was written and was passed unanimously [Music] U and for taking this stand and to get I would like to give our annual internet archive hero Award of 2023 to Connie Chan the city supervisor for our district in San Francisco I would like to welcome please welcome Connie Chan to the [Applause] stage please say a few words if you would Thank you and good evening. It's really a privilege to be here celebrating with you celebrating Internet Archive. I am a first generation immigrant. I was born in Hong Kong and grew up in Taiwan. I came here to San Francisco's Chinatown when I was 13 years old and
when I arrived I didn't really speak any English. So I went to this place called Chinatown library and Chinatown library had Chinese books. It was amazing because they were free and there were a lot of them and in a foreign country being able to read books for free in my mother tongue it was amazing and comforting but at the same time it took me another step. I also was able to read
English books and other books and continue on and on my education throughout time. It's the reason why I know libraries to people like me - first generation immigrants, low income immigrants - at that time with my single mother and my brother, I knew that living in Chinatown and having access - free access - to information was part of the critical part of my education and to higher education and I know that I was not alone. And it's true I think that many of you know that we are not alone because we needed library. Library was probably by far the best system that United States had come up with. So when I learned about Internet Archive I was like - what is this wonderful thing? I don't understand it because it's beyond me - it's not a physical library but an online library. And I learn more and more about it and I have a wonderful tour. Thank you to
Brewster. But here I'm not an engineer, clearly. I do not understand internet. In fact, I do not understand the technology, very much of it, but here what I understand when I learn about Internet Archive it is a gem - it's a hidden gem that we can see for few things that I thought it's very critical to humanity. First is freedom, diversity and truth. It's freedom of information, it's diversity of information, but most importantly it's the access to truth and that is what we need and here we are. I really believe we're not just fighting for libraries and access to freedom and information, we're really fighting for our humanity. It is for our humanity. Humanity to
exist. So thank you for this award. I'm grateful, but really this award belongs to Brewster and all the people that have been supporting Internet Archive. I really hope that you will continue to fight with us and stand with Internet Archive. This fight is worth it! Thank you. okay we need more politicians like Chan uh tonight you're going to hear how we are strong and we're growing to use the tools of AI to create better libraries and better services that benefit us all this can be of the research libraries day the day we have been building for the day that we've been collecting for the day where the collective works of humankind can be more relevant to more people's lives I thank you very much for coming the rest of the show um you'll see some videos but you'll also see some of the real uh work that's going on using the internet archives research collections to go and do new and different things and also help build the collections thank you very [Applause] much we use web archiving every day to save evidence for investigation we believe this is crucial for journalism the internet archive is indispensable in creating our podcast a lot of old websites where you can't find them anywhere you can find them on a Wayback machine internet archive is my favorite way to teach history showing real life footage of actual events that happened having them Listen to Sound Bites thank you internet archive for contributing to to and enabling my creative practice as an artist as a video game developer as a musician and all the other ways you've enriched the world giving me free access to a huge collection of books from the past the internet archive allows me to discover new fields to explore and study it's so fun and helpful to go back into the internet archives and find all of the graphics and websites and images of my earlier projects that I actually personally did not save thank the Ary for helping me with my research specifically in the topics of literature on cultural and social preservation because I'm from Pakistan most of the cataloges that we have from the British Indian Era are locked away I'm very grateful to the see inter R SE um thank you web machine is a tool we frequently use here at neutral to look for deleted sites or online posts that were shared on social media this allows us to have aov to show who and what contributed to the spread of our misleading or false claing internet archive is the reason this book exists we use we machines to do a fake check on our daily work the internet archive enabled our fact checken team at Vera files to find a record at the Senate of the Philippines website that our president now lied about graduating from the University of Oxford thanks to internet archive as a writer and researcher I can keep my texts audios and videos together I use the internet archive to investigate human rights violations for the past 23 years the internet archive has made it possible for me to upload audio field recordings have made at rainbow gatherings throughout the world thank you internet arive for enabling our fact seeking work whenever we want to trace the digital footprints for any investigation weac machine is the most collap for and I really couldn't write the washingt post fact cheer without uh the way back machine simp believe using the archive has always been and will continue to be great ased in the facts thank you and AR for nightling resarch early online coach thank you internet archive thank you internet archive thank you internet archive I'm very grateful for the internet archive for existing thank [Music] you wow it's so rewarding to see how people are making use of the archive to create and do great research but how do we make sure that all of those patrons can easily discover and use the materials in the archive that they need and how does AI help hello everyone I'm Dre Camy and I'm excited to share with you some projects we've been working on that use AI to enable our Librarians to to make our materials more discoverable and easy to use at internet archive scale in fact Ai and machine learning have been a core part of our digitization pipeline for a few years now when a book is digitized all we have is a bunch of photographs of pages these photographs are great for humans who can easily make sense of them but to make them searchable and discoverable we need to make them make sense to a computer first we even have to tell it where the pages are because these photos include things we don't want like the scanning bed originally scanner operators had to tediously and manually crop each of these images to correctly identify the page boundaries in 2021 we trained a custom machine learning model on all of those manual page croppings from the years before to automatically suggest page boundaries to the scanner operators this allowed them to double their rate of processing and made it possible for us to digitize even more books great so now that we found where the pages are we want the computer to understand the words on these Pages for this we use test an open-source machine learning based tool to convert the images into text a machine can easily understand it's this process that makes it possible for our books to be searchable accessible to those with prce abilities through features like read aloud and available for bulk research cross referencing and text analysis since beginning to use Testa act in 2021 we've made over 14 million books documents microfish records you name it discoverable and accessible and in over 100 languages but since last year the definition of the term AI has shifted to mean something a little different and as more capabilities like chat GPT and large language models have been made available we've been finding many new opportunities to allow our Librarians to process more materials than ever allowing us to tackle projects we previously couldn't in order to help improve discoverability and ease of use a key part of material discoverability is good metadata remember a digitized book is just a bunch of photographs we need good metadata to know things like the title author history subjects of the book so that we can correctly connect Patron searches to those books and for some materials even despite having the book text metadata can be difficult to Source resulting in books that are a mystery to the computer and this can be very difficult to find by searching for our patrons to help tackle this problem this year we've been piloting the internet archive metadata extractor a tool that reads the the that book text that we talked about earlier from the front of the book and automatically extract some key metadata elements with this extra information our Librarians and metadata staff can match the digitized book to other full catalog records and solve these mystery books and there are a lot of mystery materials in our catalog we currently have over 300 thousand of these mystery books and that number continues to grow we also used this tool in a project this year in partnership with the University of Toronto to digitize over 23,000 Canadian government documents these documents were unlined to catalog records and so also had no metadata labeling collections of this scale manually by hand was unfeasible but the new AI tooling allows Librarians to make these previously unfeasible projects feasible we've also been using AI to help make our materials easier to use for our patrons for example our serials metadata team which works with digitized magazines and newspapers from the 20th century has always worked to research and add descriptions to each of our periodicals this is timec consuming taking an average of around 40 minutes to do that research and then write a description and there are over 18,000 periodicals that need a description so this is no small task this year the team began experimenting with using AI to help in the description writing process given metadata about a periodical chat GPT is asked to generate a description here we use chat gpt's prior knowledge about these periodicals using it almost like a research assistant this description is then vetted and edited by metadata staff and finally uploaded back to the archive where it can help our patrons find the things they [Applause] need with AI assistance writing descriptions has gone down from 40 minutes to just under 10 minutes another way we're using AI to improve Patron and researcher experience is by extracting table of contents data from books the diverse structure of table of contents across different books has made automated extraction difficult in the past however with AI a new process has been developed which initially identifies the table of contents using traditional programming and then employs OCR and chat GPT to extract the table in a structured format this data can then be used in the book reader UI to help people navigate the book and inside open library to help people discover the book so that's a lot of [Applause] projects that's a lot of projects we've been able to make use of AI with because of AI we've been able to create new tools to streamline the workflows of our Librarians and metadata staff and make our materials easier to discover and work with for patrons and researchers and those are just some of the many projects and experiments at the archive using AI right now other projects include everything from new summarization to the ability to talk to and ask questions of our materials to AI enabled search or to citation parsing or to you name it and with new AI capabilities being announced and made available at a Breakneck rate new ideas and projects are constantly being added I'd like to give a huge thank you to everyone who worked on these projects gave me information and kindly let me present all of their wonderful work on their behalfs please now join me in in welcoming another colleague of mine from the internet archive Alexis Rossi to talk [Applause] about to talk about what kinds of research can be made possible when you aggregate these [Applause] artifacts hello everybody so the work Drey just described is helping us make sure every artifact in this library is well described and easy to find that work makes it easier for researchers to find what they're looking for and we've seen so many great projects using the resources in the internet archive Helen end and Laura Gibbs use books from our library to study African folktales and share them with the public Laura even wrote an entire book showing people all of the uh tales that she found here in the internet archive we see news stories come out pretty much every day that use the Wayback machine as a resource these are the outlets that used the Wayback machine just this year for their reporting and factchecking journalists like Philip bump from The Washington Post use aggregated data from our TV archive to report about the media bubbles that we all live in now libraries build collections to facilitate research sometimes we can anticipate the types of research that people will want to do people have been using books one by one for thousands of years to learn other times New Uses emerge that we didn't anticipate and AI is showing us what some of those uses might be let me tell you a story about why it's so important to have these large collections of digital materials probably everybody here has been surfing the web and you find a page in German say and chrome pops up and says hey do you want that in English you click yes suddenly it's in English and you can read it that's the magic of machine translation so how do you teach a computer to translate between languages essentially you provide the computer with millions of sentence Pairs and the computer teaches itself that's the artificial intelligence at work a sentence pair is the same sentence represented in two different languages it works like the Rosetta Stone so the more sentence pairs you provide the better the translation will be for languages with lots of data like German French in Spanish the translations are pretty good but when you have a language where there's less data available less data equals worse translations that means that a language with fewer speakers is less accessible online because our technology hasn't learned yet how to deal with it so a few years ago a group of European researchers from the University of Edinburgh and other universities and funded by the EU came to us asking for web content in European languages including these underrepresented languages so that they could try to make better translation models web pages like the ones stored in the Wayback machine are a great data source for this sites in the same language sorry in different languages they give you that Rosetta Stone situation right same content different languages so we put together a set of data for them and then the researchers went and did the hard part they figured out which Pages were translations of each others and then they matched up the sentences got rid of all of the data and filter or all of the noise in the data and filtered it and then they came out with opsource data sets of these sentence pairs now for some languages this wasn't a big deal German art has lots of stuff it wasn't that big of an addition but for other languages the difference was huge for instance they more than doubled the number of sentences for lvan and quadrupled the number of sentences for Romanian exactly that allowed them to drastically improve the quality for these translations now this might seem kind of academic and like cool why do I care but um this has come back around full circled to benefit the public it turns out that the sentence pairs that came out in these open- Source data sets are now part of the underpinnings that allow Firefox to translate web pages for you including in some of those underserved languages yeah so this open source data is helping to level the playing field so that a nonprofit open-source browser like Firefox can compete with a corporate Behemoth like Chrome that's amazing that's amazing and I we are so happy that researchers can use these large data sets from libraries in ways that we never dreamed of so what else might we be able to do with languages according to Wiki tongues there are about 7,000 languages spoken today 3,000 of which are endangered and only 5% of languages are well represented online now with stats like that it is understandable that there are concerns that technology is leading to the demise of some of these smaller languages but stories like the one I just told you show us that technology can also help us make these smaller languages more more accessible online several years ago we worked with nonprofit panx project and the culture office of Bali to digitize all of the palm leaf manuscripts written in Banes yeah Banes locals helped with the translation and they also transcribed some of these and if you want to know how underrepresented Banes is online we had to modify utf8 so that you could see all of the characters on your screen but now we have these digital seeds for Banes can we use them to increase online access for Banes speakers exactly to do work like this researchers and libraries must be able to collect large amounts of digital information and the researchers have to be able to access it they need this data so that they and their machines can learn and help create tools that help us talk to each other we live in a world with so much conflict it's vital that we preserve languages yes but also the cultural artifacts we need to keep them safe and accessible in our libraries [Applause] yes I will leave it to Quinn and Alysa from the Stanford University to explain why everyone please welcome them to the stage yes there we go hello and thank you my name is Quinn nski and I am a digital Humanities staff at Stanford um also teaching um dlcl 103 future text uh Ai and literatures cultures and languages as quarter with Laura Whitman um I'm a former I'm a former medieval slavist and also co-president of the US professional association for digital Humanities immediately following Russia's full-scale invasion of Ukraine in February 2022 a group of volunteers from across North America and Western Europe came together to found uh saving Ukrainian cultural heritage online or sucho which went on to Archive over 50 terabytes of Ukrainian cultural heritage websites the internet archive has been an essential partner in this work from the very beginning scaling up their web archiving capacity and response to demand and developing new tools to allow volunteers to work faster and more efficiently in summer 2022 ananaya a sucho volunteer and a svic librarian at Harvard proposed that sucho capture memes from the war Anna collaborated with my colleague Simon Wilds a digital Humanities developer at Stamford to develop the sucho meme wall which shows off all the memes that our volunteers have collected translated and created Rich annotations for by hand we've had people approach us asking about whether AI could play a meaningful role in sucho and we've always said no because we wanted this to be handled with extreme care and accuracy especially when it's a task that we know will be a meaningful way for people to come together and help when they would otherwise sit paralyzed alone and doom scrolling the news we shared our diverse expertise technological linguistic cultural and learn from one another and then taught the next round of volunteers we've still got a long way to go for machine inter interpretability of a lot of memes so let's take this one as an example it's my nine-year-old's favorite and I imagine it'd be pretty easy for Dolly 3 to interpret we start with operation Z which is the Russian name for their war and in the second panel control Z the ukrainians have deleted the warship by sinking it what we got instead was the guess the ship had reverse course undone some previous action when the ship itself had been undone especially in a world where so much AI training data is created under exploitive conditions we made the choice to support and Empower people who wanted to help preserve protect and showcase Ukrainian culture and in doing so we have created to our knowledge the largest set of Real World memes with this kind of extensive annotation about templates people events as well as the transcription and translation of Ukrainian if there's a future for AI powered meme collection and annotation it might start with this data set but these memes mean a lot more than just data and for that I will turn it over to my friend and colleague Alyssa Burker hello hello uh my name is Alyssa Burker I am a student at Stanford and I'm currently teaching a Ukrainian language course there is no relief during the War uh but there is always a way to feel closer and more connected to others while so many ukrainians have been separated from their families and loved ones one of the most most essential ways to connect throughout the war has been through memes there has never been a war where we have had as much access to real-time coverage as we have during Russia's fullscale invasion of Ukraine today thanks to sucho and internet archive we have the ability to not only document Russia's war crimes which is absolutely necessary but to say vital elements of Ukrainian culture perhaps none of which have been as vital for the surv survival and optimism of the Ukrainian Spirit as memes it is truly it's true it is truly impossible to overemphasize the importance of Ukrainian memes for Ukrainian people in this war whether ukrainians are hiding in bomb shelters right now in occupied territories or have fled as refugees they are united with their community Through the collective experience of memes in the famous case of Chaba ukrainians used memes to transform Russia's repeated attempt to take control of the Chaba airport in Ukraine into a legendary and hilarious example of Russia's utter failure and incompetence there is not a single Ukrainian that did not experience relief and laughter from CH the B of Kim yes this even includes my 91-year-old grandfather Russians regularly bomb homes and hospitals but they also make a concerted effort to bomb museums schools universities and libraries in an attempt to fulfill their stated purpose of obliterating Ukrainian Culture by preserving memes that carry so much history and emotional connection sucho and the internet archive actively resist Russia's goal of the complete Erasure of Ukrainian culture when I look through the sucho meme wall today I remember each meme as a cultural moment that helped me and millions of other ukrainians gather strength and emerge from truly the darkest of places as a Ukrainian American in the United States these digital assets have given me tremendous Insight understanding and empathy towards what those in Ukraine are going through today despite the most horrific of circumstances in hiding and occupied territories and on the front lines ukrainians never give up their fight for their freedom their culture and human values as is exemplified in the saying be brave like Ukraine I humbled by the opportunity to tell you the story of so many ukrainians today and I am overwhelmed truly with gratitude for the internet archives in invaluable role in ensuring our people will never be forgotten thank you in the face of the struggles and injustices the internet archive is facing today I want to remind us us all to never give up and be brave like Ukraine [Music] [Applause] slav Quinn Alyssa thank you so so much for sharing this beautiful project these digital meme collections are not just a source of connection for those alive today but are precious historical artifacts like World War II posters or letters from the Civil War those in the future will be able to comprehend more about this moment in time thanks to them but there is still so much we have to understand about the world today here to talk about a decade's worth of work turning the internet archive into a research platform please welcome the founder of the the G delt project everyone please welcome him to the [Applause] stage thank you so much it's truly an honor to be here tonight click there we go it's truly an honor to be here tonight so what is the internet archive to most of you you probably think about the web and the book archive to me what I'm so fascinated by is a television news archive a 100 channels from 50 countries on five continents in 35 languages over portions of the last 20 years one of the most incredible archives of visual storytelling of global events now about a decade ago the founder of the TV news archive Roger McDonald reached out and said how can journalists and Scholars use this incredible art archive to tell the stories of the world to understand when we turn on the news what are we hearing about and how is it framed so one of our very first collaborations was to map the geography of television in other words when I turn on the television where am I hearing about we actually made this incredible map it was like raindrops out of map every time a location was was mentioned now this in turn led to something called the TV Explorer so this idea of taking close captioning allow you to keyword search that so journalists for example could ask how much attention is co getting right now how much attention is inflation getting um all the major events Ukraine you know all these different events across the world how much attention are they getting how are they being framed and in turn you think about on television it's not just a spoken word it's the onscreen text that goes with it so in the same way that books you know you take a photograph of a book page and you use technology to turn that into text we did the same with television to extract all that onscreen text one of the earliest examples that we did which we took Donald Trump's tweets and we scanned for them across television news and showed how he was able to drive the cable news agenda um from his Tweet now in turn this this this this beg this question of the connection between television and the online world so one of the things we've done is so here's a clip this is CNN during Co and it just said from Russia somewhere but where what's the story behind this clip so we showed how you can take a clip from television and scan the open web for that uh and this text you see beneath was the description of the video when it first appeared in the web so be able to connect across modalities and also fact check them so this is a fascinating example we took a known fact check so an existing fact check and we used um new AI tools to scan television news for any reference related to that and this is really powerful as fact Checkers to be able to say where's this narrative gaining traction uh so last year when Russia invaded Ukraine uh so Mark Graham at the um at the internet archive champion this idea of how do we preserve you know this is a huge moment how do we preserve Russian bellian and Ukrainian television and that led to this incredible Archive of what were the narratives how was this you know how was each country telling the story at this moment um and so then so the archive K came to me and said well how do we make this accessible to journalism scholar we have this incredible incredible archive how do we make this accessible so the first thing that we did was create something called the visual Explorer so we took each broadcast and every four seconds we extract one image and we make a thumbnail grid and so this is a broadcast so you think about television it's linear you know it's just play plays plays well by making it something like this you can skim television now so if I want to know did did Vladimir Putin appear anywhere in this broadcast um how much military imagery was shown how often is was the Z shown at this point this is early on uh in Russian television I can do that um I can scan all of this very rapidly as a human being so remember also sometimes the most powerful AI is AI that allows us that amplifies us as a human being to be able to use our ability to understand and kind of gets rid of a lot of that that grunt work now of course a lot of telion across the world is not Clos captioned so starting early on we applied Google speech to tech technology to transcribe these in the Russian and then use Google Translate to translate these in English now again that's far from perfect but it allows thank you um it allows us now journalists to go through this and say what are they saying how are they spinning these narratives what are they paying attention to and that's incredibly incredibly powerful and fast forward to today we're using a new tool so this is something Google has something called chirp uh it's a it's a large speech model it's essentially um sort of the new era of speech recognition recognizes over 100 languages but what's most interesting about this and this whole generation of new tools multilingual so this is a actual broadcast this is a Chinese State television uh three languages in 60 Seconds we've got English Mander in Arabic in 60 seconds in this bro in this particular clip all transcribe right there this is incredibly powerful for the first time we can start studying how do multilingual societies tell their stories um you you it's really really incredible what we can actually questions we can ask now but of course what makes television news so powerful is the visual Dimension if you took all of that television and you start looking at it like this you start looking at all the stories across the world but look at this all the visual stor so how can a machine help us make sense of the visual dimension of all this so we've been exploring how a variety of different AI tools can help us understand visual storytelling not the TR not just the spoken word not just the on-screen text but the imagery the visual metaphors so one early question was to say what if we took Russian television for a year and folded it on itself compared every second to every other second so we can actually Trace clips and see how those clips are being reused but more interestingly visual metaphor so things that are not the same but have similar color schemes similar similar visual Styles it turns out some really fascinating things you can do with visual metaphors now facial recognition is a very scary area so the way we've been approaching it is for major public figures hting it a picture and saying find others that look like this so in this case Tucker Carlson we knew that he appeared a lot on Russian television but how much so we literally took his picture we were able to track his appearances across Russian television this is really powerful and document just how important he was to telling their narratives but then we took an episode of 60 minutes that's a a famous Russian show and what we did is we extracted out every face that appeared on there and who occur co-occurs with whom now this is really powerful who is telling the story so we can see complexity here but we can see Olga here at the center she's she's kind of the the star of the show there we can see her at the center there now we scaled this up to an entire year of that show and we can see all these complex dynamics that at the center her now this is really powerful that we can take these tools so when you think about it's just like well who's in this this page or sorry this uh this image what we're more interested in is questions like this how can we use this to understand visual storytelling now computer vision historically was predefined categories about 30,000 objects and activities that machines could understand so in the early days of Co we said what's different about coid uh television coverage compared to pre-co the answer books everywhere but not on every channel and this was really interesting to us now this is really powerful and we use this on Russian television to show military imagery in the early days of the war and then Russia realized it was losing less and less and less coverage and then they felt that they were gaining some ground so then they start ramping up again so this is a really powerful way of kind of understanding like how are governments portraying like what in this particular case how is Russia feel about how it's doing in the battlefield and this is really really powerful but still it's a predefined category un limited to what someone else came up with so there are tools today we have a demo of this type in an English language description like soldier in front of a flag and it will find imagery that matches that but here comes a really cool part what about the inverse so we have a golden retriever detector so this is part of the part of the AI Explorer it actually stand for Golden Retrievers on on television news so here's an example of one and we asked the machine describe this image everything you see there was written by a machine this is where we're at today um this is really really powerful the ability to have a machine watch tele and tell us about it but there are a lot of limitations what you're hearing mostly about generative AI today is the height the poses there's a lot of limits so hallucination you may have heard that term before so we hinted a broadcast about the Chinese spy balloon and and asked to summarize it and it became a nuclear capable Hypersonic missile aimed at the American Homeland not quite what you want for television summary uh false transcripts it said NATO fully Praises Putin and says he was did a great job um plagiarize summarize so sometimes you say summarize this and it goes out and it finds clips from across the web of people saying similar things and glues those together now bias is a really scary thing so for about 6070 years now we've had keyword search so if you have a collection of biographies and you type in CEO you're going to get the biographies that mention the word CEO the most but nowadays semantic search if you run the same query with the semantic search engine white men first minority men second women last this is a huge issue that people are not really everyone's kind of rushing to the space without if you ask him to make a summary to Res summarize that it's even worse so these are really big issues to think about um distraction summarize this Russian broadcast in English midways through it sees a reference to Rome gets distracted and starts summarizing in Italian and then this is a really scary thing machines are not really good at understanding what country has supported Ukraine the most Russia because has delivered the most weapons these are you know these are really scary things that these machines can do um but there's still a lot of powerful things that we can do we can take a day of Co coverage and say make a narrative map of everything that's being said and how it's interconnected we can have a machine watch an entire day of Russian television and summarize it Mo everything here was M by a machine summarize it Moment by moment U but of course why summarize well because you want to do something with it so we had it watch a day of Iranian television and said find every reference to the nuclear Accord and any criticism right a state uh Point BYO rebuttal digital diplomacy automated diplomacy now again is this a wonderful the future of diplomacy and an amazing thing which is a really frightening future um that has a lot of danger for society um these are really fascinating questions either way this future is here but it's our shared future it's up to us to decide you know because again hallucination all the limitations that goes with this and all the impact on society do we really want machines to be writing all this stuff for us these are huge questions um and finally I want to give a huge shout out to Tracy jquest she's the architect of the TV news archive um I can't see where she is but uh give her a huge shout out to so she created a really neat tool recently it takes the onscreen text and then summarize it you can go to archive.org the television news archive section of it go there today and you'll actually see this and it's using this onscreen text and doing a live summary of what's being said on television each day again the incredible power of making this content more accessible and thank you so much hopefully this has been an inspiration to you um what's possible day thank you so [Applause] much all right well speaking of narrative maps and big questions hi everyone my name is Jamie from the internet archive and by show of hands uh who here is just crazy excited about AI okay okay uh is anyone kind of uneasy about AI not sure where it's going to take us okay a lot of people yeah okay um is anyone upset or angry or really afraid by show of hands of AI democratizing catastroph weapons super intelligence that sort of thing okay there's a few of you yeah okay well now all of you have your own reasons for why you believe things differently you have your own experiences and your own insights that inform your point of view so at the internet archive we've been pretty sincere about trying to understand what are people's different views and why they disagree so that we can inform how we as an organization can serve both the public and our mission in this time of intense technological change so we started a series of hackathons and invited people from different groups who hold different views to come and debate do research and have conversations we invited people from alignment researchers to those who want to accelerate Ai and we asked them to do research and answer questions like well what do you actually mean when you're using the term Ai and what kind of risk do you perceive with AI we also borrowed some deeply Wicked ethical questions that were posed by open AI this summer like should AI ever be used to instill beliefs in people we came up with over 800 topics of debate about artificial intelligence just as a start but things are changing so fast and these debates are ongoing and so I really don't think there's any way we can just organize enough hackathons to sort out 800 AI debates if you are relying on human beings to do the research and the debating so to understand the debates that are happening in AI we turned to AI itself to help us research topics and map debates so instead of collecting arguments from people showing up at a physical place at a specific time one of our hackathon and this is a story of AI working out really well um one of our hackathon created an autonomous research agent to crawl through the web and identify claims related to topics on our list when it identifies a claim that's relevant to us it summarizes and extracts it we also created a prompt-based model that extracts arguments claims and evidence from entire artifacts like Open Access skull Journal Journal articles and websites and then it filters out all of the irrelevant claims a secondary model interprets the correctness of those extractions because of course you got to look out for the hallucinations but in the past day alone we extracted over 23,000 claims from 500 references for about $15 and this rate is approximately 12,000 claims per hour with just one machine running I actually have a background in this kind of analysis work doing it by hand and the fastest I've ever seen a human being do this is 300 under 300 claims per hour we also built a prompt injector which creates a sequence of prompts with a few shot examples to identify positions that people take on questions about AI to give us a sort of top level scaffolding of the debate then using this tool we generate arguments across economic no going back all right across e ethical environmental economic and nine other categories which which support or refute those positions so let's dive into just one example question should we regulate AI to find the highle general positions people take we used our prompt injector that was tuned with those few shot examples to pull data from chat GPT to give us the high level positions so this was machine generated one position we should allow technology companies the freedom to develop AI Technologies as they see fit with minimal government interference another one we should impose strict laws on the development and deployment of AI Technologies to ensure safety here's another position heavy regulation is unnecessary as the AI industry is mostly self-regulating capable of learning from its mistakes and improving I chose those ones there's there's a whole bunch of positions that it chose okay The Prompt injector uh then prompts gp4 to identify likely arguments within the 12 different categories and any additional ones which we may want to add uh for example an economic argument in favor of regulating AI may include regulating AI has economic benefits as it prevents unchecked development that could lead to financial harm high risk low probability threats such as unchecked AI magnify Financial Risk by increasing the likelihood of rare but costly disasters these disasters can potentially Cascade globally massively impacting the concentrated Tech sectors and therefore disrupting fragile Global economies that was created by AI an argument against the regulation of AI there is a concern that government regulation could lead to a convergent of AI Technologies towards a one-size fits-all standard stifling diversity and reducing the potential benefits of competition and Variety in the market so now we have this framework of the debate these high level positions corresponding to these questions which we can actually start modeling in a graph from there we can take the claims and evidence that we've extracted from the bottom up and start connecting them with the top- down positions that we had generated it's a sort of connection of the scaffolding we're still working on integrating The Logical coherence middle part which is actually a lot of work but already these Maps Asis are incredibly comprehensive the map that I showed you earlier actually has over 540 of those arguments but um who wants to look at a map besides me so we decided that we were going to create a tool to make it easier to see we create a tool that summarizes these claims and uh visualizes them as an interactive unpack piece of paper so what does this mean this means instead of you spending potentially hundreds of hours researching the different points of view about AI you can instead read a paper which shows you arguments from different points of view and we can actually automate this paper over time as we continuously process more information so AI can help us research and understand what we think about AI the pros the cons of the technology not by limiting the conversation to just those people who can show up and be in a room but instead by combining the collective points of view from people across the web and across the world by creating new infrastructure that could accommodate it our goal is to automate the creation of these maps with these various tools that we've built and then link evidence to claims that we um that we will do through techniques like retrieval augmented generation then of course give this information for free away as a library um having built like I said before having built these deliberation graphs by hand for years um I can tell you very definitive that putting together these materials with automation is going to save thousands of research hours for any one reader and for this project we still have again hundreds of debates hundreds of debates hundreds of debates and much more work to do since much of the
2023-10-19 03:42