Revolutionizing Image Search with Vectors
>> Hi everyone, you're not going want to miss this episode of The AI Show. Where we're going to learn how you can revolutionize Image Search with Cognitive Vector Search let's jump in [MUSIC]. Today, we are joined by Varsha, welcome why don't you tell us a little bit about who you are and what you do? >> Thanks Cassie. Everyone I'm Varsha Parthasarathy, product manager on Azure AI, and I focus on Computer Vision technologies.
I'm happy to be here, excited to share more about image retrieval using Vector Search. >> Cool so we did a show recently showing how to use cognitive search for vectors and focused on text but this time we're going to be looking at images so can you tell us a little bit more about how Image Search works within Cognitive Search? >> Yeah, Image Search or Image Retrieval systems have traditionally used features like tags, labels these are extracted from images or even descriptions you compare images and rank them by similarity, so keyword Search is the most basic form of this information retrieval approach where the search engine looks for the exact match of the keyword entered by the user, and returns images that contain the exact keywords stored as content labels or image tags. Keyword search as you can imagine relies heavily on users ability to input specific search terms. Let's take an example. I love hiking I'm in Pacific Northwest I'm trying to find images of hiking with friends in the ring.
Again, hiking as a Word or term is not used globally, it could be bush walking in places like Australia or trekking in places like Asia so if if the images aren't tagged with hiking specifically you wouldn't be able to find those images. Now, contrast that with vector search, on the other hand where you have semantic similarities between the user search or the context within the search query and that is being compared with contents present in the image. Even if you search for something like trekking or hiking or bush walking all of those search queries would lead you to the same images that you're looking for that are relevant to the search query. In this case for Vector Search both image and text is converted into vectors and stored in the same high dimensional vector space and I'll show you a demo in a minute and how Florence enables the vector search within images. It enables both text to images as well as those image to any job for search.
>> Cool, so that really tells me a lot about how it can enhance maybe some of the current features I might have to solve you know searching for keywords within images. Now, I'm going to be able to get a much better response because I'm going to be able to have a more semantically correct search using, you said the Florence model? Can you tell me more about what this Florence model is? >> Yes, great question. As we know by now with ChatGPT, DALLÂ·E, Bard, the industry is definitely moving towards large foundation model, and Microsoft has also invested in trained a large foundation model for vision and language specific task. The code name being project Florence. It's a transformer based model trained on massive amounts of data billions of text image base using solid supervision. Now, why is Florence important? Traditionally to train a computer vision model you'd collect relevant training dataset label it for a specific task, so if you're training for Object Detection or Classification.
You train a model, label the data for that specific task and for each task you train an independent model, and for foundation model which is Florence you train one Bayes model using massive amounts of data, and in this case we're using both image and text so it's a multi model Bayes model, and then you have adaptation models that help you fine-tune this Bayes model for individual tasks like classification, tagging, image captioning and so on. It also does enable newer tasks like image retrieval or visual questioning and answering like visual chat, so it's a multimodal Bayes model that enables all of these new user experiences and scenarios. >> I understand now what the Florence Model is and the power and capabilities that it's bringing to this task. Where can people learn more about the Florence model? >> We released Florence earlier this year so there are research papers published and you can find out more through that, understand more about how the model was trained, how the model performance benchmarks so you can find out more with those published papers.
>> Then let's see how this works, can you show me how this works within the cognitive search functionality? >> Let me start with Visual Studio experience this is a no code interactive experience where people can try out all division features that are available within Vision AI. I'm specifically going in do searching photos with image retrieval, so here you can put in a natural language sentence as a form of text query and find images that you're looking for within your dataset. Before I jump into the demo, I do want make it clear that these sample images don't have any metadata associated with it so no content label or tags associated with them, and we've provided a couple of these image datasets that people can try and test the Florence vector search experience with. Let me quickly go through maybe one or two of these examples. I'm picking manufacturing in this case.
These are the images that you see below or the images present in the dataset, there're about 250 of these images. Now, let's go search for a query, so I'm picking one of the pre-curated ones so let's take a look at boxes on a conveyor belt for example. You see images that pop up again none of these images have any metadata associated with it so it's just pure vector search, and you can see more images as you go through the slider. This is all the images that are relevant to the query that I've typed in here.
Now, let's try another one people looking at tablets together, again you see these images that are popping up. Again I can increase and see more of these relevant images pop up. You could also try your own custom query, and I can change this to people looking at information on a tablet together wearing maybe safety hats, let's try that. Now, you're seeing images with people wearing safety hats looking at the information together on their tablets. Now, as you can imagine these are complex queries that we're looking at and Florence does a great job of width factorization and finding the relevant images in this case. >> What's happening here is we have these datasets in the Florence models running and giving the results based on the text prompt that you're giving it, and then that's how it's able to get these, there's no metadata, there's no text associated with it, it's all with the power of Florence.
>> Exactly, so let me go to a concept page, this is a publicly available page within Azure Cognitive Services, so let me just go through what's happening with the Visual Studio. We saw a bunch of photos within the photo dataset or the photo library you vectorize those images using vectorize image API same thing with vectorize text you take the user query which is in the text, so vectorize that. Now, that you are vectors in the same space, multidimensional space, you store it in a vector database.
Then you perform something called a similarity function, and here I'm using the cosine similarity but you could essentially use any distance function to measure the similarity between the two vectors and find the top n or the top 10, 5 whatever the image vectors that you're looking for, and map these vectors back to the images and provide that as results. That's essentially the concept of how the image retrieval system with vector search works. I do want to go back real quick on the studio aspect. We just saw the curated datasets in sample datasets with the text query you could also try with your own images signing into Azure Account. You can bring in 500 images, quickly create a POC and see how Florence works with your dataset.
With that, the demo is limited to 500 images. If you want to create your own API dataset I would highly recommend to get started with cognitive search which makes it easy to set up a vector database, create a search index and have that end-to-end search systems set up. >> It's so coo. I think showing how it worked is really helpful because it seems very magical when you just see it work. But then when you break down step-by-steps how we're taking images and words and putting them into vector space and just if they're closer together they're more similar using the cosine similarity and that's how you find it I feel like every time I see that it still is just so brilliant and simple at the same time, and it really puts together how they're able to work so well.
>> Exactly, so now let's jump into how vector search works with cognitive search specifically for images. You might have seen in the previous session that Liam shared, we're using the same dataset and showing the same index. In this case it's a recipe of food ingredients present in those foods so text is associated with it but in this particular demo I'm only using the images within the documents set so it's only images of food and recipes associated with it. With that, I'll get started with what I'm trying to do on this Python notebook so I'm importing the libraries that I need to set the demo up.
I'm also importing the index from Azure Cognitive Search, the search index that was already set up. I have a JSON configuration with all the passwords and service endpoints. Here in this case you will need two accounts, one for Azure Cognitive Search and one for Azure Computer Vision, both of these accounts have been setup so I'm just creating those connection and endpoints and so let me run that. >> What is the computer vision service used with? Do you need to use that as part of the cognitive search or how are those two working together? >> Great question, so cognitive search today allows you to bring in vectors whether it's opening vectors through eight embeddings or Florence vectors for vision embeddings for all of these services you would have to first figure out where to generate these vectors before bringing it to cognitive search, so for images we're using Azure Computer Vision APIs to generate those vectors for images. >> It makes sense.
>> Awesome. Here, I'm defining two functions that we just spoke about. Generating embeddings for images for the image set and here I have around 2,300 images of food and the recipe dataset.
I'm using the image retrieval API in this case I apply image API to generate vectors for the image dataset. Doing the scene for text and the text query using the same version which is another key point. Different versions of the model does have slight variations in the way vectors are presented and we do not recommend different versions to be used together.
>> You want to make sure that you're using that same model also when you create your index and vectorize your images? >> Exactly. >> A vector that is or the model that you're using for pricing text and pricing images need to be the same model in the same vector space which is where you can do direct comparison and find out the semantically similar images for that text query. >> Makes sense. >> Awesome, so moving on I'm just running this code to vectorize images and text though that runs successfully. Now for the fun part, so here I'm going to show both text to image search as well as image to image search so let's try something like fish tacos, I'm taking the top five images and you'll see the results at the bottom.
You're seeing both scores from displaying the score here on the image ID and again image ID is important for us to give the image back and map to the relevant document this particular image is presenting. >> Just to go over the call that was made sorry this is the Python API or the SK and we're calling the search client so we have the fish taco texts and you're sending that in. We're generating the vector for that text and then we're sending it into the index that we've already created and then it's going to return the image that is most similar. >> Exactly, so fish tacos again an easy one and you can also see the score of the distance function so these are [inaudible] similarity scores that procedure so let me just scroll down so these are the images that we're seeing with fish tacos. They all look great so let's try something a little more complicated, so garlic shrimp with green onion.
Let's see what that would generate. Let's try something else, maybe ramen with boiled eggs on the side which is my standard order, let's see if that works. >> Every time we do this dataset it makes me hungry , still hungry now. Oh that looks amazing.
>> Clearly it's doing a great job with image search and you could also type in something with like I don't know it's probably bad. [inaudible] does a good job of just like figuring out what the user is asking for. Again the tone doesn't need to be specified doesn't need to be correct with respect to the spelling errors and as you can see it's giving me results for shrimp pasta or spaghetti pasta with shrimp. >> Whatever use cases are you seeing customers or people interested in using these features? >> Definitely a lot in digital asset management so there's a huge need to be able to reuse the assets that were created by their marketing agencies or by people licensing data. Again there's a huge costs nowadays to have IDA licensing done so you might be able to utilize whatever data that you're buying or licensing to the utmost capability so discovery and search in like aspect of being able to find what you're looking for as instantly as possible are all features that we're seeing where our customers are using it for. The other neat thing that we're also seeing people use it for is more like product recommendation or product cataloging where they're not only using image search as the primary PR vector search but also augmenting it with metadata around it so you can think of PCs having different versions of different OS's or laptops with specific product IDs or it could even be close with specific product IDs so you're augmenting the vector search with metadata search and we call that hybrid search within the cognitive vector search pipeline so you could do something like that as well where you could have flexibility of doing both.
>> That's so cool. I can see it on the similarity search, like I want something that looks like this or maybe I'm on a food ordering app and taking pictures on my food finally makes sense because I find a meal that I had and I want to order something like that. That could be interesting to just the different ways that you can use images that you have in order to find something similar. >> Exactly, I look at a ton of Pinterest boards and I'm like okay where do I find this piece of clothing doesn't exist today but you could soon have that as an image input to one of these websites and get the product that you're looking for. >> That's so cool. >> Awesome, so we saw text vector, now let's checkout image to image vector search.
Here I'm pulling in a picture of lasagna so let's try what it returns. >> Now we're taking an image instead of text embedding that and sending that as the query. >> Exactly so we're looking at this is the picture that I'm using to send it to the query for our vector search and here are the results that I'm getting back again with pretty high confidence. Again possibilities with image to image search are endless, we've seen even within chat experiences or bot experiences people send products looking for similar products that they're looking for if something is out of stock or they have a very specific snapshot that they've taken don't remember where it is from or what brand is it of, they could just easily such with the image and find relevant links for those products.
>> What different languages does this support, I see we've been using English what other languages can we address? >> Yeah it's a great question again so right now we support English with the embeddings. But we do have a multilingual model that supports more than 100 different languages and that's coming out soon so stay tuned for that. >> That's not out yet but it's coming. >> Yes.
>> Awesome and then the other question I have is I see that we're using Python and we're using the cognitive search SDK? What other languages are supported for this, programming languages? >> Yeah so currently the Computer Vision API is REST endpoints that you could use. SDK and Python C++ those are all coming so should be available in a few months as well as on cognitive search you already have SDKs available in popular languages like Python and C++. >> Then you're using Azure Data sources, does my data need to be in Azure or what different data sources are supported? >> Great question again, so it doesn't need to be in Azure, definitely it helps within Cognitive Search to have Azure data or Azure Blob Storage but you can also bring in your storage whether it's local or storage that is in other services and plug it into Azure Cognitive search.
>> Great so you have the flexibility to plug in your data source and have it indexed. >> Exactly. >> Another question I have around indexing your own data is I know there is tooling within it we talked about that on the last episode, everyone go check that out for getting those indexes setup, how frequently should someone be re-indexing their data? Is that by a basis of what they're trying to do, how often their data changes or how do people make that decision about when they need to create the new index that will overwrite the existing one? >> A great question again, so again if the dataset hasn't changed if you're dealing with the same 5,000 images you don't need to re-index ever unless you want to switch over to a different model or you're trying something else that is not available in the current indexed model but as you bring in more images or more text you would have to do an update to the index so wouldn't be re-indexing for all of the images but just for the new images to update and there's a function called recrawl, you re-crawl with all the new data that has been added to the dataset and may index those to add to the existing index. >> That's something I could have on a schedule as well so I don't have to necessarily manually trigger that I can say I know my data is going to be changing every day I want to re-index every night or something like that. >> Sure. Yeah you could use the API endpoints for a specific timeline every week every day or whatever the schedule is or you could also set up a function that would say do a crawl and if there is any new data present then crawl that and update the index.
>> Fantastic so it sounds like there is lots of functionality already for being able to get the data in and being able to leverage this amazing powers. We saw some really cool demos, we learned a lot about how we can start leveraging images in Cognitive Search, where can people go to learn more? >> You can get started the easiest ways to get started with the Log announcement post that we just saw as well as Visual Studio where you can try out these experiences and create an easy [inaudible] or you could go into the GitHub repo that we have or you could use that code to setup the image retrieval systems for your search index. >> Great so I'm showing those links below, they are also in the description for this video so you can go and check those out and you can get started with the image search capabilities in Azure Cognitive Search. Thank you so much for hanging out with us today