Accessibility collaborative applications of sound sensing technology on mobile phones and smart IoT

Accessibility collaborative applications of sound sensing technology on mobile phones and smart IoT

Show Video

Good afternoon, everyone. It's a pleasure to share some of our accessibility work. Mr. Xia just shared about visual accessibility,

and I will focus mainly on auditory aspects. Let me introduce our team. We're from Xiaomi's AI Lab, which comprises many sub-departments, including Image, Acoustics, NPL, etc.

We're part of the Acoustics department, with a wide range of subfields such as understanding and generation. This photo commemorates our recent recognition as "Workers Pioneer" in Beijing. I'd like to begin by sharing our experience at a Hackathon in 2020.

We were fortunate to win first place and a prize of 100,000 RMB, which we donated primarily to people with hearing impairments. How do I play this video? Top left corner. On the top left corner... Ok. I can click on the video. I see. It works.

Thank you. At Xiaomi, we believe engineers are our most valuable asset and we respect all innovative ideas. In 2018, the Xiaoai team at Xiaomi opened some annotator positions to people with disabilities. Some of them, afflicted with cerebral palsy, faced challenges in everyday communication. Hello, my name is Zhang Dakui.

I approached the voice team at Xiaomi. They undertook some indicative work related to accessibility based on my condition, which turned out quite impressive. I've always worked on speech recognition technology, confident in its potential to understand their speeches. So I thought we could give it a shot. We had previously developed a technology intended for dialect recognition, which happened to be a solution for this situation, an application we had not considered before.

This project uses AI voice technology to bring convenience to individuals with special needs. It resonates with our mission to let everyone enjoy the benefits of technological advancements. Initially, when we received the data...

Could we resume the video? Right. Let's resume from where we left off. ... we realized we could not understand the speech ourselves, let alone develop an algorithm for its recognition. Surprisingly, the algorithm worked exceptionally well.

When we created this application package, Dakui tested it himself and praised its effectiveness. This brought us great satisfaction. After some debate over the project's name, we all agreed on "Listen", reflecting our commitment to understanding their needs and addressing their issues. The engineers at Xiaomi are young, socially responsible, and idealistic. In most people's lives, the special community is often overlooked. We urge more engineers to leverage the power of technology to address the issues faced by this community.

Hosting Hackathons is one way to encourage engineers to innovate boldly and bring out more talent. Xiaomi highly values talent, and many of our innovative products over the past decade are the result of ideas from our front-line engineers. Yes. That was our experience at the Hackathon. As I mentioned earlier, we're not an accessibility team, but rather a generic AI team working on sound.

We hadn't realized that our technology could be used for accessibility until Ms. Zhu Xi suggested that we could expand on our technology. For instance, simple advancements can make a significant difference to people with special needs. Indeed, at least for us, the technology was particularly straightforward. Although we published a paper in 2018, in reality, we only made some adaptations to Dakui's voice or fine-tuned it. Of course, some policy-based techniques were involved, but those are just details.

This technology isn't difficult, but even the smallest amount of work we've done can be highly beneficial to special groups. Dakui is particularly grateful for this application, which he is now using regularly. Even our small efforts proved highly beneficial for those with special needs, inspiring us greatly. Even though the initial recognition rate was poor due to the difficulty in understanding Dakui's speech caused by his cerebral palsy, a simple adaptation could help. Dakui, a PhD holder, once gave a 5-minute speech. We used it for self-adaptation and improved the recognition rate to about 95%.

That was the work we did in 2020 focusing on voice. But sound isn't just about speech. It's a small part of our sound world.

There are various sounds in the world. Many of these sounds are ordinary to us but critical to individuals with disabilities. For example, some tragic incidents have been reported due to the inability to hear certain sounds. Right. Here's a tricky concept. People often associate disabilities with physical handicaps, but it's not necessarily so.

In a certain way, everyone could be considered disabled. For example, we can't hear other sounds while frying food, wearing headphones, or driving with your hands tied. In a way, we are all people with disabilities. So accessibility work could also address many needs of ordinary people. We weren't the first to develop this technology.

It was Apple. However, we were probably the first in the Android community. As far as I know, Vivo also developed a similar technology, but as per my knowledge, it can only recognize one type of sound. According to internal testing, Apple's accuracy and variety are not as good as ours. What is this? Quite simple.

It recognizes the sounds around you and sends notifications to your phone accordingly, such as preset audience sounds. This converts auditory information into visual data, offering assistance in scenarios involving visual and auditory impairments. This feature was introduced as part of the accessibility functions in Xiaomi's MIUI two years ago. If you're a Xiaomi smartphone user, you can find this feature in the accessibility settings. Apple has it too.

And it's not just on smartphones. This functionality is present on devices like speakers and even cameras that can detect sounds like a baby crying at home. We have some interesting smaller features as well, such as snore detection. You can set it before sleeping, and it will automatically record any snoring. The next day, you can listen to what you might have said in your sleep or how you snored.

These are health-related features. Our sound recognition capabilities are vast, and the sounds we currently detect are only a small fraction of what our technology is capable of identifying. The applications we are using right now only represent a small fraction of our capabilities. Another application of this technology, pioneered by YouTube and Google, is AI captions. Traditional AI captioning predominantly transcribes spoken words, but context and ambient sound are also important. For instance, an individual with a hearing impairment might miss applause or other sounds, affecting their experience.

Providing this information in the subtitle can improve users' experience. This feature, currently provided by TED, YouTube, and Google, is something we've also developed. At present, our Xiaoai team is integrating this capability into our AI captioning system, which is nearing readiness for a public launch.

In comparison to Google's offering, our recognition accuracy is pretty impressive. At least, that's the case in this audio. However, our testing was not rigorous, as the number of examples was not extensive.

A distinct advantage of Xiaomi is our expansive hardware portfolio. With a multitude of IoT devices and various types of hardware capable of interconnectivity, we have centralized various devices around the smartphone. The information received by these devices is processed and then sent to the appropriate endpoint using the right strategy. Without this coordinated approach, users could become overwhelmed by alerts from different devices like speakers and cameras. However, by implementing strategic processing in the middle, which can, for example, identify the most likely room for an incident to occur, we can effectively utilize the advantage of Xiaomi's IoT capabilities.

This is a feature that not even Apple has achieved. Our work on this technology recently won us a company-wide technology award, a selected award. Could the staff play the video? As I spend every day with my child, I often think of my child's voice. Perhaps the voice messages my boyfriend sends me every day. Dang, dang, dang, this sound. When my child made a sound for the first time, saying "Mom," I burst into tears.

Holding my child, I said, "You can say Mom!" Because of work, we often live in different places. Whenever I am upset, I send him voice messages, and he guides and comforts me. When my granddaughter comes home from school on weekends, she knocks on the door, "Grandpa, Grandpa! Open the door!" If I couldn't hear it anymore, I would be very sad. I hope such a thing never happens.

If I couldn't hear my child knocking on the door, it would be very sad. I hope that day comes later rather than sooner. It's a pity to lose the ability to hear, but technology can help as much as possible. Sound first becomes physical information points through the human auditory system, and the brain captures the information in these points to understand what people are thinking.

This process is essentially encoding and compressing information by the brain. But if the human auditory system is damaged, this chain is broken. Fortunately, current AI technology can do similar things. Sounds are extracted by computers and features are identified, then deep learning models perform a similar encoding process. This completes information extraction and can help hearing-impaired people understand their environment.

Mi Ditto is a typical accessibility application we developed. It mainly transcribes spoken words into text so that users can see what the other person is saying. When watching a show without subtitles, it can convert the dialogues into captions. We also have an application called Xiao AI Call. It mainly addresses accessibility issues during a call, and can be used in meetings.

For instance, if I can't answer the phone during a meeting, it can assist me. The sound information in our lives isn't limited to speech. It includes various environmental sounds like knocking on the door, home alarms, etc. Our ambient sound recognition technology can currently recognize up to hundreds of sounds. We have already implemented 14 sounds that users care about.

When our devices, like smartphones, speakers, or Mi Home Security Cameras, detect these sounds, they alert the user via text. The technology underlying this feature essentially mimics the process of human learning to train an AI model, a process we split into three stages. In the first stage, we exploit the vast amount of unlabeled sound data available in real life to self-supervisedly train a base model with a significant number of parameters. Although this model lacks recognition and classification capabilities, it can discern different features of various sounds.

In the second stage, we utilize some weakly-labeled sound data to train a teacher model, endowing this model with extensive recognition and classification capabilities. The final stage involves using strongly-labeled data to train a lightweight and deployable student model. This student model then undergoes continuous training under the guidance of the teacher model, enhancing its recognition and classification capabilities. With this methodology, the student model can categorize the type of any sound input it receives. The ultimate goal of developing these technologies is to empower all our users, including those with hearing impairments, visual impairments, or speech difficulties, to enjoy the convenience these technologies provide.

This inclusivity aligns with the core mission of our Acoustics department, striving to ensure that all users have equal rights to human-computer and inter-human interactions. This was a promotional video made by Xiaomi a while ago. In the second half of the video, it was Wang Yongqing, a member of our team, who briefly explained the general principles or processes behind this technology. I will expand on it a little bit here. This is the model training process.

We recently published an article, the graphics from which are presented here, outlining our model training process. The process comprises three steps, as Yongqing detailed in the video. The first involves self-supervised training with MAE.

This requires no labels, which I will expand upon later. Subsequently, we train the teacher model, using longer-duration data as inputs, such as 10-second sound bites. The teacher model, while powerful, is heavy in terms of parameters.

We then train the lighter student model using short-duration inputs, say two seconds. In training the student model, we have recently proposed a streaming solution, leveraging Transformer-XL to manage the state. I will delve into this shortly. This is the general process.

The model architecture is centered around ViT. There is a small animation in the upper right corner. It involves breaking an image down into smaller patches, treating these patches as a sequence, and adding positional embeddings into the transformer model.

Finally, linear layers are applied to yield the classification output. What's the difference between sound and image? There is none in this context. A sound spectrum can be visualized as an image. Sound spectrums are also segmented into smaller patches and fed into the model, an approach that can be considered mainstream today. That's the model architecture.

Self-supervised training MAE should be familiar to everyone. It is to mask this input and the MAE proposed by Mr. Haiming before, and then try to recover it, which allows for unsupervised training.

We have amassed a colossal amount of data, amounting to around 31 years' worth. The teacher model guides the student model in a process akin to automatic annotation. I won't delve into the details. It's important to note that our models, with comparable performance, offer a significant advantage in terms of parameter size and, most notably, memory usage, compared with jsota.

Occasionally, our inference speed is about a hundred times faster. As to streaming training, we apply the Transformer-XL methodology. The traditional approach is to train each small block independently, retaining some for mid-sequence training. For more details on this, please refer to the papers in the bottom right corner. This is custom sound recognition, initially pioneered by Apple relatively early.

Here is their official document, saying it allows users to record custom sounds. We also did this. I have a demo video. Could staff help me play it? I've registered a door-knocking sound. Any sound can be registered, even an alarm bell. I spoke a random sentence, which serves as a secondary sample.

The door-knocking sound is recognized. Please note this is just a demo, and the functionality will soon be integrated into the latest version of MIUI. This system isn't complicated. It's essentially employing a few-shot learning method. Each training iteration selects a closely-related group for comparison. I won't go into details.

Allow me to share some of our most recent projects. Although it may not be accessibility-related, you can still check out our demo. Both the upper left and this one. Both can be played.

The one above. Not this one. I'll start talking, Now I will simulate a scenario. Suppose I can't communicate with Hanlin due to language barriers, so we might need a translation device.

However, I also hope that the recorded voice can be preserved. We can use the translation device to communicate, but if the surroundings are noisy, the recorded voice might not be clear. If we can record just the two of us and block out the ambient noise, that would be quite cool. That's the end of it.

Now, let's play the one below. Just play... Suppose I can't communicate with Hanlin due to language barriers, so we might need a translation device. However, I also hope that the recorded voice can be preserved.

We can use the translation device to communicate, but if the surroundings are noisy, the recorded voice might not be clear. Here, he's talking to him, but we can't hear. ... block out the ambient noise, that would be quite cool.

The challenge of this technology lies in the close proximity of the two people. One person's voice is filtered out while the other's is retained. Initially, it captures some voiceprint information, which is then used as input to the voice enhancement model.

It is as if using its own voiceprint as an incubation for the voice enhancement model. This is our recent work, which we have published in speech earlier this year. Here are some papers we have published. We also participated in some competitions, such as ICASSP. Here are a few simple thoughts.

Firstly, accessibility might seem like using technology to aid those with impairments, but it's them who are driving us to develop our technology. Accessibility, by catering to more demanding scenarios, not only addresses the needs of those facing specific challenges but also offers a platform for the deployment of pioneering technologies. That's all. Thank you all for your attention.

2023-08-28 03:19

Show Video

Other news