Let’s Create a Speech Synthesizer (C++17) with Finnish Accent!
Today, I'll have something I filmed, for both let's create a skip synthesizer. It's a nice accent. Earlier. I made three videos providing, background information for this video if you haven't watched them yet I suggest you do so now, I have compiled, them into a nice playlist, so you can watch them all in one sitting. Please do so now and then come back here click the card that opens the playlist I will be waiting here. Unlike. In the PCM, video replaying. Pre-recorded. Speech samples. Is not what I had in mind for this video to, recap here is the list of phonemes, that will be the building blocks of beach for the synthesizer. The bottom released, phonemes, that most finish do not pronounce correctly, but, are instead, elías, into other phonemes, I think, for authenticity, we, can do the same in our speech, synthesizer. This, will help keep the design simple, this, is the resulting, roster, of phonemes, there are 22, phonemes, in total now there are many ways to go forward from here one, of the popular approaches. That leads into high-quality speech. Synthesizers. Is to create a list of all pairs, of phonemes, that can occur, in normal, speech for, example all. Consonants. Followed by all vowels but, also overall, followed, by all consonants, and of course all pairs of words that can be reasonably, pronounced, and all, pairs of consonants, too they, would get a professional, voice artist, to record all these hundreds, of samples at constant, pitch and constant. Stress, for, some speech synthesizers. Even triplets, might be recorded, it's likely they would construct an artificial, long piece of text that contains all these phone in pairs and the voice artists would be instructed, to read it as monotone mostly, as they possibly can then someone would use an audio editing program, and meticulously. Cut pieces from the recording, to populate this table the. Speech synthesizer. Would mix and select these samples, at runtime. For example, this word giving, Alka culpa would be constructed, from 15, voice samples, some, of which are identical, and the synthesizer, woods, Lesly, blend the end of one sample to the beginning, of the next for, my demo speech synthesizer. I will, not do anything that complex, I'm, going to operate on a single, phonemes, only, now, I could, just use recordings, of myself, speaking all these different phonemes, and it would not take very much time to do that at all instead. I decide to approach the problem, in an old-fashioned. Way so, I made this chart, it, shows how, each of these phonemes, might be constructed. The, first 14, phonemes, have something, in common the, vocal cords, are vibrating the. Throughout the phoneme, for. Example, I can, say Leonetti have among valley a line and we in Amelia Lowell away in a single unbroken voice, using. All of these 14, phonemes, let's, speak, about wolves first humans. Can speak because. We are able to change how our voice, resonates, within our mouth bear. With me for a moment I'm going to create an incredibly. Stupid sounding, recording. Ah. Now. Let's clean up that audio and. Save. It on disk. Next. Let's open the audio in prod prod. Is an open source program for, studying, phonetics, while I was doing that recording, you heard several different, vowels my voice stayed, at the constant, pitch but a different, harmonic, overtones, were created, by varying the shape of Airways. Within my mouth in this analysis window you, can actually see what happened in the bottom there is a blue line indicating. My voice pitch it is relatively, horizontal, which means there was not much variation in, it however, these red lines represent the, harmonic, overtones, of my voice and they are all over the place changing. Smoothly between, low and high values, in, speech, these harmonic, overtones, are called four months, in, Wikipedia. There is a relatively brief article, about formants including, this table, on typical values, for the format's of different, vowels each of these 14 sounds, have four informants, that make up the sound for, months are produced, by different parts of the vocal tract including the, larynx of the pharynx, for, speech since as I serve the exact, mechanism, is not as important, as the result, additionally. With the and, the e sounds. There is some level of freak Asian present, it is a little whooshing, component, the, whooshing, sounds, a bit different, in each consonant, it may, be higher pitched or lower, pitched and it may be short or long in other words there is a sound source and a tube that adds resonance, and noises, to the sound this, is called a source filter modal, and audio, compression method, called linear, predictive, coding is, centered, around this, same LPC.
Starts With the assumption, that the speech signal is produced by a buzzer at the end of a tube with occasional added, hissing, and popping sounds. Also apparently, crude that this model is actually a close approximation, of the reality, of speech production do. You have a cell phone NPC. Happens to be the basis of GSM, invoice compression. If, you have a cell phone it contains, an implementation. Of a PC so I am going to use LPC, also for this synthesizer, in, this table, I have identified, the component, sounds that I need to synthesize for. The first fourteen, phonemes, we have a voice that is modulated, in different ways plus, some optional, for occasion at the same time the rest of the consonants, are similar, except. There is no voice simultaneously. I have, split each phoneme, into three parts a beginning, a middle and, an end each. Phoneme may have a short sound of some kind in the beginning, and in the end for, example at the end of M there, is a subtle sound, from the lips the, middle is the part of the phoneme, that is a stretch, as long as it needs to be to produce a short or long sound so. The total budget of sounds that I need is 17, sustain stones 7, release sounds, one glottal, noise and silence. Total. 26, sound samples, to, generate, these samples, are recorded, myself saying this sequence, as monotonically, as I could I'll live, a, young moon. Our. Hair, say code. To. This. Recording, was important, in prod then. I edited the sound to make it completely, monotonous, in hindsight, this step was completely, redundant, but it was nice to learn that this research tool could SS an auto-tune program, for bad singers, this, is how the result sounds I live. A, young, moon. A. Hesse. Compote. This. Was then down sampled, in 244, kilohertz removing. Some mostly, irrelevant detail, then I used prod to convert this recording, into 48 or neural MPC the, resulting, file looks like this it's, a text file that contains, some numbers. The, audio was divided, into frames and, for each frame a set of coefficients and, again is listed, next I wrote a C++, program to play this file the program reads all lines in a file and identifies, their content, it saves, important, parameters, like the sampling, period which is the universe of the sampling rate in, two variables, the, coefficients. Are saved in an array when, it encounters, to gain line it synthesizes, the frame the frame is synthesized, next, it starts, by generating, an arbitrary, bus anything. Goes as long as it has a clear frequency. And as long as it's not the pure sine wave next. The LPC filter, is applied the, filter shapes the frequency, characteristics, of, the bus that is fed to it much like a feeler filter, basically. It's a vocoder, the, resulting, sample is saved into a buffer once, the file is done with the buffer is saved into a WAV file, and this is how it sounds, I'll live. A, young, moon. Yes. The, coreboot. 248. Was my choice for the order, of the LPC data, I made a comparison, for different LPC orders here is a short voice sample, I took from one of dr. David woods videos, how to stop Prison, radicalization and, here's how it sounds at different, orders how, to stop Prison, radicalization. How. To stop Prison, radicalization. How. To stop Prison, radicalization. How. To stop Prison, radicalization. How. To stop Prison, radicalization. How. To stop Prison, radicalization I. Think. That 48, was the sweet spot where artifacts. Were minimal, and increasing. Coefficients. From a 48, did not significantly. Improve, the audio to justify, the increase of data now, it is important, to note that there LPC file is not a recording. It, is a synthesis. Instruction. For, example I, can modify the bus formula, and replace, it with white, noise stop. These. Changes, the voice into a whisper, or I can, change the tempo, make it four times slower. Or. Make. It twice as fast I'll stop for summarization, Oh change, the pitch make, it higher. Or. Make, it lower at a stop Prison, radicalization my. Bus formula, deliberately, contains, a small amount of aspiration. In it if, I, remove the aspiration. And leave just the buzz out, of stop Prison, radicalization, the. Sound becomes a bit cleaner and also more synthetic, sounding, these, samples, are recorded, at 44, kilohertz, if I used a much smaller, sample, rate, such as 8 kilohertz a much smaller number of coefficients would, be enough here is the 16 coefficient, LPC made from a 44, kilohertz recording. And stop from our colonization, and here is a 16 coefficient. LPC, made for my 8 kilohertz, recording. At a stop Prison, radicalization, the. Letter was a bit more muffled, like a telephone line but had way less chirping. Artifacts, in it, lowering, the sample rate allows, you to get more bang for your buck in terms of data transmission, and, that's why telephone, lines and cell phones use low sample rate but, there's plenty of low sample rate speech synthesizers. Out there and I want to use a good sample, rate so I'm going with 44.
Kilohertz, And forty, eighth order in a PC so the LPC file is divided, into frames each, frame, representing, the characteristics. Of the audio for a small slice of time next. I spent a day of writing this tool which is a modification, of, the wave writing, program, from earlier this program, allows you to adjust the, parameters, such as breathless and busyness in real time and to choose any frame form of the recording, to play I used these two peak frames that in my opinion best, represented. The phonemes, that I wanted to include in my speech synthesizer. Next. I wrote a tool that copy pastes, the frames that I picked and it produced, this file it is a simpler plus source code which brings me to the next part simpler. Plus source code we, begin with the data structure, that was just generated. This saves each of the recordings, as a structure. I decided, to make it so that each recording. Can have multiple frames, further, than just one for better quality my. Process, of text-to-speech. Begins, by reading the text input and converting, it to a list of phonemes, rather, prasanan, elements, first. We start by normalizing, to the text removing. As much unnecessary. Detail. As possible such, as converting, all of it into lowercase. I also went ahead and converted it, into 32-bit. Unicode because. Dealing, with text character, by character is, quite difficult in, utf-8. When a single character, can span across multiple bytes I mean, it's still not perfect even, in 32-bit, únicos, because of combining, diacritics. And stuff but you get what I mean it, helps with this application. Punctuation. Must also be taken care of I decided, to add special symbols, to angled, brackets, that will be later used to control, the pitch of the voice I, will just leave the pitch handling, plant for now and, get back to it later now. That the text has been canonized, and the work-in-progress. Strain should only contain pronounceable. Letters, and pause markers, it's converted, into indexes, into the sound to record its list this. Code is a bit complicated, for what it actually does it, basically just assigns. The timing value for each phoneme and depending whether it was repeated or not if, you are interested, in exploring it in detail we can download the source code which can be found through links in the video description and, explore, it offline, now, that we have the list of Records that we should use to play the speech let's, go through them.
Earlier. I mentioned that, in my design, each record, may actually contain more, than one frame I decided, upon three different, styles for playback, of these frames the, synthesizer, may choose one of the frames to believe a random, for some variation, in the voice or it might play all of them in the sequence, for use whenever a single, frame is not enough to capture the phoneme clearly, enough or in the case of the trill error, it might rapidly, cycle through their frames whatever. The method we do need the actual synthesizer. So let's tackle that part now this, is basically the same code as either LPC, - wave converter I, briefly showed earlier, but let's go through it in more detail, now I am using sfm l for this project this, Audio Driver class is basically the exact, same thing as in the PCM audio video I made earlier it's job is just to read samples, from an array, and push them to the sound library, there is nothing too exciting about it the interesting part is where the LPC frames get converted, into WAV audio in, the context, of speech synthesis. LPC, works so that first there is a source of noise, buzzer something. That generates, a voice that has a pitch anything. Will do including, music, as long as it is not pure sine-wave. It cannot, be pure, sine-wave because, the next step is applying, a finit impulse response, filter, over it this filter either attenuates. Or amplify, certain, frequencies, of the bus but it cannot make them up from nothing the. Difference, between the bus and the filter output, is saved into the buffer the, filter operates, on the differences, between the bus and past outputs, generated, by the filter, so we use a rolling buffer, that's what the modulo operator does, it make sure the indexes, loop back to the same indexes, over and over again the latest sample, is sent to the speaker in my design, the audio chunk is first saved into a temporary buffer, and then moved into the buffer that is shared by the audio, engine, this is so that we can minimize the, time that the audio buffer has to be locked and this, is what it sounds like mind you this is going to be finished language, text right now. Mallamma. Temperamentally. Tanner, hang on Sonic mo mmmmmm. See him saw him kaboom game Massa, Massa Travis, Tom it, was already fairly understandable, to, an average finish, listener, even if some phonemes, were not as clear as they could be there, were three little problems, with that sort sample, first, the speech was quite monotonous, we.
Could Make it sound more interesting, by smoothly altering, the pitch and voice quality, over time however. That's not enough, I decided to, actually melt a lot epical of low of pitch in Finnish text reading to do that first, the text is divided, into syllables using, rough algorithm. That simply, checks where the vowels and consonants are. And decides, that the new syllable, begins where there is a single consonant, followed by a vowel then, the pitch curve is given to the center's by keeping track where each sentence, begins and where it ends and giving a certain pitch to the first and last syllable and interpreting. The rest and this, is what it sounds like oopsie, him an unconformity. Thomas. If apollomon and had enough until, a moon moved on Bethlehem. In me as lefty Sentinel, said to Eve among some of them Parkinson's, samokhin on say muscleman more in bhavam the. Second problem is quite obvious, and quite annoying, to. Be honest i have no idea what is the cause of the content clicks and clacks heard in the audio but, I figured it's best to do something, about them. My workaround, for the clicks and pops is not very pretty it, is pretty much equivalent to, fixing a broken television. By beating, it until it works, it. Gets the work done I also, decided, to smooth out the frame boundaries, a bit by making all the synthesis, parameters, change smoothly, gradually. This, is what it sounds like, Sunmi, totalus, mes, available. Mulatto mine male vas a masculine, discontented. A mystical. Super naughty stupid Varma Kostova unlock, magnitude estimation, in lucky on subtle attribute, and that's fame nation but, the title of this video was not let's, make a finish, speech synthesiser. This, video was about making a speech synthesizer with. A Finnish accent, so there is still work to do I have to make it read English, to. Make it read English, a broad, code from a very old speech since, his program, called arson, which in turn arose, from a research, paper or Internet, United States Naval, Research Laboratory, in, the year 1976. I simplified. The code a bit so that the two source code files about 900, lines of code fit nicely in one screen full and I got myself a function, that converts English, text into sort of an ASCII representation, of, the international, phonetic alphabet then, I wrote the conversion, table which reduces, those phonemes, into the set of phonemes used in Finnish, this function, is then called in the part of my program, that deals with text, to phonemes conversion.
And Innocents. Home like, this to be fair it is hard. To understand. This. Is made not exactly, like a typical, v nice accent, public, spectacle. As in written thin lips it member to text with animal mess it is not exactly, for adult many. People, have been joking about my accent, suggesting. That maybe I wrote the speech synthesiser, to do the voiceovers, for my videos well in case you ever wondered, what happened if I were to do that now, you know if, you liked what you saw thumbs, up the video and hit the subscribe button if, you haven't already it's. The bill I can do to make sure you get all notifications. Of my new uploads then, is go to my supporters, at patreon, PayPal, libera Bay and other sites, I have, not addressed you, in a video for a long time but you are very much, appreciated. Indeed as always. Have a nice day and a, Shalom in your life. You.