Let’s Create a Speech Synthesizer (C++17) with Finnish Accent!

Let’s Create a Speech Synthesizer (C++17) with Finnish Accent!

Show Video

Today, I'll have something I filmed, for both let's create a skip synthesizer. It's a nice accent. Earlier. I made three videos providing, background information for this video if you haven't watched them yet I suggest you do so now, I have compiled, them into a nice playlist, so you can watch them all in one sitting. Please do so now and then come back here click the card that opens the playlist I will be waiting here. Unlike. In the PCM, video replaying. Pre-recorded. Speech samples. Is not what I had in mind for this video to, recap here is the list of phonemes, that will be the building blocks of beach for the synthesizer. The bottom released, phonemes, that most finish do not pronounce correctly, but, are instead, elías, into other phonemes, I think, for authenticity, we, can do the same in our speech, synthesizer. This, will help keep the design simple, this, is the resulting, roster, of phonemes, there are 22, phonemes, in total now there are many ways to go forward from here one, of the popular approaches. That leads into high-quality speech. Synthesizers. Is to create a list of all pairs, of phonemes, that can occur, in normal, speech for, example all. Consonants. Followed by all vowels but, also overall, followed, by all consonants, and of course all pairs of words that can be reasonably, pronounced, and all, pairs of consonants, too they, would get a professional, voice artist, to record all these hundreds, of samples at constant, pitch and constant. Stress, for, some speech synthesizers. Even triplets, might be recorded, it's likely they would construct an artificial, long piece of text that contains all these phone in pairs and the voice artists would be instructed, to read it as monotone mostly, as they possibly can then someone would use an audio editing program, and meticulously. Cut pieces from the recording, to populate this table the. Speech synthesizer. Would mix and select these samples, at runtime. For example, this word giving, Alka culpa would be constructed, from 15, voice samples, some, of which are identical, and the synthesizer, woods, Lesly, blend the end of one sample to the beginning, of the next for, my demo speech synthesizer. I will, not do anything that complex, I'm, going to operate on a single, phonemes, only, now, I could, just use recordings, of myself, speaking all these different phonemes, and it would not take very much time to do that at all instead. I decide to approach the problem, in an old-fashioned. Way so, I made this chart, it, shows how, each of these phonemes, might be constructed. The, first 14, phonemes, have something, in common the, vocal cords, are vibrating the. Throughout the phoneme, for. Example, I can, say Leonetti have among valley a line and we in Amelia Lowell away in a single unbroken voice, using. All of these 14, phonemes, let's, speak, about wolves first humans. Can speak because. We are able to change how our voice, resonates, within our mouth bear. With me for a moment I'm going to create an incredibly. Stupid sounding, recording. Ah. Now. Let's clean up that audio and. Save. It on disk. Next. Let's open the audio in prod prod. Is an open source program for, studying, phonetics, while I was doing that recording, you heard several different, vowels my voice stayed, at the constant, pitch but a different, harmonic, overtones, were created, by varying the shape of Airways. Within my mouth in this analysis window you, can actually see what happened in the bottom there is a blue line indicating. My voice pitch it is relatively, horizontal, which means there was not much variation in, it however, these red lines represent the, harmonic, overtones, of my voice and they are all over the place changing. Smoothly between, low and high values, in, speech, these harmonic, overtones, are called four months, in, Wikipedia. There is a relatively brief article, about formants including, this table, on typical values, for the format's of different, vowels each of these 14 sounds, have four informants, that make up the sound for, months are produced, by different parts of the vocal tract including the, larynx of the pharynx, for, speech since as I serve the exact, mechanism, is not as important, as the result, additionally. With the and, the e sounds. There is some level of freak Asian present, it is a little whooshing, component, the, whooshing, sounds, a bit different, in each consonant, it may, be higher pitched or lower, pitched and it may be short or long in other words there is a sound source and a tube that adds resonance, and noises, to the sound this, is called a source filter modal, and audio, compression method, called linear, predictive, coding is, centered, around this, same LPC.

Starts With the assumption, that the speech signal is produced by a buzzer at the end of a tube with occasional added, hissing, and popping sounds. Also apparently, crude that this model is actually a close approximation, of the reality, of speech production do. You have a cell phone NPC. Happens to be the basis of GSM, invoice compression. If, you have a cell phone it contains, an implementation. Of a PC so I am going to use LPC, also for this synthesizer, in, this table, I have identified, the component, sounds that I need to synthesize for. The first fourteen, phonemes, we have a voice that is modulated, in different ways plus, some optional, for occasion at the same time the rest of the consonants, are similar, except. There is no voice simultaneously. I have, split each phoneme, into three parts a beginning, a middle and, an end each. Phoneme may have a short sound of some kind in the beginning, and in the end for, example at the end of M there, is a subtle sound, from the lips the, middle is the part of the phoneme, that is a stretch, as long as it needs to be to produce a short or long sound so. The total budget of sounds that I need is 17, sustain stones 7, release sounds, one glottal, noise and silence. Total. 26, sound samples, to, generate, these samples, are recorded, myself saying this sequence, as monotonically, as I could I'll live, a, young moon. Our. Hair, say code. To. This. Recording, was important, in prod then. I edited the sound to make it completely, monotonous, in hindsight, this step was completely, redundant, but it was nice to learn that this research tool could SS an auto-tune program, for bad singers, this, is how the result sounds I live. A, young, moon. A. Hesse. Compote. This. Was then down sampled, in 244, kilohertz removing. Some mostly, irrelevant detail, then I used prod to convert this recording, into 48 or neural MPC the, resulting, file looks like this it's, a text file that contains, some numbers. The, audio was divided, into frames and, for each frame a set of coefficients and, again is listed, next I wrote a C++, program to play this file the program reads all lines in a file and identifies, their content, it saves, important, parameters, like the sampling, period which is the universe of the sampling rate in, two variables, the, coefficients. Are saved in an array when, it encounters, to gain line it synthesizes, the frame the frame is synthesized, next, it starts, by generating, an arbitrary, bus anything. Goes as long as it has a clear frequency. And as long as it's not the pure sine wave next. The LPC filter, is applied the, filter shapes the frequency, characteristics, of, the bus that is fed to it much like a feeler filter, basically. It's a vocoder, the, resulting, sample is saved into a buffer once, the file is done with the buffer is saved into a WAV file, and this is how it sounds, I'll live. A, young, moon. Yes. The, coreboot. 248. Was my choice for the order, of the LPC data, I made a comparison, for different LPC orders here is a short voice sample, I took from one of dr. David woods videos, how to stop Prison, radicalization and, here's how it sounds at different, orders how, to stop Prison, radicalization. How. To stop Prison, radicalization. How. To stop Prison, radicalization. How. To stop Prison, radicalization. How. To stop Prison, radicalization. How. To stop Prison, radicalization I. Think. That 48, was the sweet spot where artifacts. Were minimal, and increasing. Coefficients. From a 48, did not significantly. Improve, the audio to justify, the increase of data now, it is important, to note that there LPC file is not a recording. It, is a synthesis. Instruction. For, example I, can modify the bus formula, and replace, it with white, noise stop. These. Changes, the voice into a whisper, or I can, change the tempo, make it four times slower. Or. Make. It twice as fast I'll stop for summarization, Oh change, the pitch make, it higher. Or. Make, it lower at a stop Prison, radicalization my. Bus formula, deliberately, contains, a small amount of aspiration. In it if, I, remove the aspiration. And leave just the buzz out, of stop Prison, radicalization, the. Sound becomes a bit cleaner and also more synthetic, sounding, these, samples, are recorded, at 44, kilohertz, if I used a much smaller, sample, rate, such as 8 kilohertz a much smaller number of coefficients would, be enough here is the 16 coefficient, LPC made from a 44, kilohertz recording. And stop from our colonization, and here is a 16 coefficient. LPC, made for my 8 kilohertz, recording. At a stop Prison, radicalization, the. Letter was a bit more muffled, like a telephone line but had way less chirping. Artifacts, in it, lowering, the sample rate allows, you to get more bang for your buck in terms of data transmission, and, that's why telephone, lines and cell phones use low sample rate but, there's plenty of low sample rate speech synthesizers. Out there and I want to use a good sample, rate so I'm going with 44.

Kilohertz, And forty, eighth order in a PC so the LPC file is divided, into frames each, frame, representing, the characteristics. Of the audio for a small slice of time next. I spent a day of writing this tool which is a modification, of, the wave writing, program, from earlier this program, allows you to adjust the, parameters, such as breathless and busyness in real time and to choose any frame form of the recording, to play I used these two peak frames that in my opinion best, represented. The phonemes, that I wanted to include in my speech synthesizer. Next. I wrote a tool that copy pastes, the frames that I picked and it produced, this file it is a simpler plus source code which brings me to the next part simpler. Plus source code we, begin with the data structure, that was just generated. This saves each of the recordings, as a structure. I decided, to make it so that each recording. Can have multiple frames, further, than just one for better quality my. Process, of text-to-speech. Begins, by reading the text input and converting, it to a list of phonemes, rather, prasanan, elements, first. We start by normalizing, to the text removing. As much unnecessary. Detail. As possible such, as converting, all of it into lowercase. I also went ahead and converted it, into 32-bit. Unicode because. Dealing, with text character, by character is, quite difficult in, utf-8. When a single character, can span across multiple bytes I mean, it's still not perfect even, in 32-bit, únicos, because of combining, diacritics. And stuff but you get what I mean it, helps with this application. Punctuation. Must also be taken care of I decided, to add special symbols, to angled, brackets, that will be later used to control, the pitch of the voice I, will just leave the pitch handling, plant for now and, get back to it later now. That the text has been canonized, and the work-in-progress. Strain should only contain pronounceable. Letters, and pause markers, it's converted, into indexes, into the sound to record its list this. Code is a bit complicated, for what it actually does it, basically just assigns. The timing value for each phoneme and depending whether it was repeated or not if, you are interested, in exploring it in detail we can download the source code which can be found through links in the video description and, explore, it offline, now, that we have the list of Records that we should use to play the speech let's, go through them.

Earlier. I mentioned that, in my design, each record, may actually contain more, than one frame I decided, upon three different, styles for playback, of these frames the, synthesizer, may choose one of the frames to believe a random, for some variation, in the voice or it might play all of them in the sequence, for use whenever a single, frame is not enough to capture the phoneme clearly, enough or in the case of the trill error, it might rapidly, cycle through their frames whatever. The method we do need the actual synthesizer. So let's tackle that part now this, is basically the same code as either LPC, - wave converter I, briefly showed earlier, but let's go through it in more detail, now I am using sfm l for this project this, Audio Driver class is basically the exact, same thing as in the PCM audio video I made earlier it's job is just to read samples, from an array, and push them to the sound library, there is nothing too exciting about it the interesting part is where the LPC frames get converted, into WAV audio in, the context, of speech synthesis. LPC, works so that first there is a source of noise, buzzer something. That generates, a voice that has a pitch anything. Will do including, music, as long as it is not pure sine-wave. It cannot, be pure, sine-wave because, the next step is applying, a finit impulse response, filter, over it this filter either attenuates. Or amplify, certain, frequencies, of the bus but it cannot make them up from nothing the. Difference, between the bus and the filter output, is saved into the buffer the, filter operates, on the differences, between the bus and past outputs, generated, by the filter, so we use a rolling buffer, that's what the modulo operator does, it make sure the indexes, loop back to the same indexes, over and over again the latest sample, is sent to the speaker in my design, the audio chunk is first saved into a temporary buffer, and then moved into the buffer that is shared by the audio, engine, this is so that we can minimize the, time that the audio buffer has to be locked and this, is what it sounds like mind you this is going to be finished language, text right now. Mallamma. Temperamentally. Tanner, hang on Sonic mo mmmmmm. See him saw him kaboom game Massa, Massa Travis, Tom it, was already fairly understandable, to, an average finish, listener, even if some phonemes, were not as clear as they could be there, were three little problems, with that sort sample, first, the speech was quite monotonous, we.

Could Make it sound more interesting, by smoothly altering, the pitch and voice quality, over time however. That's not enough, I decided to, actually melt a lot epical of low of pitch in Finnish text reading to do that first, the text is divided, into syllables using, rough algorithm. That simply, checks where the vowels and consonants are. And decides, that the new syllable, begins where there is a single consonant, followed by a vowel then, the pitch curve is given to the center's by keeping track where each sentence, begins and where it ends and giving a certain pitch to the first and last syllable and interpreting. The rest and this, is what it sounds like oopsie, him an unconformity. Thomas. If apollomon and had enough until, a moon moved on Bethlehem. In me as lefty Sentinel, said to Eve among some of them Parkinson's, samokhin on say muscleman more in bhavam the. Second problem is quite obvious, and quite annoying, to. Be honest i have no idea what is the cause of the content clicks and clacks heard in the audio but, I figured it's best to do something, about them. My workaround, for the clicks and pops is not very pretty it, is pretty much equivalent to, fixing a broken television. By beating, it until it works, it. Gets the work done I also, decided, to smooth out the frame boundaries, a bit by making all the synthesis, parameters, change smoothly, gradually. This, is what it sounds like, Sunmi, totalus, mes, available. Mulatto mine male vas a masculine, discontented. A mystical. Super naughty stupid Varma Kostova unlock, magnitude estimation, in lucky on subtle attribute, and that's fame nation but, the title of this video was not let's, make a finish, speech synthesiser. This, video was about making a speech synthesizer with. A Finnish accent, so there is still work to do I have to make it read English, to. Make it read English, a broad, code from a very old speech since, his program, called arson, which in turn arose, from a research, paper or Internet, United States Naval, Research Laboratory, in, the year 1976. I simplified. The code a bit so that the two source code files about 900, lines of code fit nicely in one screen full and I got myself a function, that converts English, text into sort of an ASCII representation, of, the international, phonetic alphabet then, I wrote the conversion, table which reduces, those phonemes, into the set of phonemes used in Finnish, this function, is then called in the part of my program, that deals with text, to phonemes conversion.

And Innocents. Home like, this to be fair it is hard. To understand. This. Is made not exactly, like a typical, v nice accent, public, spectacle. As in written thin lips it member to text with animal mess it is not exactly, for adult many. People, have been joking about my accent, suggesting. That maybe I wrote the speech synthesiser, to do the voiceovers, for my videos well in case you ever wondered, what happened if I were to do that now, you know if, you liked what you saw thumbs, up the video and hit the subscribe button if, you haven't already it's. The bill I can do to make sure you get all notifications. Of my new uploads then, is go to my supporters, at patreon, PayPal, libera Bay and other sites, I have, not addressed you, in a video for a long time but you are very much, appreciated. Indeed as always. Have a nice day and a, Shalom in your life. You.

2019-02-01 13:21

Show Video


This is LPC (Linear Predictive Coding): yₙ = eₙ − ∑(ₖ₌₁..ₚ) (bₖ yₙ₋ₖ) where ‣ y[] = output signal, e[] = excitation signal (buzz, also called predictor error signal), b[] = the coefficients for the given frame ‣ p = number of coefficients per frame, k = coefficient index, n = output index Compare with FIR (Finite Impulse Response): yₙ = ∑(ₖ₌₁..ₚ) (bₖ xₙ₋ₖ) where ‣ x[] = input signal The similarities between the two are striking. FIR is used in applications like low-pass filtering, high-pass filtering, band-pass filtering, band-stop filtering, etc. It is an almost magical type of mathematics that is used to generate these filters. For LPC, there are several different algorithms, many of which are implemented in Praat, the software that I used in this video to create my LPC files.


+私の顔にペニスを撃ってください your name is inappropriate

+私の顔にペニスを撃ってください Get over it.

16-17 minutes in: Please next time lower the audio or give a warning. The popping killed my hearing

I met that formula when I learn Signal & System module (my school call it Digital Signal Processing)

These are two of my favorite things: synthesizers and a new Bisqwit video. Great job.

Great Project. Thanks for sharing.

"joes own editor" :P

+Bisqwit please make more.videos cracking games

Yes, that is its name. https://github.com/jhallen/joe-editor/

Where is he from?

+1 for using 'goto' in your code :)

And i am proud coding a minimax 4 in a row bot...

It would be fun adding this to your MIDI player for playing back karaoke MIDI files.

Well then, Bisqwit using Dear ImGui!

Yup, that’s a first!

Ohhh in the past I've made a pseudo "TTS" using the winmm from windows.h and PlaySound function...

His C++ level ... is over 9000

Your failing lowercase conversion for umlauts is a pretty nasty trap I fell into in the past as well. It looks like you're doing everything correct, it should work, yet it somehow doesn't. Unfortunately, it's not as easy as imbuing/passing the correct locale. It might be, but that's not guaranteed. Even when using the UTF-8 locale, you might just walk char by char and for whatever reason ignore UTF-8 sequences… So far for me it always worked when using the wide character version instead (i.e. `wchar_t` over `char` or possibly `uint32_t`), although I've heard even that fails for some. Guess it's not totally unexpected I've heard stuff about dropping the `codecvt` header from the standard… So in your case I'd just to the `std::u32string` conversion first, then unify character casing after that.

Can it welcome you to the hydraulic press tsännel?

Sure it could.

8:00 Buzzer cannot be pure sine, because then the filtering of the frequencies would make no sense - there would be only one frequency in buzzer to start with. Buzzer needs to have rich frequency spectrum, but at the same time it needs to be harmonic (i.e. all frequencies are natural multiples of some base frequency = there is a defined pitch). You could use any function in form A*sin(x) + B*sin(2x) + C*sin(3x) +..., but of course the easiest way to produce signal like that is to use 1) square wave, 2) sawtooth wave (as you did), 3) triangle wave, 4) exp(sin(x)), etc.

+Bisqwit Vocal cords are the buzzers. Air go through a buzzer, then through a tube (vocal tract) which amplifies some frequencies (formants), and dampen other. If buzzer sound would be just one sine wave, then the tube just makes it louder or more silent, nothing more. Tube cannot create new frequencies, acts as a filter only. So the aim for the buzzer is to generate many frequencies, so the tube (vocal tract) have something to choose from. White noise (during whispering) have all frequencies - so it is OK. Pitched sound is also OK, since it have many sine waves in it, as long as its base frequency is not too high (easier to understand bass singing than soprano singing!). High pitch have fewer sine waves in formants frequency range (~300-3000Hz). Try changing VoicePitch to ~1046 Hz (soprano's high C), and you won't be able to distinguish vowels o from u from a, or e from i.

Good explanation, but not an ELI5. I had trouble explaining it in layman terms without invoking mathematics and frequency spectrums... That's why I wrote the annotation.

16:53 this sounds like isis

Perfect for spurdo memes

People who write code not in a proper compiler are scary.

Actually you don't write code in compiler :/

I was going to make a joke about how you already sound like speech synthesis when speaking English, but your English gets better every single video. Keep it up man!

I'm convinced now that I'm absolutely retarded.

You should do coding in Roblox Studio. https://www.roblox.com/create

I legit thought the speech synthesizer was speaking for the first 3 minutes

Note that std::wstring_convert is deprecated in C++17, so if you want to be standard conforming, you should replace it with something else.

+Bisqwit Unfortunately there isn't a standard way anymore. The standards commity has said that they're working on a replacement, but will only readd it if it's fully compliant with the unicode standards (apparently this one didn't work in all cases). The only way seems to be fully implement it yourself (utf8 decoding isn't very hard luckily), or use a library like iconv or libicu.

Noted. I used it for 1) its brevity and 2) because I couldn’t figure out a concise replacement that is not deprecated.

My Finnish is a bit rusty; the sample Finnish texts were bible quotes, right?

Correct. The captions had the equivalent text in English.

I want to hear it sing Space Oddity.

You never explained how to actually stop prison radicalisation.. dislike! Jk very interesting and cool video :)

I linked to a video that explains it though! Thanks for writing.

I can fell my teacher aura here

Can you make it say some common Rallienglanti phrases for us?


Like this? https://bisqwit.iki.fi/kala/finsyn-rally.mp3

Rolling index, why didn't I think of that... Thanks for the excellent video. :)

Yeah, a rolling index is a bit neater solution than doing a copy-backwards-by-1 loop after each iteration. On the other hand, the rolling index makes SIMD optimizations impossible, so it’s a tradeoff.

You are amazing.

Great video :)

These videos make me respect you even more. You're very knowledgeable!

I'm speechless. No pun intended.

3:16 lol Futurama martians


I, for one, welcome our new, Finnish robot overlords. _Hail Roboisqwit!_ Seriously, though, this is neat-as-hell. It's also kind of… heartbreaking, in a way. I never considered how speech synthesis works, and now that I know? The magic… is gone. :(

Joel, regarding your question at 8:05, we cannot use a sine wave because it only has audio energy in 1 frequency, whereas to synthesise human speech, we need energies in "all" frequencies, so we can have base pitches and formants happening at the same time. Buzzers have a better spread of frequencies, compared to the more "pure" sine wave. Hope I made sense ^_^

That's actually a very good explainaition. Your first explaination made me understand it. But then again, I'm not 5 years old.

+Bisqwit Aha, I'd never heard the term ELI5 (Explain Like I'm 5) before! Here is my second attempt :-) Voice sounds are slightly complicated. Sine wave sounds are simple. Buzzers are super-complicated. We cannot use 1 simple sine wave, filter it, and get a complex voice sound. We have to start with a super-complex buzzer, then filter out some things, to be left with a less-complex voice sound.

Good explanation, but not really an ELI5 :-) I understand the situation as indicated elsewhere in the video, but I was having trouble explaining in layman terms without referring to things like frequency spectrum; I wrote that request for the benefit of audience.

is being a cuckold christian tradition?


hey bisqwit, what operating system do you use?

+ivan pineda It's debian buster

+Bisqwit thanks, I was curious because it doesn't quite look like other distros I've seen

Linux in general.

Joel, always a pleasure to watch a C++ (related in some way) video. Keep the good work!

Question: At 9:17, why do you have: constexpr unsigned maxOrder? What is the purpose of the constexpr here? Won't the compiler evaluate the what maxOrder is without the constexpr? Why haven't you use const?

For integers, there is not much difference between const and constexpr. I just like to document the intention. The primary target audience of source code is people, after all. When I write “constexpr”, I mean “this should be a compile-time constant, and something probably depends on the fact”. Here, MaxOrder _needs_ to be a compile-time constant, because it is used as an array dimension. When I write “const”, I mean “this is immutable; it should be only read, not written to”. For example, the constant “rate” is not intended to be changed, but I don’t necessarily need it to be a compile-time constant, even though it happens to be.

Awesome stuff.

Earthbound music in the background FeelsgoodMan

Yus, more translation material!

maybe a weird question, but are you self taught?

+Bisqwit inspiring

Pretty much.

sUoMi PeRkElE



This doesn't fool me, I know you have a much more advanced synthesizer that you use for your videos. A nice coverup attempt though


Wow, what a cool video to wake up to. Excellent work!

11:01 This program looks really nice. What GUI library did you use?

Yep, correct. Imgui it is.

Looks like imgui to me.

I wanna make it say nigga.

Interesting as always.

LOL this guy hasn't understood the matrix.

The solution for the clicking in the sound is to simply fade out some of the frequency from the very end of the sample. Because LPC just converting the audio samples into the simple and low resolution waveform just bunch of float values and a gain.

Sample voice frame to C code... the Lisp lover in me says you should have used Lisp, code as data and data as code. Either way this was beyond interesting. I like your accent but to be honest everyone who speaks English has an accent. The voice speaking with an accent was diffently something I wasn't expecting this 月曜日。

Wow, i dont know what are you even talking about, but its cool.

Speech synthesis

Hello, Bisqwit! I have sent you an email, if you are free, please do read it.

Awesome, as always bisqwit!

Can this be done in JAVA?

+Bisqwit hmm , okay I will try this on Android. Thank you.

There is nothing low-level in this code.

+Bisqwit even those low level stuff ? android probably uses C binaries for its speech synthesis

It can be done in any turing-complete language.

I've never seen `++i %= max` before. That's pretty cool. Edit: it seems this only works in C++ but not in C, Java or Javascript

+shaurz I'd just write a tiny inline function with a speaking name instead. ;) Like `incmod(v, m)`

I wouldn't recommend writing such esoteric code. i = (i + 1) % max will do the same thing and will compile to the same code.

+Bisqwit Thanks for the explanation! I used an online compiler to quickly try all versions of C++ and indeed it worked in all of them.

In C++, operator++() returns a reference to the object being modified. This is not the case in C. This has nothing to do with C++17 or about sequence points. If the expression was `i++ %= max`, it would be a different story. `++i %= max` is completely unambiguous in its meaning. The reason it does not work in C is because `++i` returns a non-lvalue copy of the variable in C, not a reference to it. (C does not have references.)

How did you know this stuff before starting the project or learn as you went? I've wanted to start stuff like this but get overwhelmed by all the stuff I have to learn to finish the project.

I studied it while making this project. Reading example code, reading articles that describe how LPC works, exploring outputs, trial and error until I got the first LPC-to-WAV converter working. Some basic principles I had already learned years ago don’t-remember-where. And of course, the principles of phoneme-based speech synthesis were already familiar to me since the 1990s when I studied how Dr. Sbaitso works.

We have reached peak AI revolution - machines making machines A voice synth making a voice synth

Actually the truth was like -22. I just happened to do the recording a month earlier...



whoah nice!

It can't be a sine wave, because as far as I can tell, you need a spectrum of frequencies spanning the human voice. A sine wave is only one frequency.

+Bisqwit that's my guess. You need the multiple frequencies to "equalize" the voice out of

Yes but it is only one frequency at a time. Applying an equalizer to it can not create any new frequencies or harmonics.

A sine wave is just the shape. It can be of any imaginable frequency.

Yes but now we want to listent to the synthesizer's voice

you *are*

Can you explain just a bit what you did to generate the LPC sample from David Woods? I guess manually editing the pitch curve with Praat?

I dumped the soundtrack of the video into a wav file using MPlayer. Then I opened the soundtrack in Audacity, and cropped it into just those three seconds or so, saved it into a new wav file. (Or maybe I dumped only three seconds from the soundtrack in the first place, using -ss and -endpos options. I don’t remember.) Then I opened the wav file in Praat, and did nothing else but synthesized the LPC from it (Analyze spectrum → To LPC (burg) → Save).

I always liked your accent. So I liked it

I like this

This might be a silly question, but is the speech from the beginning generated using the speech synthesizer you coded here? It sounds much better there, so I guess not...

About your edit (pronouncing words differently so they sound right) I was reading just yesterday about TTS engines (which is a huge coincidence) and Mycroft's Mimic engine has a page where people can suggest fixes to words doing exactly that ("eye q" for "IQ" and other stuff)

Wow, it sounds so much better! I guess it's possible to do the same changes in code, although figuring out the pitch curve is probably a problem on its own

It is generated by the same synthesizer indeed, but I edited the pitch curve in Praat in postprocess. Also I think I spelled some words a bit differently in order to get the right voice out. I did the same at 19:20 (except for postprocessing), intentionally misspelling “typical” as “typeecal” because the text-to-phonemes table would have generated /tɪpaɪkəl/ (typie-cal) otherwise.

Hey Bisqwit! May I ask what editing software do you use ? Thanks!

For which type of content?

You inspire me to start coding in C++ I think you know C lang better than anyone else in world :)

Really outstanding video! Great work Bisqwit!

Music: https://www.youtube.com/watch?v=Url3QHHNKSA

Maybe it would sound better if you put some fading in between the phonemes?

+Bisqwit Ah yes, I meant crossfading. Didn't notice that's how you fixed the pops. Awesome video by the way!

That would likely produce jumpy staccato sound. If by fading you mean cross-fade, as in smooth altering the coefficients, that’s what I already do as indicated at 17:30.

Ha! This was so much fun!

Dang, this was awesome!

So you're basically teaching us how to make Vocaloid-like software? Nice.

+jj zun Yesssssssssssssssssssssssss

I'm down for some Hatsune-Bisqwit! ;)

Cheer bisqwit! Almost near to guitar effects tutorial!

"Yes, I use PHP. Because a programming language, that you know is much more efficient than one that you don't know." This is the truest statement I have ever heard.

in order to debug PHP you have to var_dump every single variable because the stack trace in PHP is a real mess.

+Michael Smith You can debug it line by line with xdebug, but c# or java are usually better for that kind of work.

+Li Feng In what world is PHP "easy to debug"

Please don't talk about PHP

I use PHP too because it more productive and easier to debug

this is one heck of a flex on the people theorizing you use a synthesiser for VO

Yeah. Those comments are so boring now IMO...

+Minh It's pretty similar; in fact, the first vocoders as used by artists like Kraftwerk worked in a similar manner, but implemented in an analog way - instead of using digital FIR filters and complex math for building the filters, the voice was fed into an analog filterbank, and the amplitudes of the outputs of the filters were fed into the gain controls of another filterbank tuned to the same frequencies, which filtered the synthesizer's audio. The result was a synthesizer speaking. This was also used as a crude form of speech compression for encrypting communication by the US Army in the past.

This. Someone do a voicer synth using only synth sounds. Harmor recommended for even more hardcore.

I think the clicks are because the program is cutting/pasting at random waveform values. This produces non-continuous gaps in the waveform that generates those clicks. I think the simple way to solve it, is to just wait until the value of the sample crosses the 0 line to perform the cut of the audio, and wait again a 0 crossing to introduce the next one.

You also have to determine a rithmic pattern dependant on the langauge and overall delivery, as words are not spoken at a constant pace.

+Music sucks That's the method used in video editing.

What about a simple fade-out/fade-in between the samples?

Interesting theory. That actually matches advice mentioned in the SNES manual in relation to audio samples. What it is trying to say exactly is ambiguous, but it warns against discontinuities in the waveform, which would result in clicking sounds. Of course, given the ADPCM coding, discontinuities on block boundaries would easily result if you're not careful. (since the samples within a block are all expanded using the same parameters, but across block boundaries the parameters change.)

Shalom. Very nice video. What software do you use for video editing on Linux btw?

I use kdenlive.

Now I can have Robot Bisqwit wake me up every morning.

Imaging having “SHALOM! SHALOM!” as a wake-up alarm

18:01 SETI finally caught a clear message from another galaxy...Nice project

I'm early this time. Awesome video!

Thank you!

I think I remember asking about this at one point, glad to see a video done on it

He naturally sounds like a speech synthesizer.

+Moriarty Vivaldi It's a meme and that happens when you have a community, i actually like the way he speaks. Unclench your ass and relax. You take things too seriously

+Li Feng Idiot.

throw away Imagine having every single video spammed with this message. It’s unnecessary he’s seen them a thousand times, cringe that people think they’re being funny or original.

+Li Feng I don't think comments like this are meant to be rude, more like a fun observation. :)

Don't forgot he's not native English speaker and not living on native English speaker country , I got annoyed fro these comments "Your English are not good like mine" , So how is good your Finish Finnish Indian Japanese Chines !


In fact it's identical to an IIR filter, which has coefficients for both x and y, and your x coefficient is 1 and all your y coefficients are negated.

This is interesting as a programmer, as someone who's trying to learn another language (old english, dead language sure, but fun), and as someone who asked you how to trill about a month ago haha. Still can't trill, but I'm on my way.


Bisquit i wonder why are you not sending vedio on bilibili more? I am looking forward to your next vedio on bilibili, come back, ok?

Sorry. It is a bit difficult to maintain communities in more than one place. I will try to upload a few of the recent videos some day.

I don't think that will work, because of all the excitation signal history in the bp[] array. Instantaneously changing the filter coefficients can lead to instability. One thing that might help, or might make it worse (I'm not sure) is to try implementing the transposed version where the bp[] array isn't just past output samples, but partially computed future samples. See the notes here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.lfilter.html

Can you speak clearly for fucks sake, which country are you from this voice is auto tuned as fuck and irritating it's in no way nice or good to hear, my ears are fucking bleeding ? Do you even hear yourself man.

Wow, aren’t you a beacon of positivity, open-mindedness and educatedness in cultures, languages and accents.

Thanks for posting!

+throw away Bruh Chinese have no sense of humor, don't even try to educate him that this was a joke, he can''t open his eyes to read anyways lmao.

+Li Feng Indian is not a language you dumbass it's Hindi and a bunch of other languages in India, it's not his English that's irritating, it's his desperate attempt to sound like a fucking robot that's so irritating and his fast pace.


+Master Bob I wasn't. No where did I say anything about all Muslims, the religion itself is horrible but you only need to look at it's history and attitudes towards anyone it doesn't agree with. I just said about the ones I'd met they weren't very nice and I don't really care about trying to meet a nice one.

fredhair I don’t think it’s fair to brand all members of a group based off a few, especially a group that large.

+Master Bob well it's not the nicest religion in the world.. the only Muslims I've met have been drug dealers getting rich off addicts and generally being rude by acting entitled and self righteous. Maybe I've just met a bad selection of them but I'm happy to not meet any more of them tbh.

Bisqwit seems a bit... obsessed with Islam

I always liked the implication of GSM using LPC, that technically you're not hearing someone's actual voice, but a reconstruction made of a buzzer with a filter and hisses and pops from filtered noise. So, you're actually listening to a speech synthesizer's reconstruction of the other person's voice! :-) :-)

Puhut englantia mun mielestä melkeen yhtä hyvin ku jotkut ketkä puhuu sitä äidinkielenään.

Kiitos. Näiden videoiden tekeminen on hyvää harjoitusta. Vuosi vuodelta se paranee.

How to stop prison radicalization

You have some seriously niche interests. I like that. :)

wtf does he use to code

An editor

Once again; another great video!

I'm wondering if the technology that is used to transfer data from Vinyl Record Albums into mp3 files would be of any assistance... Then just filter out the background music until you have pure voice. Then you can have a singing speech synthesizer.

As for the first sentence, I fail to see the relevance. As for the second sentence, what kind of solutions do you have for “filtering out background music”? Even on YouTube* it depends on correctly identifying the original recording (with or without lyrics) leaving only the added commentary and sound effects, and even then the resulting audio sounds quite hollow. *) YouTube has a tool that allows video creators remove a song that infringes copyright, when YouTube has first identified the infringement using ContentID. Often it results in simply muting that region of the video, but sometimes it successfully removes the song leaving only commentary.

Just had an idea I will make my own speech synth. I wondered if there is some nice example low-level enough. And guess what. This guys had the same idea just in time to have it done now. Great job !


Bisqwit I'm surprised to receive your reply! You are my idol. Your videos are so cool!

Ah, yes. That is not coincidental. Interesting, nonetheless!

​+Bisqwit I means that your first recording(https://youtu.be/Jcymn3RGkF4?t=197 ) sounds like Tuvan throat singing

Google translation: I seem to hear Humai.

+shaurz It seems weird to you because of different background. Finish folk did it like this for centuries.

music in background adds nice atmosphere to video as always x)

Make video about decompiler.

using c++ for something like this is pretty hardcore, i like it

Hello would you please do a tutorial series for beginners in developing?

+Bisqwit i hope soon ! It's really hard finding good content in the internet. so i thought "hey maybe i could ask bisqwit" why not ^^

Probably, some of these years! When a good enough way to make it comes to my mind. I make my videos mostly through a creative process…

Hi bisqwit I don't know if you realize this but you are an inspiration for many of the viewers here, like a hero. So could you make a video about how you reached this insane level of skill, what your journey was like, and maybe some tips on how one can be as good as you ? Thanks for all the amazing content ^_^

+Bisqwit I'm not. I don't want you raping people's ears with your monotonic ass rancid voice. Hearing your voice makes me wanna punch you in the face tbh. Your basic ass voice + the cancerous background music = perfect recipe to scare sane human away.

bisqwit was the inspiration to write my own tools whenever i need one, Great video.

Are you still driving a bus?

No, I lost that job three years ago because of competitive tendering. The company had to lay off like 90% of their employees.

Are you using imgui?

Yes, I used it for the tool that is shown at 11:27.

i cant understand how you have so much knowledge, its crazy

can you decode "voice to skull technology?"

whats that.

Wow, sound like your very educated in sound technology. I wonder if you could figure this out. https://youtu.be/hNLszYLSThQ

+Bisqwit Thank you for your reply. I know what you mean about the mind trying to make sense of sounds, like a fan running. This is not evp that I am talking about. Its too much and too complicated to comment about here. I am going to ask if you could do me a favor and google search voice to skull or microwave hearing, these are real devices with patents and are in use and have been used since it was developed in the 1970s. Us patent # US6470214B1 and US Patent 5159703. I was wondering if you could demodulate the electronic voices? There is a diagram of the device in the patent 5159703. Thousands of people are being attacked with these devices, trying to find someome smart enough, who cares, who can reverse engineer one of these devices or demodulate the synthetic voices. You may think me mad, but google "targeted individuals V2k electronic harassment". Please investigate and make your own mind up. Much love and blessings.

Just the same phenomenon as seeing faces in near-dark scenes, or hearing words in backwards speech. The brain tries to make sense of the noise and it picks the closest interpretation. To most people, it may seem completely like noise until someone tells them what specifically to look for. Then, they begin hearing it; but if they are told something similar but different, they begin hearing that instead. In other words, it is completely, or almost completely, a psychological phenomenon.

13:15 accidentally used the whisper effect

Will you eventually upload Final Fantasy V :: Airship to bisqwit extra?

Good question! I probably should.


Super interesting article. Thanks!

11:01 Hey, it's imgui! Very nice to use

Nyan! =^_^=

I did something like this in the early nineties. I recorded my voice on my Mac SE, and wrote a hypercard stack to play the correct sounds together. It didn't translate English into phonemes, you had to write out your own phonemes, but that wasn't quite do unusual at that time. I also only made one recording per phoneme, because ain't nobody got time to record every possible phoneme pair

I did something like this in the early nineties. I recorded my voice on my Mac SE, and wrote a hypercard stack to play the correct sounds together. It didn't translate English into phonemes, you had to write out your own phonemes, but that wasn't quite so unusual at that time. I also only made one recording per phoneme, because ain't nobody got time to record every possible phoneme pair

You used the speech synthesizer that you made to give the Tutorial!

Yes, I used it in the first few seconds of this video.

When you find a problem after you played the audio do you just in real time think of a solution and code it right there at that speed?

c++17 on dos

I'm guessing the "buzz" can't be a pure sine wave because a pure sine wave has no harmonics; it's a pure tone. In other words, there's nothing to filter out except for one single frequency.

why you don't use namespace std?

The standard namespace is not a demon to be vanquished with -a magic spell- boilerplate code. It has a purpose.

This is great. How come there aren't many walkthroughs of programming speech synthesis programs out there?

Probably not that popular a topic.

The constant clicks and pops are due to discontinuities at the frame boundaries. With an algorithm like this, they usually fix it using overlap-add. The gist of OLA is that your frames overlap and are weighted by a windowing function, then you sum them together where they overlap.

The input signal can't be a pure sine wave because: 1.) The vocal chords don't emit pure sine waves; they emit something more like a buzz. 2.) A pure sinewave would almost be unaffected by the LPC filters because it's a single frequency. A buzz is extremely rich in harmonics, and the human ear keys off the presence or absence of those harmonics in determining what was said. That's why if you look at voice data in a spectrogram, you tend to see lots of streaks that move together or widen/shrink based on what's being said. In a sort of philosophical explanation, the input signal is "sampling" your LPC filters. A single single sine wave would result in sampling just a single data point. You need a lot of sine waves to get enough of a picture of the LPC filter to see what it looks like, which is what your brain is keying on to make sense of your words. Think of it kind of like an image. The sine waves are the pixels that you're building a picture of the LPC filter with. A single sine wave is like a single pixel; it doesn't tell you much. A buzz is loaded with lots of sine waves, so analogously it's loaded with a lot of pixels, so it can give you a better picture of the LPC filter, and thus a better picture of the formant it represents.

Great explanation! Not an ELI5 though :-) But I would have settled for that.

Finally! A Bisqwit Vocaloid :3

I can't speak for anyone else, but I was glad when this was Finnished.

Did I hear a turret say "Weeee" when he said "thumbs up the video"?


Being a Finn I am not very good with consonant clusters. Luckily there’s captions on the video so you can understand!

Now we need to record the speech synth speaking and use that to make another synth

You are a Coding God. Was is hard to teach Bjarne Stroustrup to learn c++ ;-)

Are you going to make this speech synthesizer a TTS voice for Windows?

Other news