From Google Assistant to Cloud, speech is integral to every aspect of Google’s product line. So we make it a vital development area for planning future devices. This episode of the
It’s something many of us now see as the norm: When you talk to your smartphone, speaker, or display, it will talk back to you. But what’s going on behind the scenes when you ask your device for the weather? As Google continues to enhance
However, toddlers typically only learn one or two languages at a time. Google’s technology is designed to understand and respond to any user in the world, whether they’re speaking a different language or are just caught in a noisy environment.
The Recorder app has long been a
This technology is also used to fuel
Tune in to the
Transcript
[00:00:01] Rachid Finge Hey and welcome back to the Made by Google Podcast. I'm your host Rachid Finge, and I can believe this is episode 5 already. We’ve covered so much ground in the world of Google Hardware and we have much more to come. So don't forget to subscribe to the podcast. Today we're talking about talking. Many Google products don't need you to type stuff, actually. Some don't even support typing, but you use your voice instead and some products talk back to you, too. Most notably, of course, the Google assistant. If you ever wondered what's needed to make computers and phones understand human speech or to make them sound like a human voice, this is the episode for you because our guest is the director of product management for Google Speech, Nino Tasca. Nino, welcome to the Made by Google Podcast. It's great to have you.
[00:00:59] Nino Tasca Thank you for having me. It's great to be here.
[00:01:01] Rachid Finge So you're a director of product management for Google Speech. What do you tell friends and family what it is you do at work?
[00:01:09] Nino Tasca Yes, I work on the Google speech team, which is really an amazing opportunity because at Google, the speech is literally in every product that we produce. And you you all know about how many products are out there with Google, everything from search to Google assistant, YouTube, Cloud, many more. And what we're seeing is as speech is getting better at recognizing human and audio and also synthesizing text, what we call TTS. There's just many more product opportunities. So it's been a really fun ride to watch how the speech models get better, technology improves over time and more product possibilities emerge.
[Meet the Googler]
[00:01:53] Rachid Finge Today's guest works on things that are easy to take for granted, like talking to your smart speaker and having it talk back to you as the director of Product Management for Google Speech. Nino Tasca actually helps many of our products come to life without Google speech. The Google assistant simply wouldn't be, and the same goes for the recorder app or Android features like live caption. Nino will tell you that working on speech technology at Google is exhilarating because there are so many possibilities to unlock. Each step forward that Nino's team achieves. Makes computers easier to use. Speech recognition at Google is getting so good that Nino has replaced a lot of his typing with speaking. Find out much more about Google Speech and what it means to Google's devices and services. In this episode of the Made by Google Podcast. So you're on a team that makes sure the computer understands what I say on one side, but on the other side, if I'm, for example, using the Google assistant. Your team also makes sure that the Google assistant actually has a way of saying something that I can hear. Is that right?
[00:03:03] Nino Tasca Exactly. So we think of it as two sides. On the one hand, if we call it ASR or automatic speech recognition, and sometimes in the outside world it's called STT or speech to text. So that literally takes speech or audio files, runs it through a speech model, and then turns that audio into words. So you can think of a Google assistant query, What's the weather today? An audio file goes in and out comes the word What's the weather today? And then on the flip side of that, after we determine what the weather actually is through the Google servers, we actually synthesize and say it's sunny and 72 today. And so that word comes through the TTS system for text to speech. So speech to text on the input and text to speech on the output.
[00:03:52] Rachid Finge As you said, there are so many products that use speech in one way or another at Google. I'm just wondering, what's been the most fun to work on in your team?
[00:04:00] Nino Tasca Yeah, it's a great question. Definitely the Google assistant. We are closely embedded with the Google assistant because it brings speech to the forefront in many of the other products. Speech is still a critical, critical aspect, uncovering many use cases that were simply not possible before. And we can go into those layers. But with Google, assistant speech really is at the forefront and it's been really fun to see how we can, you know, build this product that's a voice first and really enable use cases to make things easier for users throughout the day, whether they're in their house, in the car or on the go.
[00:04:38] Rachid Finge A speech recognition has been something that scientists, I guess, or even movie writers have been after for decades and decades. It's become so good over maybe the past decade. So without asking for a whole lecture, could you explain like what happens? Well, let's say when I ask for the weather what is happening behind the scenes when I ask that question. Yeah.
[00:04:59] Nino Tasca So you're right. So speech has been a great scientific problem for many decades. As we talked about earlier, we take audio file in and that comes as audio, right. And so what's happening is we actually have deep neural networks that build these machine learning or speech to text models and it analyzes the audio and determines the words you said. So we actually understand “what's the weather”. Going beyond the speech team then. We actually have a natural language processing team that tries to take the content out of that. So you can see the actual words you said are “What's the weather?” But we now have to translate that into computer code, which basically says weather. And maybe the area you're asking the weather from if you're on your phone in, let's say Cleveland, Ohio, it knows that you want the weather where you are located. So it takes that, sends it to our internal servers, gets the weather from our system, and then comes back. And we actually have a generative text writing response engine that actually formulates the what way to say, 72 and sunny in a way that's pleasing to the users and we switch it up. So you're not always hearing the exact same words and phrases, but on the output we get that sentence. Let's say it's 72 and sunny, and then we actually have to synthesize the text. And that's our TTS engine, which takes those words, understands the right prosody, the right accent, the right pace of speaking, and outputs it in a voice that you can select.
[00:06:35] Rachid Finge So what if we back up and go to that speech to text situation? So I think everyone is seeing waveforms, right? That is like a way to represent what audio looks like with these waves going up and down. So how does your team or how does the system make sure that or understands that a certain waveform is maybe the word weather like how do you teach your computer that?
[00:06:56] Nino Tasca It's funny because when I joined Google and I was learning about machine learning, I was also raising my daughter at the same time whose age happened to match my tenure at Google. Right. And it's a really similar pattern because what happened many years ago is we'd actually try to break it up into, you know, a couple of milliseconds of each sound of the word and say, K, sounds like K and stapling together. And that system is just very, very limited and the gains have a really low ceiling. But with deep neural networks, we actually train these models like we train a kid. So if you think about how you teach your kid, what certain words are, you show them a picture of a dog and you say Dog or ball. You don't say a ball is a round spherical object that can be orange for a basketball or colored for a beach ball. You just show them many, many examples. And over time the brain just gets it right, the neural network. And that's what we actually call the computer models, neural networks, because they're modeled after the human brain. And so speech is very similar. We take all these audio samples and they're just waveforms. Like you said, many people have seen them on a computer before, and then we'll take it and we'll annotate a few of them and said, okay, this audio says, What's the weather? This audio says, What's your name? This audio could even be longer. It could be like the captioning for a YouTube video. So we take different audio, different use cases and throw it into the machine learning model. And the machine learning model recognizes patterns. So when certain audio patterns look like other ones that we know to be weather the Cleveland, Ohio, then we actually can build that out, extract it and understand how any input, even input we haven't seen before can be translated into text.
[00:08:46] Rachid Finge And I guess it gets a little bit more complicated in practice because people want to use the Google assistant maybe when they're in a busy train station or and I don't think I'm going to surprise our listeners here. I'm not a native speaker of English, I suppose, for a speech team that throws in an additional challenge to make sure that people like me can also use the assistant in English. So how do you deal with that?
[00:09:09] Nino Tasca That's a great question. So our mission on the speech team is actually to solve speech for all users everywhere. So you bring up the two great use cases. One is making sure that everyone anywhere in the world can use speech as well as I can. Right. And that's true. If you're speaking your native language, it could be English. It could be a language with less speakers in the world, or it could be someone like yourself that wasn't born in a certain country, but speaks that language, maybe with an accent, some stronger than others. And it can also be in difficult environments from a train station when you want to find when the next train is, or even in the car with the air conditioning on the radio blasting. And so there's all sorts of technology that sort of shoulder the core speech to tech system. We have noise cancelation systems in place to make sure that when, for example, when you're in the car and we know that you're actually trying to issue a query or a request to the Google assistant, we can cancel some of the background noise. And going back to accented speakers, we do make sure we build our speech models with a wide range of speakers. And one of the things we actually find is getting more realistic use cases makes for better products and better output. So if we just asked 100 people to say, what's the weather? We know that it's not going to build a great, great speech model because it's forced. When someone tells you to read something, even if you try to do it as natural as possible, you're not going to say it in the way that people actually use it when you know it's their two kids in the background eating breakfast and you're running and you're trying to figure out if I need my rain jacket today. Those are real life examples, and that's what the Google assistant needs to be here for in those real life examples. And so the speech team and all of the teams at Google work hard to make sure our products work when you need them most, not just in perfect conditions.
[00:11:05] Rachid Finge All right. So now we know how speech to text works. We now know that then there is a server, probably not our team that actually figures out what the weather is. Yeah. And then we get back to your team that needs to give the assistant a voice. So how do you make a computer speak again? That could be a whole lecture, I'm sure, but still wondering, like, what's the basics there?
[00:11:23] Nino Tasca Yeah, it's funny because it is the same process of deep neural networks that are trained for for the Google assistant in English, we actually have ten different voices. Each of those voices were modeled after a voice actor, which is a real profession. Right. You obviously know this for cartoons or commercial voiceover as people go into the studio, very high quality studios and can say or utter certain words and phrases for Google assistant and you know, many of the other properties out there, it's a similar process where we look for actors that have a certain range of characteristics and we actually ask them to go into the studio and, you know, same words that the Google assistant would say or maps, for example, you know, turn left at the next stoplight. Right. And we very similar way talking about what they are. We don't ask them to say every possible response to the Google assistant. That, of course, is responsible, but we give them a wide range of text and then we can feed that into our deep neural networks and output these TTS voices. So we get the core model which can take any voice and make it into a synthetic voice. But then we can actually bring some life to the Google assistant by allowing users to choose a voice that best resonates with them. And it's great, too, because we see that, you know, different users of different different groups like to have voices that sound like them or. Is pleasing to their ear. So we find that giving a good choice of voices actually creates a better user experience, especially for products like the Google assistant, where you might use it multiple times per day, multiple times per week. And so you're building up a relationship with this voice.
[00:13:04] Rachid Finge Cool. So something I'm wondering, after speaking to Monika Gupta in a previous episode, she works on the Tensor Chip. It's making much of our machine learning available on a pixel device, and I think that helps creating something called assistant assistant voice typing, which you're of course familiar with, might be one of your favorite products. So could you explain to us what is assistant voice typing? And how is it different from maybe all the other speech products and features that we've had in the years before?
[00:13:33] Nino Tasca Yeah, it's great. So yes, you are correct. Assistant Voice Typing is one of my favorite products and one that, you know, our team works on. So let me give you a little bit of history. So dictation has been a feature within Gboard and other email apps on the phone for many years now, you know, probably over a decade. And it's always worked, you know, okay and solve certain use cases. One of the biggest innovations for me, just as a user that really turned out what I call a 0 to 1 product, because when it was first working, it was a little bit of lag. You would say a word and it would take a few seconds or a few micro milliseconds. Excuse me. And till the words appear on the screen. Right. What we did, though, was we actually invested pretty heavily to make sure that our speech models, the ones we were just talking about, can fit on device. And so that's a combination of making the model smaller and also better hardware. And so now we actually have speech models, high quality that can actually live on your phone. And this is where it turns into a 0 to 1 product because all the latency, all the lag goes away. As you're speaking, you can actually see the words appear and it's really like a very interactive product. One of the the funny things is in certain use cases, the words can appear even faster than you say them, because we prefer that we know what you're going to say based on the model. So it's kind of one of these little funny twists that the models can do is they can predict what you're going to say if you're halfway through a word. So that's been around for a couple of years as well on speech. But with the pixel. Starting with pixel six and then we, you know, double down on the effort and pixel seven, we have these TPU chips which I believe Monica talked to you about. Yeah. Really creates these super powerful models, actually, models that are better than models and higher quality than we can run on the server. And as a product team and an engineering team, we came together and figured out like, how can we actually make this a differentiator? And so we invested heavily in making sure that our speech models were optimized for the chip, everything from high quality, low latency, low power drain, all of those things. Then we said, okay, one of the real product possibilities, and we knew with voice dictation there were still many rough edges from the old model. You still had to use your hands a lot, everything from sending to typing certain words or emojis to the to field the subject field. It was still very much voice was only part of the experience. And so with assistant voice typing, we wanted to really make it a full voice forward experience where you could actually not need your hands at all. And so that's the model we went into. And so it's, it's funny that people think speech is just speech to text and getting the words right. There's a lot of other things that go into that punctuation matters, right? Making sure your spelling and pronouncing the words of a loved one. So for example, I have a dog named Biscoff like the cookie. Biscoff is a common word that said. So the first time I spoke that in persistent voice typing it was recognized as a disc golf. You know the sport you if you can play in the park. And one of the features we added with Pixel seven was this personalization. So now as I tell the Google assistant and correct disc golf into Biscoff for every other interaction, every time I'm, you know, texting my wife via voice, hey, I'm going to the park to walk biscoff it gets it right and those things matter, right? That makes it a usable product because it's as good and sometimes better than what you can do if you are using your fingers and your thumbs, which I know kids these days are super fast. Voice always beats text, right, 100% of the time.
[00:17:18] Rachid Finge And that might also be the challenge. People somehow when it comes to voice have super high expectations that you need to live up to somehow. So I guess that sort of personalization where they get the names of loved ones right is very important to your team.
[00:17:33] Nino Tasca Oh, absolutely. Yeah.
[Made by Numbers]
[00:17:35] Rachid Finge So, Nino, we have this section called Made by Numbers in the Made by Google podcast where we ask our guest for a number that is either important to them or in their work. We've had very large numbers. We have smaller numbers. I'm just wondering. What is the number for Made by Numbers that you brought to this episode?
[00:17:51] Nino Tasca Sure. I'll go with the small number this time and hopefully smaller over time, which is 4%. So 4%. The way we measure speech quality primarily is called word error rate or WER for short. And so the way word error rate works is if we're looking at let's say a user says 100 words, what percentage do we get right and what percentage did we get wrong? And so for certain use cases on the Google assistant, especially assistant voice typing, we can get word error rate down to 4%, which is basically as good as humans can do. And it's really important because as we've seen throughout Google, as we can build higher quality speech models, get this word error rate down, more product possibilities exist. So we've seen all types of products emerge over the last couple of years. One speech quality has risen to the point where we can basically understand the vast majority of words that a user is saying.
[00:18:51] Rachid Finge So if it's at 4%, I'm just wondering, is there sort of a class of things that the system gets wrong, or what's the main reason why it's 4% or not 2%, for example?
[00:19:01] Nino Tasca So it's a great question. Right. And obviously, each additional percent gets harder and harder to achieve. So, yeah, there's a couple of things. One is there's, you know, many words in all the. Yes. And so it gets difficult to understand them all. But more importantly, I think it's different environments. Sometimes in a very clean environment. If a user's talking slowly, we can get pretty, pretty close to zero. If there's a noisy environment sometimes that obfuscates certain words. Sometimes users even don't speak that clearly. I speak rather fast and I don't articulate as well. Of course, it's not the user's fault, it's our fault. But it makes it a harder challenge. Sometimes users have accents or have other difficulty speaking, and it can get harder and harder to understand each individual word. But even for users that, you know, let's say for our U.S. English model, which is probably our highest quality one. Users that we're born and raised here. Sometimes they have unique words. That just don't match up. It could be a contact name. That could be a street name. It could be, you know, the name of a loved one. And those just become extremely hard because they're not used in everyday or models might not have seen those before. So the 4%, there could be a lot of use cases. We're working on all of those. And some are, you know, like we mentioned, some of them like getting contact names. Right. Is really, really important.
[00:20:31] Rachid Finge So in order to get the 4% down, is it the same as what you mentioned at the beginning? We just give the system more and more examples to understand everything better.
[00:20:39] Nino Tasca That's not the direction we're going in. We're actually going in an opposite direction where we're trying to go into more what's called semi supervised learning. So we don't have to annotate as many examples. And the deep neural networks can actually just get smarter over time and learn from audio files that do not actually have have been annotated. And so there's many different research efforts underway to get that down. Some of it is pure research. How can we just make the models faster, better, more efficient? And some are integration points. For example, with assistant voice typing, we talked about what are your most common contacts? Are there ways to bias to those words? Are there ways to personalize your model to understand certain words better? Going back to the noise in the background, are there more effective ways to hone in on your voice and your voice only and cut out the background noise? So there's many different efforts involved, all working in parallel, trying to hone in and improve on these models.
[00:21:42] Rachid Finge I think it's so interesting that so many fields come together in order to to solve this problem and not just teaching them all to like we teach kids how to speak. So, you know, I wanted to get to my favorite product, which is Recorder, and I can totally see how the speech model is used to transcribe what's being said. But now I think later in the year we're adding a new feature where the recorder app will be able to distinguish between people, right? So it can say person one says said this, and then person two said that. What was required to distinguish between voices, because that seems like a next level of speech recognition to me.
[00:22:17] Nino Tasca Yeah, this is another great example of us listening to our users because the Recorder app has been around for a few iterations with Pixel and one of the use cases we saw that a lot of journalists were actually using it right is one of their favorite products. We were getting the feedback and they were actually using it to record and transcribe interviews. But one of the challenges was they had to go back in and, you know, separate the voices. So we knew this was actually a very important problem to solve for some of the most important users that really loved this product and are using it, you know, for their critical components of their lives. So in order to detect multiple voices, what the recorder app does is actually analyzes the audio and we can detect different voices that are speaking. Throughout the transcript we give different labels, you know. Speaker 1. Speaker 2. And as each as the audio continues, we actually analyze it from previous versions of Speaker 1 or Speaker 2 and determine if it's likely that one of those users is speaking. Now, what's important? Privacy is the utmost importance here. So the models are temporarily stored on device, totally deleted after the session is over. But it provides a really powerful use case that allows you to know, like, for example, for a journalist in an interview when they're doing transcription, they know what they said, they know what their interviewee said, and they can go back and easily transcribe the final four, you know, post-processing.
[00:23:41] Rachid Finge I think that that's what people always love, right? You turn on airplane mode and instilled recording works as proof that everything works offline. I guess that also helps some of the assistant features come to life. Like when it is something like call screening, I guess it's basically the assistant on the phone answering the phone, right? It doesn't need a server to do anything.
[00:24:03] Nino Tasca I know it's great you brought that up and that's one of my favorite outcomes of working on the speech team. We've spent a lot of time from a speech perspective and primarily leading with the assistant to get speech models on device. And it's great working at a company like Google because we've seen all this demand internally and many different teams have now improved their products to take advantage of this on device speech models. And we've worked closely with them once again to understand what their exact user needs are, what the product's needs are, and how we can actually improve our speech models to make sure their product’s better. So I'm just going to list off a few, right. You have the call screen assistant, which is great. Somebody calls you once again, you're busy. It might be a number you don't recognize and that audio from the user is actually handled 100% on device. Right. We're not sending the audio coming in from a phone call to the servers and that allows you, the user, to make a decision whether you want to take that call or not. Let's take live caption as another example with the on device model, whether you're in airplane mode or not, any video on your phone, you can actually see the caption coming through. Great use case, especially if you're in a place where you can't listen to the audio or you're hearing impaired. Another use case is transcribing messages. That's the one I like. Yeah, yeah. It's a great feature. So people audio voice messages are still very popular, right? We thought we thought those went away with texting, but they're actually coming back. And so once again, there's many use cases where users are in the environment or simply can't listen to the audio file. So the on device speech to text model can actually transcribe the audio of the message and display it to you. Never needs to go to the server.
[00:25:47] Rachid Finge Nino, I always ask all of our guests what's coming in the future. And of course, we cannot talk about future roadmaps. I'm just wondering, like, you know, if you're a speech scientists or you are a product manager who works in this field for so long, like I guess you want to work on dropping the error rate. But what else is there to, to conquer as a speech team?
[00:26:06] Nino Tasca Yeah. You know, the opportunities are endless just because there's so many product possibilities. I'll name a couple. One of the things that I'm really passionate about is having a personalized model for every user on their device. We talked about the big investments to get on device speech. We've talked a bit about personalization, especially in the use cases for assistant voice typing. But we actually want to make personalized model that understand how you speak your pace of speech, the words you say most commonly, accents, etc., and actually make a model that can be tuned on the fly to make sure each time you talk and the more you talk, it gets better and better for you. We've a great example of personalization in use today with Project Relate, which allows users with speech difficulties to actually train their own personalized models by saying several different utterances and phrases into their phone. The model gets 100% built and personalized for them and enables and opens up all kinds of use cases that these users might not have access to today. Everything from using the Google assistant, taking notes, or even having a repeat function where let's say if you're in line for Starbucks, the cashier can't understand you. You can say your order to the phone and it can repeat back in our text to speech output. So I can imagine the Project Relate technology being expanded to all users in the future. And then with making a personalized model, it's not just about your speech, but other things we're looking at is multilingual models. A very large percentage of the world actually speaks more than one language. And we want speech to be there for you in any use case you need. So going back to assistant voice typing people that, you know, speak multiple language, they might speak to one group of friends in let's say English, another group of friends in French, and they shouldn't have to think or, you know, understand which speech model to use. They should just be able to talk freely. And we want to build multilingual models that can just understand them naturally as they go. And final point would just be natural conversations in general. This is a big project we're working on for the Google assistant is making sure that we actually have a better way to have natural back and forth conversations. Today, the Google assistant is still very much in a turn taking approach where you have to say, Hey, Google, what's the weather and wait for it to come back. But we realize that's not how humans have conversations. There's pauses. You understand it? You just it. Mm hmm. Right. There will be back and forth with back channeling. There is interrupting sometimes, you know, it can be polite. It's just the way conversations go. And we want to make sure the assistant is usable, just like you're talking to your friend. You don't need instructions. You don't need to do turn taking in order to make it work. You can just talk. You can have a natural pause. You can get some feedback from the Google assistant. You can interrupt it and no instructions needed to use the Google assistant. So all those three things are things that I'm really, really excited about and hopefully you'll see in future product updates.
[Top Tips for the Road]
[00:29:11] Rachid Finge So, you know, finally, we always have the top tips for the road where we ask our guests, you know what, our top tips for our listeners to the Made by Google podcast. This could be tips in your case about speech or maybe it's life advice. I don't know. What are what are the top tips for the road? What should we take away?
[00:29:33] Nino Tasca Sure. So I'll give you just one work advice I find really powerful, and then one speech advice. One of the things I think is really important is schedule send. So you can see that in email nowadays, I just think is really important. As you know, since the pandemic, our work and home lives are more intertwined. A lot of people have set ups, you know, in their desks out there in their house and it can never stop. Right. And I think, you know, being a responsible leader or a responsible teammate, making sure if you decide to work at night or on the weekend, you're not putting pressure on others. And so what I do is if I decide to work on the weekends, I make sure I schedule send to Monday morning. So I'm not putting pressure on users. It's a great feature in Gmail and I just encourage everyone to use it to be more thoughtful of how we work together. So this is a general tip.
[00:30:25] Rachid Finge Great advice. Absolutely.
[00:30:26] Nino Tasca But then sticking with email and messaging, try voice dictation. Right. I think many users have tried it over the past, but maybe you had an error, maybe you had a one use case which didn't work. There's just a ton of innovation going on there. There is a ton of effort. We actually have data that shows it's 2.5 times faster than using your fingers to type. And so if you had a bad experience, give it another shot. If you have a bad experience, give it another shot again. The teams are constantly working and now as a user myself, who uses that almost exclusively when I'm able to, it's such a powerful, powerful tool that I hope all users try it and keep testing it out and give us feedback. Right? The best way we can build products is to get feedback from users that are actually using the product.
[00:31:14] Rachid Finge Noted and I use it as well myself and love using it. And just for everyone listening to go back to Project Relate, please just Google that Project Relate. It's a it's a beautiful thing and it just shows how your team, Nino, helps a lot of people have a better opportunity in communicating with others in this world. So thank you a lot for that and also thanks so much for joining the Made by Google podcast, was great talking to you.
[00:31:36] Nino Tasca My pleasure. Thank you.
[00:31:39] Rachid Finge Well, I can tell you often enough do check out Project Relate when you have a chance and really shows a powerful side of technology and how it helps make it easier for anyone to be a full part of society. And thank you to Nino for your time today. Been great to find out how we make our devices able to understand what you say and then even talk back to you. Join us next week for another episode of the Made by Google podcast where we'll talk about fitness so I better get ready and in shape. And meanwhile subscribe to the podcast so you'll have the latest episode on your device when it appears each Thursday. Take care. Stay healthy. And talk to you next week.