What is speech recognition?

•

So, the first thing to know about speech recognition is that it is a method of input. It is a way for people to interact with computers, similar to other common input methods such as mouse, keyboard, and phone touchpad. The difference is that instead of using your hands, speech recognition allows you to use your voice to interact with the computer system.

Just as when you press a button on a computer and that click triggers a response based on the computer's programming, in speech recognition the computer recognizes the words you say and responds in the way it has been programmed.

Speech recognition vs. voice recognition

These terms are often used as synonyms. But there is a big difference in speech recognition and in academia among scientists, linguists, and computer scientists. Speech recognition means that computers can understand the words you speak. The computer translates the sounds of your voice into predetermined words for recognition.

In that case, what is voice recognition? Voice recognition is the process of identifying speakers based on their voices and speech styles.

Each of us speaks differently. Just as there are dialects in language, there are certain features in speech that are specific to you. That's how your mom sounds different on the phone than your favorite TV talk show host. A voice is like a fingerprint that is specific to an individual. Voice recognition technology allows computers to recognize the unique characteristics of voices and match them to people. A good example of using voice recognition technology is biometric authentication for security purposes.

So, in a nutshell, speech recognition is a computer recognizing what is said, and voice recognition is recognizing who said it.

Why recognize speech?

Speech recognition is a natural interaction. Spoken language is a great way to interact with a computer system. You just say what you want and commands are executed.

Plus, it's convenient - especially for phone applications. For example, when your hands are busy (while driving) and all you can use is a headset or headphones, you don't need to "press 1 for customer service, 2 for sales inquiries, etc." - you just use your voice.

How speech recognition works

Speech recognition allows you to have a real conversation with an inanimate object. A few decades ago, computers had limited processing power and memory capacity. As computing power, memory capacity, and natural language processing have improved, things have changed.

So, what gives a computer the ability to detect human sounds and understand them? The principle behind speech recognition is that the technology allows speech to be detected and translated into text. The best known examples of speech recognition software are those found in smart speakers such as Apple Siri, Amazon Alexa, Google Assistant and Microsoft Cortana. Another example of speech recognition software is Google Translate.

But how exactly do you convert speech into text? How can a computer recognize your speech? It is a three-step process.

Audio is broken up into individual sounds

When people talk, they create vibrations in the air. A device known as an analog-to-digital converter (ADC) converts sound waves into binary data understandable by a machine.

These sounds are then converted to digital format

The ADC filters the audio to remove unnecessary noise. It also normalizes it and the speech rate according to pre-recorded samples in the device. It then splits the data into different frequency ranges, which the spectrogram analyzes further.

On the spectrogram, you have time on the abscissa axis and frequencies of sound on the ordinate axis. All words consist of individual vowel sounds, and each has different frequencies that are recorded on the device. Bright areas on the spectrogram indicate high frequencies and darker areas indicate low frequencies.

All vowel sounds have different frequency patterns that can be preprogrammed into the computer, allowing it to recognize when the sound being pronounced corresponds to a particular vowel sound.

Algorithms and models are used to find the most appropriate word from that language

The computer processes the received phonemes using complex algorithms that compare them with words in its pre-created dictionary. But there is a catch - human language is not that simple. We all know that people speak with different accents, dialects, incorrect pronunciation, and these variations are not necessarily present in the computer's dictionary. This is where models such as the latent Markov model, which helps computers understand the subtleties and nuances of human language, come in handy. To give valuable results in response to words, the computer uses natural language processing.

When we talk, we understand which parts of a sentence and which words come together to create a sentence that makes sense. A sentence consists of a noun and a verb phrase.

The methods used generally fall into the categories of part-of-speech and partitioning tags. It is these parts of speech that give natural language processing the ability to understand context. By analyzing hundreds of sentences and various word combinations, the computer can understand what context is meant and thus it is able to recognize words correctly.

And it is not only tags and part-of-speech fragments that allow a complex system to understand the meaning of human words. Other methods can be roughly divided into two categories - syntax and semantics.

The individual words, parts of speech and their arrangement in a sentence give the computer the knowledge and context of what the sentence is trying to say. But how does the computer know what the parts of speech are? And how does it understand it all, even if it knows the parts of speech? The answer lies in the data.

There is a lot of data available in the modern world. A programmer can collect all kinds of words, phrases, sentences, grammatical rules, and word structures and input all this data into an algorithm. Using this information, the algorithm can figure out which words usually end up next to each other, how a sentence should be formed, why some words fit better in a sentence than others, etc. This is how the computer will eventually determine the context.

Language processing allows the machine to take what you say, understand the gist of it, and formulate its own statement to answer you.

There are dozens of individual applications and speech-to-text services that help millions of users day in and day out. One such service is which is capable of handling any task in a short time for a low fee. The service can be trusted with even low quality files in various formats in more than 50 languages.