Speech recognitionSpeech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.
Convolutional neural networkConvolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters (or kernel) optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer 10,000 weights would be required for processing an image sized 100 × 100 pixels.
Phonemic orthographyA phonemic orthography is an orthography (system for writing a language) in which the graphemes (written symbols) correspond to the phonemes (significant spoken sounds) of the language. Natural languages rarely have perfectly phonemic orthographies; a high degree of grapheme–phoneme correspondence can be expected in orthographies based on alphabetic writing systems, but they differ in how complete this correspondence is.
SpeechSpeech is a human vocal communication using language. Each language uses phonetic combinations of vowel and consonant sounds that form the sound of its words (that is, all English words sound different from all French words, even if they are the same word, e.g., "role" or "hotel"), and using those words in their semantic character as words in the lexicon of a language according to the syntactic constraints that govern lexical words' function in a sentence. In speaking, speakers perform many different intentional speech acts, e.
Speech and language impairmentSpeech and language impairment are basic categories that might be drawn in issues of communication involve hearing, speech, language, and fluency. A speech impairment is characterized by difficulty in articulation of words. Examples include stuttering or problems producing particular sounds. Articulation refers to the sounds, syllables, and phonology produced by the individual. Voice, however, may refer to the characteristics of the sounds produced—specifically, the pitch, quality, and intensity of the sound.
Speech synthesisSpeech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition. Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database.
PhonemeIn phonology and linguistics, a phoneme (ˈfoʊniːm) is a unit of phone that can distinguish one word from another in a particular language. For example, in most dialects of English, with the notable exception of the West Midlands and the north-west of England, the sound patterns sɪn (sin) and sɪŋ (sing) are two separate words that are distinguished by the substitution of one phoneme, /n/, for another phoneme, /ŋ/. Two words like this that differ in meaning through the contrast of a single phoneme form a minimal pair.
Artificial neural networkArtificial neural networks (ANNs, also shortened to neural networks (NNs) or neural nets) are a branch of machine learning models that are built using principles of neuronal organization discovered by connectionism in the biological neural networks constituting animal brains. An ANN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons.
Mutual intelligibilityIn linguistics, mutual intelligibility is a relationship between languages or dialects in which speakers of different but related varieties can readily understand each other without prior familiarity or special effort. It is sometimes used as an important criterion for distinguishing languages from dialects, although sociolinguistic factors are often also used. Intelligibility between languages can be asymmetric, with speakers of one understanding more of the other than speakers of the other understanding the first.
Phonetic transcriptionPhonetic transcription (also known as phonetic script or phonetic notation) is the visual representation of speech sounds (or phones) by means of symbols. The most common type of phonetic transcription uses a phonetic alphabet, such as the International Phonetic Alphabet. The pronunciation of words in all languages changes over time. However, their written forms (orthography) are often not modified to take account of such changes, and do not accurately represent the pronunciation.
English phonologyEnglish phonology is the system of speech sounds used in spoken English. Like many other languages, English has wide variation in pronunciation, both historically and from dialect to dialect. In general, however, the regional dialects of English share a largely similar (but not identical) phonological system. Among other things, most dialects have vowel reduction in unstressed syllables and a complex set of phonological features that distinguish fortis and lenis consonants (stops, affricates, and fricatives).
Hate speechHate speech is defined by the Cambridge Dictionary as "public speech that expresses hate or encourages violence towards a person or group based on something such as race, religion, sex, or sexual orientation". The Encyclopedia of the American Constitution states that hate speech is "usually thought to include communications of animosity or disparagement of an individual or a group on account of a group characteristic such as race, color, national origin, sex, disability, religion, or sexual orientation".
International Phonetic AlphabetThe International Phonetic Alphabet (IPA) is an alphabetic system of phonetic notation based primarily on the Latin script. It was devised by the International Phonetic Association in the late 19th century as a standardized representation of speech sounds in written form. The IPA is used by lexicographers, foreign language students and teachers, linguists, speech–language pathologists, singers, actors, constructed language creators, and translators.
Speech processingSpeech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include speech recognition, speech synthesis, speaker diarization, speech enhancement, speaker recognition, etc.
Residual neural networkA Residual Neural Network (a.k.a. Residual Network, ResNet) is a deep learning model in which the weight layers learn residual functions with reference to the layer inputs. A Residual Network is a network with skip connections that perform identity mappings, merged with the layer outputs by addition. It behaves like a Highway Network whose gates are opened through strongly positive bias weights. This enables deep learning models with tens or hundreds of layers to train easily and approach better accuracy when going deeper.
AllophoneIn phonology, an allophone (ˈæləfoʊn; from the Greek ἄλλος, , 'other' and φωνή, , 'voice, sound') is one of multiple possible spoken sounds - or phones - or signs used to pronounce a single phoneme in a particular language. For example, in English, the voiceless plosive t (as in stop [ˈstɒp]) and the aspirated form th (as in top [ˈthɒp]) are allophones for the phoneme /t/, while these two are considered to be different phonemes in some languages such as Thai.
Freedom of speechFreedom of speech is a principle that supports the freedom of an individual or a community to articulate their opinions and ideas without fear of retaliation, censorship, or legal sanction. The right to freedom of expression has been recognised as a human right in the Universal Declaration of Human Rights and international human rights law by the United Nations. Many countries have constitutional law that protects free speech. Terms like free speech, freedom of speech, and freedom of expression are used interchangeably in political discourse.
Phone (phonetics)In phonetics and linguistics, a phone is any distinct speech sound or gesture, regardless of whether the exact sound is critical to the meanings of words. In contrast, a phoneme is a speech sound in a given language that, if swapped with another phoneme, could change one word to another. Phones are absolute and are not specific to any language, but phonemes can be discussed only in reference to specific languages. For example, the English words kid and kit end with two distinct phonemes, /d/ and /t/, and swapping one for the other would change one word into a different word.
Speech codingSpeech coding is an application of data compression to digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream. Common applications of speech coding are mobile telephony and voice over IP (VoIP).
PhoneticsPhonetics is a branch of linguistics that studies how humans produce and perceive sounds, or in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians. The field of phonetics is traditionally divided into three sub-disciplines based on the research questions involved such as how humans plan and execute movements to produce speech (articulatory phonetics), how various movements affect the properties of the resulting sound (acoustic phonetics), or how humans convert sound waves to linguistic information (auditory phonetics).