Publication

Overcoming Asynchrony in Audio-Visual Speech Recognition

Related concepts (32)

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Streaming media

Streaming media is multimedia that is delivered and consumed in a continuous manner from a source, with little or no intermediate storage in network elements. Streaming refers to the delivery method of content, rather than the content itself. Distinguishing delivery method from the media applies specifically to telecommunications networks, as most of the traditional media delivery systems are either inherently streaming (e.g. radio, television) or inherently non-streaming (e.g. books, videotapes, audio CDs).

Phoneme

In phonology and linguistics, a phoneme (ˈfoʊniːm) is a unit of phone that can distinguish one word from another in a particular language. For example, in most dialects of English, with the notable exception of the West Midlands and the north-west of England, the sound patterns sɪn (sin) and sɪŋ (sing) are two separate words that are distinguished by the substitution of one phoneme, /n/, for another phoneme, /ŋ/. Two words like this that differ in meaning through the contrast of a single phoneme form a minimal pair.

Adaptive bitrate streaming

Adaptive bitrate streaming is a technique used in streaming multimedia over computer networks. While in the past most video or audio streaming technologies utilized streaming protocols such as RTP with RTSP, today's adaptive streaming technologies are based almost exclusively on HTTP, and are designed to work efficiently over large distributed HTTP networks. Adaptive bitrate streaming works by detecting a user's bandwidth and CPU capacity in real time, adjusting the quality of the media stream accordingly.

English phonology

English phonology is the system of speech sounds used in spoken English. Like many other languages, English has wide variation in pronunciation, both historically and from dialect to dialect. In general, however, the regional dialects of English share a largely similar (but not identical) phonological system. Among other things, most dialects have vowel reduction in unstressed syllables and a complex set of phonological features that distinguish fortis and lenis consonants (stops, affricates, and fricatives).

List of streaming media services

An over-the-top media service is a streaming media service offered directly to viewers via the Internet. OTT bypasses cable, broadcast, and satellite television platforms, the companies that traditionally act as a controller or distributors of such content. Most of these services are owned by a major film studio. Some streaming services started as an add-on to Blu-ray offerings, which are supplements to the programs watched. Streaming is an alternative to file downloading, a process in which the end-user obtains the entire file(s) for the content before watching or listening to it.

Voice (phonetics)

Voice or voicing is a term used in phonetics and phonology to characterize speech sounds (usually consonants). Speech sounds can be described as either voiceless (otherwise known as unvoiced) or voiced. The term, however, is used to refer to two separate concepts: Voicing can refer to the articulatory process in which the vocal folds vibrate, its primary use in phonetics to describe phones, which are particular speech sounds. It can also refer to a classification of speech sounds that tend to be associated with vocal cord vibration but may not actually be voiced at the articulatory level.

Phonemic orthography

A phonemic orthography is an orthography (system for writing a language) in which the graphemes (written symbols) correspond to the phonemes (significant spoken sounds) of the language. Natural languages rarely have perfectly phonemic orthographies; a high degree of grapheme–phoneme correspondence can be expected in orthographies based on alphabetic writing systems, but they differ in how complete this correspondence is.

Internet video

Internet video (also known as online video) is digital video that is distributed over the internet. Internet video exists in several formats, the most notable being MPEG-4i AVC, AVCHD, FLV, and . There are several online video hosting services, including YouTube, as well as Vimeo, Twitch, and Youku. In recent years, the platform of internet video has been used to stream live events. As a result of the popularity of online video, notable events like the 2012 U.S. presidential debates have been streamed live on the internet.

Music streaming service

A music streaming service is a type of streaming media service that focuses primarily on music, and sometimes other forms of digital audio content such as podcasts. These services are usually subscription-based services allowing users to stream digital copyright restricted songs on-demand from a centralized library provided by the service. Some services may offer free tiers with limitations, such as advertising and limits on use.

Speech processing

Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include speech recognition, speech synthesis, speaker diarization, speech enhancement, speaker recognition, etc.

Phonology

Phonology is the branch of linguistics that studies how languages or dialects systematically organize their phones or, for sign languages, their constituent parts of signs. The term can also refer specifically to the sound or sign system of a particular language variety. At one time, the study of phonology related only to the study of the systems of phonemes in spoken languages, but may now relate to any linguistic analysis either: Sign languages have a phonological system equivalent to the system of sounds in spoken languages.

Streaming television

Streaming television is the digital distribution of television content, such as television shows and films, as streaming media delivered over the Internet. Streaming television stands in contrast to dedicated terrestrial television delivered by over-the-air aerial systems, cable television, and/or satellite television systems.

Speech synthesis

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition. Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database.

Livestreaming

Live Streaming is streaming media simultaneously recorded and broadcast over the internet in real-time or near real-time. It is often referred to simply as streaming. Non-live media such as video-on-demand, vlogs, and YouTube videos are technically streamed, but not live-streamed. Livestreaming services encompass a wide variety of topics, from social media to video games to professional sports to lifecasting.

Video on demand

Video on demand (VOD) is a media distribution system that allows users to access to videos, television shows and films without a traditional video playback device and a typical static broadcasting schedule. In the 20th century, broadcasting in the form of over-the-air programming was the most common form of media distribution. As Internet and IPTV technologies continued to develop in the 1990s, consumers began to gravitate towards non-traditional modes of content consumption, which culminated in the arrival of VOD on televisions and personal computers.

Audio coding format

An audio coding format (or sometimes audio compression format) is a content representation format for storage or transmission of digital audio (such as in digital television, digital radio and in audio and video files). Examples of audio coding formats include MP3, AAC, Vorbis, FLAC, and Opus. A specific software or hardware implementation capable of audio compression and decompression to/from a specific audio coding format is called an audio codec; an example of an audio codec is LAME, which is one of several different codecs which implements encoding and decoding audio in the MP3 audio coding format in software.

Digital signal processing

Digital signal processing (DSP) is the use of digital processing, such as by computers or more specialized digital signal processors, to perform a wide variety of signal processing operations. The digital signals processed in this manner are a sequence of numbers that represent samples of a continuous variable in a domain such as time, space, or frequency. In digital electronics, a digital signal is represented as a pulse train, which is typically generated by the switching of a transistor.

Facial recognition system

A facial recognition system is a technology potentially capable of matching a human face from a or a video frame against a database of faces. Such a system is typically employed to authenticate users through ID verification services, and works by pinpointing and measuring facial features from a given image. Development began on similar systems in the 1960s, beginning as a form of computer application. Since their inception, facial recognition systems have seen wider uses in recent times on smartphones and in other forms of technology, such as robotics.

Digital audio

Digital audio is a representation of sound recorded in, or converted into, digital form. In digital audio, the sound wave of the audio signal is typically encoded as numerical samples in a continuous sequence. For example, in CD audio, samples are taken 44,100 times per second, each with 16-bit sample depth. Digital audio is also the name for the entire technology of sound recording and reproduction using audio signals that have been encoded in digital form.