Publication

SVGC-AVA: 360-Degree Video Saliency Prediction With Spherical Vector-Based Graph Convolution and Audio-Visual Attention

Related concepts (32)

Salience (also called saliency) is that property by which some thing stands out. Salient events are an attentional mechanism by which organisms learn and survive; those organisms can focus their limited perceptual and cognitive resources on the pertinent (that is, salient) subset of the sensory data available to them. Saliency typically arises from contrasts between items and their neighborhood. They might be represented, for example, by a red dot surrounded by white dots, or by a flickering message indicator of an answering machine, or a loud noise in an otherwise quiet environment.

Multi-scale approaches

The scale space representation of a signal obtained by Gaussian smoothing satisfies a number of special properties, scale-space axioms, which make it into a special form of multi-scale representation. There are, however, also other types of "multi-scale approaches" in the areas of computer vision, and signal processing, in particular the notion of wavelets. The purpose of this article is to describe a few of these approaches: For one-dimensional signals, there exists quite a well-developed theory for continuous and discrete kernels that guarantee that new local extrema or zero-crossings cannot be created by a convolution operation.

Fast Fourier transform

A fast Fourier transform (FFT) is an algorithm that computes the discrete Fourier transform (DFT) of a sequence, or its inverse (IDFT). Fourier analysis converts a signal from its original domain (often time or space) to a representation in the frequency domain and vice versa. The DFT is obtained by decomposing a sequence of values into components of different frequencies. This operation is useful in many fields, but computing it directly from the definition is often too slow to be practical.

Fourier series

A Fourier series (ˈfʊrieɪ,_-iər) is an expansion of a periodic function into a sum of trigonometric functions. The Fourier series is an example of a trigonometric series, but not all trigonometric series are Fourier series. By expressing a function as a sum of sines and cosines, many problems involving the function become easier to analyze because trigonometric functions are well understood. For example, Fourier series were first used by Joseph Fourier to find solutions to the heat equation.

Scale space

Scale-space theory is a framework for multi-scale signal representation developed by the computer vision, and signal processing communities with complementary motivations from physics and biological vision. It is a formal theory for handling image structures at different scales, by representing an image as a one-parameter family of smoothed images, the scale-space representation, parametrized by the size of the smoothing kernel used for suppressing fine-scale structures.

Visual perception

Visual perception is the ability to interpret the surrounding environment through photopic vision (daytime vision), color vision, scotopic vision (night vision), and mesopic vision (twilight vision), using light in the visible spectrum reflected by objects in the environment. This is different from visual acuity, which refers to how clearly a person sees (for example "20/20 vision"). A person can have problems with visual perceptual processing even if they have 20/20 vision.

Visual processing

Visual processing is a term that is used to refer to the brain's ability to use and interpret visual information from the world around us. The process of converting light energy into a meaningful image is a complex process that is facilitated by numerous brain structures and higher level cognitive processes. On an anatomical level, light energy first enters the eye through the cornea, where the light is bent. After passing through the cornea, light passes through the pupil and then lens of the eye, where it is bent to a greater degree and focused upon the retina.

Structure tensor

In mathematics, the structure tensor, also referred to as the second-moment matrix, is a matrix derived from the gradient of a function. It describes the distribution of the gradient in a specified neighborhood around a point and makes the information invariant respect the observing coordinates. The structure tensor is often used in and computer vision. For a function of two variables p = (x, y), the structure tensor is the 2×2 matrix where and are the partial derivatives of with respect to x and y; the integrals range over the plane ; and w is some fixed "window function" (such as a Gaussian blur), a distribution on two variables.

Audio editing software

Audio editing software is any software or computer program which allows editing and generating of audio data. Audio editing software can be implemented completely or partly as a library, as a computer application, as a web application, or as a loadable kernel module. Wave editors are digital audio editors. There are many sources of software available to perform this function. Most can edit music, apply effects and filters, and adjust stereo channels.

Stimulus modality

Stimulus modality, also called sensory modality, is one aspect of a stimulus or what is perceived after a stimulus. For example, the temperature modality is registered after heat or cold stimulate a receptor. Some sensory modalities include: light, sound, temperature, taste, pressure, and smell. The type and location of the sensory receptor activated by the stimulus plays the primary role in coding the sensation. All sensory modalities work together to heighten stimuli sensation when necessary.

Visual system

The visual system comprises the sensory organ (the eye) and parts of the central nervous system (the retina containing photoreceptor cells, the optic nerve, the optic tract and the visual cortex) which gives organisms the sense of sight (the ability to detect and process visible light) as well as enabling the formation of several non-image photo response functions. It detects and interprets information from the optical spectrum perceptible to that species to "build a representation" of the surrounding environment.

Visual cortex

The visual cortex of the brain is the area of the cerebral cortex that processes visual information. It is located in the occipital lobe. Sensory input originating from the eyes travels through the lateral geniculate nucleus in the thalamus and then reaches the visual cortex. The area of the visual cortex that receives the sensory input from the lateral geniculate nucleus is the primary visual cortex, also known as visual area 1 (V1), Brodmann area 17, or the striate cortex.

Multimodal interaction

Multimodal interaction provides the user with multiple modes of interacting with a system. A multimodal interface provides several distinct tools for input and output of data. Multimodal human-computer interaction refers to the "interaction with the virtual and physical environment through natural modes of communication", This implies that multimodal interaction enables a more free and natural communication, interfacing users with automated systems in both input and output.

Fourier transform

In physics and mathematics, the Fourier transform (FT) is a transform that converts a function into a form that describes the frequencies present in the original function. The output of the transform is a complex-valued function of frequency. The term Fourier transform refers to both this complex-valued function and the mathematical operation. When a distinction needs to be made the Fourier transform is sometimes called the frequency domain representation of the original function.

Digital audio workstation

A digital audio workstation (DAW) is an electronic device or application software used for recording, editing and producing . DAWs come in a wide variety of configurations from a single software program on a laptop, to an integrated stand-alone unit, all the way to a highly complex configuration of numerous components controlled by a central computer. Regardless of configuration, modern DAWs have a central interface that allows the user to alter and mix multiple recordings and tracks into a final produced piece.

Deep learning

Deep learning is part of a broader family of machine learning methods, which is based on artificial neural networks with representation learning. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

Scale space implementation

In the areas of computer vision, and signal processing, the notion of scale-space representation is used for processing measurement data at multiple scales, and specifically enhance or suppress image features over different ranges of scale (see the article on scale space). A special type of scale-space representation is provided by the Gaussian scale space, where the image data in N dimensions is subjected to smoothing by Gaussian convolution.

Male gaze

In feminist theory, the male gaze is the act of depicting women and the world in the visual arts and in literature from a masculine, heterosexual perspective that presents and represents women as sexual objects for the pleasure of the heterosexual male viewer. In the visual and aesthetic presentations of narrative cinema, the male gaze has three perspectives: (i) that of the man behind the camera, (ii) that of the male characters within the film's cinematic representations; and (iii) that of the spectator gazing at the image.

Visual agnosia

Visual agnosia is an impairment in recognition of visually presented objects. It is not due to a deficit in vision (acuity, visual field, and scanning), language, memory, or intellect. While cortical blindness results from lesions to primary visual cortex, visual agnosia is often due to damage to more anterior cortex such as the posterior occipital and/or temporal lobe(s) in the brain.[2] There are two types of visual agnosia: apperceptive agnosia and associative agnosia. Recognition of visual objects occurs at two primary levels.

Gaze

In critical theory, philosophy, sociology, and psychoanalysis, the gaze (French: le regard), in the figurative sense, is an individual's (or a group's) awareness and perception of other individuals, other groups, or oneself. The concept and the social applications of the gaze have been defined and explained by existentialist and phenomenologist philosophers. Jean-Paul Sartre described the gaze (or "the look") in Being and Nothingness (1943).