Publication

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

Related concepts (24)

Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data).

Gradient descent

In mathematics, gradient descent (also often called steepest descent) is a iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local maximum of that function; the procedure is then known as gradient ascent.

Recurrent neural network

A recurrent neural network (RNN) is one of the two broad types of artificial neural network, characterized by direction of the flow of information between its layers. In contrast to uni-directional feedforward neural network, it is a bi-directional artificial neural network, meaning that it allows the output from some nodes to affect subsequent input to the same nodes. Their ability to use internal state (memory) to process arbitrary sequences of inputs makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.

Backpropagation

As a machine-learning algorithm, backpropagation performs a backward pass to adjust the model's parameters, aiming to minimize the mean squared error (MSE). In a single-layered network, backpropagation uses the following steps: Traverse through the network from the input to the output by computing the hidden layers' output and the output layer. (the feedforward step) In the output layer, calculate the derivative of the cost function with respect to the input and the hidden layers.

Angular momentum

In physics, angular momentum (sometimes called moment of momentum or rotational momentum) is the rotational analog of linear momentum. It is an important physical quantity because it is a conserved quantity – the total angular momentum of a closed system remains constant. Angular momentum has both a direction and a magnitude, and both are conserved. Bicycles and motorcycles, flying discs, rifled bullets, and gyroscopes owe their useful properties to conservation of angular momentum.

Softmax function

The softmax function, also known as softargmax or normalized exponential function, converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce's choice axiom.

Learning rate

In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns". In the adaptive control literature, the learning rate is commonly referred to as gain. In setting a learning rate, there is a trade-off between the rate of convergence and overshooting.

Attention (machine learning)

Machine learning-based attention is a mechanism mimicking cognitive attention. It calculates "soft" weights for each word, more precisely for its embedding, in the context window. It can do it either in parallel (such as in transformers) or sequentially (such as recursive neural networks). "Soft" weights can change during each runtime, in contrast to "hard" weights, which are (pre-)trained and fine-tuned and remain frozen afterwards. Multiple attention heads are used in transformer-based large language models.

Acceleration

In mechanics, acceleration is the rate of change of the velocity of an object with respect to time. Accelerations are vector quantities (in that they have magnitude and direction). The orientation of an object's acceleration is given by the orientation of the net force acting on that object. The magnitude of an object's acceleration, as described by Newton's Second Law, is the combined effect of two causes: the net balance of all external forces acting onto that object — magnitude is directly proportional to this net resulting force; that object's mass, depending on the materials out of which it is made — magnitude is inversely proportional to the object's mass.

Momentum

In Newtonian mechanics, momentum (: momenta or momentums; more specifically linear momentum or translational momentum) is the product of the mass and velocity of an object. It is a vector quantity, possessing a magnitude and a direction. If m is an object's mass and v is its velocity (also a vector quantity), then the object's momentum p (from Latin pellere "push, drive") is: In the International System of Units (SI), the unit of measurement of momentum is the kilogram metre per second (kg⋅m/s), which is equivalent to the newton-second.

Boltzmann machine

A Boltzmann machine (also called Sherrington–Kirkpatrick model with external field or stochastic Ising–Lenz–Little model) is a stochastic spin-glass model with an external field, i.e., a Sherrington–Kirkpatrick model, that is a stochastic Ising model. It is a statistical physics technique applied in the context of cognitive science. It is also classified as a Markov random field. Boltzmann machines are theoretically intriguing because of the locality and Hebbian nature of their training algorithm (being trained by Hebb's rule), and because of their parallelism and the resemblance of their dynamics to simple physical processes.

Four-momentum

In special relativity, four-momentum (also called momentum–energy or momenergy) is the generalization of the classical three-dimensional momentum to four-dimensional spacetime. Momentum is a vector in three dimensions; similarly four-momentum is a four-vector in spacetime. The contravariant four-momentum of a particle with relativistic energy E and three-momentum p = (px, py, pz) = γmv, where v is the particle's three-velocity and γ the Lorentz factor, is The quantity mv of above is ordinary non-relativistic momentum of the particle and m its rest mass.

Hyperparameter (machine learning)

In machine learning, a hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are derived via training. Hyperparameters can be classified as model hyperparameters, that cannot be inferred while fitting the machine to the training set because they refer to the model selection task, or algorithm hyperparameters, that in principle have no influence on the performance of the model but affect the speed and quality of the learning process.

Equations of motion

In physics, equations of motion are equations that describe the behavior of a physical system in terms of its motion as a function of time. More specifically, the equations of motion describe the behavior of a physical system as a set of mathematical functions in terms of dynamic variables. These variables are usually spatial coordinates and time, but may include momentum components. The most general choice are generalized coordinates which can be any convenient variables characteristic of the physical system.

Mathematical optimization

Mathematical optimization (alternatively spelled optimisation) or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfields: discrete optimization and continuous optimization. Optimization problems arise in all quantitative disciplines from computer science and engineering to operations research and economics, and the development of solution methods has been of interest in mathematics for centuries.

Artificial neural network

Artificial neural networks (ANNs, also shortened to neural networks (NNs) or neural nets) are a branch of machine learning models that are built using principles of neuronal organization discovered by connectionism in the biological neural networks constituting animal brains. An ANN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons.

Centrifugal force

In Newtonian mechanics, the centrifugal force is an inertial force (also called a "fictitious" or "pseudo" force) that appears to act on all objects when viewed in a rotating frame of reference. It is directed away from an axis which is parallel to the axis of rotation and passing through the coordinate system's origin. If the axis of rotation passes through the coordinate system's origin, the centrifugal force is directed radially outwards from that axis.

Energy–momentum relation

In physics, the energy–momentum relation, or relativistic dispersion relation, is the relativistic equation relating total energy (which is also called relativistic energy) to invariant mass (which is also called rest mass) and momentum. It is the extension of mass–energy equivalence for bodies or systems with non-zero momentum. It can be written as the following equation: This equation holds for a body or system, such as one or more particles, with total energy E, invariant mass m0, and momentum of magnitude p; the constant c is the speed of light.

Center-of-momentum frame

In physics, the center-of-momentum frame (COM frame), also known as zero-momentum frame, is the inertial frame in which the total momentum of the system vanishes. It is unique up to velocity, but not origin. The center of momentum of a system is not a location, but a collection of relative momenta/velocities: a reference frame. Thus "center of momentum" is a short for "center-of-momentum ". A special case of the center-of-momentum frame is the center-of-mass frame: an inertial frame in which the center of mass (which is a single point) remains at the origin.

Hamiltonian mechanics

Hamiltonian mechanics emerged in 1833 as a reformulation of Lagrangian mechanics. Introduced by Sir William Rowan Hamilton, Hamiltonian mechanics replaces (generalized) velocities used in Lagrangian mechanics with (generalized) momenta. Both theories provide interpretations of classical mechanics and describe the same physical phenomena. Hamiltonian mechanics has a close relationship with geometry (notably, symplectic geometry and Poisson structures) and serves as a link between classical and quantum mechanics.