Publication

Training DNNs with Hybrid Block Floating Point

Related concepts (22)

In computing, floating-point arithmetic (FP) is arithmetic that represents subsets of real numbers using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. Numbers of this form are called floating-point numbers. For example, 12.345 is a floating-point number in base ten with five digits of precision: However, unlike 12.345, 12.3456 is not a floating-point number in base ten with five digits of precision—it needs six digits of precision; the nearest floating-point number with only five digits is 12.

Fixed-point theorem

In mathematics, a fixed-point theorem is a result saying that a function F will have at least one fixed point (a point x for which F(x) = x), under some conditions on F that can be stated in general terms. The Banach fixed-point theorem (1922) gives a general criterion guaranteeing that, if it is satisfied, the procedure of iterating a function yields a fixed point.

Fixed point (mathematics)

hatnote|1=Fixed points in mathematics are not to be confused with other uses of "fixed point", or stationary points where math|1=f(x) = 0. In mathematics, a fixed point (sometimes shortened to fixpoint), also known as an invariant point, is a value that does not change under a given transformation. Specifically for functions, a fixed point is an element that is mapped to itself by the function. Formally, c is a fixed point of a function f if c belongs to both the domain and the codomain of f, and f(c) = c.

Fixed-point arithmetic

In computing, fixed-point is a method of representing fractional (non-integer) numbers by storing a fixed number of digits of their fractional part. Dollar amounts, for example, are often stored with exactly two fractional digits, representing the cents (1/100 of dollar). More generally, the term may refer to representing fractional values as integer multiples of some fixed small unit, e.g. a fractional amount of hours as an integer multiple of ten-minute intervals.

Fixed-point iteration

In numerical analysis, fixed-point iteration is a method of computing fixed points of a function. More specifically, given a function defined on the real numbers with real values and given a point in the domain of , the fixed-point iteration is which gives rise to the sequence of iterated function applications which is hoped to converge to a point . If is continuous, then one can prove that the obtained is a fixed point of , i.e., More generally, the function can be defined on any metric space with values in that same space.

Decimal floating point

Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions (common in human-entered data, such as measurements or financial information) and binary (base-2) fractions. The advantage of decimal floating-point representation over decimal fixed-point and integer representation is that it supports a much wider range of values.

Double-precision floating-point format

Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. Floating point is used to represent fractional values, or when a wider range is needed than is provided by fixed point (of the same bit width), even if at the cost of precision. Double precision may be chosen when the range or precision of single precision would be insufficient.

Kakutani fixed-point theorem

In mathematical analysis, the Kakutani fixed-point theorem is a fixed-point theorem for set-valued functions. It provides sufficient conditions for a set-valued function defined on a convex, compact subset of a Euclidean space to have a fixed point, i.e. a point which is mapped to a set containing it. The Kakutani fixed point theorem is a generalization of the Brouwer fixed point theorem. The Brouwer fixed point theorem is a fundamental result in topology which proves the existence of fixed points for continuous functions defined on compact, convex subsets of Euclidean spaces.

Half-precision floating-point format

In computing, half precision (sometimes called FP16 or float16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular and neural networks. Almost all modern uses follow the IEEE 754-2008 standard, where the 16-bit base-2 format is referred to as binary16, and the exponent uses 5 bits.

Quadruple-precision floating-point format

In computing, quadruple precision (or quad precision) is a binary floating point–based computer number format that occupies 16 bytes (128 bits) with precision at least twice the 53-bit double precision. This 128-bit quadruple precision is designed not only for applications requiring results in higher than double precision, but also, as a primary function, to allow the computation of double precision results more reliably and accurately by minimising overflow and round-off errors in intermediate calculations and scratch variables.

Single-precision floating-point format

Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. A floating-point variable can represent a wider range of numbers than a fixed-point variable of the same bit width at the cost of precision. A signed 32-bit integer variable has a maximum value of 231 − 1 = 2,147,483,647, whereas an IEEE 754 32-bit base-2 floating-point variable has a maximum value of (2 − 2−23) × 2127 ≈ 3.

Lefschetz fixed-point theorem

In mathematics, the Lefschetz fixed-point theorem is a formula that counts the fixed points of a continuous mapping from a compact topological space to itself by means of traces of the induced mappings on the homology groups of . It is named after Solomon Lefschetz, who first stated it in 1926. The counting is subject to an imputed multiplicity at a fixed point called the fixed-point index. A weak version of the theorem is enough to show that a mapping without any fixed point must have rather special topological properties (like a rotation of a circle).

Floating-point error mitigation

Floating-point error mitigation is the minimization of errors caused by the fact that real numbers cannot, in general, be accurately represented in a fixed space. By definition, floating-point error cannot be eliminated, and, at best, can only be managed. Huberto M. Sierra noted in his 1956 patent "Floating Decimal Point Arithmetic Control Means for Calculator": Thus under some conditions, the major portion of the significant data digits may lie beyond the capacity of the registers.

Brouwer fixed-point theorem

Brouwer's fixed-point theorem is a fixed-point theorem in topology, named after L. E. J. (Bertus) Brouwer. It states that for any continuous function mapping a nonempty compact convex set to itself there is a point such that . The simplest forms of Brouwer's theorem are for continuous functions from a closed interval in the real numbers to itself or from a closed disk to itself. A more general form than the latter is for continuous functions from a nonempty convex compact subset of Euclidean space to itself.

Deep learning

Deep learning is part of a broader family of machine learning methods, which is based on artificial neural networks with representation learning. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

Floating-point unit

A floating-point unit (FPU, colloquially a math coprocessor) is a part of a computer system specially designed to carry out operations on floating-point numbers. Typical operations are addition, subtraction, multiplication, division, and square root. Some FPUs can also perform various transcendental functions such as exponential or trigonometric calculations, but the accuracy can be very low, so that some systems prefer to compute these functions in software.

Fixed-point theorems in infinite-dimensional spaces

In mathematics, a number of fixed-point theorems in infinite-dimensional spaces generalise the Brouwer fixed-point theorem. They have applications, for example, to the proof of existence theorems for partial differential equations. The first result in the field was the Schauder fixed-point theorem, proved in 1930 by Juliusz Schauder (a previous result in a different vein, the Banach fixed-point theorem for contraction mappings in complete metric spaces was proved in 1922). Quite a number of further results followed.

IEEE 754

The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portably. Many hardware floating-point units use the IEEE 754 standard.

Dynamic range

Dynamic range (abbreviated DR, DNR, or DYR) is the ratio between the largest and smallest values that a certain quantity can assume. It is often used in the context of signals, like sound and light. It is measured either as a ratio or as a base-10 (decibel) or base-2 (doublings, bits or stops) logarithmic value of the difference between the smallest and largest signal values. Electronically reproduced audio and video is often processed to fit the original material with a wide dynamic range into a narrower recorded dynamic range that can more easily be stored and reproduced; this processing is called dynamic range compression.

Multi-exposure HDR capture

In photography and videography, multi-exposure HDR capture is a technique that creates extended or high dynamic range (HDR) images by taking and combining multiple exposures of the same subject matter at different exposure levels. Combining multiple images in this way results in an image with a greater dynamic range than what would be possible by taking one single image. The technique can also be used to capture video by taking and combining multiple exposures for each frame of the video.