Publication

A Low-power Carry Cut-Back Approximate Adder with Fixed-point Implementation and Floating-point Precision

Related concepts (18)

In computing, floating-point arithmetic (FP) is arithmetic that represents subsets of real numbers using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. Numbers of this form are called floating-point numbers. For example, 12.345 is a floating-point number in base ten with five digits of precision: However, unlike 12.345, 12.3456 is not a floating-point number in base ten with five digits of precision—it needs six digits of precision; the nearest floating-point number with only five digits is 12.

Decimal floating point

Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions (common in human-entered data, such as measurements or financial information) and binary (base-2) fractions. The advantage of decimal floating-point representation over decimal fixed-point and integer representation is that it supports a much wider range of values.

Double-precision floating-point format

Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. Floating point is used to represent fractional values, or when a wider range is needed than is provided by fixed point (of the same bit width), even if at the cost of precision. Double precision may be chosen when the range or precision of single precision would be insufficient.

Single-precision floating-point format

Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. A floating-point variable can represent a wider range of numbers than a fixed-point variable of the same bit width at the cost of precision. A signed 32-bit integer variable has a maximum value of 231 − 1 = 2,147,483,647, whereas an IEEE 754 32-bit base-2 floating-point variable has a maximum value of (2 − 2−23) × 2127 ≈ 3.

Quadruple-precision floating-point format

In computing, quadruple precision (or quad precision) is a binary floating point–based computer number format that occupies 16 bytes (128 bits) with precision at least twice the 53-bit double precision. This 128-bit quadruple precision is designed not only for applications requiring results in higher than double precision, but also, as a primary function, to allow the computation of double precision results more reliably and accurately by minimising overflow and round-off errors in intermediate calculations and scratch variables.

Half-precision floating-point format

In computing, half precision (sometimes called FP16 or float16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in applications where higher precision is not essential, in particular and neural networks. Almost all modern uses follow the IEEE 754-2008 standard, where the 16-bit base-2 format is referred to as binary16, and the exponent uses 5 bits.

Extended precision

Extended precision refers to floating-point number formats that provide greater precision than the basic floating-point formats. Extended precision formats support a basic format by minimizing roundoff and overflow errors in intermediate values of expressions on the base format. In contrast to extended precision, arbitrary-precision arithmetic refers to implementations of much larger numeric types (with a storage count that usually is not a power of two) using special software (or, rarely, hardware).

Fixed-point arithmetic

In computing, fixed-point is a method of representing fractional (non-integer) numbers by storing a fixed number of digits of their fractional part. Dollar amounts, for example, are often stored with exactly two fractional digits, representing the cents (1/100 of dollar). More generally, the term may refer to representing fractional values as integer multiples of some fixed small unit, e.g. a fractional amount of hours as an integer multiple of ten-minute intervals.

Floating-point unit

A floating-point unit (FPU, colloquially a math coprocessor) is a part of a computer system specially designed to carry out operations on floating-point numbers. Typical operations are addition, subtraction, multiplication, division, and square root. Some FPUs can also perform various transcendental functions such as exponential or trigonometric calculations, but the accuracy can be very low, so that some systems prefer to compute these functions in software.

Kahan summation algorithm

In numerical analysis, the Kahan summation algorithm, also known as compensated summation, significantly reduces the numerical error in the total obtained by adding a sequence of finite-precision floating-point numbers, compared to the obvious approach. This is done by keeping a separate running compensation (a variable to accumulate small errors), in effect extending the precision of the sum by the precision of the compensation variable.

Floating-point error mitigation

Floating-point error mitigation is the minimization of errors caused by the fact that real numbers cannot, in general, be accurately represented in a fixed space. By definition, floating-point error cannot be eliminated, and, at best, can only be managed. Huberto M. Sierra noted in his 1956 patent "Floating Decimal Point Arithmetic Control Means for Calculator": Thus under some conditions, the major portion of the significant data digits may lie beyond the capacity of the registers.

Efficient energy use

Efficient energy use, sometimes simply called energy efficiency, is the process of reducing the amount of energy required to provide products and services. For example, insulating a building allows it to use less heating and cooling energy to achieve and maintain a thermal comfort. Installing light-emitting diode bulbs, fluorescent lighting, or natural skylight windows reduces the amount of energy required to attain the same level of illumination compared to using traditional incandescent light bulbs.

Bfloat16 floating-point format

The bfloat16 (brain floating point) floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing. It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format.

Dynamic range

Dynamic range (abbreviated DR, DNR, or DYR) is the ratio between the largest and smallest values that a certain quantity can assume. It is often used in the context of signals, like sound and light. It is measured either as a ratio or as a base-10 (decibel) or base-2 (doublings, bits or stops) logarithmic value of the difference between the smallest and largest signal values. Electronically reproduced audio and video is often processed to fit the original material with a wide dynamic range into a narrower recorded dynamic range that can more easily be stored and reproduced; this processing is called dynamic range compression.

Energy conservation

Energy conservation is the effort to reduce wasteful energy consumption by using fewer energy services. This can be done by using energy more effectively (using less energy for continuous service) or changing one's behavior to use less service (for example, by driving less). Energy conservation can be achieved through efficient energy use, which has some advantages, including a reduction in greenhouse gas emissions and a smaller carbon footprint, as well as cost, water, and energy savings.

High dynamic range

High dynamic range (HDR) is a dynamic range higher than usual, synonyms are wide dynamic range, extended dynamic range, expanded dynamic range. The term is often used in discussing the dynamic range of various signals such as s, videos, audio or radio. It may apply to the means of recording, processing, and reproducing such signals including analog and digitized signals. The term is also the name of some of the technologies or techniques allowing to achieve high dynamic range images, videos, or audio.

Dynamic range compression

Dynamic range compression (DRC) or simply compression is an audio signal processing operation that reduces the volume of loud sounds or amplifies quiet sounds, thus reducing or compressing an audio signal's dynamic range. Compression is commonly used in sound recording and reproduction, broadcasting, live sound reinforcement and in some instrument amplifiers. A dedicated electronic hardware unit or audio software that applies compression is called a compressor.

Energy audit

An energy audit is an inspection survey and an analysis of energy flows for energy conservation in a building. It may include a process or system to reduce the amount of energy input into the system without negatively affecting the output. In commercial and industrial real estate, an energy audit is the first step in identifying opportunities to reduce energy expense and carbon footprint. When the object of study is an occupied building then reducing energy consumption while maintaining or improving human comfort, health and safety are of primary concern.