Cluster analysisCluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, , information retrieval, bioinformatics, data compression, computer graphics and machine learning.
Hierarchical clusteringIn data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories: Agglomerative: This is a "bottom-up" approach: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive: This is a "top-down" approach: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
GrassmannianIn mathematics, the Grassmannian Gr(k, V) is a space that parameterizes all k-dimensional linear subspaces of the n-dimensional vector space V. For example, the Grassmannian Gr(1, V) is the space of lines through the origin in V, so it is the same as the projective space of one dimension lower than V. When V is a real or complex vector space, Grassmannians are compact smooth manifolds.
InformationInformation is an abstract concept that refers to that which has the power to inform. At the most fundamental level, information pertains to the interpretation (perhaps formally) of that which may be sensed, or their abstractions. Any natural process that is not completely random and any observable pattern in any medium can be said to convey some amount of information. Whereas digital signals and other data use discrete signs to convey information, other phenomena and artefacts such as analogue signals, poems, pictures, music or other sounds, and currents convey information in a more continuous form.
K-means clusteringk-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances.
Single-linkage clusteringIn statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that contain the closest pair of elements not yet belonging to the same cluster as each other. This method tends to produce long thin clusters in which nearby elements of the same cluster have small distances, but elements at opposite ends of a cluster may be much farther from each other than two elements of other clusters.
Linear combinationIn mathematics, a linear combination is an expression constructed from a set of terms by multiplying each term by a constant and adding the results (e.g. a linear combination of x and y would be any expression of the form ax + by, where a and b are constants). The concept of linear combinations is central to linear algebra and related fields of mathematics. Most of this article deals with linear combinations in the context of a vector space over a field, with some generalizations given at the end of the article.
Correlation clusteringClustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance. Cluster analysis In machine learning, correlation clustering or cluster editing operates in a scenario where the relationships between the objects are known instead of the actual representations of the objects.
Information theoryInformation theory is the mathematical study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. The field, in applied mathematics, is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering. A key measure in information theory is entropy.
Primitive data typeIn computer science, primitive data types are a set of basic data types from which all other data types are constructed. Specifically it often refers to the limited set of data representations in use by a particular processor, which all compiled programs must use. Most processors support a similar set of primitive data types, although the specific representations vary. More generally, "primitive data types" may refer to the standard data types built into a programming language (built-in types).
Determining the number of clusters in a data setDetermining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. For a certain class of clustering algorithms (in particular k-means, k-medoids and expectation–maximization algorithm), there is a parameter commonly referred to as k that specifies the number of clusters to detect.
Type systemIn computer programming, a type system is a logical system comprising a set of rules that assigns a property called a type (for example, integer, floating point, string) to every "term" (a word, phrase, or other set of symbols). Usually the terms are various constructs of a computer program, such as variables, expressions, functions, or modules. A type system dictates the operations that can be performed on a term. For variables, the type system determines the allowed values of that term.
Data typeIn computer science and computer programming, a data type (or simply type) is a collection or grouping of data values, usually specified by a set of possible values, a set of allowed operations on these values, and/or a representation of these values as machine types. A data type specification in a program constrains the possible values that an expression, such as a variable or a function call, might take. On literal data, it tells the compiler or interpreter how the programmer intends to use the data.
Affine combinationIn mathematics, an affine combination of x1, ..., xn is a linear combination such that Here, x1, ..., xn can be elements (vectors) of a vector space over a field K, and the coefficients are elements of K. The elements x1, ..., xn can also be points of a Euclidean space, and, more generally, of an affine space over a field K. In this case the are elements of K (or for a Euclidean space), and the affine combination is also a point. See for the definition in this case.
Graph minorIn graph theory, an undirected graph H is called a minor of the graph G if H can be formed from G by deleting edges, vertices and by contracting edges. The theory of graph minors began with Wagner's theorem that a graph is planar if and only if its minors include neither the complete graph K5 nor the complete bipartite graph K3,3. The Robertson–Seymour theorem implies that an analogous forbidden minor characterization exists for every property of graphs that is preserved by deletions and edge contractions.
Linear subspaceIn mathematics, and more specifically in linear algebra, a linear subspace or vector subspace is a vector space that is a subset of some larger vector space. A linear subspace is usually simply called a subspace when the context serves to distinguish it from other types of subspaces. If V is a vector space over a field K and if W is a subset of V, then W is a linear subspace of V if under the operations of V, W is a vector space over K.
Algebraic data typeIn computer programming, especially functional programming and type theory, an algebraic data type (ADT) is a kind of composite type, i.e., a type formed by combining other types. Two common classes of algebraic types are product types (i.e., tuples and records) and sum types (i.e., tagged or disjoint unions, coproduct types or variant types). The values of a product type typically contain several values, called fields. All values of that type have the same combination of field types.
Type theoryIn mathematics, logic, and computer science, a type theory is the formal presentation of a specific type system, and in general, type theory is the academic study of type systems. Some type theories serve as alternatives to set theory as a foundation of mathematics. Two influential type theories that were proposed as foundations are Alonzo Church's typed λ-calculus and Per Martin-Löf's intuitionistic type theory. Most computerized proof-writing systems use a type theory for their foundation, a common one is Thierry Coquand's Calculus of Inductive Constructions.
Information managementInformation management (IM) is the appropriate and optimized capture, storage, retrieval, and use of information. It may be personal information management or organizational. IM for organizations concerns a cycle of organizational activity: the acquisition of information from one or more sources, the custodianship and the distribution of that information to those who need it, and its ultimate disposal through archiving or deletion.
Data transformation (computing)In computing, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration and application integration. Data transformation can be simple or complex based on the required changes to the data between the source (initial) data and the target (final) data. Data transformation is typically performed via a mixture of manual and automated steps.