Modern data center fabrics open the possibility of microsecond distributed applications, such as data stores and message queues. A challenging aspect of their development is to ensure that, besides being fast in the common case, these applications react fa ...
RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) ...
We propose a novel, minimally intrusive approach to adding fault tolerance to existing complex scientific simulation codes, used for addressing a broad range of time-dependent problems on the next generation of supercomputers. Exascale systems have the pot ...
Remote Direct Memory Access (RDMA) is a technology that allows direct access from the network to a machine's main memory without involving its CPU. RDMA offers low-latency, high-bandwidth performance and low CPU utilization. While RDMA provides massive per ...
Power consumption has become a critical issue in large scale clusters. Existing solutions for addressing the servers' energy consumption suggest ``shrinking'' the set of active machines, at least until the more power-proport-ional hardware devices become a ...