The hardware complexity of modern machines makes the design of adequate programming models crucial for jointly ensuring performance, portability, and productivity in high-performance computing (HPC). Sequential task-based programming models paired with adv ...
Machine learning (ML) applications are ubiquitous. They run in different environments such as datacenters, the cloud, and even on edge devices. Despite where they run, distributing ML training seems the only way to attain scalable, high-quality learning. B ...
Modern hardware is abundantly parallel and increasingly heterogeneous. The numerous processing cores have non-uniform access latencies to the main memory and processor caches, which causes variability in the communication costs. Unfortunately, database sys ...
We present a work-stealing algorithm for runtime scheduling of data-parallel operations in the context of shared-memory architectures on data sets with highly-irregular workloads that are not known a priori to the scheduler. This scheduler can parallelize ...
Our vectorized Helmholtz solver runs at 85% efficiency on a NEC SX-5. The most time-consuming parts have been ported on SMP, NUMA, and cluster architectures. It is shown that an OpenMP version can deliver similar performance when running it on a 16 process ...
Traditional reliable servers require costly design changes to the processor, use custom system or application software, or cannot scale beyond a few processing elements. We present TRUSS, a family of server architectures providing reliable, scalable comput ...