Almost trivial distributed parallelization of stencil-based GPU and CPU applications on a regular staggered grid
A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems.
OCI-compatible engine to deploy Linux containers on HPC environments.
Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.
Distributed Communication-Optimal Shuffle and Transpose Algorithm
Material for tutorials and hands-on about containers