# scalometer - parallel kernel benchmarking This project provides a benchmarking tool for benchmarking parallelization strategies with kernels found in HPC applications. It is designed to make adding kernels and parallelization strategies easy. ## Features - **Kernel Registry**: A registry that allows the user to register and execute different computational kernels easily. - **Parallelization Strategies**: Two strategies for parallelizing the execution of kernel loops: - **OpenMP**: Uses OpenMP directives to parallelize the outermost loop. - **Eventify**: Uses the Eventify tasking system for parallelism. - **Kernel Execution**: Kernels such as **STREAM TRIAD** and **DAXPY** are implemented, and their execution can be timed and compared across different parallelization strategies. ## Contact In case of troubles and feature requests, be welcome to open issues and pull requests. You may as well contact the author Patrick Lipka (patrick.lipka@sipearl.com). ## Project Structure ``` . ├── bin/ # Compiled executable ├── include/ # Header files │ ├── kernels.hpp # Kernel and KernelRegistry declarations │ ├── strategy.hpp # Parallelization strategies (OpenMP, Eventify) │ └── utils.hpp # Utility functions for initialization ├── src/ # Source files │ ├── kernels.cpp # Kernel and KernelRegistry implementations │ ├── strategy.cpp # Parallelization strategies (OpenMP, Eventify) │ ├── main.cpp # Main entry point for benchmarking ├── Makefile # Makefile to build the project └── README.md # Project documentation ``` ## Requirements - C++20 or higher - OpenMP support (for OpenMP parallelization strategy) ### Dependencies: - **Eventify**: If you want to compiler with eventify (`ENABLE_EVENTIFY=YES`), ensure that the eventify library is properly installed and the environment variable `EVENTIFY_ROOT` points to the root directory of the Eventify installation. ## Building the Project To build the project, run: ``` make ``` The default is to compile with eventify enabled `ENABLE_EVENTIFY=YES`. If you want to build without eventify, please done ``` ENABLE_EVENTIFY=NO make ``` The make command will compile the source files and generate an executable called `benchmark` in the `bin/` directory. Similar to the STREAM benchmark´s Makefile, the vector sizes are defined by the preprocessor variable `VECTOR_SIZE` that can be set in the Makefile. ### Clean Up To remove all compiled files and the executable, run: ``` make clean ``` ## Usage ### Running the Benchmark To run a kernel benchmark, use the following command: ``` ./bin/benchmark ``` - ``: The name of the kernel to run. Example: `stream_triad` - ``: The parallelization strategy to use. Available options: `omp` (for OpenMP) and `eventify` (for Eventify). - ``: The number of threads or tasks to use for parallel execution. This depends on the parallelization strategy (e.g., number of threads for OpenMP, number of tasks for Eventify). ### Example: To run the `stream_triad` kernel with the OpenMP strategy using 4 threads: ``` ./bin/benchmark stream_triad omp 4 ``` To run the `daxpy` kernel with the Eventify strategy using 8 tasks: ``` ./bin/benchmark daxpy eventify 8 ``` ### Error Handling - If an invalid kernel name is provided, the program will print an error message and list available kernels. Example of an invalid kernel name: ``` $ ./bin/benchmark invalid_kernel omp 4 Kernel not found: invalid_kernel Available kernels are: - stream_triad - daxpy ``` ## Adding New Kernels To add a new kernel to the project, follow these steps: 1. **Define the Kernel**: - Open the `src/kernels.cpp` file and scroll to the section where new kernels are registered (around the `initialize_registry` function). - Use the existing kernels (`stream_triad` and `daxpy`) as templates. Create a new kernel by adding a lambda to the `register_kernel` method. - The number, types and initialization of arguments can be choosen freely. - Note that you only need to provide the loop body / inner loops of a loop nest. The outer loop with induction variable `int i` is defined as part of the parallelization strategy already. For example, to add a new **vector product** kernel, you can do the following: ``` registry->register_kernel("vector_product", [&]() { auto a = std::make_shared>(); auto b = std::make_shared>(); auto c = std::make_shared>(); auto prepare = [=]() { a->resize(VECTOR_SIZE); b->resize(VECTOR_SIZE); c->resize(VECTOR_SIZE); initialize_vector(*b); initialize_vector(*c); }; auto execute = [=](int kernel_start_idx, int kernel_end_idx, int n_tasks_or_threads) { strategy::execute_strategy(strategy_name, kernel_start_idx, kernel_end_idx, n_tasks_or_threads, [&](int i) { (*a)[i] = (*b)[i] * (*c)[i]; // Vector product operation }); }; return Kernel("vector_product", execute, prepare); }); ``` In this example: - `a`, `b`, and `c` are the vectors used for the operation. - `prepare` initializes these vectors and fills them with random values using the `initialize_vector` function. - `execute` contains the vector product logic, where each element in vector `a` is computed as the product of corresponding elements in vectors `b` and `c`. 2. **Register the Kernel**: - The new kernel should be automatically registered when the `initialize_registry` function is called. This is done dynamically through the registry. 3. **Use the Kernel**: - Once you have added the kernel to the registry, you can run it just like the existing kernels using the `./bin/benchmark` command. For example: ``` ./bin/benchmark vector_product omp 4 ``` ### Notes on Adding Kernels: - Kernels must be registered with a **name** (e.g., `"vector_product"`) and should include the corresponding **allocations and data initialization** (`prepare`) and **kernel logic** (`execute`). - Kernels must consist out of an outer loop at least for now. - The kernel’s execution should be parallelizable using all of the available strategies (`omp` (OpenMP) and `eventify` (eventify tasking library) for now). You can add more strategies by extending the `strategy` namespace. - The `VECTOR_SIZE` preprocessor variable defines the size of the input data and should be appropriate for the kernel you are implementing. ## Known Isuues and Limitations - The instantiation of eventify's `task_system` is inckluded in the kernel timing, leading to a constant overhead compared to OpenMP. On NVIDIA Grace, this is 2.8 ms. It's ongoning discussion whether to include it or not. ## Contributing Feel free to submit issues or pull requests to improve the project. ## License This project is licensed under the MIT License.