175 lines
		
	
	
		
			6.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
			
		
		
	
	
			175 lines
		
	
	
		
			6.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
| # scalometer - parallel kernel benchmarking
 | ||
| 
 | ||
| This project provides a benchmarking tool for benchmarking parallelization strategies with kernels found in HPC applications.
 | ||
| It is designed to make adding kernels and parallelization strategies easy.
 | ||
| 
 | ||
| ## Features
 | ||
| 
 | ||
| - **Kernel Registry**: A registry that allows the user to register and execute different computational kernels easily.
 | ||
| - **Parallelization Strategies**: Two strategies for parallelizing the execution of kernel loops:
 | ||
|   - **OpenMP**: Uses OpenMP directives to parallelize the outermost loop.
 | ||
|   - **Eventify**: Uses the Eventify tasking system for parallelism.
 | ||
| - **Kernel Execution**: Kernels such as **STREAM TRIAD** and **DAXPY** are implemented, and their execution can be timed and compared across different parallelization strategies.
 | ||
| 
 | ||
| ## Contact
 | ||
| 
 | ||
| In case of troubles and feature requests, be welcome to open issues and pull requests. 
 | ||
| You may as well contact the author Patrick Lipka (patrick.lipka@sipearl.com).
 | ||
| 
 | ||
| ## Project Structure
 | ||
| ```
 | ||
| .
 | ||
| ├── bin/              # Compiled executable
 | ||
| ├── include/          # Header files
 | ||
| │   ├── kernels.hpp   # Kernel and KernelRegistry declarations
 | ||
| │   ├── strategy.hpp  # Parallelization strategies (OpenMP, Eventify)
 | ||
| │   └── utils.hpp     # Utility functions for initialization
 | ||
| ├── src/              # Source files
 | ||
| │   ├── kernels.cpp   # Kernel and KernelRegistry implementations
 | ||
| │   ├── strategy.cpp  # Parallelization strategies (OpenMP, Eventify)
 | ||
| │   ├── main.cpp      # Main entry point for benchmarking
 | ||
| ├── Makefile          # Makefile to build the project
 | ||
| └── README.md         # Project documentation
 | ||
| ```
 | ||
| ## Requirements
 | ||
| 
 | ||
| - C++20 or higher
 | ||
| - OpenMP support (for OpenMP parallelization strategy)
 | ||
| - Eventify library (for Eventify parallelization strategy)
 | ||
| - Limitation: Providing installs for all implemented parallelization strategies is mandantory at this point. Selective compilation of strategies might be added later if needed.
 | ||
| 
 | ||
| ### Dependencies:
 | ||
| 
 | ||
| - **Eventify**: Ensure that the Eventify library is properly installed and the environment variable `EVENTIFY_ROOT` points to the root directory of the Eventify installation.
 | ||
| 
 | ||
| ## Building the Project
 | ||
| 
 | ||
| To build the project, run:
 | ||
| 
 | ||
| ```
 | ||
| make
 | ||
| ```
 | ||
| 
 | ||
| This will compile the source files and generate an executable called `benchmark` in the `bin/` directory.
 | ||
| Similar to the STREAM benchmark´s Makefile, the vector sizes are defined by the preprocessor variable `VECTOR_SIZE` that can be set in the Makefile.
 | ||
| 
 | ||
| ### Clean Up
 | ||
| 
 | ||
| To remove all compiled files and the executable, run:
 | ||
| 
 | ||
| ```
 | ||
| make clean
 | ||
| ```
 | ||
| 
 | ||
| ## Usage
 | ||
| 
 | ||
| ### Running the Benchmark
 | ||
| 
 | ||
| To run a kernel benchmark, use the following command:
 | ||
| 
 | ||
| ```
 | ||
| ./bin/benchmark <kernel_name> <strategy> <num_threads_or_tasks>
 | ||
| ```
 | ||
| 
 | ||
| - `<kernel_name>`: The name of the kernel to run. Example: `stream_triad`
 | ||
| - `<strategy>`: The parallelization strategy to use. Available options: `omp` (for OpenMP) and `eventify` (for Eventify).
 | ||
| - `<num_threads_or_tasks>`: The number of threads or tasks to use for parallel execution. This depends on the parallelization strategy (e.g., number of threads for OpenMP, number of tasks for Eventify).
 | ||
| 
 | ||
| ### Example:
 | ||
| 
 | ||
| To run the `stream_triad` kernel with the OpenMP strategy using 4 threads:
 | ||
| 
 | ||
| ```
 | ||
| ./bin/benchmark stream_triad omp 4
 | ||
| ```
 | ||
| 
 | ||
| To run the `daxpy` kernel with the Eventify strategy using 8 tasks:
 | ||
| 
 | ||
| ```
 | ||
| ./bin/benchmark daxpy eventify 8
 | ||
| ```
 | ||
| 
 | ||
| ### Error Handling
 | ||
| 
 | ||
| - If an invalid kernel name is provided, the program will print an error message and list available kernels.
 | ||
| 
 | ||
| Example of an invalid kernel name:
 | ||
| 
 | ||
| ```
 | ||
| $ ./bin/benchmark invalid_kernel omp 4
 | ||
| Kernel not found: invalid_kernel
 | ||
| Available kernels are:
 | ||
|   - stream_triad
 | ||
|   - daxpy
 | ||
| ```
 | ||
| 
 | ||
| ## Adding New Kernels
 | ||
| 
 | ||
| To add a new kernel to the project, follow these steps:
 | ||
| 
 | ||
| 1. **Define the Kernel**:
 | ||
|     - Open the `src/kernels.cpp` file and scroll to the section where new kernels are registered (around the `initialize_registry` function).
 | ||
|     - Use the existing kernels (`stream_triad` and `daxpy`) as templates. Create a new kernel by adding a lambda to the `register_kernel` method.
 | ||
|     - The number, types and initialization of arguments can be choosen freely.
 | ||
|     - Note that you only need to provide the loop body / inner loops of a loop nest. The outer loop with induction variable `int i` is defined as part of the parallelization strategy already.
 | ||
| 
 | ||
|     For example, to add a new **vector product** kernel, you can do the following:
 | ||
| 
 | ||
|     ```
 | ||
|     registry->register_kernel("vector_product", [&]() {
 | ||
|       auto a = std::make_shared<std::vector<float>>();
 | ||
|       auto b = std::make_shared<std::vector<float>>();
 | ||
|       auto c = std::make_shared<std::vector<float>>();
 | ||
| 
 | ||
|       auto prepare = [=]() {
 | ||
|         a->resize(VECTOR_SIZE);
 | ||
|         b->resize(VECTOR_SIZE);
 | ||
|         c->resize(VECTOR_SIZE);
 | ||
|         initialize_vector(*b);
 | ||
|         initialize_vector(*c);
 | ||
|       };
 | ||
| 
 | ||
|       auto execute = [=](int kernel_start_idx, int kernel_end_idx, int n_tasks_or_threads) {
 | ||
|         strategy::execute_strategy(strategy_name, kernel_start_idx, kernel_end_idx, n_tasks_or_threads, [&](int i) {
 | ||
|           (*a)[i] = (*b)[i] * (*c)[i];  // Vector product operation
 | ||
|         });
 | ||
|       };
 | ||
| 
 | ||
|       return Kernel("vector_product", execute, prepare);
 | ||
|     });
 | ||
|     ```
 | ||
| 
 | ||
|     In this example:
 | ||
|     - `a`, `b`, and `c` are the vectors used for the operation.
 | ||
|     - `prepare` initializes these vectors and fills them with random values using the `initialize_vector` function.
 | ||
|     - `execute` contains the vector product logic, where each element in vector `a` is computed as the product of corresponding elements in vectors `b` and `c`.
 | ||
| 
 | ||
| 2. **Register the Kernel**:
 | ||
|     - The new kernel should be automatically registered when the `initialize_registry` function is called. This is done dynamically through the registry.
 | ||
| 
 | ||
| 3. **Use the Kernel**:
 | ||
|     - Once you have added the kernel to the registry, you can run it just like the existing kernels using the `./bin/benchmark` command. For example:
 | ||
| 
 | ||
|     ```
 | ||
|     ./bin/benchmark vector_product omp 4
 | ||
|     ```
 | ||
| 
 | ||
| ### Notes on Adding Kernels:
 | ||
| 
 | ||
| - Kernels must be registered with a **name** (e.g., `"vector_product"`) and should include the corresponding **allocations and data initialization** (`prepare`) and **kernel logic** (`execute`).
 | ||
| - Kernels must consist out of an outer loop at least for now.
 | ||
| - The kernel’s execution should be parallelizable using all of the available strategies (`omp` (OpenMP) and `eventify` (Eventify) for now). You can add more strategies by extending the `strategy` namespace.
 | ||
| - The `VECTOR_SIZE` preprocessor variable defines the size of the input data and should be appropriate for the kernel you are implementing.
 | ||
| 
 | ||
| ## Known Isuues and Limitations
 | ||
| - The instantiation of Eventify's `task_system` is inckluded in the kernel timing, leading to a constant overhead compared to OpenMP. On NVIDIA Grace, this is 2.8 ms. It's ongoning discussion whether to include it or not.
 | ||
| 
 | ||
| 
 | ||
| ## Contributing
 | ||
| 
 | ||
| Feel free to submit issues or pull requests to improve the project.
 | ||
| 
 | ||
| ## License
 | ||
| 
 | ||
| This project is licensed under the MIT License.
 |