Go to file

Patrick Lipka 6cfc1a9fa9 Typo removed		2024-12-16 17:51:25 +01:00
include	Conditional compilation of eventify	2024-12-16 14:51:38 +01:00
src	Applied Google formatting	2024-12-13 12:01:35 +01:00
.gitignore	Initial commit	2024-12-13 00:33:08 +01:00
LICENSE	Initial commit	2024-12-13 00:30:40 +01:00
Makefile	Conditional compilation of Eventify corrected	2024-12-16 14:28:21 +01:00
README.md	Typo removed	2024-12-16 17:51:25 +01:00

README.md

scalometer - parallel kernel benchmarking

This project provides a benchmarking tool for benchmarking parallelization strategies with kernels found in HPC applications. It is designed to make adding kernels and parallelization strategies easy.

Features

Kernel Registry: A registry that allows the user to register and execute different computational kernels easily.
Parallelization Strategies: Two strategies for parallelizing the execution of kernel loops:
- OpenMP: Uses OpenMP directives to parallelize the outermost loop.
- Eventify: Uses the Eventify tasking system for parallelism.
Kernel Execution: Kernels such as STREAM TRIAD and DAXPY are implemented, and their execution can be timed and compared across different parallelization strategies.

Contact

In case of troubles and feature requests, be welcome to open issues and pull requests. You may as well contact the author Patrick Lipka (patrick.lipka@sipearl.com).

Project Structure

.
├── bin/              # Compiled executable
├── include/          # Header files
│   ├── kernels.hpp   # Kernel and KernelRegistry declarations
│   ├── strategy.hpp  # Parallelization strategies (OpenMP, Eventify)
│   └── utils.hpp     # Utility functions for initialization
├── src/              # Source files
│   ├── kernels.cpp   # Kernel and KernelRegistry implementations
│   ├── strategy.cpp  # Parallelization strategies (OpenMP, Eventify)
│   ├── main.cpp      # Main entry point for benchmarking
├── Makefile          # Makefile to build the project
└── README.md         # Project documentation

Requirements

C++20 or higher
OpenMP support (for OpenMP parallelization strategy)

Dependencies:

Eventify: If you want to compile with eventify (ENABLE_EVENTIFY=YES), ensure that the eventify library is properly installed and the environment variable EVENTIFY_ROOT points to the root directory of the Eventify installation.

Building the Project

To build the project, run:

make

The default is to compile with eventify enabled ENABLE_EVENTIFY=YES. If you want to build without eventify, please done

ENABLE_EVENTIFY=NO make

The make command will compile the source files and generate an executable called benchmark in the bin/ directory. Similar to the STREAM benchmark´s Makefile, the vector sizes are defined by the preprocessor variable VECTOR_SIZE that can be set in the Makefile.

Clean Up

To remove all compiled files and the executable, run:

make clean

Usage

Running the Benchmark

To run a kernel benchmark, use the following command:

./bin/benchmark <kernel_name> <strategy> <num_threads_or_tasks>

<kernel_name>: The name of the kernel to run. Example: stream_triad
<strategy>: The parallelization strategy to use. Available options: omp (for OpenMP) and eventify (for Eventify).
<num_threads_or_tasks>: The number of threads or tasks to use for parallel execution. This depends on the parallelization strategy (e.g., number of threads for OpenMP, number of tasks for Eventify).

Example:

To run the stream_triad kernel with the OpenMP strategy using 4 threads:

./bin/benchmark stream_triad omp 4

To run the daxpy kernel with the Eventify strategy using 8 tasks:

./bin/benchmark daxpy eventify 8

Error Handling

If an invalid kernel name is provided, the program will print an error message and list available kernels.

Example of an invalid kernel name:

$ ./bin/benchmark invalid_kernel omp 4
Kernel not found: invalid_kernel
Available kernels are:
  - stream_triad
  - daxpy

Adding New Kernels

To add a new kernel to the project, follow these steps:

Define the Kernel:
- Open the src/kernels.cpp file and scroll to the section where new kernels are registered (around the initialize_registry function).
- Use the existing kernels (stream_triad and daxpy) as templates. Create a new kernel by adding a lambda to the register_kernel method.
- The number, types and initialization of arguments can be choosen freely.
- Note that you only need to provide the loop body / inner loops of a loop nest. The outer loop with induction variable int i is defined as part of the parallelization strategy already.
For example, to add a new vector product kernel, you can do the following:
```
registry->register_kernel("vector_product", [&]() {
  auto a = std::make_shared<std::vector<float>>();
  auto b = std::make_shared<std::vector<float>>();
  auto c = std::make_shared<std::vector<float>>();

  auto prepare = [=]() {
    a->resize(VECTOR_SIZE);
    b->resize(VECTOR_SIZE);
    c->resize(VECTOR_SIZE);
    initialize_vector(*b);
    initialize_vector(*c);
  };

  auto execute = [=](int kernel_start_idx, int kernel_end_idx, int n_tasks_or_threads) {
    strategy::execute_strategy(strategy_name, kernel_start_idx, kernel_end_idx, n_tasks_or_threads, [&](int i) {
      (*a)[i] = (*b)[i] * (*c)[i];  // Vector product operation
    });
  };

  return Kernel("vector_product", execute, prepare);
});
```
In this example:
- a, b, and c are the vectors used for the operation.
- prepare initializes these vectors and fills them with random values using the initialize_vector function.
- execute contains the vector product logic, where each element in vector a is computed as the product of corresponding elements in vectors b and c.
Register the Kernel:
- The new kernel should be automatically registered when the initialize_registry function is called. This is done dynamically through the registry.
Use the Kernel:
- Once you have added the kernel to the registry, you can run it just like the existing kernels using the ./bin/benchmark command. For example:
```
./bin/benchmark vector_product omp 4
```

Notes on Adding Kernels:

Kernels must be registered with a name (e.g., "vector_product") and should include the corresponding allocations and data initialization (prepare) and kernel logic (execute).
Kernels must consist out of an outer loop at least for now.
The kernel’s execution should be parallelizable using all of the available strategies (omp (OpenMP) and eventify (eventify tasking library) for now). You can add more strategies by extending the strategy namespace.
The VECTOR_SIZE preprocessor variable defines the size of the input data and should be appropriate for the kernel you are implementing.

Known Isuues and Limitations

The instantiation of eventify's task_system is inckluded in the kernel timing, leading to a constant overhead compared to OpenMP. On NVIDIA Grace, this is 2.8 ms. It's ongoning discussion whether to include it or not.

Contributing

Feel free to submit issues or pull requests to improve the project.

License

This project is licensed under the MIT License.

README.md Unescape Escape