scalometer/README.md

# scalometer - parallel kernel benchmarking

This project provides a benchmarking tool for benchmarking parallelization strategies with kernels found in HPC applications.
It is designed to make adding kernels and parallelization strategies easy.

## Features

- **Kernel Registry**: A registry that allows the user to register and execute different computational kernels easily.
- **Parallelization Strategies**: Two strategies for parallelizing the execution of kernel loops:
  - **OpenMP**: Uses OpenMP directives to parallelize the outermost loop.
  - **Eventify**: Uses the Eventify tasking system for parallelism.
- **Kernel Execution**: Kernels such as **STREAM TRIAD** and **DAXPY** are implemented, and their execution can be timed and compared across different parallelization strategies.

## Contact

In case of troubles and feature requests, be welcome to open issues and pull requests.
You may as well contact the author Patrick Lipka (patrick.lipka@sipearl.com).

## Project Structure
```
.
├── bin/              # Compiled executable
├── include/          # Header files
│   ├── kernels.hpp   # Kernel and KernelRegistry declarations
│   ├── strategy.hpp  # Parallelization strategies (OpenMP, Eventify)
│   └── utils.hpp     # Utility functions for initialization
├── src/              # Source files
│   ├── kernels.cpp   # Kernel and KernelRegistry implementations
│   ├── strategy.cpp  # Parallelization strategies (OpenMP, Eventify)
│   ├── main.cpp      # Main entry point for benchmarking
├── Makefile          # Makefile to build the project
└── README.md         # Project documentation
```
## Requirements

- C++20 or higher
- OpenMP support (for OpenMP parallelization strategy)

### Dependencies:

- **Eventify**: If you want to compiler with eventify (`ENABLE_EVENTIFY=YES`), ensure that the eventify library is properly installed and the environment variable `EVENTIFY_ROOT` points to the root directory of the Eventify installation.

## Building the Project

To build the project, run:

```
make
```

The default is to compile with eventify enabled `ENABLE_EVENTIFY=YES`. If you want to build without eventify, please done

```
ENABLE_EVENTIFY=NO make
```

The make command will compile the source files and generate an executable called `benchmark` in the `bin/` directory.
Similar to the STREAM benchmark´s Makefile, the vector sizes are defined by the preprocessor variable `VECTOR_SIZE` that can be set in the Makefile.

### Clean Up

To remove all compiled files and the executable, run:

```
make clean
```

## Usage

### Running the Benchmark

To run a kernel benchmark, use the following command:

```
./bin/benchmark <kernel_name> <strategy> <num_threads_or_tasks>
```

- `<kernel_name>`: The name of the kernel to run. Example: `stream_triad`
- `<strategy>`: The parallelization strategy to use. Available options: `omp` (for OpenMP) and `eventify` (for Eventify).
- `<num_threads_or_tasks>`: The number of threads or tasks to use for parallel execution. This depends on the parallelization strategy (e.g., number of threads for OpenMP, number of tasks for Eventify).

### Example:

To run the `stream_triad` kernel with the OpenMP strategy using 4 threads:

```
./bin/benchmark stream_triad omp 4
```

To run the `daxpy` kernel with the Eventify strategy using 8 tasks:

```
./bin/benchmark daxpy eventify 8
```

### Error Handling

- If an invalid kernel name is provided, the program will print an error message and list available kernels.

Example of an invalid kernel name:

```
$ ./bin/benchmark invalid_kernel omp 4
Kernel not found: invalid_kernel
Available kernels are:
  - stream_triad
  - daxpy
```

## Adding New Kernels

To add a new kernel to the project, follow these steps:

1. **Define the Kernel**:
    - Open the `src/kernels.cpp` file and scroll to the section where new kernels are registered (around the `initialize_registry` function).
    - Use the existing kernels (`stream_triad` and `daxpy`) as templates. Create a new kernel by adding a lambda to the `register_kernel` method.
    - The number, types and initialization of arguments can be choosen freely.
    - Note that you only need to provide the loop body / inner loops of a loop nest. The outer loop with induction variable `int i` is defined as part of the parallelization strategy already.

    For example, to add a new **vector product** kernel, you can do the following:

    ```
    registry->register_kernel("vector_product", [&]() {
      auto a = std::make_shared<std::vector<float>>();
      auto b = std::make_shared<std::vector<float>>();
      auto c = std::make_shared<std::vector<float>>();

      auto prepare = [=]() {
        a->resize(VECTOR_SIZE);
        b->resize(VECTOR_SIZE);
        c->resize(VECTOR_SIZE);
        initialize_vector(*b);
        initialize_vector(*c);
      };

      auto execute = [=](int kernel_start_idx, int kernel_end_idx, int n_tasks_or_threads) {
        strategy::execute_strategy(strategy_name, kernel_start_idx, kernel_end_idx, n_tasks_or_threads, [&](int i) {
          (*a)[i] = (*b)[i] * (*c)[i];  // Vector product operation
        });
      };

      return Kernel("vector_product", execute, prepare);
    });
    ```

    In this example:
    - `a`, `b`, and `c` are the vectors used for the operation.
    - `prepare` initializes these vectors and fills them with random values using the `initialize_vector` function.
    - `execute` contains the vector product logic, where each element in vector `a` is computed as the product of corresponding elements in vectors `b` and `c`.

2. **Register the Kernel**:
    - The new kernel should be automatically registered when the `initialize_registry` function is called. This is done dynamically through the registry.

3. **Use the Kernel**:
    - Once you have added the kernel to the registry, you can run it just like the existing kernels using the `./bin/benchmark` command. For example:

    ```
    ./bin/benchmark vector_product omp 4
    ```

### Notes on Adding Kernels:

- Kernels must be registered with a **name** (e.g., `"vector_product"`) and should include the corresponding **allocations and data initialization** (`prepare`) and **kernel logic** (`execute`).
- Kernels must consist out of an outer loop at least for now.
- The kernel’s execution should be parallelizable using all of the available strategies (`omp` (OpenMP) and `eventify` (eventify tasking library) for now). You can add more strategies by extending the `strategy` namespace.
- The `VECTOR_SIZE` preprocessor variable defines the size of the input data and should be appropriate for the kernel you are implementing.

## Known Isuues and Limitations
- The instantiation of eventify's `task_system` is inckluded in the kernel timing, leading to a constant overhead compared to OpenMP. On NVIDIA Grace, this is 2.8 ms. It's ongoning discussion whether to include it or not.


## Contributing

Feel free to submit issues or pull requests to improve the project.

## License

This project is licensed under the MIT License.