[FEATURE] Add support columnar data buffer to save memory usage

# Background

Power Grid Model currently uses the row-based buffer to share the data across the C-API. For example, the memory layout of a node input buffer looks like:

```
| XXXX OOOO  XXXXXXXX  | XXXX OOOO  XXXXXXXX  | ... |
| id_0       u_rated_0 | id_1       u_rated_1 | ... |
```

Where `X` is a meaningful byte and `O` is an empty byte to align the memory.

In this way we can match the input/update/output data structs exactly as we do in the calculation core. This can deliver the performance benefits as we avoid any copies and the memory layout is exactly matching.

When we need to leave some attributes unspecified in input/update, we set the pre-defined [`null`](https://power-grid-model.readthedocs.io/en/stable/advanced_documentation/native-data-interface.html#basic-data-types) value defined in the place so the core knows the value should be ignored.

# Problem

While this design is CPU-wise very efficient, it could be memory-wise inefficient due to several reasons:

1. If you have many unspecified attributes, you waste the memory to have many `null`.
2. To align the data structure, the compiler will add paddings, as shown above. Those memory will not be used at all.
3. If we are only interested in certain attributes in the output, we still need to create the memory buffer for all the attributes. See the related issue here: #542.

There is a strong case to support columnar data buffers. We give two real-world examples of this issue.

## Example of update buffer

If we have an update buffer of 1000 scenarios of 1000 `sym_load`, the buffer size is `24 * 1000 * 1000 = 24,000,000` bytes. However, we might only need to specify `id` and `p_specified`. If we could provide these two array separately, the buffer size in total is `(8 + 4) * 1000 * 1000 = 12,000,000` bytes. The reduction on memory footprint is 50%!

## Example of output buffer

If we get a result buffer of 1000 scenarios of 1000 `line`, the buffer size is `80 * 1000 * 1000 = 80,000,000` bytes. However, we might only need to know the `loading` output, not even `id`, since we already know the `id` order in the input. The buffer size is `8 * 1000 * 1000 = 8,000,000` bytes. We can save 90% of memory footprint!

# Proposal

We propose to support columnar data buffers across the C-API (and further in the Python API). Both the PGM core and serialization need to support that.

## C-API

We already have the `dataset` concept in the C-API boundary. Therefore, this feature should not have breaking change in the C-API. Concretely, we add additional functions as `PGM_dataset_*_add_attribute_buffer` to allow user add columnar attribute buffers to the dataset. The user can call the dataset as below:

```c++
// create dataset
PGM_MutableDataset* dataset = PGM_create_dataset_mutable(handle, "input", 0, 0); 

// add row-based buffer for node
PGM_dataset_const_add_buffer(handle, dataset, "node", 5, 5, nullptr, node_buffer);

// add empty buffer for line by put nullptr in data, but with size definition
// decision made: do it this way, because using nullptr is a common way in C API to communicate that the buffer does not exist yet
PGM_dataset_const_add_buffer(handle, dataset, "line", 5, 5, nullptr, nullptr);
// and then add individual line attributes
PGM_dataset_const_add_attribute_buffer(handle, dataset, "line", "id", line_id_buffer);
PGM_dataset_const_add_attribute_buffer(handle, dataset, "line", "r1", line_r1_buffer);

// add row buffer for sym_load
PGM_dataset_const_add_buffer(handle, dataset, "sym_load", 5, 5, nullptr, sym_load_buffer);
// the following line should return error. It is not allowed to add attribute buffer if the row-buffer is set.
PGM_dataset_const_add_attribute_buffer(handle, dataset, "sym_load", "p_specified", p_buffer);

// use the dataset for further actions like initialize model, put into serialization, etc.
```

## Python API

In the Python API, four non-breaking changes are expected.

1. In all the places where a PGM dataset is expected, for each component, the user should be able to also supply either a numpy structured array (returned by `initialize_array`) or a dictionary of numpy homogeneous arrays (e.g. `{"id": [1, 2], "u_rated": [150e3, 10e3]}`).
2. In the calculate functions, the user can put desired components and/or attributes in `output_component_types`. The Python wrapper needs to decide whether to create a structured array or dictionary of homogeneous arrays per component. We need to figure how maintain backwards compatibility.
3. In the deserialization, user can specify if they want row- or column-based arrays in the returned dataset. See decision below for more information
4. In the serialization, if the user gives a dataset which are all column-based, the user has to provide a dataset type because there is no way to deduce it. As long as there one row-based array in the user-provided dataset, we can still deduce the dataset type.

Decision made on step 3 (deserialization): 
For deserialization we support either row or column based deserialization (function argument: Enum). If a user wants to deserialize to columnar data the default is to deserialize all data present. A user can give an Optional function argument to specify the desired components and attributes. In that case, deserialization + a filter (for the specific components and attributes) is happening. Let's call this Optional function argument `filter`. Make sure this behavior is documented well + document that providing a filter might result in loss of data.

## Make id optional for batch update dataset in columnar format

From the user's perspective, the user would definitely like to provide a columnar batch dataset in a way that the `id` is not provided for a certain component. In that case, it should be inferred that the elements where attributes are to be updated via columnar buffer are in the exact same sequence of the input data. This is a realistic use-case and will be appreciated by the user, to save the additional step to just assign the exactly the same `id` as in the input data. The following Python code should work:

```python
model = PowerGridModel(input_data=input_data)
result = model.calculate_power_flow(
    update_data={
        "sym_load": {"p_specified": np.random.randn(n_step, n_sym_load)}
    }
)
```

# Implementation Proposal

To make this feature possible, following implementation suggestions are proposed in the C++ core:

1. (done)Rename [`DatasetHandler`](https://github.com/PowerGridModel/power-grid-model/blob/3c9b5bbc01267338bf61bee86598c18909493edb/power_grid_model_c/power_grid_model/include/power_grid_model/auxiliary/dataset_handler.hpp) to `Dataset`.
2. (done)Remove old [`Dataset`](https://github.com/PowerGridModel/power-grid-model/blob/3c9b5bbc01267338bf61bee86598c18909493edb/power_grid_model_c/power_grid_model/include/power_grid_model/auxiliary/dataset.hpp).
3. In the new `Dataset`, add buffer control and iteration functionality. It can detect if a component buffer is row or column based, and in case of column based, generate temporary object to have the full struct for `MainModel` to consume.
4. (done)Rewrite `MainModel` to use the new `Dataset`. This also relates to #431.
6. In `Serializer`, it should directly read the row and column based buffer and serialize them to `msgpack` and `json`.
7. In `Deserializer`, it should write the attributes either in row- or column-based depending on what buffers are set in the `WritableDataset`.
8. Make id optional in update dataset: in the main core, we need to have special treatment in `is_update_independent` to make `id` as optional attribute in the batch update dataset.
    1. `is_update_independent` should be per component instead of the whole dataset. So we can allow individual `sequence` for each component.
    1. For a certain component, if the buffer is row-based
        1. If the `id` of the row-based buffer is not all `NaN`, we use the current logic to determine if the component is independent.
        1. If the `id` of the row-based buffer is all `NaN`
            1. If the buffer is not uniform, or the buffer is uniform but `elements_per_scenario` is not the same as the number of elements in the input data (in the model). An error should be raised.
            1. If the above check passes, we assume the component buffer is independent. And we generate a `sequence` from `0` to `n_comp` for this component. This will be consumed by the update function so the update function does not do `id` lookup.
    1. For a certain component, if the buffer is columnar, we do the following:
        1.  If `id` attribute buffer is provided **and** it is not all `NaN`, we look at `id` to judge if the component is independent or not. We do not need to create proxy stuff which is waste of time. Just directly look at `id` buffer.
        1.  If `id` attribute buffer is not provided **or** if the `id` is provided but they are all `NaN`:
            1. If the buffer is not uniform, or the buffer is uniform but `elements_per_scenario` is not the same as the number of elements in the input data (in the model). An error should be raised.
            1. If the above check passes, we assume the component buffer is independent. And we generate a `sequence` from `0` to `n_comp` for this component. This will be consumed by the update function so the update function does not do `id` lookup.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add support columnar data buffer to save memory usage #548

Background

Problem

Example of update buffer

Example of output buffer

Proposal

C-API

Python API

Make id optional for batch update dataset in columnar format

Implementation Proposal

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEATURE] Add support columnar data buffer to save memory usage #548

Description

Background

Problem

Example of update buffer

Example of output buffer

Proposal

C-API

Python API

Make id optional for batch update dataset in columnar format

Implementation Proposal

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions