Skip to content

Commit

Permalink
add first rfc
Browse files Browse the repository at this point in the history
  • Loading branch information
nkanazawa1989 committed Aug 25, 2022
1 parent e2502ff commit fdfb98a
Showing 1 changed file with 121 additions and 0 deletions.
121 changes: 121 additions & 0 deletions 0000-experiment-dataframe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Dataframe for Qiskit Experiments

| **Status** | **Proposed** |
|:------------------|:---------------------------------------------|
| **RFC #** | #### |
| **Authors** | Naoki Kanazawa (nkanazawa1989@gmail.com) |
| **Deprecates** | N/A |
| **Submitted** | 2022-08-25 |
| **Updated** | 2022-08-25 |


## Summary
This RFC proposes a new internal data structure to be used by Qiskit Experiments (QE).
New data structure based on the [pandas data frame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) aims at centralizing all data generated by job execution, data processing, and following analyses and allowing users to resume or rerun analysis more intuitively.


## Motivation
With the current QE Implementation, it is very tough for novice users to tune or customize the data analysis chain because of the complexity of data structure. Data are often protected or private members, and every analysis step has own data model. Therefore it is not easy to scrutinize the actual data, and it is almost impossible to resume analysis from a particular step without hack. New data structure allows users to easily access these internal data with better handling and visibility.

## User Benefit
Experimentalists will mainly enjoy the benefit of new framework. They can try fancy analysis techniques directly on the experiment data without treating it as a big deal, e.g. modifying the analysis class implementation or implementing a new class. Experiment authors (developers) will also benefit by this, especially when testing new analysis code.

## Design Proposal
The workflow of experiment analysis is roughly classified into following steps: (1) Extract data from job `Result` object. (2) Pass the data to the data processing routine. This might be a dedicated [data processor](https://qiskit.org/documentation/experiments/stubs/qiskit_experiments.data_processing.DataProcessor.html) class. This might perform discrimination, single value decomposition, restless formatting, computation of probability or expectation value, depending on the configuration of the experiment. (3) Pass the processed data to analysis routine. Some extra formatting might be performed here, such as computing new quantity by combining two series, applying smoothing filters to data, or averaging the outcomes over the same experiment settings.

In the current implementation, these steps have own data model. Specifically in an example of the typical [curve analysis](https://qiskit.org/documentation/experiments/apidocs/curve_analysis.html) workflow, the step-1 generates `List[Dict[str, Any]]`, followed by the step-2 that converts the data into `ndarray` or [uncertainties UFloat](https://pythonhosted.org/uncertainties/) object, and lastly the step-3 converts them into a dedicated class `CurveData` managing x and y values with series index. Some experiment may consist of multiple sub-experiments (series) that are simultaneously analyzed, e.g. [Ramsey XY](https://qiskit.org/documentation/experiments/stubs/qiskit_experiments.library.characterization.RamseyXY.html) experiment. Usually these experiments are managed by circuit metadata, thus an outcome value should be tied to the associated metadata from the experimental circuit.

The pandas data frame will nicely fit in with our framework in consideration of this situation. The outcome of a single circuit execution generates a single row in the data frame table, and each row consists of the outcome value along with the data state (raw, processed, formatted, etc...) and circuit metadata. Such data frame may look like:

<center>

index | outcome | slots | shots | status | xval | parameter | ...
----------|:---------:|:-----:|:-----:|-----------|:----:|:---------:|----
circuit-1 | CountData | 1 | 1024 | raw | 3 | X |
circuit-2 | CountData | 1 | 1024 | raw | 3 | Y |
circuit-3 | CountData | 1 | 1024 | raw | 4 | X |
...
data-1 | 1.3 | NA | 1024 | processed | 3 | X |
data-2 | 1.4 | NA | 1024 | processed | 3 | Y |
...
point-1 | 1.5 ± 0.2 | NA | 1024 | formatted | 3 | X |
point-2 | 1.6 ± 0.1 | NA | 1024 | formatted | 3 | Y |

</center>

Here we can find several benefits of using data frame from multiple aspects.

#### 1. functionality

Data frame is widely accepted by the community of data science, and offers many useful functionality for grouping, sorting, and filtering the data. Experimentalists can easily find tutorials and user guides provided by external communities and we can save our bandwidths for documentation.

Particularly, the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.get_group.html) offers a convenient syntax to extract data at certain analysis step. For example, an experimentalist may prepare data for curve fitting, and then test a custom algorithm they write. With above example data,

```python
# Get data frame from the experiment data object.
# This contains full data set.
frame = exp_data.frame()

# Get subset of data prepared for curve fitting.
fit_data = frame.groupby("status").get_group("formatted")

# Run custom fitter with typecasted numpy arrays.
xdata = fit_data["xval"].to_numpy()
ydata = fit_data["outcome"].to_numpy()
popt, pcov = my_curve_fitter(xdata, ydata)
```

#### 2. centralization

Because we can dynamically add arbitrary columns to the data frame, in principle we can manage all intermediate data generated by each analysis step with a single object. Data set at certain step can be extracted with the `groupby` with "status" column as shown above. Similarly, a particular subset of a composite experiment data could be obtained by "child" column. This indicates the data frame could drastically simplify our code base.

#### 3. portability

Data frame natively supports [variety of file formats](https://pandas.pydata.org/docs/reference/io.html) from popular CSV and JSON to practical one such as HDF5 and SQL. This drastically enhances the portability of analysis data. Especially, the JSON format will fit well in the IBM experiment service API where we could upload these data as an extra artifact entry.


## Detailed Design

Even though this is the change of internal data structure, this may impact community developers who write own experiment on top of our base classes. Thus, the change should be made step by step to guarantee the backward compatibility.

#### In Qiskit Experiments 0.5

Add `ExperimentData.frame()` method that returns a data frame. In this version, this lives together with the `ExperimentData.data()` method that returns a conventional list of result dictionary. However, internal data representation `ExperimentData._result_data` is replaced with the [thread safe data frame](https://stackoverflow.com/questions/13592618/python-pandas-dataframe-thread-safe), i.e. a subclass of `ThreadSafeContainer`, and conventional data structure is generated on the fly from the data frame when the `.data()` method is called.

The data processor class and processing nodes are also updated to leverage the data frame, but internal data processing can still rely on numpy data. Processing node can push processed data to the experiment data object. This data is internally stored in the data frame and it may be retrieved later with `.frame()` method. This allows analysis class to utilize intermediate data for analysis and visualization at a later time. For example, the analysis class may generate the plot of IQ plane when data is available.

#### In Qiskit Experiments 0.6

Deprecate the `ExperimentData.data()` method. Accordingly, all analysis classes relying on `.data()` call must be updated. Specifically, these are the analysis classes that don't use the data processor framework, such as tomography and quantum volume analysis. This change is pretty straightforward.

```python
# Conventional code
for datum in experiment_data.data():
metadata = data["metadata"]
if metadata["A"] == "X":
counts_x.append(data["counts"])
else:
counts_y.append(data["counts"])
...

# New style
counts_x = experiment_data.frame().groupby("A").get_group("X")["outcome"]
counts_y = experiment_data.frame().groupby("A").get_group("Y")["outcome"]
```

In addition, `CurveAnalysis` class is updated to use the data frame. This will allow experimentalists to easily deal with custom curve fitting algorithms.

#### In Qiskit Experiments 0.7

Completely remove the `ExperimentData.data()` method. Support artifact of the data frame. The support of experiment service team is essential to achieve this.


## Alternative Approaches
N/A

## Questions
N/A

## Future Extensions
- Upgrade data model of composite experiment. The `groupby` can drastically simplify the current implementation. Perhaps `child_data` will be no longer necessary because we can manage entire experiment data with a single data frame.
- Some performance optimization might be required once we complete the implementation of full feature. Data frame can be very slow when it's implemented inefficiently.

0 comments on commit fdfb98a

Please sign in to comment.