Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe for Qiskit Experiments #28

Merged
merged 14 commits into from
Feb 6, 2023

Conversation

nkanazawa1989
Copy link
Contributor

@nkanazawa1989 nkanazawa1989 commented Aug 25, 2022

The complicated data structure inside the Qiskit Experiments class, particularly ExperimentData, makes advanced usage of the experiment framework harder. Experimentalists may hesitate to use such framework for their research project. Based on our users survey, we propose new experiment data structure built on top of the pandas dataframe, which is one of the most celebrated python package in the data science community (and we can also reduce maintenance overhead of custom build classes).

This PR focuses on the update of analysis results. In future we can also adopts data frame for experiment (circuit) results, i.e. ExperimentData.data(), for better metadata management.

@mtreinish mtreinish changed the title Datafarme for Qiskit Experiments Dataframe for Qiskit Experiments Aug 25, 2022
- created_in_db (bool): Same as `AnalysisResultData._created_in_db`. A boolean value to indicate wether this is loaded from the database or just created locally.
- result_id (str): Same as `AnalysisResultData.result_id`. A unique ID of this entry.
- backend (str): Same as `AnalysisResultData.backend_name`. Name of the associated backend.
- creation_datetime (datetime): Same as `AnalysisResultData.creation_datetime`. A date time at which the experiment data received the job results.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a modified field as well, for the case where a result is modified (eg if a use sets tags, verified etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of modified field? This is a metadata to control autosave?

0000-experiment-dataframe.md Outdated Show resolved Hide resolved
0000-experiment-dataframe.md Outdated Show resolved Hide resolved
0000-experiment-dataframe.md Outdated Show resolved Hide resolved
0000-experiment-dataframe.md Outdated Show resolved Hide resolved
- `created_in_db`: This has been conventionally managed internally in the `ExperimentData`.
- `experiment`: This may be stored in the `fit` field. Not only experiment UUID, but also human readable representation of experiment is important to help experimentalists with filtering and sorting analysis parameters.

To reduce the amount of work done by QE, we suggest that the experiment service `qiskit-ibm-experiment` will implement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like if the experiment service has its classes and functionality to work with saving/loading, and updating/modifying (where allowed) raw data, analysis results, artifacts (which I think could include figures), and that experiments is basically just initializing and populating the values of these classes to be saved, or modifying loaded ones with as little special handling has possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the question is how the minimal interface qiskit-experiments uses would look like; I can make the changes in the qiskit-ibm-experiment to accommodate for (hopefully) anything we'd like.


## Migration plan

Even though this is the change of internal data structure, this may impact community developers who write own experiment on top of our base classes. Thus, the change should be made step by step to guarantee the backward compatibility.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here is where we should try and have fixed classes for the service/db for relevant data types, and the intention is these should not be modified/subclasses, but worked with by other classes within qiskit-experiments or other frameworks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently qiskit-ibm-experiment uses the ExperimentData and AnalysisResultData dataclasses, which are also used inside qiskit-experiments. I have no objection to modifying how these dataclasses look or used, but this is really a question of what qiskit-experiments users would use.

We can also drop these classes altogether and stick with padas dataframes (that's basically what the local DB is doing now). Again, the question is what users would prefer.

If there are only keyword arguments, we can directly populate the analysis result data frame and bypasses generation of redundant `AnalysisResult` which no longer exists in the `ExperimentData`.
Existing analysis classes in the Qiskit Experiments namespace will switch to this kwargs style, namely, it must directly populate analysis data from `_run_analysis` method.

`ExperimentData.artifacts()` and `.add_artifacts()` methods are newly implemented with new protected member `ExpeirmentData._artifacts` which is a thread safe list of artifacts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we define an artifact class in the service, could we migrate figure handling to use artifacts? Maybe this requires too much change of the web UI to do in one go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service already know how to handle figures, and should definitely be able to handle artifacts as well. I agree that delegating as much possible code to the service is preferable.

0000-experiment-dataframe.md Outdated Show resolved Hide resolved
0000-experiment-dataframe.md Outdated Show resolved Hide resolved
- Cleanup the primary columns
- Add example codes to explain how to upgrade
- Update data flow of analysis result, keep analysis result container to decouple experiment data from analysis

#### Complement

As these examples show, experimentalist can distinguish the outcomes of batch (with `experiment`) and parallel experiments (with `components`). Even though there are multiple entries for the identical parameter (example 3), they have different result and experiment IDs and run times. In other words, one can sort the entries by `run_time` and get the time-ordered sequence of `value` for the trend analysis:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about cases of complicated nested composite experiments where looking at experiment and components are not enough to distinguish between different results. The user might be interested in is something like "I want the result of the fourth T1 experiment that was run simultaneously on Q0 and Q10." Of course, this can be deduced by looking at all the columns, but are we expecting the users to be familiar with pandas and write their own multi-column filters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Such functionality can be provided through the pandas accessor, but the good point of using pandas is they can find many well written documentation in the web. As you know we don't have enough capability of maintaining detailed documentation and I'd like to avoid overengineering at this stage (even if we provide the logic it is tough for users to actually use it). Can you provide some configuration that might mess up the result table?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think we should make more work for ourselves implementing too many custom accessors since pandas already has many features and is well-documented. Perhaps a tutorial showing some particularly useful operations would be good enough. As we discussed, the main issue I can think of is run_time can be potentially out of order from the actual execution time on the device. I can think of a few times that are relevant:

  1. time when job was submitted to the queue
  2. time when job actually started to run
  3. time when job finished running
  4. run_time as defined here, time of experiment data receiving the job results

Usually 1 and 2 will be in the same order, but experimentalists who have access to the queue can potentially move jobs around in it. So for something like "get the result of the fourth T1 experiment", I wonder which time metric we should be using, or should we be recording multiple times and ask the user to specify what they want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is worth thinking a bit more about how we should record times.

For experiments I would expect the important times to be 2 and 3, when it started running on the device, and when it finished running on the device. However these times are generally specific to backend data (count/iq), not individual analysis results so any results processed from the same batch of data, such as parallel experiments, will have the same times. I don't think the specific time that ExperimentData received a job result is particular relevant (unless the job contains no timing information for its execution).

Another idea would be to include a created_time for analysis results like you have for artifacts. I'm imagining this could be relevant if you re-ran analysis on exisiting data (which would have a fixed time for when it ran) and wanted to save both results to the DB (im not sure this is allowed through qiskit-experiments without the two results also having different experiment ids, but you could probably do it directly through service).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Chris. I noticed that Job object reports multiple timestep, i.e.

job = backend.run(qc)
job._wait_for_completion()
job.time_per_step()

this returns a dictionary

{'CREATING': datetime.datetime(2023, 1, 26, 15, 22, 4, 465000, tzinfo=tzlocal()),
 'CREATED': datetime.datetime(2023, 1, 26, 15, 22, 5, 813000, tzinfo=tzlocal()),
 'VALIDATING': datetime.datetime(2023, 1, 26, 15, 22, 6, 43000, tzinfo=tzlocal()),
 'VALIDATED': datetime.datetime(2023, 1, 26, 15, 22, 6, 629000, tzinfo=tzlocal()),
 'QUEUED': datetime.datetime(2023, 1, 26, 15, 22, 6, 728000, tzinfo=tzlocal()),
 'RUNNING': datetime.datetime(2023, 1, 26, 15, 22, 7, 752000, tzinfo=tzlocal()),
 'COMPLETED': datetime.datetime(2023, 1, 26, 15, 22, 19, 271000, tzinfo=tzlocal())}

I think we can pick RUNNING time as a run_time, and separately add created_time to cover your case of reanalysis. Updated rfc with 9a62098.

nkanazawa1989 and others added 2 commits January 11, 2023 02:37
Co-authored-by: gadial <gadial@gmail.com>

#### Complement

As these examples show, experimentalist can distinguish the outcomes of batch (with `experiment`) and parallel experiments (with `components`). Even though there are multiple entries for the identical parameter (example 3), they have different result and experiment IDs and run times. In other words, one can sort the entries by `run_time` and get the time-ordered sequence of `value` for the trend analysis:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is worth thinking a bit more about how we should record times.

For experiments I would expect the important times to be 2 and 3, when it started running on the device, and when it finished running on the device. However these times are generally specific to backend data (count/iq), not individual analysis results so any results processed from the same batch of data, such as parallel experiments, will have the same times. I don't think the specific time that ExperimentData received a job result is particular relevant (unless the job contains no timing information for its execution).

Another idea would be to include a created_time for analysis results like you have for artifacts. I'm imagining this could be relevant if you re-ran analysis on exisiting data (which would have a fixed time for when it ran) and wanted to save both results to the DB (im not sure this is allowed through qiskit-experiments without the two results also having different experiment ids, but you could probably do it directly through service).


### C. Updating existing results or artifacts.

An end user (experimentalist) can modify analysis result or artifact after experiment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended to replace the artifact with a new one using the same artifact_id or to delete the existing one and add the replacement artifact with a new artifact_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on the experiment service implementation, but I was thinking to replace existing entry with the same ID.

print(error)
```

Note that the experiment service already supports API to save artifacts. The payload must be JSON format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this JSON serialization happen in the service layer, so the python artifict object has a python class as its value? If the later who's responsibility is the JSON serializer in this case? qiskit experiments or the service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think this is the responsibility of service since user may want to implement custom (local) service that is capable of handling custom data format. Indeed the conventional IBM provider implements own encoder rather than in Qiskit terra. I think we should move current experiment encoder to service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified this in 3de30ba.

Copy link
Member

@chriseclectic chriseclectic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me this looks good to start moving to the next stage of prototyping

@nkanazawa1989 nkanazawa1989 merged commit 21ff6eb into Qiskit:master Feb 6, 2023
@t-imamichi
Copy link
Member

polars might be a good alternative. I heard it is faster than pandas, e.g. https://h2oai.github.io/db-benchmark/

@nkanazawa1989
Copy link
Contributor Author

nkanazawa1989 commented Feb 9, 2023

Thanks @t-imamichi for the suggestion. We are aware of that package but we intentionally chose pandas for now. The number of analysis result entries in ExperimentData is estimated to be roughly ~100 and performance gain with Rust implementation is not quite drastic, and we want to chose a popular package with lot of documentation so that we can alleviate maintenance/support overhead.

@1ucian0 1ucian0 added the RFC proposal New RFC is proposed label Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC proposal New RFC is proposed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants