Dataframe for Qiskit Experiments #28

nkanazawa1989 · 2022-08-25T07:16:07Z

The complicated data structure inside the Qiskit Experiments class, particularly ExperimentData, makes advanced usage of the experiment framework harder. Experimentalists may hesitate to use such framework for their research project. Based on our users survey, we propose new experiment data structure built on top of the pandas dataframe, which is one of the most celebrated python package in the data science community (and we can also reduce maintenance overhead of custom build classes).

This PR focuses on the update of analysis results. In future we can also adopts data frame for experiment (circuit) results, i.e. ExperimentData.data(), for better metadata management.

…primitive data types.

0000-experiment-dataframe.md

…r than expeirment results.

Dataframe interfrace

chriseclectic · 2022-11-29T18:09:16Z

0000-experiment-dataframe.md

+- created_in_db (bool): Same as `AnalysisResultData._created_in_db`. A boolean value to indicate wether this is loaded from the database or just created locally.
+- result_id (str): Same as `AnalysisResultData.result_id`. A unique ID of this entry.
+- backend (str): Same as `AnalysisResultData.backend_name`. Name of the associated backend.
+- creation_datetime (datetime): Same as `AnalysisResultData.creation_datetime`. A date time at which the experiment data received the job results.


Should we have a modified field as well, for the case where a result is modified (eg if a use sets tags, verified etc).

What is the purpose of modified field? This is a metadata to control autosave?

0000-experiment-dataframe.md

chriseclectic · 2022-11-29T18:32:35Z

0000-experiment-dataframe.md

+- `created_in_db`: This has been conventionally managed internally in the `ExperimentData`.
+- `experiment`: This may be stored in the `fit` field. Not only experiment UUID, but also human readable representation of experiment is important to help experimentalists with filtering and sorting analysis parameters.
+
+To reduce the amount of work done by QE, we suggest that the experiment service `qiskit-ibm-experiment` will implement


I would like if the experiment service has its classes and functionality to work with saving/loading, and updating/modifying (where allowed) raw data, analysis results, artifacts (which I think could include figures), and that experiments is basically just initializing and populating the values of these classes to be saved, or modifying loaded ones with as little special handling has possible.

I agree, the question is how the minimal interface qiskit-experiments uses would look like; I can make the changes in the qiskit-ibm-experiment to accommodate for (hopefully) anything we'd like.

chriseclectic · 2022-11-29T18:33:42Z

0000-experiment-dataframe.md

+
+## Migration plan
+
+Even though this is the change of internal data structure, this may impact community developers who write own experiment on top of our base classes. Thus, the change should be made step by step to guarantee the backward compatibility.


I think here is where we should try and have fixed classes for the service/db for relevant data types, and the intention is these should not be modified/subclasses, but worked with by other classes within qiskit-experiments or other frameworks.

Currently qiskit-ibm-experiment uses the ExperimentData and AnalysisResultData dataclasses, which are also used inside qiskit-experiments. I have no objection to modifying how these dataclasses look or used, but this is really a question of what qiskit-experiments users would use.

We can also drop these classes altogether and stick with padas dataframes (that's basically what the local DB is doing now). Again, the question is what users would prefer.

chriseclectic · 2022-11-29T18:34:55Z

0000-experiment-dataframe.md

+If there are only keyword arguments, we can directly populate the analysis result data frame and bypasses generation of redundant `AnalysisResult` which no longer exists in the `ExperimentData`.
+Existing analysis classes in the Qiskit Experiments namespace will switch to this kwargs style, namely, it must directly populate analysis data from `_run_analysis` method.
+
+`ExperimentData.artifacts()` and `.add_artifacts()` methods are newly implemented with new protected member `ExpeirmentData._artifacts` which is a thread safe list of artifacts.


If we define an artifact class in the service, could we migrate figure handling to use artifacts? Maybe this requires too much change of the web UI to do in one go

The service already know how to handle figures, and should definitely be able to handle artifacts as well. I agree that delegating as much possible code to the service is preferable.

0000-experiment-dataframe.md

- Cleanup the primary columns - Add example codes to explain how to upgrade - Update data flow of analysis result, keep analysis result container to decouple experiment data from analysis

coruscating · 2023-01-10T05:26:51Z

0000-experiment-dataframe.md

+
+#### Complement
+
+As these examples show, experimentalist can distinguish the outcomes of batch (with `experiment`) and parallel experiments (with `components`). Even though there are multiple entries for the identical parameter (example 3), they have different result and experiment IDs and run times. In other words, one can sort the entries by `run_time` and get the time-ordered sequence of `value` for the trend analysis:


I'm thinking about cases of complicated nested composite experiments where looking at experiment and components are not enough to distinguish between different results. The user might be interested in is something like "I want the result of the fourth T1 experiment that was run simultaneously on Q0 and Q10." Of course, this can be deduced by looking at all the columns, but are we expecting the users to be familiar with pandas and write their own multi-column filters?

Good point. Such functionality can be provided through the pandas accessor, but the good point of using pandas is they can find many well written documentation in the web. As you know we don't have enough capability of maintaining detailed documentation and I'd like to avoid overengineering at this stage (even if we provide the logic it is tough for users to actually use it). Can you provide some configuration that might mess up the result table?

Yeah, I don't think we should make more work for ourselves implementing too many custom accessors since pandas already has many features and is well-documented. Perhaps a tutorial showing some particularly useful operations would be good enough. As we discussed, the main issue I can think of is run_time can be potentially out of order from the actual execution time on the device. I can think of a few times that are relevant:

time when job was submitted to the queue

time when job actually started to run

time when job finished running

run_time as defined here, time of experiment data receiving the job results

Usually 1 and 2 will be in the same order, but experimentalists who have access to the queue can potentially move jobs around in it. So for something like "get the result of the fourth T1 experiment", I wonder which time metric we should be using, or should we be recording multiple times and ask the user to specify what they want.

Maybe it is worth thinking a bit more about how we should record times.

For experiments I would expect the important times to be 2 and 3, when it started running on the device, and when it finished running on the device. However these times are generally specific to backend data (count/iq), not individual analysis results so any results processed from the same batch of data, such as parallel experiments, will have the same times. I don't think the specific time that ExperimentData received a job result is particular relevant (unless the job contains no timing information for its execution).

Another idea would be to include a created_time for analysis results like you have for artifacts. I'm imagining this could be relevant if you re-ran analysis on exisiting data (which would have a fixed time for when it ran) and wanted to save both results to the DB (im not sure this is allowed through qiskit-experiments without the two results also having different experiment ids, but you could probably do it directly through service).

Thanks Chris. I noticed that Job object reports multiple timestep, i.e.

job = backend.run(qc) job._wait_for_completion() job.time_per_step()

this returns a dictionary

{'CREATING': datetime.datetime(2023, 1, 26, 15, 22, 4, 465000, tzinfo=tzlocal()), 'CREATED': datetime.datetime(2023, 1, 26, 15, 22, 5, 813000, tzinfo=tzlocal()), 'VALIDATING': datetime.datetime(2023, 1, 26, 15, 22, 6, 43000, tzinfo=tzlocal()), 'VALIDATED': datetime.datetime(2023, 1, 26, 15, 22, 6, 629000, tzinfo=tzlocal()), 'QUEUED': datetime.datetime(2023, 1, 26, 15, 22, 6, 728000, tzinfo=tzlocal()), 'RUNNING': datetime.datetime(2023, 1, 26, 15, 22, 7, 752000, tzinfo=tzlocal()), 'COMPLETED': datetime.datetime(2023, 1, 26, 15, 22, 19, 271000, tzinfo=tzlocal())}

I think we can pick RUNNING time as a run_time, and separately add created_time to cover your case of reanalysis. Updated rfc with 9a62098.

Co-authored-by: gadial <gadial@gmail.com>

chriseclectic · 2023-01-18T18:51:40Z

0000-experiment-dataframe.md

+
+#### Complement
+
+As these examples show, experimentalist can distinguish the outcomes of batch (with `experiment`) and parallel experiments (with `components`). Even though there are multiple entries for the identical parameter (example 3), they have different result and experiment IDs and run times. In other words, one can sort the entries by `run_time` and get the time-ordered sequence of `value` for the trend analysis:


Maybe it is worth thinking a bit more about how we should record times.

For experiments I would expect the important times to be 2 and 3, when it started running on the device, and when it finished running on the device. However these times are generally specific to backend data (count/iq), not individual analysis results so any results processed from the same batch of data, such as parallel experiments, will have the same times. I don't think the specific time that ExperimentData received a job result is particular relevant (unless the job contains no timing information for its execution).

Another idea would be to include a created_time for analysis results like you have for artifacts. I'm imagining this could be relevant if you re-ran analysis on exisiting data (which would have a fixed time for when it ran) and wanted to save both results to the DB (im not sure this is allowed through qiskit-experiments without the two results also having different experiment ids, but you could probably do it directly through service).

chriseclectic · 2023-01-18T18:53:14Z

0000-experiment-dataframe.md

+
+### C. Updating existing results or artifacts.
+
+An end user (experimentalist) can modify analysis result or artifact after experiment.


Is this intended to replace the artifact with a new one using the same artifact_id or to delete the existing one and add the replacement artifact with a new artifact_id?

I think it depends on the experiment service implementation, but I was thinking to replace existing entry with the same ID.

chriseclectic · 2023-01-18T18:55:10Z

0000-experiment-dataframe.md

+    print(error)
+```
+
+Note that the experiment service already supports API to save artifacts. The payload must be JSON format.


Should this JSON serialization happen in the service layer, so the python artifict object has a python class as its value? If the later who's responsibility is the JSON serializer in this case? qiskit experiments or the service?

Good point. I think this is the responsibility of service since user may want to implement custom (local) service that is capable of handling custom data format. Indeed the conventional IBM provider implements own encoder rather than in Qiskit terra. I think we should move current experiment encoder to service.

I clarified this in 3de30ba.

chriseclectic

For me this looks good to start moving to the next stage of prototyping

t-imamichi · 2023-02-08T15:40:17Z

polars might be a good alternative. I heard it is faster than pandas, e.g. https://h2oai.github.io/db-benchmark/

nkanazawa1989 · 2023-02-09T02:21:07Z

Thanks @t-imamichi for the suggestion. We are aware of that package but we intentionally chose pandas for now. The number of analysis result entries in ExperimentData is estimated to be roughly ~100 and performance gain with Rust implementation is not quite drastic, and we want to chose a popular package with lot of documentation so that we can alleviate maintenance/support overhead.

add first rfc

854f321

nkanazawa1989 force-pushed the experiments-dataframe branch from fdfb98a to 854f321 Compare August 25, 2022 07:29

mtreinish changed the title ~~Datafarme for Qiskit Experiments~~ Dataframe for Qiskit Experiments Aug 25, 2022

update RFC after converation with team, mainly adding description of …

666f434

…primitive data types.

eggerdj reviewed Aug 30, 2022

View reviewed changes

0000-experiment-dataframe.md Outdated Show resolved Hide resolved

nkanazawa1989 mentioned this pull request Oct 11, 2022

Label raw data returned by CompositeCurveAnalysis Qiskit-Extensions/qiskit-experiments#935

Closed

nkanazawa1989 and others added 4 commits November 23, 2022 11:28

update rfc. now this rfc describes updates to analysis results, rathe…

b33d685

…r than expeirment results.

Added the "experiment service" part

5565a70

Merge pull request #2 from gadial/experiments-dataframe

c9c9e4b

Dataframe interfrace

finalize

bacc4a8

chriseclectic reviewed Nov 29, 2022

View reviewed changes

nkanazawa1989 added 3 commits November 30, 2022 11:30

reflect review comments

8454c5e

redesign artifact

263e94a

Feedback from Chris

3f94163

- Cleanup the primary columns - Add example codes to explain how to upgrade - Update data flow of analysis result, keep analysis result container to decouple experiment data from analysis

nkanazawa1989 force-pushed the experiments-dataframe branch from 8b46c32 to 3f94163 Compare December 5, 2022 18:42

coruscating reviewed Jan 10, 2023

View reviewed changes

nkanazawa1989 and others added 2 commits January 11, 2023 02:37

update from Gadi

f446846

Co-authored-by: gadial <gadial@gmail.com>

add more thoughts from discussion

c026f5d

chriseclectic reviewed Jan 18, 2023

View reviewed changes

nkanazawa1989 added 2 commits January 26, 2023 12:37

clarify the responsibility of serialization

3de30ba

add separate run_time and created_tiem in analysis result

9a62098

chriseclectic approved these changes Jan 30, 2023

View reviewed changes

assign PR number and update status

b053a8b

nkanazawa1989 force-pushed the experiments-dataframe branch from b0536e1 to b053a8b Compare January 31, 2023 08:24

nkanazawa1989 merged commit 21ff6eb into Qiskit:master Feb 6, 2023

1ucian0 added the RFC proposal New RFC is proposed label Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe for Qiskit Experiments #28

Dataframe for Qiskit Experiments #28

nkanazawa1989 commented Aug 25, 2022 •

edited

Loading

chriseclectic Nov 29, 2022

nkanazawa1989 Nov 30, 2022

chriseclectic Nov 29, 2022

nkanazawa1989 Nov 30, 2022

gadial Nov 30, 2022

chriseclectic Nov 29, 2022

nkanazawa1989 Nov 30, 2022

gadial Nov 30, 2022

chriseclectic Nov 29, 2022

nkanazawa1989 Nov 30, 2022

gadial Nov 30, 2022

coruscating Jan 10, 2023

nkanazawa1989 Jan 10, 2023

coruscating Jan 12, 2023

chriseclectic Jan 18, 2023

nkanazawa1989 Jan 26, 2023

chriseclectic Jan 18, 2023

chriseclectic Jan 18, 2023

nkanazawa1989 Jan 26, 2023

chriseclectic Jan 18, 2023

nkanazawa1989 Jan 26, 2023

nkanazawa1989 Jan 26, 2023

chriseclectic left a comment

t-imamichi commented Feb 8, 2023

nkanazawa1989 commented Feb 9, 2023 •

edited

Loading


		## Migration plan

		Even though this is the change of internal data structure, this may impact community developers who write own experiment on top of our base classes. Thus, the change should be made step by step to guarantee the backward compatibility.


		#### Complement

		As these examples show, experimentalist can distinguish the outcomes of batch (with `experiment`) and parallel experiments (with `components`). Even though there are multiple entries for the identical parameter (example 3), they have different result and experiment IDs and run times. In other words, one can sort the entries by `run_time` and get the time-ordered sequence of `value` for the trend analysis:


		### C. Updating existing results or artifacts.

		An end user (experimentalist) can modify analysis result or artifact after experiment.

Dataframe for Qiskit Experiments #28

Dataframe for Qiskit Experiments #28

Conversation

nkanazawa1989 commented Aug 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chriseclectic left a comment

Choose a reason for hiding this comment

t-imamichi commented Feb 8, 2023

nkanazawa1989 commented Feb 9, 2023 • edited Loading

nkanazawa1989 commented Aug 25, 2022 •

edited

Loading

nkanazawa1989 commented Feb 9, 2023 •

edited

Loading