Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate outdated interfaces of pandas used in rank.py #134

Closed
FuryMartin opened this issue Aug 16, 2024 · 4 comments · Fixed by #135
Closed

Deprecate outdated interfaces of pandas used in rank.py #134

FuryMartin opened this issue Aug 16, 2024 · 4 comments · Fixed by #135
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt.

Comments

@FuryMartin
Copy link
Contributor

FuryMartin commented Aug 16, 2024

What would you like to be added/modified:

Some outdated interfaces of pandas are used in in rank.py .

We can deprecate them by:

  1. Implement __get_all() using a more elegant interface.
  2. using pd.concat() to merge old DataFrame and new DataFrame.

The modified method should remain compatible with the version of pandas corresponding to Python 3.6.

Why is this needed:

Ianvs was originally designed for pandas==1.1.5, but the latest version is now pandas==2.2.2.Due to a major version update, some interfaces of pandas have been deprecated in the new version.

Continuing to use these old interfaces will encounter errors on Python>=3.8.

  1. pd.np has been deprecated in pandas>=2.0.0 :
    selected_df.index = pd.np.arange(1, len(selected_df) + 1)
AttributeError: module 'pandas' has no attribute 'np'
  1. append has been deprecated in pandas>=2.0.0:
    all_df = all_df.append(old_df)
AttributeError: 'DataFrame' object has no attribute 'append'
  1. initializing all_df with np.NAN will cause str data missing:
    all_df.loc[i] = [np.NAN for i in range(len(self.all_df_header))]
+------+-----------+-----+-----------+----------+-----------+----------+---------------------+----------------------+-------------------+------------------------+---------------------+------+-----+
| rank | algorithm | acc | edge-rate | paradigm | basemodel | apimodel | hard_example_mining | basemodel-model_name | basemodel-backend | basemodel-quantization | apimodel-model_name | time | url |
+------+-----------+-----+-----------+----------+-----------+----------+---------------------+----------------------+-------------------+------------------------+---------------------+------+-----+
|  1   |           | 0.6 |    0.6    |          |           |          |                     |                      |                   |                        |                     |      |     |
+------+-----------+-----+-----------+----------+-----------+----------+---------------------+----------------------+-------------------+------------------------+---------------------+------+-----+
  1. Setting value by df.[row_index][column] will cause SettingWithCopyWarning. This interface will be deprecated in pandas>=3.0 in the future.
    all_df.loc[i][0] = algorithm.name

Line 151, 154, 158, 165, 167 have the same issue, too.

./ianvs/core/storymanager/rank/rank.py:148: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  all_df.loc[i][0] = algorithm.name
  1. Assigning values one by one reduces code readability and can be simplified.

    all_df.loc[i] = [np.NAN for i in range(len(self.all_df_header))]
    # fill name column of algorithm
    algorithm = test_case.algorithm
    all_df.loc[i][0] = algorithm.name
    # fill metric columns of algorithm
    for metric_name in test_results[test_case.id][0]:
    all_df.loc[i][metric_name] = test_results[test_case.id][0].get(metric_name)
    # file paradigm column of algorithm
    all_df.loc[i]["paradigm"] = algorithm.paradigm_type
    # fill module columns of algorithm
    for module_type, module in algorithm.modules.items():
    all_df.loc[i][module_type] = module.name
    # fill hyperparameters columns of algorithm modules
    hps = self._get_algorithm_hyperparameters(algorithm)
    # pylint: disable=C0103
    for k, v in hps.items():
    all_df.loc[i][k] = v
    # fill time and output dir of testcase
    all_df.loc[i][-2:] = [test_results[test_case.id][1], test_case.output_dir]

  2. Using whitspace as seperator in a CSV(Comma-Separated Values) file is weird.

    all_df.to_csv(self.all_rank_file, index_label="rank", encoding="utf-8", sep=" ")

@FuryMartin
Copy link
Contributor Author

FuryMartin commented Aug 16, 2024

The current implemetation in Ianvs is:

def _get_all(self, test_cases, test_results) -> pd.DataFrame:
all_df = pd.DataFrame(columns=self.all_df_header)
for i, test_case in enumerate(test_cases):
all_df.loc[i] = [np.NAN for i in range(len(self.all_df_header))]
# fill name column of algorithm
algorithm = test_case.algorithm
all_df.loc[i][0] = algorithm.name
# fill metric columns of algorithm
for metric_name in test_results[test_case.id][0]:
all_df.loc[i][metric_name] = test_results[test_case.id][0].get(metric_name)
# file paradigm column of algorithm
all_df.loc[i]["paradigm"] = algorithm.paradigm_type
# fill module columns of algorithm
for module_type, module in algorithm.modules.items():
all_df.loc[i][module_type] = module.name
# fill hyperparameters columns of algorithm modules
hps = self._get_algorithm_hyperparameters(algorithm)
# pylint: disable=C0103
for k, v in hps.items():
all_df.loc[i][k] = v
# fill time and output dir of testcase
all_df.loc[i][-2:] = [test_results[test_case.id][1], test_case.output_dir]
if utils.is_local_file(self.all_rank_file):
old_df = pd.read_csv(self.all_rank_file, delim_whitespace=True, index_col=0)
all_df = all_df.append(old_df)
return self._sort_all_df(all_df, self._get_all_metric_names(test_results))

This is a revised version of the implementation and all interfaces used are compatible with pandas==1.1.5

    def _get_all(self, test_cases, test_results) -> pd.DataFrame:
        all_df = pd.DataFrame(columns=self.all_df_header)

        for i, test_case in enumerate(test_cases):
            algorithm = test_case.algorithm
            test_result = test_results[test_case.id][0]

            # add algorithm, paradigm, time, url of algorithm
            row_data = {
                "algorithm": algorithm.name,
                "paradigm": algorithm.paradigm_type,
                "time": test_results[test_case.id][1],
                "url": test_case.output_dir
            }

            # add metric of algorithm
            row_data.update(test_result)

            # add module of algorithm
            row_data.update({
                module_type: module.name
                for module_type, module in algorithm.modules.items()
            })

            # add hyperparameters of algorithm modules
            row_data.update(self._get_algorithm_hyperparameters(algorithm))

            # fill data
            all_df.loc[i] = row_data

        new_df = self._concat_existing_data(all_df)

        return self._sort_all_df(new_df, self._get_all_metric_names(test_results))

    def _concat_existing_data(self, new_df):
        if utils.is_local_file(self.all_rank_file):
            old_df = pd.read_csv(self.all_rank_file, index_col=0)
            new_df = pd.concat([old_df, new_df])
        return new_df

Comparing to the current implementation, the revised one mainly:

  • Fill the DataFrame in one go using dict-typed row_data.
  • Use pd.concat to merge old_df and new_df.
  • Extract the merging of dataframe into a separate function to make the logic of __get_all() clearer.

Additionally, I removed sep=" " from all CSV read and write functions.

@FuryMartin
Copy link
Contributor Author

After fixing PCB-AoI's dependencies, I conducted experiments on this example with Python==3.6.13 to prove that our revisions do not introduce new compatibility issues.

The experimental results are as follows:

Run once:

f1_score_avg: 0.8568
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+
| rank |        algorithm        | f1_score |      paradigm      | basemodel | basemodel-momentum | basemodel-learning_rate |         time        |                                           url                                            |
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+
|  1   | fpn_singletask_learning |  0.8694  | singletasklearning |    FPN    |        0.95        |           0.1           | 2024-08-16 22:54:24 | ./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c94-5bde-11ef-bf9b-755996a48c84 |
|  2   | fpn_singletask_learning |  0.8568  | singletasklearning |    FPN    |        0.5         |           0.1           | 2024-08-16 22:57:38 | ./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c95-5bde-11ef-bf9b-755996a48c84 |
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+

Run Twice:

f1_score_avg: 0.8635
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+
| rank |        algorithm        | f1_score |      paradigm      | basemodel | basemodel-momentum | basemodel-learning_rate |         time        |                                           url                                            |
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+
|  1   | fpn_singletask_learning |  0.8707  | singletasklearning |    FPN    |        0.95        |           0.1           | 2024-08-16 23:59:15 | ./workspace/benchmarkingjob/fpn_singletask_learning/08e9a128-5be8-11ef-bf9b-755996a48c84 |
|  2   | fpn_singletask_learning |  0.8694  | singletasklearning |    FPN    |        0.95        |           0.1           | 2024-08-16 22:54:24 | ./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c94-5bde-11ef-bf9b-755996a48c84 |
|  3   | fpn_singletask_learning |  0.8635  | singletasklearning |    FPN    |        0.5         |           0.1           | 2024-08-17 00:02:22 | ./workspace/benchmarkingjob/fpn_singletask_learning/08e9a129-5be8-11ef-bf9b-755996a48c84 |
|  4   | fpn_singletask_learning |  0.8568  | singletasklearning |    FPN    |        0.5         |           0.1           | 2024-08-16 22:57:38 | ./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c95-5bde-11ef-bf9b-755996a48c84 |
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+

The all_rank.csv I get shows as bellow:

rank,algorithm,f1_score,paradigm,basemodel,basemodel-momentum,basemodel-learning_rate,time,url
1,fpn_singletask_learning,0.8707,singletasklearning,FPN,0.95,0.1,2024-08-16 23:59:15,./workspace/benchmarkingjob/fpn_singletask_learning/08e9a128-5be8-11ef-bf9b-755996a48c84
2,fpn_singletask_learning,0.8694,singletasklearning,FPN,0.95,0.1,2024-08-16 22:54:24,./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c94-5bde-11ef-bf9b-755996a48c84
3,fpn_singletask_learning,0.8635,singletasklearning,FPN,0.5,0.1,2024-08-17 00:02:22,./workspace/benchmarkingjob/fpn_singletask_learning/08e9a129-5be8-11ef-bf9b-755996a48c84
4,fpn_singletask_learning,0.8568,singletasklearning,FPN,0.5,0.1,2024-08-16 22:57:38,./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c95-5bde-11ef-bf9b-755996a48c84

These results indicate that the revised version is functioning properly.

@MooreZheng MooreZheng added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Aug 29, 2024
@hsj576
Copy link
Member

hsj576 commented Aug 30, 2024

The overall change looks good to me. Please talk about it at the next community meeting and see if anyone else has any questions.

@FuryMartin
Copy link
Contributor Author

FuryMartin commented Aug 30, 2024

The overall change looks good to me. Please talk about it at the next community meeting and see if anyone else has any questions.

OK, thanks for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants