Deprecate outdated interfaces of pandas used in rank.py #134

FuryMartin · 2024-08-16T15:44:45Z

What would you like to be added/modified:

Some outdated interfaces of pandas are used in in rank.py .

We can deprecate them by:

Implement __get_all() using a more elegant interface.
using pd.concat() to merge old DataFrame and new DataFrame.

The modified method should remain compatible with the version of pandas corresponding to Python 3.6.

Why is this needed:

Ianvs was originally designed for pandas==1.1.5, but the latest version is now pandas==2.2.2.Due to a major version update, some interfaces of pandas have been deprecated in the new version.

Continuing to use these old interfaces will encounter errors on Python>=3.8.

pd.np has been deprecated in pandas>=2.0.0 :

ianvs/core/storymanager/rank/rank.py

Line 208 in f2352ce

selected_df.index = pd.np.arange(1, len(selected_df) + 1)

AttributeError: module 'pandas' has no attribute 'np'

append has been deprecated in pandas>=2.0.0:

ianvs/core/storymanager/rank/rank.py

Line 171 in f2352ce

all_df = all_df.append(old_df)

AttributeError: 'DataFrame' object has no attribute 'append'

initializing all_df with np.NAN will cause str data missing:

ianvs/core/storymanager/rank/rank.py

Line 145 in f2352ce

all_df.loc[i] = [np.NAN for i in range(len(self.all_df_header))]

+------+-----------+-----+-----------+----------+-----------+----------+---------------------+----------------------+-------------------+------------------------+---------------------+------+-----+
| rank | algorithm | acc | edge-rate | paradigm | basemodel | apimodel | hard_example_mining | basemodel-model_name | basemodel-backend | basemodel-quantization | apimodel-model_name | time | url |
+------+-----------+-----+-----------+----------+-----------+----------+---------------------+----------------------+-------------------+------------------------+---------------------+------+-----+
|  1   |           | 0.6 |    0.6    |          |           |          |                     |                      |                   |                        |                     |      |     |
+------+-----------+-----+-----------+----------+-----------+----------+---------------------+----------------------+-------------------+------------------------+---------------------+------+-----+

Setting value by df.[row_index][column] will cause SettingWithCopyWarning. This interface will be deprecated in pandas>=3.0 in the future.

ianvs/core/storymanager/rank/rank.py

Line 148 in f2352ce

all_df.loc[i][0] = algorithm.name

Line 151, 154, 158, 165, 167 have the same issue, too.

./ianvs/core/storymanager/rank/rank.py:148: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  all_df.loc[i][0] = algorithm.name

Assigning values one by one reduces code readability and can be simplified.

ianvs/core/storymanager/rank/rank.py

Lines 145 to 167 in f2352ce

    
           all_df.loc[i] = [np.NAN for i in range(len(self.all_df_header))] 
        
           # fill name column of algorithm 
        
           algorithm = test_case.algorithm 
        
           all_df.loc[i][0] = algorithm.name 
        
           # fill metric columns of algorithm 
        
           for metric_name in test_results[test_case.id][0]: 
        
               all_df.loc[i][metric_name] = test_results[test_case.id][0].get(metric_name) 
        
           # file paradigm column of algorithm 
        
           all_df.loc[i]["paradigm"] = algorithm.paradigm_type 
        
           # fill module columns of algorithm 
        
           for module_type, module in algorithm.modules.items(): 
        
               all_df.loc[i][module_type] = module.name 
        
           # fill hyperparameters columns of algorithm modules 
        
           hps = self._get_algorithm_hyperparameters(algorithm) 
        
           # pylint: disable=C0103 
        
           for k, v in hps.items(): 
        
               all_df.loc[i][k] = v 
        
           # fill time and output dir of testcase 
        
           all_df.loc[i][-2:] = [test_results[test_case.id][1], test_case.output_dir]

Using whitspace as seperator in a CSV(Comma-Separated Values) file is weird.

ianvs/core/storymanager/rank/rank.py

Line 179 in f2352ce

all_df.to_csv(self.all_rank_file, index_label="rank", encoding="utf-8", sep=" ")

The text was updated successfully, but these errors were encountered:

FuryMartin · 2024-08-16T15:53:54Z

The current implemetation in Ianvs is:

ianvs/core/storymanager/rank/rank.py

Lines 142 to 173 in f2352ce

    
           def _get_all(self, test_cases, test_results) -> pd.DataFrame: 
        
               all_df = pd.DataFrame(columns=self.all_df_header) 
        
               for i, test_case in enumerate(test_cases): 
        
                   all_df.loc[i] = [np.NAN for i in range(len(self.all_df_header))] 
        
                   # fill name column of algorithm 
        
                   algorithm = test_case.algorithm 
        
                   all_df.loc[i][0] = algorithm.name 
        
                   # fill metric columns of algorithm 
        
                   for metric_name in test_results[test_case.id][0]: 
        
                       all_df.loc[i][metric_name] = test_results[test_case.id][0].get(metric_name) 
        
                   # file paradigm column of algorithm 
        
                   all_df.loc[i]["paradigm"] = algorithm.paradigm_type 
        
                   # fill module columns of algorithm 
        
                   for module_type, module in algorithm.modules.items(): 
        
                       all_df.loc[i][module_type] = module.name 
        
                   # fill hyperparameters columns of algorithm modules 
        
                   hps = self._get_algorithm_hyperparameters(algorithm) 
        
                   # pylint: disable=C0103 
        
                   for k, v in hps.items(): 
        
                       all_df.loc[i][k] = v 
        
                   # fill time and output dir of testcase 
        
                   all_df.loc[i][-2:] = [test_results[test_case.id][1], test_case.output_dir] 
        
               if utils.is_local_file(self.all_rank_file): 
        
                   old_df = pd.read_csv(self.all_rank_file, delim_whitespace=True, index_col=0) 
        
                   all_df = all_df.append(old_df) 
        
               return self._sort_all_df(all_df, self._get_all_metric_names(test_results))

This is a revised version of the implementation and all interfaces used are compatible with pandas==1.1.5

    def _get_all(self, test_cases, test_results) -> pd.DataFrame:
        all_df = pd.DataFrame(columns=self.all_df_header)

        for i, test_case in enumerate(test_cases):
            algorithm = test_case.algorithm
            test_result = test_results[test_case.id][0]

            # add algorithm, paradigm, time, url of algorithm
            row_data = {
                "algorithm": algorithm.name,
                "paradigm": algorithm.paradigm_type,
                "time": test_results[test_case.id][1],
                "url": test_case.output_dir
            }

            # add metric of algorithm
            row_data.update(test_result)

            # add module of algorithm
            row_data.update({
                module_type: module.name
                for module_type, module in algorithm.modules.items()
            })

            # add hyperparameters of algorithm modules
            row_data.update(self._get_algorithm_hyperparameters(algorithm))

            # fill data
            all_df.loc[i] = row_data

        new_df = self._concat_existing_data(all_df)

        return self._sort_all_df(new_df, self._get_all_metric_names(test_results))

    def _concat_existing_data(self, new_df):
        if utils.is_local_file(self.all_rank_file):
            old_df = pd.read_csv(self.all_rank_file, index_col=0)
            new_df = pd.concat([old_df, new_df])
        return new_df

Comparing to the current implementation, the revised one mainly:

Fill the DataFrame in one go using dict-typed row_data.
Use pd.concat to merge old_df and new_df.
Extract the merging of dataframe into a separate function to make the logic of __get_all() clearer.

Additionally, I removed sep=" " from all CSV read and write functions.

FuryMartin · 2024-08-16T16:04:50Z

After fixing PCB-AoI's dependencies, I conducted experiments on this example with Python==3.6.13 to prove that our revisions do not introduce new compatibility issues.

The experimental results are as follows:

Run once:

f1_score_avg: 0.8568
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+
| rank |        algorithm        | f1_score |      paradigm      | basemodel | basemodel-momentum | basemodel-learning_rate |         time        |                                           url                                            |
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+
|  1   | fpn_singletask_learning |  0.8694  | singletasklearning |    FPN    |        0.95        |           0.1           | 2024-08-16 22:54:24 | ./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c94-5bde-11ef-bf9b-755996a48c84 |
|  2   | fpn_singletask_learning |  0.8568  | singletasklearning |    FPN    |        0.5         |           0.1           | 2024-08-16 22:57:38 | ./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c95-5bde-11ef-bf9b-755996a48c84 |
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+

Run Twice:

f1_score_avg: 0.8635
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+
| rank |        algorithm        | f1_score |      paradigm      | basemodel | basemodel-momentum | basemodel-learning_rate |         time        |                                           url                                            |
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+
|  1   | fpn_singletask_learning |  0.8707  | singletasklearning |    FPN    |        0.95        |           0.1           | 2024-08-16 23:59:15 | ./workspace/benchmarkingjob/fpn_singletask_learning/08e9a128-5be8-11ef-bf9b-755996a48c84 |
|  2   | fpn_singletask_learning |  0.8694  | singletasklearning |    FPN    |        0.95        |           0.1           | 2024-08-16 22:54:24 | ./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c94-5bde-11ef-bf9b-755996a48c84 |
|  3   | fpn_singletask_learning |  0.8635  | singletasklearning |    FPN    |        0.5         |           0.1           | 2024-08-17 00:02:22 | ./workspace/benchmarkingjob/fpn_singletask_learning/08e9a129-5be8-11ef-bf9b-755996a48c84 |
|  4   | fpn_singletask_learning |  0.8568  | singletasklearning |    FPN    |        0.5         |           0.1           | 2024-08-16 22:57:38 | ./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c95-5bde-11ef-bf9b-755996a48c84 |
+------+-------------------------+----------+--------------------+-----------+--------------------+-------------------------+---------------------+------------------------------------------------------------------------------------------+

The all_rank.csv I get shows as bellow:

rank,algorithm,f1_score,paradigm,basemodel,basemodel-momentum,basemodel-learning_rate,time,url
1,fpn_singletask_learning,0.8707,singletasklearning,FPN,0.95,0.1,2024-08-16 23:59:15,./workspace/benchmarkingjob/fpn_singletask_learning/08e9a128-5be8-11ef-bf9b-755996a48c84
2,fpn_singletask_learning,0.8694,singletasklearning,FPN,0.95,0.1,2024-08-16 22:54:24,./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c94-5bde-11ef-bf9b-755996a48c84
3,fpn_singletask_learning,0.8635,singletasklearning,FPN,0.5,0.1,2024-08-17 00:02:22,./workspace/benchmarkingjob/fpn_singletask_learning/08e9a129-5be8-11ef-bf9b-755996a48c84
4,fpn_singletask_learning,0.8568,singletasklearning,FPN,0.5,0.1,2024-08-16 22:57:38,./workspace/benchmarkingjob/fpn_singletask_learning/fdd65c95-5bde-11ef-bf9b-755996a48c84

These results indicate that the revised version is functioning properly.

hsj576 · 2024-08-30T06:35:21Z

The overall change looks good to me. Please talk about it at the next community meeting and see if anyone else has any questions.

FuryMartin · 2024-08-30T08:09:49Z

The overall change looks good to me. Please talk about it at the next community meeting and see if anyone else has any questions.

OK, thanks for the review.

FuryMartin mentioned this issue Aug 16, 2024

fix: deprecate outdated interface of pandas #135

Merged

MooreZheng added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Aug 29, 2024

kubeedge-bot closed this as completed in #135 Sep 26, 2024

FuryMartin mentioned this issue Sep 27, 2024

OSPP: Cloud-edge collaborative inference for LLM based on KubeEdge-Ianvs #149

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate outdated interfaces of pandas used in rank.py #134

Deprecate outdated interfaces of pandas used in rank.py #134

FuryMartin commented Aug 16, 2024 •

edited

Loading

FuryMartin commented Aug 16, 2024 •

edited

Loading

FuryMartin commented Aug 16, 2024

hsj576 commented Aug 30, 2024

FuryMartin commented Aug 30, 2024 •

edited

Loading

Deprecate outdated interfaces of pandas used in rank.py #134

Deprecate outdated interfaces of pandas used in rank.py #134

Comments

FuryMartin commented Aug 16, 2024 • edited Loading

FuryMartin commented Aug 16, 2024 • edited Loading

FuryMartin commented Aug 16, 2024

hsj576 commented Aug 30, 2024

FuryMartin commented Aug 30, 2024 • edited Loading

FuryMartin commented Aug 16, 2024 •

edited

Loading

FuryMartin commented Aug 16, 2024 •

edited

Loading

FuryMartin commented Aug 30, 2024 •

edited

Loading