Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New datasets + reorganization of current benchmarks #153

Open
folivetti opened this issue Aug 30, 2023 · 5 comments
Open

New datasets + reorganization of current benchmarks #153

folivetti opened this issue Aug 30, 2023 · 5 comments
Assignees

Comments

@folivetti
Copy link
Contributor

On a recent study (https://dl.acm.org/doi/abs/10.1145/3597312) I've noticed that the difference between the top-N (N = 15 or more) algorithms in most datasets are insignificant. They only differ on a small selection of the Friedman datasets. Maybe it is a good idea to separate the comparison of the algorithms in different groups:

Given this, my other proposal is to add the benchmarks of those two competitions into the benchmark and the one proposed by @MilesCranmer. For the 2023 competition I can also generate datasets with different levels of noise and other nasty features! Also, we can grab other benchmark functions from multimodal optimization to create more of those.

@folivetti folivetti self-assigned this Feb 14, 2024
@folivetti
Copy link
Contributor Author

Like we discussed in our meeting, I'll make a first pass on the current results and identify potential datasets to remove from the benchmark and some possibilities to select and categorize them.

@folivetti
Copy link
Contributor Author

@lacava @foolnotion @MilesCranmer
Some initial stats about PMLB! I've checked each dataset to see whether the targer column was discrete or continuous and whether they had only positive values:

Count by domain: 
domain
R         74
R+        15
Z          1
Z+        30
{0, 1}     2
Name: dataset, dtype: int64

Count by type: 
type
continuous    89
discrete      33
Name: dataset, dtype: int64

titanic, banana are actually classification problems. All others different from R may be from other distributions (poisson, gamma, etc.)

I've also checked the uniqueness of the Z+ datasets:

Uniqueness ratios below 100% of Z+ data: 
                     dataset  uniqueness_ratio  unique_values
0                   1027_ESL          0.018443              9
1                   1028_SWD          0.004000              4
2                   1029_LEV          0.005000              5
3                   1030_ERA          0.009000              9
4               1089_USCrime          0.893617             42
12                1595_poker          0.000010             10
14            195_auto_price          0.911950            145
15               197_cpu_act          0.006836             56
16                   201_pol          0.000733             11
17             207_autoPrice          0.911950            145
20              218_house_8L          0.089756           2045
22             227_cpu_small          0.006836             56
25           230_machine_cpu          0.555024            116
26       294_satellite_image          0.000932              6
29   485_analcatdata_vehicle          0.979167             47
32                519_vinnie          0.042105             16
34   523_analcatdata_neavote          0.080000              8
37                537_houses          0.186143           3842
40    556_analcatdata_apnea2          0.374737            178
41    557_analcatdata_apnea1          0.345263            164
43                   561_cpu          0.497608            104
44             562_cpu_small          0.006836             56
46               573_cpu_act          0.006836             56
47             574_house_16H          0.089756           2045
112      665_sleuth_case2002          0.129252             19
115        687_sleuth_ex1605          0.661290             41
116   690_visualizing_galaxy          0.643963            208
119      712_chscase_geyser1          0.225225             50

Some of them looks like a multiclass problem. Plotting the median of medians of R^2 errorbar we get:

plot_domain

indeed most algorithms perform quite well for R and not that good for the other problems, possibly because they are only fitting least squares disregarding other possibilities.

My suggestion is that we may remove those from SRBench 3.0 and reinsert them in SRBench 4.0 under a different track (non-gaussian distributions).

In another analysis, I picked the top-10 algorithms wrt r2_test and removed from the list of datasets those that:

  • the smallest r2_test among these 10 algorithms was greater than 0.9 (so all of them were competent to find the solution) or
  • the standard deviation of r2_test for these 10 algorithms was smaller than 0.05 (there's not much variation among the top-10, so maybe nobody can find a better solution)

With this procedure we end up with 85 datasets. If we only keep those with domain R we end up with 58 datasets (all of them friedman :-) ).

@gkronber
Copy link

Some remarks on:

Uniqueness ratios below 100% of Z+ data: 
                     dataset  uniqueness_ratio  unique_values
0                   1027_ESL          0.018443              9
1                   1028_SWD          0.004000              4
2                   1029_LEV          0.005000              5
3                   1030_ERA          0.009000              9
4               1089_USCrime          0.893617             42
12                1595_poker          0.000010             10
14            195_auto_price          0.911950            145
15               197_cpu_act          0.006836             56
16                   201_pol          0.000733             11
17             207_autoPrice          0.911950            145
20              218_house_8L          0.089756           2045
22             227_cpu_small          0.006836             56
25           230_machine_cpu          0.555024            116
26       294_satellite_image          0.000932              6
29   485_analcatdata_vehicle          0.979167             47
32                519_vinnie          0.042105             16
34   523_analcatdata_neavote          0.080000              8
37                537_houses          0.186143           3842
40    556_analcatdata_apnea2          0.374737            178
41    557_analcatdata_apnea1          0.345263            164
43                   561_cpu          0.497608            104
44             562_cpu_small          0.006836             56
46               573_cpu_act          0.006836             56
47             574_house_16H          0.089756           2045
112      665_sleuth_case2002          0.129252             19
115        687_sleuth_ex1605          0.661290             41
116   690_visualizing_galaxy          0.643963            208
119      712_chscase_geyser1          0.225225             50
  • 207_autoPrice and 195_auto_price are the same.
  • 573_cpu_act and 562_cpu_small are the same.
  • Source for datasets with _visualizing_ identifier is: William S. Cleveland: Visualizing Data
    • 690_visualizing_galaxy: target is "velocity relative to the earth"
  • Source for datasets with _analcatdata_ identifier is: Jeffrey S. Simonoff: Analyzing Categorical Data
    • 485_analcatdata_vehicle: "Margolis et al. (2000) analyzed data for 1991-1996 from the Fatality Analysis Reporting System, a nationwide registry of motor vehicle deaths in the United States. They reported deaths of children (younger than 16) cross-classified by type (passenger or pedestrian/bicyclist), gender, age, and whether the accident was alcohol related. The data are given in the file vehicle."
    • 523_analcatdata_neavote: "Many political, civic, and business organizations lobby state and federal legislators to try to influence their votes on key legislative matters. These organizations often track the voting patterns of legislators in the form of "report cards," which summarize the votes of legislators on key pieces of legislation. The file neavote summarizes the 1999 U.S. Senate legislative report card produced by the National Education Association (NEA, the nation's oldest and largest organization committed to advancing the cause of public education), based on 10 Senate votes during the year, and was obtained from the NEA's web site (www.nea.org)"
    • 557_analcatdata_apnea1: "An automated algorithm was also used to classify time periods into the same five categories using nasal pressure and mechanical respiratory input impedance." "The file apneal contains the results of comparisons of second-by-second evaluations by the two scorers for each of the 19 subjects". A more elaborate description is given in the book.
    • 556_analcatdata_apnea2: "The files apnea2 and apnea3 give results comparing the automated algorithm to scorer 1 and scorer 2, respectively. (The scorer values are not directly comparable to those in apneal, because they were generated based on cross-validated test sets."

These files (and many others in PMLB) were most likely taken verbatim from StatLib (http://lib.stat.cmu.edu/datasets/) which references the sources.

Also relevant is the effort by @alexzwanenburg (EpistasisLab/pmlb#180) who invested a lot of time to identify duplicates and clean up some of the datasets in PMLB.

@folivetti
Copy link
Contributor Author

Thanks @gkronber this was going to be my next step (search duplication), so this PR will make my life much easier :-)

@lacava
Copy link
Member

lacava commented Feb 21, 2024

there is a PR on the PMLB repo that fixes many of these issues. might be ag ood place to start oh, i see @gkronber mentioned this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants