Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add RandomUnderSampling functionality #89

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

vmpyr
Copy link

@vmpyr vmpyr commented Apr 8, 2023

Work underway. Have added functions to utils for future use as well

  • Use updated packages
  • Write the Random Under Sampler
  • Add tests

@vmpyr vmpyr marked this pull request as ready for review April 12, 2023 09:56
Copy link

@AshlinHarris AshlinHarris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

I can't access the branch security settings for this repo, but I'm guessing that 100% codecov is a required test. Do you have test cases for each option? These would also be a great help for me, since I'm less familiar with this repo.

Also, I would change the version "0.9.0-dev".

Project.toml Outdated
@@ -1,23 +1,26 @@
name = "ClassImbalance"
uuid = "04a18a73-7590-580c-b363-eeca0919eb2a"
authors = ["Paul Stey <paulstey@gmail.com>", "Dilum Aluthge <dilum@aluthge.com>", "Brown Center for Biomedical Informatics <bcbi@brown.edu>"]
version = "0.8.7"
version = "0.8.8"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a minor release ("0.9.0" or "0.9.0-dev")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, I'll do that

replacement::Bool = false
) where T <: AbstractVector where S <: Integer where A <: Any
# check if X implements getobs
@assert Tables.istable(X) "$X is not implementing the MLUtils.jl getobs interface"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any example inputs? Here is what I tried to get around this @assert:

pkg> activate --temp
pkg> add https://github.com/vmpyr/ClassImbalance.jl#vmpyr/random_undersampler
# Example from README: 
julia> using ClassImbalance
julia> y = vcat(ones(20), zeros(180)); # 0 = majority, 1 = minority
julia> X = hcat(rand(200, 10), y)
julia> X2, y2 = smote(X, y, k = 5, pct_under = 100, pct_over = 200)
([0.8133030487718941 0.6399427612098323 … 0.18570608874157368 1.0; 0.7749414282075959 0.46482119778838893 … 0.17369768251844653 1.0; … ; 0.8348062387369848 0.04923338419004153 … 0.8706114262388865 1.0; 0.8582028352406533 0.39648688626915957 … 0.46337293622094944 1.0], [1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0  …  1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0])
julia> using Tables
julia> random_undersampler(Tables.table(X),y)
(Tables.DictColumnTable with 40 rows, 11 columns, and schema:
 :Column1   Float64
 :Column2   Float64
 :Column3   Float64
 :Column4   Float64
 :Column5   Float64
 :Column6   Float64
 :Column7   Float64
 :Column8   Float64
 :Column9   Float64
 :Column10  Float64
 :Column11  Float64, [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])

julia> random_undersampler(Tables.table(X2),y2)
(Tables.DictColumnTable with 80 rows, 11 columns, and schema:
 :Column1   Float64
 :Column2   Float64
 :Column3   Float64
 :Column4   Float64
 :Column5   Float64
 :Column6   Float64
 :Column7   Float64
 :Column8   Float64
 :Column9   Float64
 :Column10  Float64
 :Column11  Float64, [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I do have one I tried. I had .csv files for it too btw, you can find them here.

# import the required packages
import CSV
using ClassImbalance, DataFrames, DelimitedFiles

# load the data as defined in the task guidelines
# load X as a DataFrame
X = CSV.read("./data/pima-indians-diabetes-X.csv", DataFrame; header=["x1","x2","x3","x4","x5","x6","x7","x8"])
# load y as a CategoricalVector
y = CategoricalVector(vec(readdlm("./data/pima-indians-diabetes-y.csv", Int)))

# call the undersampler
Xover, yover = random_undersampler(X, y)

# print the results
show(Xover)
println()
show(IOContext(stdout, :limit => true), yover)
println()

# print the counts of the classes
println("initial count of class 0: ", count(y .== 0))
println("initial count of class 1: ", count(y .== 1))
println("count of class 0 after undersampling: ", count(yover .== 0))
println("count of class 1 after undersampling: ", count(yover .== 1))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I've written the function such that X contains only the features and not the labels. I'm writing a overloaded function for a data matrix containing both features and labels.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example, the function does work correctly anyway, you can access the columns and check the data or the count of it.

julia> Xover, yover = random_undersampler(Tables.table(X),y)
(Tables.DictColumnTable with 40 rows, 10 columns, and schema:
 :Column1   Float64
 :Column2   Float64
 :Column3   Float64
 :Column4   Float64
 :Column5   Float64
 :Column6   Float64
 :Column7   Float64
 :Column8   Float64
 :Column9   Float64
 :Column10  Float64, [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0    0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])

julia> count(y .== 1)
20

julia> count(y .== 0)
180

julia> count(yover .== 0)
20

julia> count(yover .== 1)
20

julia> Xover
Tables.DictColumnTable with 40 rows, 10 columns, and schema:
 :Column1   Float64
 :Column2   Float64
 :Column3   Float64
 :Column4   Float64
 :Column5   Float64
 :Column6   Float64
 :Column7   Float64
 :Column8   Float64
 :Column9   Float64
 :Column10  Float64

julia> Xover.Column1
40-element Vector{Float64}:
 0.3952844545871751
 0.8704752637842621
 0.6724007567263786
 0.35193156297873285
 0.7208683859930942
 0.4178980161173723
 0.33217406300791696
 0.07224589352246602
 0.09320295190634142
 0.10316414877241253
 0.6166656578236519
 0.20575631236600955
 0.5488182158068842
 0.3156879765220615
 0.8174789837160467
 
 0.10887263478147902
 0.6100936867437683
 0.9340285563941725
 0.11318763378056362
 0.24546432607219804
 0.2788921342943955
 0.6103736533699433
 0.05058497458051314
 0.9006314393478084
 0.3487415101299778
 0.693372173075407
 0.4980414320194523
 0.9952081568741031
 0.8699331124771522
 0.11711882057394163

@DilumAluthge
Copy link
Member

Just a quick comment.

This package hasn't been updated in a while, so the CI in this repo is Travis CI. Of course, Travis CI no longer offers free CI for open source projects. So we need to migrate the CI to GitHub Actions CI.

@vmpyr @AshlinHarris Would you two be willing to work together to migrate CI to GitHub Actions? You can use https://discourse.julialang.org/t/easy-workflow-file-for-setting-up-github-actions-ci-for-your-julia-package/49765 as a starting point.

@AshlinHarris
Copy link

@DilumAluthge Could you please add me as an admin on this repo? I can't access the branch protection, but I'll probably need to remove the travis-ci requirement.

@DilumAluthge
Copy link
Member

Done.

@DilumAluthge
Copy link
Member

If possible, I'd recommend doing the Travis-to-GitHub-Actions migration in a separate PR. I tend to prefer multiple smaller PRs (instead of one large PR), because I find that the smaller PRs are each easier to review.

@vmpyr
Copy link
Author

vmpyr commented Apr 13, 2023

@AshlinHarris

I'm guessing that 100% codecov is a required test. Do you have test cases for each option?

Not as of now, I have written the functionality and will start writing the tests today most probably.

@DilumAluthge
Copy link
Member

Yeah, we should keep the requirement of having 100% code coverage.

@vmpyr
Copy link
Author

vmpyr commented Apr 13, 2023

@DilumAluthge

If possible, I'd recommend doing the Travis-to-GitHub-Actions migration in a separate PR.

Yes, I'll happily work on that. Will I be needed to have admin privileges as well? If so, please add me as admin too.

The repo needs a huge overhaul since it is very old now.

@DilumAluthge
Copy link
Member

DilumAluthge commented Apr 13, 2023

It'll be sufficient for @AshlinHarris to have admin permissions; there are relatively few settings that need to be modified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants