Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new feature request - automatically fit multiple variables #27

Open
jkmackie opened this issue Feb 11, 2023 · 2 comments
Open

new feature request - automatically fit multiple variables #27

jkmackie opened this issue Feb 11, 2023 · 2 comments

Comments

@jkmackie
Copy link

jkmackie commented Feb 11, 2023

I recommend dfit.fit_transform(X) be extended to include multiple variables. Each variable will be fitted individually.

matrix rows = samples
matrix columns = features (variables)

feature_matrix

The proposed functionality mirrors the popular scikit-learn API. Here is an example of that API: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

Also, parallel processing across a multi-core CPU would be an awesome enhancement! :-)

Guillaume Lemaitre (https://github.com/glemaitre) committed code for sklearn.utils.parallel. He is a developer for the scikit-learn foundation. He may be a good contact on how best to implement parallel processing in Python in 2023.

@erdogant
Copy link
Owner

Great suggestion. I looked into it but it is quite some work to integrate such an approach. At the moment all functions (figures, predict etc) are designed for single-column univariate and not multi-column univariate. I will put this on my never-ending-always-getting-longer-todo-list.

@jkmackie
Copy link
Author

Thank you for the reply!

In the meantime, here is starter code to run a distfit exploratory data analysis with multiple cores. Pandas DataFrames are used for readability. (Code can be easily tweaked to use numpy instead.)

The illustration below uses a numeric-only dataset called Company Bankruptcy Prediction. It has 6819 rows and 96 columns.

Note: Error-handling is required to run distfit on this dataset. Certain columns will error out--with or without parallel processing.

import numpy as np
import pandas as pd
import re
from distfit import distfit
from joblib import Parallel, delayed
import collections
pd.options.display.max_columns = 100

# Numeric-only data from here (sign-in required):  
# https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction/download?datasetVersionNumber=2

#----------------------------------------------------------------------------------
#Clean up column names and lower memory.
#----------------------------------------------------------------------------------
df = pd.read_csv("./data.csv")
for c in df.columns:  #clean up column names
    no_beg_end_spaces = c.strip()
    result = re.sub(r"\s+", "_", no_beg_end_spaces)
    df.rename(columns={c : result}, inplace=True)

print('df shape:', df.shape)
display(df.tail(3))

for c in df.columns:
    df[c] = pd.to_numeric(df[c], downcast='float')


#----------------------------------------------------------------------------------
#Use joblib to run distfit on CPU cores in parallel.
#----------------------------------------------------------------------------------
chunks = np.array_split(df, len(df.columns), axis=1)  #chunks are one column due to univariate constraint.
display(chunks[0].head())
display(chunks[1].head())

def get_distfit(series):
    try:        
        result = dfit.fit_transform(series.values, verbose=30)
        return result['model']['name']
    except:
        return 'ERROR'
    
dfit = distfit()
with Parallel(n_jobs=-2, prefer="processes") as parallel:
    results = parallel(delayed(get_distfit)(chunk) for chunk in chunks)

display(list(zip(df.columns, results))[0:5])  #show best distribution by column
display(sorted(collections.Counter(results).items(), key=lambda x:x[1], reverse=True))


#----------------------------------------------------------------------------------
#Get best distribution one column at a time (slower than parallel run).
#----------------------------------------------------------------------------------
sequential_outputs = []
for chunk in chunks:
    sequential_outputs.append(get_distfit(chunk))
display(list(zip(df.columns, sequential_outputs))[0:5])  #show best distribution by column
display(sorted(collections.Counter(sequential_outputs).items(), key=lambda x:x[1], reverse=True))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants