Release Performance Improvement · dvgodoy/handyspark

Performance Improvements

summaries are no longer computed when a HandyFrame is created.
column statistics (q1, q3, median, percentile) now accept a precision argument (default = 0.01) to compute approximate statistics faster
stratify operations are no longer using RDD methods and rely on Spark's DataFrame built-in optimizer to deliver fast columnar statistics. A substantial performance improvement was achieved for almost every stratify operation.

Stratified transformers

Transformers HandyImputer and HandyFencer now store values for stratified operations using the column name as first level of dictionary and filter clause as second level, as opposed to the inverse structure being used in version 0.1.0a1.

in version 0.1.0a1:

{'Pclass == "1" and Sex == "female"': {'Age': 34.61176470588235},
'Pclass == "1" and Sex == "male"': {'Age': 41.28138613861386},
'Pclass == "2" and Sex == "female"': {'Age': 28.722972972972972},
'Pclass == "2" and Sex == "male"': {'Age': 30.74070707070707},
'Pclass == "3" and Sex == "female"': {'Age': 21.75},
'Pclass == "3" and Sex == "male"': {'Age': 26.507588932806325}}

in version 0.2.0a1:

{'Age': {'Pclass == "1" and Sex == "female"': 34.61176470588235,
'Pclass == "1" and Sex == "male"': 41.28138613861386,
'Pclass == "2" and Sex == "female"': 28.722972972972972,
'Pclass == "2" and Sex == "male"': 30.74070707070707,
'Pclass == "3" and Sex == "female"': 21.75,
'Pclass == "3" and Sex == "male"': 26.507588932806325}}

Outlier detection and removal

Two new methods are available, at both HandyFrame and HandyColumns object, for detecting and removing outliers, based on Mahalanobis distance:

get_outliers: returns a Spark DataFrame containing all rows considered outliers
remove_outliers: returns a filtered Spark DataFrame where all outliers were removed

Those methods consider only numeric columns and use a threshold (default 99.9%) to compute the corresponding chi-square critical value to filter the rows.

Binary classification metrics

The BinaryClassificationMetrics object was extended to take a Spark DataFrame (instead of an RDD only) and the corresponding scoreCol, with the vector of probabilities output from a classifier, and a labelCol with the true labels.

It exposes several methods that were not available to PySpark:

thresholds
roc
pr
fMeasureByThreshold
precisionByThreshold
recallByThreshold

It also implements some new methods:

getMetricsByThreshold: returns a Spark DataFrame with all metrics, FPR, Recall and Precision, by threshold
confusionMatrix: returns a DenseMatrix representing the confusion matrix for the informed threshold
print_confusion_matrix: returns a nice pandas DataFrame with the confusion matrix
plot_roc_curve
plot_pr_curve

Information Theory

HandyColumn object now exposes methods for computing entropy and mutual information:

entropy: returns pandas Series with entropy for informed columns
mutual_info: returns pandas DataFrame with mutual information between informed columns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Improvement

Performance Improvements

Stratified transformers

Outlier detection and removal

Binary classification metrics

Information Theory