Discrepancy in features computed and number of clusters with split up data #87

marybethcassity · 2024-05-23T23:12:42Z

I've noticed there is a discrepancy in the features computed and the number of clusters in the embedded space when a single csv is split into multiple csvs and later recombined in the BSOID app/algorithm. I think this is due to 2 reasons.

First, the adaptive filtering in bsoid_utilities/likelihoodprocessing.py. This is based on the distribution of the likelihoods of the file and would therefore change based on the length of the file. If the csv is split into multiple csvs, the features calculated for each data point may be different.

Second, the StandardScaler() in extract_features.py is applied to each csv. If the csv is split in multiple csvs, the scaling will be performed on each csv individually before the data is recombined and the features calculated for each data point may be different.

Therefore, even if the input data is the same and just split into multiple files (one csv split into multiple csvs, but combined capturing the same pose data from the original mp4), the number of clusters for training the random forest classifier will not match. What should I make of this? Does the embedded space still carry meaning about the behavior if it changes due to factors such as file length and the way the data is combined? Is there anything I could be doing wrong to cause this discrepancy?

Note, this discrepancy occurs before the UMAP embedding or HDBSCAN clustering. The features computed are different, resulting in a different embedded space, and therefore, different number of clusters used to train the random forest classifier.

marybethcassity changed the title ~~Discrepancy in number of clusters with split up data~~ Discrepancy in features computed and number of clusters with split up data May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy in features computed and number of clusters with split up data #87

Discrepancy in features computed and number of clusters with split up data #87

marybethcassity commented May 23, 2024 •

edited

Loading

Discrepancy in features computed and number of clusters with split up data #87

Discrepancy in features computed and number of clusters with split up data #87

Comments

marybethcassity commented May 23, 2024 • edited Loading

marybethcassity commented May 23, 2024 •

edited

Loading