Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in features computed and number of clusters with split up data #87

Open
marybethcassity opened this issue May 23, 2024 · 0 comments

Comments

@marybethcassity
Copy link

marybethcassity commented May 23, 2024

I've noticed there is a discrepancy in the features computed and the number of clusters in the embedded space when a single csv is split into multiple csvs and later recombined in the BSOID app/algorithm. I think this is due to 2 reasons.

First, the adaptive filtering in bsoid_utilities/likelihoodprocessing.py. This is based on the distribution of the likelihoods of the file and would therefore change based on the length of the file. If the csv is split into multiple csvs, the features calculated for each data point may be different.

Second, the StandardScaler() in extract_features.py is applied to each csv. If the csv is split in multiple csvs, the scaling will be performed on each csv individually before the data is recombined and the features calculated for each data point may be different.

Therefore, even if the input data is the same and just split into multiple files (one csv split into multiple csvs, but combined capturing the same pose data from the original mp4), the number of clusters for training the random forest classifier will not match. What should I make of this? Does the embedded space still carry meaning about the behavior if it changes due to factors such as file length and the way the data is combined? Is there anything I could be doing wrong to cause this discrepancy?

Note, this discrepancy occurs before the UMAP embedding or HDBSCAN clustering. The features computed are different, resulting in a different embedded space, and therefore, different number of clusters used to train the random forest classifier.

@marybethcassity marybethcassity changed the title Discrepancy in number of clusters with split up data Discrepancy in features computed and number of clusters with split up data May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant