Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add other stats for low-order moments #2006

Open
xwu99 opened this issue Nov 29, 2021 · 3 comments
Open

Add other stats for low-order moments #2006

xwu99 opened this issue Nov 29, 2021 · 3 comments

Comments

@xwu99
Copy link
Contributor

xwu99 commented Nov 29, 2021

We are using oneDAL distr algos to optimize Spark ML. Some metrics are missing and Could you check if you can add the following stats in distributed low-order moments (basic stats) ?

  • count
  • numNonzeros
  • weightSum
  • normL1
  • normL2

Check for details: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html

@makart19
Copy link
Contributor

Clarification details per our discussion with Xiaochang:

  • Count: [Xiaochang]: User usually get several metrics instead of single one, it's convenient for them to get observations’ count from result along with other metrics. Otherwise, user needs extra coding effort.
  • numNonzeroes: [Xiaochang]: just count the number of non 0.0
  • weightSum: [Xiaochang]: there is a separate column called weight in Spark's dataframe for each row.
    Need to investigate possibility of adding corresponding API into compute_inpute and compute_result.

Also, need to check how much adding all these metrics will affect performance of default case (when all metrics are calculated).

@xwu99
Copy link
Contributor Author

xwu99 commented Dec 13, 2021

Thanks @makart19.
For weight column, Could also consider a general support for weighted points as a general feature for all algorithms, such as weighted points for kmeans etc. Check Spark's Kmeans, there is a optional weightCol to be set.
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.KMeans.html

@makart19
Copy link
Contributor

Ok, we will consider weights support for other algorithms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants