Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potentially surprising behavior with tech.v3.dataset.column-filters/categorical #255

Closed
bowbahdoe opened this issue Jun 19, 2021 · 2 comments

Comments

@bowbahdoe
Copy link

bowbahdoe commented Jun 19, 2021

Context

Just getting ramped up with the library, so I am sure I am not going down the recommended path, but one issue my brother has had with analyzing data in R is splitting that data into different subsets to run basic models on.

I.E., given a dataset with "price", "A", "B", "C", "D", "E" run linear reggression to predict price using only A, A and B, A and B and C, ... B and E... and so on.

This is what I came up with for a first draft of splitting a dataset into all the relevant combinations

(require '[tech.v3.dataset :as ds])
(require '[tech.v3.dataset.column :as column])
(require '[tech.v3.dataset.column-filters :as cf])
(require '[clojure.math.combinatorics :as combo])

(def csv-data
  (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv"))

(defn column-combos [df]
  (let [columns (ds/columns df)
        subsets (->> (combo/subsets columns)
                     (filter not-empty)
                     (filter #(some #{(df "price")} %))
                     (filter #(> (count %) 1)))]
    (for [subset subsets]
      (-> (ds/new-dataset {:dataset-name
                           (string/join " and "
                                        (map column/column-name subset))}
                          subset)
          (ds/categorical->number cf/categorical)))))

What went wrong?

The issue is that cf/categorical - which I included as part of the test to encode the "stock" field in the example data as a number - will return nil if there are no categorical columns in the dataset. It is documented to "Return a dataset containing only the categorical columns.", so I expected it to return an empty dataset if there were no matching columns.

( I didn't really read the docstring first I just assumed)

So the working version of this function ended up needing this workaround

(ds/categorical->number (comp (fn [ds]
                                (or ds
                                    (ds/new-dataset [])))
                              cf/categorical))))))
@cnuernber
Copy link
Collaborator

Interesting. Makes sense. Another option would be to first do a group-by and then do combo on the keys. Agreed that it could return empty dataset. Filter may be just as fast in most cases as compared to a grouping and the concat steps.

@cnuernber
Copy link
Collaborator

Hmm, empty dataset and nil should work the same. I wonder if it wouldn't be better to update lots of dataset functions so that nil and an empty dataset return the same value. row-count, column-count, columns, etc. all should be safe to call on nil -- I have hit that before in a few cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants