Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DYNOTEARS] TypeError: Index must be integers #86

Closed
LukaJakovljevic opened this issue Jan 5, 2021 · 6 comments
Closed

[DYNOTEARS] TypeError: Index must be integers #86

LukaJakovljevic opened this issue Jan 5, 2021 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@LukaJakovljevic
Copy link

Description

Hi, I have a problem when running DYNOTEARS on top of dataframe.
Seems like the method does not recognise that df.index is int.

Steps to Reproduce

  1. second cell from this example I have a question about Dynotears #74 (comment) (when trying to run dynotears)
  2. also, same error when trying to apply from_numpy_dynamic or from_pandas_dynamic on other dataframes, which have indexes as integers, in increasing order

Expected Result

Executing dynotears on top of dataframe

Actual Result

TypeError: Index must be integers

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-82-5ce17a03d23b> in <module>
      1 from causalnex.structure.dynotears import from_pandas_dynamic
----> 2 g_learnt = from_pandas_dynamic(df,1,lambda_w=.1,lambda_a=.1,w_threshold=.1)
      3 g_learnt

~\anaconda3\envs\test_env\lib\site-packages\causalnex\structure\dynotears.py in from_pandas_dynamic(time_series, p, lambda_w, lambda_a, max_iter, h_tol, w_threshold, tabu_edges, tabu_parent_nodes, tabu_child_nodes)
     98     time_series = [time_series] if not isinstance(time_series, list) else time_series
     99 
--> 100     X, Xlags = DynamicDataTransformer(p=p).fit_transform(time_series, return_df=False)
    101 
    102     col_idx = {c: i for i, c in enumerate(time_series[0].columns)}

~\anaconda3\envs\test_env\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
    569         if y is None:
    570             # fit method of arity 1 (unsupervised transformation)
--> 571             return self.fit(X, **fit_params).transform(X)
    572         else:
    573             # fit method of arity 2 (supervised transformation)

~\anaconda3\envs\test_env\lib\site-packages\causalnex\structure\transformers.py in fit(self, time_series, return_df)
     88         """
     89         time_series = time_series if isinstance(time_series, list) else [time_series]
---> 90         self._check_input_from_pandas(time_series)
     91         self.columns = list(time_series[0].columns)
     92         self.return_df = return_df

~\anaconda3\envs\test_env\lib\site-packages\causalnex\structure\transformers.py in _check_input_from_pandas(self, time_series)
    203 
    204             if t.index.dtype != int:
--> 205                 raise TypeError("Index must be integers")
    206 
    207             if self.columns is not None:

TypeError: Index must be integers

Your Environment

  • CausalNex version used: 0.9.0
  • Python version used: 3.8.5
  • Operating system and version: Windows 10 Pro, x64
@GabrielAzevedoFerreiraQB
Copy link
Contributor

GabrielAzevedoFerreiraQB commented Jan 6, 2021

Hi Luka,
It is strange: I tested the code and it worked here
image

Could you show the steps you are following, please?

A few (hopefully helpful) notes:

The index of the dataframe you provide is quite important in the from_pandas_dynamic function: it represents the cadence of the time series in your data.

For example, the row 0 represents the time stamp 0, i.e. all the features obtained at moment 0, or x_0. The row 1 represents time x_1, and so on. Ideally we have a time series x_0, x_1, x_2... with occasionally some disruption points, where we dont have data for certain time stamps (e.g x_0, x_1, x_2,x_5,x_6,x_7,...). Your index represents this time series

This means a couple of things:

  • if the index on the df is, for example (0, 1, 3, 5) it means that you don't know what happens on time stamp "2", x_2 (it is missing information and from pandas has a way of dealing with that). If you have (0,2,4,6..) it means that you never have two consecutive events (you have x_0 but not x_1...), and the resulting network will be very different from when you have (0,1,2,3...) as index.

  • If your index is not an integer, there is no way for dynotears to compute events are consecutive and which are not. This, then, generates an error.

  • Finally, if the index are integers but not in order (for example, 0,1,2,4,3,5), we throw an error for safety purposes, since is more natural to store a time series in increasing order of events. This avoids the case where the user does not pay attention to the index.

@LukaJakovljevic
Copy link
Author

LukaJakovljevic commented Jan 6, 2021

Hi @GabrielAzevedoFerreiraQB,

Thank you for the fast answer and explanation.

I have executed the exact cells as in your example, this is what I (and some people that I have recently asked also to install library) get in cell [4]:
Capture

I believe this is because df.index.dtype returns int64. This can have something to do with numpy.
Indeed, if you type df.index.dtype == int after cell [3] you get False.
In causalnex\structure\transformers.py line 204 it compares it to int, that's where the error comes from.

When I change that line in code to if t.index.dtype != 'int64' everything works.

Maybe you can change that part in the code, to allow index to also be of type 'int64'?

Thanks,
Luka

@GabrielAzevedoFerreiraQB
Copy link
Contributor

GabrielAzevedoFerreiraQB commented Jan 7, 2021

That makes sense! We will make a change to allow int64 and possible "other types of int".
Thanks a lot for finding this bug!

if you do not want to change the source code, For now, I suggest trying df.index.dtype = df.index.dtype.astype(int) or
using these versions of pandas and numpy below.

image

@GabrielAzevedoFerreiraQB
Copy link
Contributor

On that note, would you you be able to share what is the Pandas and numpy version you are using?

@LukaJakovljevic
Copy link
Author

You're welcome!

Here are the versions below:

P.S. I have tried that and similar commands for changing df.index type to plain int at first.
The default int is int64 as you can see, I didn't find a way to make it int using these versions and that's why I had to change code at the end, to extend to this type, which made everything later working.

Let me know if I can help with some further info

Capture2

@oentaryorj oentaryorj added the bug Something isn't working label Jul 26, 2021
@oentaryorj
Copy link
Contributor

A more robust integer type checking has been implemented in this commit and will be available in the next CausalNex release.

@oentaryorj oentaryorj self-assigned this Sep 7, 2021
@qbphilip qbphilip mentioned this issue Nov 10, 2021
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants