Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar reference #66

Open
teucer opened this issue Mar 4, 2022 · 19 comments
Open

Grammar reference #66

teucer opened this issue Mar 4, 2022 · 19 comments

Comments

@teucer
Copy link
Contributor

teucer commented Mar 4, 2022

I could not find any reference on the grammar that is supported. How does it compare to formulaic?

@tomicapretto
Copy link
Collaborator

You're right there's no formal grammar reference. The closest we have is the Getting started section in our docs.

Nevertheless, formulae and formulaic should agree in the design matrices that are constructed from the same formula. The internals are different, but the results should match.

One of the most relevant differences is that formulae implements the | operator to construct design matrices for random effects, while formulae does not (yet).

@teucer
Copy link
Contributor Author

teucer commented Mar 5, 2022

Thank you.

How does it compare to R? We are in the process of migrating an R package to python, it would be useful to know them.

@tomicapretto
Copy link
Collaborator

Could you tell me which features you're interested in? That could help to point you to relevant differences.

Formulae is very similar to model formulas in R. It still lacks the double pipe || operator, but maybe it is not relevant for you.

@teucer
Copy link
Contributor Author

teucer commented Mar 6, 2022

I was wondering about diff or lag operators for example. It is not clear to me how vector operators are handled.

@tomicapretto
Copy link
Collaborator

I think this is going to be clearer if you share a concrete R example with both the input and the output. In particular, how do you use diff() or lag() in a R formula?

@teucer
Copy link
Contributor Author

teucer commented Mar 6, 2022

Some thing like below:

log(y) ~ lag(log(y)) + diff(log(x))

As you can see diff and lag apply to the whole 'vector', they are not scalar functions.

@tomicapretto
Copy link
Collaborator

tomicapretto commented Mar 6, 2022

@teucer could you share a reproducible example? That's what I mean with both input and output. I understand the formula, but I don't know what is the expected behavior from that formula.

What I would like to have is the following

  • Sample input data
  • Sample formula
  • How the formula is used (e.g. is it used within lm() or within which modeling function?)
  • Sample output

As far as I know, the result of lag(x) is of length n, but the result of diff(x) is of length n - 1 so it is not clear to me how these are handled.

> x <- 1:5
> length(lag(x))
# [1] 5
> length(diff(x))
# [1] 4

@teucer
Copy link
Contributor Author

teucer commented Mar 7, 2022

Sorry for the lengthy discussion, I am not the author of the package that we are trying to convert. It seems that people have implemented their own lag and diff functions. The general question remains though, would something like below work?

# this is not a scalar function
import numpy as np

def lag(arr, num=1, fill_value=np.nan):
    if num >= 0:
        return np.concatenate((np.full(num, fill_value), arr[:-num]))
    else:
        return np.concatenate((arr[-num:], np.full(-num, fill_value)))
        
xs = np.arange(10)
print(lag(xs)) # > [nan  0.  1.  2.  3.  4.  5.  6.  7.  8.]

dm = design_matrices("y1 ~ lag(x)", data)

Now, it would be useful if we could pass our own additional transformations. I think you are using Environment to do that. What about passing a dictionary of transformations? It might be the case that we are defining them in another file.

# we want to avoid star import: "from .utils import *"
from .utils import lag 

def myfun(x):
    return x + 1

trans_dict = {"myfun":  myfun, "lag": lag}
 
dm = design_matrices("y1 ~ myfun(x)", data, transformations=trans_dict)

@tomicapretto
Copy link
Collaborator

Thanks for the comments clarifying my question!

  1. Formulae should work fine with array valued functions.
  2. The problem with the first lag() function is that it returns np.nan. Currently, Formulae either raises an error or drops rows containing missing values. We should add a case where we just want to keep missing values. See

if incomplete_rows_n > 0:
if na_action == "drop":
_log.info(
"Automatically removing %s/%s rows from the dataset.",
incomplete_rows_n,
data.shape[0],
)
data = data[~incomplete_rows]
else:
raise ValueError(f"'data' contains {incomplete_rows_n} incomplete rows.")

  1. At the moment, it is not possible to pass a dictionary of transformations to design_matrices(). Maybe you can modify the following dictionary

TRANSFORMS = {
"B": binary,
"binary": binary,
"C": C,
"I": I,
"offset": offset,
"p": proportion,
"prop": proportion,
"proportion": proportion,
"S": S,
"T": T,
}

which is what Formulae consumes when looking for internal transformations. However I don't think this is a clean solution.

  1. How many transformations do you have? If they are few, you could just do from .utils import lag, diff, etc and that would solve the problem.

@teucer
Copy link
Contributor Author

teucer commented Mar 7, 2022

  • 2: It would be good to keep them. We could ignore the rows with np.nan during estimation, but would need them later on.
  • 3: The only way to do is to "monkey patch" (?). Not my preferred approach, hence the suggestion. We could take the additional transformations and concatenate with TRANSFORMS.
  • 4: That is feasible, but I don't know how many we will end up with.

PS: could do PRs if required.

@tomicapretto
Copy link
Collaborator

It would be good if you could try to work on a PR for 2 (allowing to keep missing values) and 3, allowing to pass a dictionary with transformations.

For 3, have a look at

env = Environment.capture(env, reference=1)

and https://github.com/bambinos/formulae/blob/master/formulae/environment.py

and feel free to ask questions.

As a pointer, I would create a new Environment holding the transforms you want and./or use .with_outer_namespace().

@tomicapretto
Copy link
Collaborator

@teucer thanks for the contributions y #69 and #70. Is the development version working for your needs now?

@teucer
Copy link
Contributor Author

teucer commented Mar 9, 2022

Yes. Can we do a release?

@tomicapretto
Copy link
Collaborator

The development version as a lot of breaking changes. I want to double check a couple of things before doing a release.
Can you install from github in the meantime?

@teucer
Copy link
Contributor Author

teucer commented Mar 9, 2022

Ok. Is there an ETA?

PS: My main use case is behind corporate barriers, it would be difficult to install from github. We have an internal pypi proxy.

@tomicapretto
Copy link
Collaborator

I've been thinking about the release and I think it's not going to be a problem to have a release now. We need to update the Changelog first.

I just wanted to add tests for some features we don't have covered yet and test whether this development version was OK for what I want to do in Bambi. I guess I can do another release if I need to include more changes for Bambi.

@teucer
Copy link
Contributor Author

teucer commented Mar 10, 2022

Thank you for the support. Looking forward to the new release.

PS: I would be happy to further contribute if you have further tasks. If you add tasks and tag them (e.g. need contributor), it would be easier for me to pickup and send PRs.

@tomicapretto
Copy link
Collaborator

tomicapretto commented Mar 11, 2022

I'm having a problem building the docs, see https://github.com/bambinos/formulae/runs/5511845996?check_suite_focus=true

I don't know what's going on. I tried to reproduce the problem locally (same version of everything) but I couldn't.

Edit It drove me nuts.

I fixed the error by changing how formulae is installed within the environment where the action runs. See ed81e6e and https://github.com/bambinos/formulae/actions/runs/1969203706.

@tomicapretto
Copy link
Collaborator

@teucer there's a new release now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants