-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Var.vars should accept Vars that are themselves functions of multiple Vars #3
Comments
Data for any Var should be searched for in the following order:
This would lead to a recursive creation of Vars and their associated data until the data for the top-level Var is ultimately created. |
@spencerkclark's comment on #44 :
There is no such existing infrastructure to deal with any of these issues, but I'm definitely wanting to implement it. One potential (partial) route would be to have each model (or run or project) have a list of variables that it has native, and then somehow use that to determine which function is used to compute a given variable. |
Keeping track of what has been computed (and easily accessing those computed variables) could be a useful front-end feature as well. For instance, could the process of opening or loading a Right now it is a little difficult to parse the directory structure that's created when a Calc is completed (e.g. I feel like you've given this problem (of how to store computed values) some thought in the past (if I recall correctly even considered trying to keep track of additional metadata such as when a variable was computed and which version of a function was used). Have you thought about this more since the inclusion of |
Thanks for bringing this thread/whole project back to life!
I really like this idea.
I agree. That directory structure is also ultimately arbitrary: where does one stop making new directories? E.g. why not have sub-directories of {$Variable} for the years, months, input data type, etc.? Ultimately, I think this should be replaced with a better serialization model. In the past, I had been thinking about using a formal database, and that still may be part of the mix, but I think you're onto something with the xarray model. Emulating it would require some thought, though. Does the Run object have access using the dot notation to every single saved computation? This could easily lead to 1000s of attributes, which intuitively seems not good. Or does the Run object have each physical Var as one attribute (e.g.
The computation metadata ideas never got beyond "that would be cool someday" status (although that status remains), nor have I thought much about them since the switch to xarray. Yes, a re-assessment is sorely needed. Does that all make sense? Let me know your thoughts. |
Very good point, a single attribute per every combination of variable / averaging-type would be very messy. I quite like your suggestion of keeping one physical Var per attribute at the Run level. In that form, a Run (among other things) could serve as a container of Datasets (which could be stored in separate files in a single folder associated with a given Run); in other words keep the current directory structure, but replace each Var directory with a Dataset. A question that emerges for me is how far can / should we go to try and simplify the names of the attributes at the variable level (
I've tried thinking about how we could do this below, but having thought through it more, I'm not super enthusiastic about what I came up with (see the drawbacks I list at the end). I'll leave it just as a source for discussion, because I think we should consider making the attribute names simpler, but at the moment it's not clear to me how we could do it in the easiest way (i.e. try to use xarray as much as we can (don't reinvent the wheel), but use it for the right reasons). Let me know what you think! Thanks. Regarding specifying the properties of a particular computed DataArray, it might be nicer if we could use the
For instance an alternative that xarray provides is that we could add a coordinate that would take String values:
In this manner, a list of variable names in the Dataset returned by A challenge associated with this method is that we likely wouldn't want to have multiple coordinates in a Dataset for a single attribute type. In other words if we had a variable computed only using '3hr' data, but wanted to add a variable computed using only 'monthly' data, we wouldn't want to have a coordinate
There may be some drawbacks with this method of doing things:
|
I agree, the current method of just appending strings to the file name is not a good way of storing the metadata. I think what matters most, and what is currently lacking, is having this metadata embedded within the file itself, as opposed to just in its file name. Right now the only information other than the coordinate arrays saved within the file is the variable name itself. So your proposal is definitely heading in the right direction. Getting this working seems more important than (and orthogonal to) the way the files are saved to disk in terms of directory structure etc.
Yes, this is untenable. Those parameter combinations for which data hasn't been generated should just be a single NaN (or None or False or something analogous), and those parameter combinations that do have data should have an array whose only dimensions are physically meaningful. This doesn't sound possible within the framework of xarray coords. So I'm back to wondering about using a formal database? As you noted to me once before, it would get unwieldy to store the data within the DB itself, and as such the DB would instead hold paths/pointers of some kind to the actual data on disk. So then the problem of the directory structure, file names, etc. still exists. But effectively what we're doing is querying various categories of data, and it's all simply AND: get data with variable name == x AND year range==y AND dtype_in==z AND ... We don't necessarily need to expose the DB internals to the user -- i.e. no SQL knowledge necessary (or would we?). I am already out of my depth though; effectively zero meaningful experience working with databases. |
I think you may be right. Storing this textual metadata in a SQL database is likely the most straightforward way to serialize it in a query-able form (which is basically what I was struggling with in the above post). And, like you mention
if we have some way of mapping metadata to filenames, it doesn't matter what we name the files at that point as long as the names are unique. So, running with this idea for the moment: In your experience working with databases, have you worked with an ORM (Object Relational Mapper) like SQLAlchemy in python before? I think that may be the way we want to go. SQLAlchemy basically allows you to interface with a database purely within python, so even within the codebase we have the option of not having to write any SQL (at least directly). If you don't use an ORM you'd have to pass all the SQL commands as strings to the database, which is a bit messy (especially if you are making queries with many (4+) conditions, which I suspect we may). To move forward we would just need to decide on a data model. I don't think we'll have to worry about performance at all (since we likely won't have databases with more than a few 1000 rows and we won't be making many (100+) queries at once), but from a querying perspective one data model might make things simpler than another. If we wanted each user to have a single database some very basic options would be:
Option 1 (although it would be the simplest to implement) might be off the table if we want to store more metadata about Proj's, Models, and Runs other than just their names. Option 2 would handle that very well. I'd say I'm leaning towards option 2 at the moment, because it basically mimics Pending your thoughts, I might try and re-familarize myself with SQL and SQLAlchemy this weekend by implementing a basic version of option 2 (but have it unlinked with
Yes, I think we should be able to accomplish this by writing some methods that wrap database queries (and writes) such that no SQL or ORM knowledge would be required. How does all that sound to you? |
This is great overall. I wasn't being modest before -- I have effectively zero meaningful experience working with DBs in Python -- so I will happily defer to you on those details. That being said, I have heard about SQLAlchemy in that Python podcast, and from that and your description it sounds like the right choice, especially given the pure Python interfacing you mentioned.
Yes I agree.
Yes, full speed ahead! I'll try to (for the first time) familiarize myself with SQLAlchemy as well at some point. And yes, best to get the basic mechanics down before trying to integrate into aospy. |
No worries, while I've used SQLAlchemy before, I still have a lot to learn as well. It will be a challenge to integrate it into
Awesome! I put together a new repo with some work on doing this. I've tried to set things up so that it is fully self-contained (meaning that you can check out the repository and run the code locally by yourself without changing your I started by creating a "synthetic" version of When the main script is run, the Proj's, Models, and Runs are added to the database if needed. Then the "computations" are done; these Calc's are added to the database one by one. Each entry contains what its filename would have been if the computation were actually completed. Var entries are created as needed as Calcs are created. Overall I'm encouraged. I think this should be doable, but there will be many kinks to work out along the way (and many decisions we'll have to make, particularly about the API). On that note I've added an ipython notebook (which you can view in Github) to the repo, where we can share examples of how the API could work using the example objects and database. Let me know your thoughts and if you think I should modify the setup in the experimental repo. I'm still a bit of a novice when it comes to setting up packages. 1Please let me know if you have any issues getting it up and running. 2As a side note, I think this would actually be a useful mode to run |
Wow, thanks for putting this together so quickly. A great first step. I forked it, cloned it, and ran Thread on aospy-db basics: spencerkclark/aospy-db#4 |
#263 accomplishes the recursive functionality but does not implement saving/stashing data at intermediate steps for use by other Calcs and/or serialization. Those would be cool someday, but for now I think it's fine to leave this closed. |
Currently, if Var is passed a value for
func
when instantiated, it then populates thevars
attribute with the Var objects that are passed into the given function. But this is only supported for one level: the Var objects passed to the function must be stand-alone, i.e. don't require a function of their own.The result is the unnecessary proliferation of functions. For example, my flux divergence calculations currently have MSE, DSE, c_pT, gz, and general versions. Each one takes as arguments all the arguments required for the pre-calculation (e.g. temp, hght, and sphum for MSE), computes it by a call to that function, then calls the general function with that output as the argument.
Possible solution: when a Calc object encounters a Var with nested functions such as this, it should, from bottom up, create Calc objects that returns the full timeseries of the specified function. At the next level up, i.e. which is taking in this computed data as one of its Var objects, the code that loads data from netCDF will have to be bypassed, since you already have the timeseries in hand that you want.
The text was updated successfully, but these errors were encountered: