-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Share anonymised sales data on DFC API with authorised users #12831
Conversation
There will be lots and lots. The sales data root object is also the authenticated person. The data has its own URL (semantic id) which doens't need to contain the user id. The service object can also be tested more easily. I'm setting up the test data here.
@mkllnk it looks good to me! As I'm still learning a lot, can I just checking that this is a correct way of reading the data:
What's the period of time the user can see? The whole database? Thiking of it I would prefer we merge this and I ask INRAE to test in production, I feel they can easily mix up staging vs production. We have a MoU in place with them. Would that be ok for you? given it's feature toggled we can remove their access easily in case of a problem.
I think so too, but they will first pass it through a python script to clean the type of products. Eg.: we send them "chocolat au lait" which is chocolat with milk, but they are only looking for milk (lait). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice git history, makes it easy to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥇 Fantastic, I love how you thought outside the box to build this well!
@@ -0,0 +1,98 @@ | |||
# frozen_string_literal: true | |||
|
|||
# Represents a single row of the aggregated sales data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So glad you did this part, it probably would have taken me ages to figure out all the required relations.
# | ||
# { product_name: "Apple", ... } | ||
def label_row(row) | ||
labels.zip(row).to_h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried about indiscriminately zipping two arrays together, so I was wondering if there's a way to map by the column names.
There's discussion here about avoiding instantiating AR records, which you've rightly done:
https://stackoverflow.com/questions/20794398/how-can-i-have-activerecords-pluck-also-return-the-column-name-for-rendering-in
I like that your solution creates a short-lived hash to label the columns on-demand.
One person suggests using ActiveRecord::Base.connection.select_all
which provides a ActiveRecord::Result
. I haven't looked at the implementation of that class, but I suspect it does the same thing, so is probably equally efficient.
But it's probably not worth exploring further, moving on...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't know select_all
. I'll keep that in mind for another time. If it could do it in batches then I would probably prefer that. But my version occupies the least memory for grabbing all data at once, I think, because it's only keeping the values in an array and the keys are only added while processing one row, as you acknowledged above.
# Without any filters: | ||
expect(count_rows.call).to eq 1 | ||
|
||
# From today: | ||
expect(count_rows.call(start_date: today)).to eq 1 | ||
|
||
# Until today: | ||
expect(count_rows.call(end_date: today)).to eq 1 | ||
|
||
# Just today: | ||
expect(count_rows.call(start_date: today, end_date: today)).to eq 1 | ||
|
||
# Yesterday: | ||
expect(count_rows.call(start_date: yesterday, end_date: yesterday)).to eq 0 | ||
|
||
# Until yesterday: | ||
expect(count_rows.call(end_date: yesterday)).to eq 0 | ||
|
||
# From tomorrow: | ||
expect(count_rows.call(start_date: tomorrow)).to eq 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great coverage
(and it's so nice to see we get it all for free with ActiveRecord and Ranges!)
I believe this is ready to merge. |
@dacook what do you mean "at scale"? I didn't activated more than 3 enterprises during my test, indeed. I guess it is also linked to my question above: does it give the whole database? Socleo are specifying the date range in the URL they are querying if I'm not mistaken. If we can do that and what to ease the load, we could only show data since January 1st 2021. |
@RachL yes, I believe it would give any data from the database, for enterprises that have the connected app. But as you say, the researchers will be extracting data for certain periods of time, which will reduce the result set. If they have trouble loading, they can make smaller queries until it works (eg 1 month, or 1 week, or 1 day...) Note: we haven't enabled connected apps, and executed the rake task on fr_prod to enable the connected apps for all enterprises yet (at least I don't remember). I think we should do that soon so that we can test with the real data. Let me know if you're happy for me to run the rake task now. |
By default it returns all data. But it also supports the startDate and endDate params like in the spec and the Socleo example. If it crashes the server, we could add a default date range in case you forget to declare one. |
@mkllnk great! Can I confirm how the URL looks like when adding dates? I think I'm missing something: /api/dfc/affiliate_sales_data/startDate=01/01/2024&endDate=31/12/2024 |
@dacook we can run it before landing this PR in prod? If so yes please do, many thanks! |
The URL needs a I just tried to deploy to fr_staging to check this, but I guess it doesn't work after branches are merged. |
@RachL I've activated the connected app for all current enterprises on fr_staging, and fr_prod (5662 enterprises). Note that it won't be visible to users on prod until the checkbox is enabled: https://coopcircuits.fr/admin/connected_app_settings/edit
|
ℹ️ Funded feature code is #12543 [DFC] Anonymized orders endpoint
What? Why?
The French teams wants to shared price trend data with a research team at INRAE. This PR is adding a new DFC API endpoint publishing the required data in DFC format. This standard is used by other study participants and should make the processing easier.
I followed the example spec with a nested document structure. The DFC document is nested for objects without semantic ids which I all set to nil. Assigning ids would create a document with a flat graph structure. But I'm assuming that the data will be converted into a spreadsheet. Hence I query the data as table rows and process them entry by entry. I did notice though that another participant structured their document differently. Only the researchers can tell us if this data format is useful for them.
Many thanks to @anansilva for starting this piece of work. I started a new branch with a rewritten history to replace the original PR but many of the concepts come from Ana.
What should we test?
affiliate_sales_data
for your user./api/dfc/affiliate_sales_data
and observe the data.Release notes
Changelog Category (reviewers may add a label for the release notes):
The title of the pull request will be included in the release notes.
Dependencies
Documentation updates
You also find an example at
/api-docs/
.