Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share anonymised sales data on DFC API with authorised users #12831

Merged
merged 5 commits into from
Sep 3, 2024

Conversation

mkllnk
Copy link
Member

@mkllnk mkllnk commented Aug 30, 2024

ℹ️ Funded feature code is #12543 [DFC] Anonymized orders endpoint

What? Why?

The French teams wants to shared price trend data with a research team at INRAE. This PR is adding a new DFC API endpoint publishing the required data in DFC format. This standard is used by other study participants and should make the processing easier.

I followed the example spec with a nested document structure. The DFC document is nested for objects without semantic ids which I all set to nil. Assigning ids would create a document with a flat graph structure. But I'm assuming that the data will be converted into a spreadsheet. Hence I query the data as table rows and process them entry by entry. I did notice though that another participant structured their document differently. Only the researchers can tell us if this data format is useful for them.

Many thanks to @anansilva for starting this piece of work. I started a new branch with a rewritten history to replace the original PR but many of the concepts come from Ana.

What should we test?

  • Activate feature affiliate_sales_data for your user.
  • Activate connected app AffiliateSalesData for a distributor with orders.
  • Visit /api/dfc/affiliate_sales_data and observe the data.
  • You should find the required data:
    • enterprise (producers) postal code
    • enterprise (distributor) postal code
    • variant unit name
    • unit type
    • units
    • price
    • quantity sold
  • As another user you should not see any data (except a person object)
  • Data of other enterprises should not be published.
  • Paging has not been implemented.

Release notes

Changelog Category (reviewers may add a label for the release notes):

  • User facing changes
  • API changes (V0, V1, DFC or Webhook)
  • Technical changes only
  • Feature toggled

The title of the pull request will be included in the release notes.

Dependencies

Documentation updates

You also find an example at /api-docs/.

There will be lots and lots. The sales data root object is also the
authenticated person. The data has its own URL (semantic id) which
doens't need to contain the user id.

The service object can also be tested more easily. I'm setting up the
test data here.
@mkllnk mkllnk added the api changes These pull requests change the API and can break integrations label Aug 30, 2024
@mkllnk mkllnk self-assigned this Aug 30, 2024
@mkllnk mkllnk marked this pull request as ready for review August 30, 2024 05:33
@RachL RachL added the pr-staged-fr staging.coopcircuits.fr label Aug 30, 2024
@RachL
Copy link
Contributor

RachL commented Aug 30, 2024

@mkllnk it looks good to me! As I'm still learning a lot, can I just checking that this is a correct way of reading the data:

Name Value What it represent
dfc-b:hasAddress
@type "dfc-b:Address"
dfc-b:hasStreet ""
dfc-b:hasPostalCode "26110" Producer postal code
dfc-b:hasCity ""
dfc-b:hasCountry ""
dfc-b:latitude 0
dfc-b:longitude 0
dfc-b:region ""
dfc-b:logo ""
dfc-b:name ""
dfc-b:hasDescription ""
dfc-b:VATnumber ""
dfc-b:supplies
@type "dfc-b:SuppliedProduct"
dfc-b:name "Abricots" variant unit name
dfc-b:description ""
dfc-b:hasQuantity
@type "dfc-b:QuantitativeValue"
dfc-b:hasUnit "dfc-m:Gram" Unit type
dfc-b:value 1000 Units (1 kg of Abricots here)
dfc-b:alcoholPercentage 0
dfc-b:lifetime ""
dfc-b:usageOrStorageCondition ""
dfc-b:totalTheoreticalStock 0
dfc-b:concernedBy
@type "dfc-b:OrderLine"
dfc-b:description ""
dfc-b:quantity
@type "dfc-b:QuantitativeValue"
dfc-b:hasUnit "dfc-m:Piece"
dfc-b:value 118 Quantity sold?
dfc-b:hasPrice
@type "dfc-b:QuantitativeValue"
dfc-b:value 3.89 Price
dfc-b:partOf
@type "dfc-b:Order"**
dfc-b:orderNumber ""
dfc-b:date ""
dfc-b:belongsTo
@type "dfc-b:SaleSession"
dfc-b:beginDate ""
dfc-b:endDate ""
dfc-b:quantity 0
dfc-b:objectOf
@type "dfc-b:Coordination"
dfc-b:coordinatedBy
@type " dfc-b:Enterprise"
dfc-b:hasAddress
@type "dfc-b:Address"
dfc-b:hasStreet ""
dfc-b:hasPostalCode "38100" Distributor postal code
dfc-b:hasCity ""
dfc-b:hasCountry ""
dfc-b:latitude 0
dfc-b:longitude 0
dfc-b:region ""
dfc-b:logo ""
dfc-b:name ""
dfc-b:hasDescription ""
dfc-b:VATnumber

What's the period of time the user can see? The whole database?

Thiking of it I would prefer we merge this and I ask INRAE to test in production, I feel they can easily mix up staging vs production. We have a MoU in place with them. Would that be ok for you? given it's feature toggled we can remove their access easily in case of a problem.

But I'm assuming that the data will be converted into a spreadsheet

I think so too, but they will first pass it through a python script to clean the type of products. Eg.: we send them "chocolat au lait" which is chocolat with milk, but they are only looking for milk (lait).

@RachL RachL removed the pr-staged-fr staging.coopcircuits.fr label Aug 30, 2024
Copy link
Collaborator

@rioug rioug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice git history, makes it easy to follow.

Copy link
Member

@dacook dacook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇 Fantastic, I love how you thought outside the box to build this well!

@@ -0,0 +1,98 @@
# frozen_string_literal: true

# Represents a single row of the aggregated sales data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So glad you did this part, it probably would have taken me ages to figure out all the required relations.

#
# { product_name: "Apple", ... }
def label_row(row)
labels.zip(row).to_h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried about indiscriminately zipping two arrays together, so I was wondering if there's a way to map by the column names.
There's discussion here about avoiding instantiating AR records, which you've rightly done:
https://stackoverflow.com/questions/20794398/how-can-i-have-activerecords-pluck-also-return-the-column-name-for-rendering-in

I like that your solution creates a short-lived hash to label the columns on-demand.

One person suggests using ActiveRecord::Base.connection.select_all which provides a ActiveRecord::Result. I haven't looked at the implementation of that class, but I suspect it does the same thing, so is probably equally efficient.

But it's probably not worth exploring further, moving on...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't know select_all. I'll keep that in mind for another time. If it could do it in batches then I would probably prefer that. But my version occupies the least memory for grabbing all data at once, I think, because it's only keeping the values in an array and the keys are only added while processing one row, as you acknowledged above.

Comment on lines +22 to +41
# Without any filters:
expect(count_rows.call).to eq 1

# From today:
expect(count_rows.call(start_date: today)).to eq 1

# Until today:
expect(count_rows.call(end_date: today)).to eq 1

# Just today:
expect(count_rows.call(start_date: today, end_date: today)).to eq 1

# Yesterday:
expect(count_rows.call(start_date: yesterday, end_date: yesterday)).to eq 0

# Until yesterday:
expect(count_rows.call(end_date: yesterday)).to eq 0

# From tomorrow:
expect(count_rows.call(start_date: tomorrow)).to eq 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great coverage
(and it's so nice to see we get it all for free with ActiveRecord and Ranges!)

@dacook
Copy link
Member

dacook commented Sep 2, 2024

I believe this is ready to merge.
I'm unclear if this has been tested at scale in staging. It might be possible to crash a puma worker in prod, but I think our setup is robust enough to recover from that anyway.

@RachL
Copy link
Contributor

RachL commented Sep 2, 2024

@dacook what do you mean "at scale"? I didn't activated more than 3 enterprises during my test, indeed.

I guess it is also linked to my question above: does it give the whole database?

Socleo are specifying the date range in the URL they are querying if I'm not mistaken. If we can do that and what to ease the load, we could only show data since January 1st 2021.

@dacook
Copy link
Member

dacook commented Sep 2, 2024

@RachL yes, I believe it would give any data from the database, for enterprises that have the connected app.
As the intention is to connect all enterprises first, this would be a very large result set.

But as you say, the researchers will be extracting data for certain periods of time, which will reduce the result set. If they have trouble loading, they can make smaller queries until it works (eg 1 month, or 1 week, or 1 day...)

Note: we haven't enabled connected apps, and executed the rake task on fr_prod to enable the connected apps for all enterprises yet (at least I don't remember). I think we should do that soon so that we can test with the real data. Let me know if you're happy for me to run the rake task now.

@mkllnk
Copy link
Member Author

mkllnk commented Sep 3, 2024

By default it returns all data. But it also supports the startDate and endDate params like in the spec and the Socleo example. If it crashes the server, we could add a default date range in case you forget to declare one.

@mkllnk mkllnk merged commit 3f1d99d into openfoodfoundation:master Sep 3, 2024
54 of 56 checks passed
@mkllnk mkllnk deleted the anonymous-orders branch September 3, 2024 00:59
@RachL
Copy link
Contributor

RachL commented Sep 3, 2024

@mkllnk great! Can I confirm how the URL looks like when adding dates? I think I'm missing something:

/api/dfc/affiliate_sales_data/startDate=01/01/2024&endDate=31/12/2024

@RachL
Copy link
Contributor

RachL commented Sep 3, 2024

Note: we haven't enabled connected apps, and executed the rake task on fr_prod to enable the connected apps for all enterprises yet (at least I don't remember). I think we should do that soon so that we can test with the real data. Let me know if you're happy for me to run the rake task now.

@dacook we can run it before landing this PR in prod? If so yes please do, many thanks!

@dacook dacook added the pr-staged-fr staging.coopcircuits.fr label Sep 3, 2024
@dacook
Copy link
Member

dacook commented Sep 3, 2024

The URL needs a ? before the parameters, so it would be like this:
https://staging.coopcircuits.fr/api/dfc/affiliate_sales_data?endDate=2024-12-31&startDate=2024-01-01
(I've also used the ISO date format which is unambiguous)

I just tried to deploy to fr_staging to check this, but I guess it doesn't work after branches are merged.

@dacook dacook removed the pr-staged-fr staging.coopcircuits.fr label Sep 3, 2024
@dacook
Copy link
Member

dacook commented Sep 3, 2024

@RachL I've activated the connected app for all current enterprises on fr_staging, and fr_prod (5662 enterprises). Note that it won't be visible to users on prod until the checkbox is enabled: https://coopcircuits.fr/admin/connected_app_settings/edit
Any new enterprises will need to be manually opted in.

openfoodnetwork@vps-d2442729:~/apps/openfoodnetwork/current$ RAILS_ENV=production bundle exec rake ofn:enterprises:activate_connected_app_type["AffiliateSalesData"] > log/activate_connected_app_type-"AffiliateSalesData".log
openfoodnetwork@vps-d2442729:~/apps/openfoodnetwork/current$ wc -l log/activate_connected_app_type-AffiliateSalesData.log
5662 log/activate_connected_app_type-AffiliateSalesData.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api changes These pull requests change the API and can break integrations
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

As a dedicated user, I can access a DFC API endpoint with fewer data and all enterprises
4 participants