Share anonymised sales data on DFC API with authorised users #12831

mkllnk · 2024-08-30T05:19:37Z

ℹ️ Funded feature code is #12543 [DFC] Anonymized orders endpoint

What? Why?

The French teams wants to shared price trend data with a research team at INRAE. This PR is adding a new DFC API endpoint publishing the required data in DFC format. This standard is used by other study participants and should make the processing easier.

I followed the example spec with a nested document structure. The DFC document is nested for objects without semantic ids which I all set to nil. Assigning ids would create a document with a flat graph structure. But I'm assuming that the data will be converted into a spreadsheet. Hence I query the data as table rows and process them entry by entry. I did notice though that another participant structured their document differently. Only the researchers can tell us if this data format is useful for them.

Many thanks to @anansilva for starting this piece of work. I started a new branch with a rewritten history to replace the original PR but many of the concepts come from Ana.

What should we test?

Activate feature affiliate_sales_data for your user.
Activate connected app AffiliateSalesData for a distributor with orders.
Visit /api/dfc/affiliate_sales_data and observe the data.
You should find the required data:
- enterprise (producers) postal code
- enterprise (distributor) postal code
- variant unit name
- unit type
- units
- price
- quantity sold
As another user you should not see any data (except a person object)
Data of other enterprises should not be published.
Paging has not been implemented.

Release notes

Changelog Category (reviewers may add a label for the release notes):

User facing changes
API changes (V0, V1, DFC or Webhook)
Technical changes only
Feature toggled

The title of the pull request will be included in the release notes.

Dependencies

Documentation updates

You also find an example at /api-docs/.

There will be lots and lots. The sales data root object is also the authenticated person. The data has its own URL (semantic id) which doens't need to contain the user id. The service object can also be tested more easily. I'm setting up the test data here.

RachL · 2024-08-30T14:01:32Z

@mkllnk it looks good to me! As I'm still learning a lot, can I just checking that this is a correct way of reading the data:

Name	Value	What it represent
dfc-b:hasAddress
@type	"dfc-b:Address"
dfc-b:hasStreet	""
dfc-b:hasPostalCode	"26110"	Producer postal code
dfc-b:hasCity	""
dfc-b:hasCountry	""
dfc-b:latitude	0
dfc-b:longitude	0
dfc-b:region	""
dfc-b:logo	""
dfc-b:name	""
dfc-b:hasDescription	""
dfc-b:VATnumber	""
dfc-b:supplies
@type	"dfc-b:SuppliedProduct"
dfc-b:name	"Abricots"	variant unit name
dfc-b:description	""
dfc-b:hasQuantity
@type	"dfc-b:QuantitativeValue"
dfc-b:hasUnit	"dfc-m:Gram"	Unit type
dfc-b:value	1000	Units (1 kg of Abricots here)
dfc-b:alcoholPercentage	0
dfc-b:lifetime	""
dfc-b:usageOrStorageCondition	""
dfc-b:totalTheoreticalStock	0
dfc-b:concernedBy
@type	"dfc-b:OrderLine"
dfc-b:description	""
dfc-b:quantity
@type	"dfc-b:QuantitativeValue"
dfc-b:hasUnit	"dfc-m:Piece"
dfc-b:value	118	Quantity sold?
dfc-b:hasPrice
@type	"dfc-b:QuantitativeValue"
dfc-b:value	3.89	Price
dfc-b:partOf
@type	"dfc-b:Order"**
dfc-b:orderNumber	""
dfc-b:date	""
dfc-b:belongsTo
@type	"dfc-b:SaleSession"
dfc-b:beginDate	""
dfc-b:endDate	""
dfc-b:quantity 0
dfc-b:objectOf
@type "dfc-b:Coordination"
dfc-b:coordinatedBy
@type "	dfc-b:Enterprise"
dfc-b:hasAddress
@type	"dfc-b:Address"
dfc-b:hasStreet	""
dfc-b:hasPostalCode	"38100"	Distributor postal code
dfc-b:hasCity	""
dfc-b:hasCountry	""
dfc-b:latitude	0
dfc-b:longitude	0
dfc-b:region	""
dfc-b:logo	""
dfc-b:name	""
dfc-b:hasDescription	""
dfc-b:VATnumber

What's the period of time the user can see? The whole database?

Thiking of it I would prefer we merge this and I ask INRAE to test in production, I feel they can easily mix up staging vs production. We have a MoU in place with them. Would that be ok for you? given it's feature toggled we can remove their access easily in case of a problem.

But I'm assuming that the data will be converted into a spreadsheet

I think so too, but they will first pass it through a python script to clean the type of products. Eg.: we send them "chocolat au lait" which is chocolat with milk, but they are only looking for milk (lait).

rioug

Nice git history, makes it easy to follow.

dacook

🥇 Fantastic, I love how you thought outside the box to build this well!

dacook · 2024-09-02T01:34:51Z

engines/dfc_provider/app/services/affiliate_sales_data_row_builder.rb

@@ -0,0 +1,98 @@
+# frozen_string_literal: true
+
+# Represents a single row of the aggregated sales data.


So glad you did this part, it probably would have taken me ages to figure out all the required relations.

dacook · 2024-09-02T01:51:37Z

engines/dfc_provider/app/services/affiliate_sales_query.rb

+    #
+    #   { product_name: "Apple", ... }
+    def label_row(row)
+      labels.zip(row).to_h


I was worried about indiscriminately zipping two arrays together, so I was wondering if there's a way to map by the column names.
There's discussion here about avoiding instantiating AR records, which you've rightly done:
https://stackoverflow.com/questions/20794398/how-can-i-have-activerecords-pluck-also-return-the-column-name-for-rendering-in

I like that your solution creates a short-lived hash to label the columns on-demand.

One person suggests using ActiveRecord::Base.connection.select_all which provides a ActiveRecord::Result. I haven't looked at the implementation of that class, but I suspect it does the same thing, so is probably equally efficient.

But it's probably not worth exploring further, moving on...

Ah, I didn't know select_all. I'll keep that in mind for another time. If it could do it in batches then I would probably prefer that. But my version occupies the least memory for grabbing all data at once, I think, because it's only keeping the values in an array and the keys are only added while processing one row, as you acknowledged above.

dacook · 2024-09-02T01:59:48Z

engines/dfc_provider/spec/services/affiliate_sales_query_spec.rb

+      # Without any filters:
+      expect(count_rows.call).to eq 1
+
+      # From today:
+      expect(count_rows.call(start_date: today)).to eq 1
+
+      # Until today:
+      expect(count_rows.call(end_date: today)).to eq 1
+
+      # Just today:
+      expect(count_rows.call(start_date: today, end_date: today)).to eq 1
+
+      # Yesterday:
+      expect(count_rows.call(start_date: yesterday, end_date: yesterday)).to eq 0
+
+      # Until yesterday:
+      expect(count_rows.call(end_date: yesterday)).to eq 0
+
+      # From tomorrow:
+      expect(count_rows.call(start_date: tomorrow)).to eq 0


Great coverage
(and it's so nice to see we get it all for free with ActiveRecord and Ranges!)

dacook · 2024-09-02T02:07:48Z

I believe this is ready to merge.
I'm unclear if this has been tested at scale in staging. It might be possible to crash a puma worker in prod, but I think our setup is robust enough to recover from that anyway.

RachL · 2024-09-02T08:13:13Z

@dacook what do you mean "at scale"? I didn't activated more than 3 enterprises during my test, indeed.

I guess it is also linked to my question above: does it give the whole database?

Socleo are specifying the date range in the URL they are querying if I'm not mistaken. If we can do that and what to ease the load, we could only show data since January 1st 2021.

dacook · 2024-09-02T23:56:57Z

@RachL yes, I believe it would give any data from the database, for enterprises that have the connected app.
As the intention is to connect all enterprises first, this would be a very large result set.

But as you say, the researchers will be extracting data for certain periods of time, which will reduce the result set. If they have trouble loading, they can make smaller queries until it works (eg 1 month, or 1 week, or 1 day...)

Note: we haven't enabled connected apps, and executed the rake task on fr_prod to enable the connected apps for all enterprises yet (at least I don't remember). I think we should do that soon so that we can test with the real data. Let me know if you're happy for me to run the rake task now.

mkllnk · 2024-09-03T00:59:11Z

By default it returns all data. But it also supports the startDate and endDate params like in the spec and the Socleo example. If it crashes the server, we could add a default date range in case you forget to declare one.

RachL · 2024-09-03T07:26:17Z

@mkllnk great! Can I confirm how the URL looks like when adding dates? I think I'm missing something:

/api/dfc/affiliate_sales_data/startDate=01/01/2024&endDate=31/12/2024

RachL · 2024-09-03T07:44:29Z

Note: we haven't enabled connected apps, and executed the rake task on fr_prod to enable the connected apps for all enterprises yet (at least I don't remember). I think we should do that soon so that we can test with the real data. Let me know if you're happy for me to run the rake task now.

@dacook we can run it before landing this PR in prod? If so yes please do, many thanks!

dacook · 2024-09-03T23:36:02Z

The URL needs a ? before the parameters, so it would be like this:
https://staging.coopcircuits.fr/api/dfc/affiliate_sales_data?endDate=2024-12-31&startDate=2024-01-01
(I've also used the ISO date format which is unambiguous)

I just tried to deploy to fr_staging to check this, but I guess it doesn't work after branches are merged.

dacook · 2024-09-03T23:50:11Z

@RachL I've activated the connected app for all current enterprises on fr_staging, and fr_prod (5662 enterprises). Note that it won't be visible to users on prod until the checkbox is enabled: https://coopcircuits.fr/admin/connected_app_settings/edit
Any new enterprises will need to be manually opted in.

openfoodnetwork@vps-d2442729:~/apps/openfoodnetwork/current$ RAILS_ENV=production bundle exec rake ofn:enterprises:activate_connected_app_type["AffiliateSalesData"] > log/activate_connected_app_type-"AffiliateSalesData".log
openfoodnetwork@vps-d2442729:~/apps/openfoodnetwork/current$ wc -l log/activate_connected_app_type-AffiliateSalesData.log
5662 log/activate_connected_app_type-AffiliateSalesData.log

mkllnk added 5 commits August 30, 2024 14:30

Add DFC API endpoint for sales data

4342d3b

Build DFC data for sales

bd16116

Publish data only of participating distributors

1016656

Filter sales data by dates

d52134d

mkllnk added the api changes These pull requests change the API and can break integrations label Aug 30, 2024

mkllnk self-assigned this Aug 30, 2024

mkllnk mentioned this pull request Aug 30, 2024

As a dedicated user, I can access a DFC API endpoint with fewer data and all enterprises #12544

Closed

mkllnk marked this pull request as ready for review August 30, 2024 05:33

RachL added the pr-staged-fr staging.coopcircuits.fr label Aug 30, 2024

RachL removed the pr-staged-fr staging.coopcircuits.fr label Aug 30, 2024

rioug approved these changes Sep 2, 2024

View reviewed changes

dacook approved these changes Sep 2, 2024

View reviewed changes

mkllnk merged commit 3f1d99d into openfoodfoundation:master Sep 3, 2024
54 of 56 checks passed

mkllnk deleted the anonymous-orders branch September 3, 2024 00:59

dacook added the pr-staged-fr staging.coopcircuits.fr label Sep 3, 2024

dacook removed the pr-staged-fr staging.coopcircuits.fr label Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Share anonymised sales data on DFC API with authorised users #12831

Share anonymised sales data on DFC API with authorised users #12831

mkllnk commented Aug 30, 2024

RachL commented Aug 30, 2024

rioug left a comment

dacook left a comment

dacook Sep 2, 2024

dacook Sep 2, 2024

mkllnk Sep 3, 2024

dacook Sep 2, 2024

dacook commented Sep 2, 2024

RachL commented Sep 2, 2024

dacook commented Sep 2, 2024

mkllnk commented Sep 3, 2024

RachL commented Sep 3, 2024

RachL commented Sep 3, 2024

dacook commented Sep 3, 2024

dacook commented Sep 3, 2024

		@@ -0,0 +1,98 @@
		# frozen_string_literal: true

		# Represents a single row of the aggregated sales data.

Share anonymised sales data on DFC API with authorised users #12831

Share anonymised sales data on DFC API with authorised users #12831

Conversation

mkllnk commented Aug 30, 2024

What? Why?

What should we test?

Release notes

Dependencies

Documentation updates

RachL commented Aug 30, 2024

rioug left a comment

Choose a reason for hiding this comment

dacook left a comment

Choose a reason for hiding this comment

dacook Sep 2, 2024

Choose a reason for hiding this comment

dacook Sep 2, 2024

Choose a reason for hiding this comment

mkllnk Sep 3, 2024

Choose a reason for hiding this comment

dacook Sep 2, 2024

Choose a reason for hiding this comment

dacook commented Sep 2, 2024

RachL commented Sep 2, 2024

dacook commented Sep 2, 2024

mkllnk commented Sep 3, 2024

RachL commented Sep 3, 2024

RachL commented Sep 3, 2024

dacook commented Sep 3, 2024

dacook commented Sep 3, 2024