-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DS-3326] Add state to subplat subscriptions v2 #5836
base: main
Are you sure you want to change the base?
Conversation
9ee000e
to
95906b9
Compare
This comment has been minimized.
This comment has been minimized.
...-fx-data-shared-prod/subscription_platform_derived/stripe_subscriptions_history_v1/query.sql
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main thing we want to do for DS-3326 is include the US/Canadian states for subscriptions in the newer SubPlat ETLs:
...-fx-data-shared-prod/subscription_platform_derived/stripe_subscriptions_history_v1/query.sql
Outdated
Show resolved
Hide resolved
...-fx-data-shared-prod/subscription_platform_derived/stripe_subscriptions_history_v1/query.sql
Outdated
Show resolved
Hide resolved
...-fx-data-shared-prod/subscription_platform_derived/stripe_subscriptions_history_v1/query.sql
Outdated
Show resolved
Hide resolved
95906b9
to
0ddee82
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
c50a2f4
to
676e10f
Compare
This comment has been minimized.
This comment has been minimized.
676e10f
to
11a5e36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sean-rose I think I've covered the necessary tables so ready for review. I left some questions as well
@@ -51,6 +51,9 @@ fields: | |||
- name: country_name | |||
type: STRING | |||
mode: NULLABLE | |||
- name: country_state_code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat awkwardly named because I don't want it to be ambiguous and interpreted as the state of a subscription
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's less than ideal either way, but personally I'd lean toward naming these state_code
and we can clarify the meaning in the column description (FYI "status" is the term being used elsewhere in the ETLs for the state/status of a subscription).
Also, even when the other columns don't have descriptions please go ahead and add descriptions for these columns being added, as we should consistently clarify the meaning of the column and note that it's only populated for the US and Canada.
@@ -8,7 +8,7 @@ description: |- | |||
|
|||
This table's schema closely mirrors Stripe's customers API (https://stripe.com/docs/api/customers/object). | |||
Fields which Fivetran doesn't sync and fields we've specifically chosen not to include (e.g. PII) have been omitted. | |||
In particular, all address fields except country have been omitted because Firefox Account user IDs are present and this is in a Mozilla-confidential dataset. | |||
In particular, postal codes have been omitted from address because Firefox Account user IDs are present and this is in a Mozilla-confidential dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line still relevant? From what I can tell, the source tables are moz-confidential
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line still relevant?
Yes, it's still relevant, and I'd recommend saying "all address fields except country and state have been omitted", because it's referring to not just what's omitted in the ETL but also what's omitted from the Fivetran sync (e.g. we've configured Fivetran to not sync street addresses).
In particular, postal codes have been omitted from address because Firefox Account user IDs are present and this is in a Mozilla-confidential dataset. | |
In particular, all address fields except country and state have been omitted because Firefox Account user IDs are present and this is in a Mozilla-confidential dataset. |
(the same applies to similar changes made in other metadata.yaml
files)
From what I can tell, the source tables are moz-confidential
The ultimate source tables for these ETLs are in moz-fx-data-shared-prod.stripe_external
and moz-fx-data-bq-fivetran.stripe
, which aren't Moz-confidential.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok I saw that the redash service account had access to stripe_external
so I assumed it was accessible from stmo but that role doesn't include data viewer permissions. I also didn't see that data platform group has special access
-- charges usually have postal code and are sometimes associated with | ||
-- a card that does not have a state or postal code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know enough about relationship between cards and charges to know if this is safe to do but I thought I'd add this here so it's easier to dicuss. From the cards that have postal codes, they almost always match the the one in the charge.
e.g. this returns two distinct cards
SELECT
charge.id,
card.id,
charge.billing_detail_address_postal_code,
card.address_zip,
FROM
`moz-fx-data-shared-prod.stripe_external.charge_v1` AS charge
LEFT JOIN
`moz-fx-data-shared-prod.stripe_external.card_v1` AS card
ON
(card.id = charge.card_id)
WHERE
charge.billing_detail_address_postal_code != card.address_zip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but I'd prioritize the charge's address over the card's address in COALESCE(...) AS latest_card_state
, since the charge presumably has the more up-to-date postal code.
And since there have been no cases thus far where a card has a postal code and an associated charge does not, you could opt to ignore the card postal codes and just use the charge postal codes (which is what I did in stripe_subscriptions_history_v1
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I didn't notice that the other table was already using charges. I'll remove the card one
history.subscription.status AS provider_status, | ||
history.subscription_first_active_at AS started_at, | ||
history.subscription.ended_at, | ||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly whitespace so using the hide whitespace option will make this easier to read
...-shared-prod/subscription_platform_derived/stripe_logical_subscriptions_history_v1/query.sql
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
@@ -51,6 +51,9 @@ fields: | |||
- name: country_name | |||
type: STRING | |||
mode: NULLABLE | |||
- name: country_state_code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's less than ideal either way, but personally I'd lean toward naming these state_code
and we can clarify the meaning in the column description (FYI "status" is the term being used elsewhere in the ETLs for the state/status of a subscription).
Also, even when the other columns don't have descriptions please go ahead and add descriptions for these columns being added, as we should consistently clarify the meaning of the column and note that it's only populated for the US and Canada.
@@ -8,7 +8,7 @@ description: |- | |||
|
|||
This table's schema closely mirrors Stripe's customers API (https://stripe.com/docs/api/customers/object). | |||
Fields which Fivetran doesn't sync and fields we've specifically chosen not to include (e.g. PII) have been omitted. | |||
In particular, all address fields except country have been omitted because Firefox Account user IDs are present and this is in a Mozilla-confidential dataset. | |||
In particular, postal codes have been omitted from address because Firefox Account user IDs are present and this is in a Mozilla-confidential dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line still relevant?
Yes, it's still relevant, and I'd recommend saying "all address fields except country and state have been omitted", because it's referring to not just what's omitted in the ETL but also what's omitted from the Fivetran sync (e.g. we've configured Fivetran to not sync street addresses).
In particular, postal codes have been omitted from address because Firefox Account user IDs are present and this is in a Mozilla-confidential dataset. | |
In particular, all address fields except country and state have been omitted because Firefox Account user IDs are present and this is in a Mozilla-confidential dataset. |
(the same applies to similar changes made in other metadata.yaml
files)
From what I can tell, the source tables are moz-confidential
The ultimate source tables for these ETLs are in moz-fx-data-shared-prod.stripe_external
and moz-fx-data-bq-fivetran.stripe
, which aren't Moz-confidential.
- name: state | ||
type: STRING | ||
mode: NULLABLE | ||
description: Two-letter state/province code (only for US and CA). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This statement isn't entirely true as we aren't enforcing that they be two-letter codes, but I feel like we should, since the non-two-letter values generally look to be ones we don't want. I'll add suggestions in the relevant ETLs for how I think that should be done.
- I'd suggest spelling out "Canada" here and in other descriptions/comments to remove ambiguity (those of us in the US tend to most associate "CA" with California as that's its state code).
IF( | ||
customer.address.country IN ("US", "CA"), | ||
COALESCE( | ||
NULLIF(customer.address.state, ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NULLIF(customer.address.state, ""), | |
IF(LENGTH(customer.address.state) = 2, customer.address.state, NULL) |
A relevant comment when I added equivalent logic for US & Canadian state codes in the older VPN ETLs:
Most billing address state values for US and Canadian customers are two-letter codes, but there are some with state names instead, including some which are clearly not US/Canadian states. So although this excludes somewhat legit values like "Colorado" for a few customers, I felt it was worth it to exclude clearly wrong values like "Bayamón" and keep the data consistent for BI.
IF( | ||
customer.shipping.address.country IN ("US", "CA"), | ||
COALESCE( | ||
NULLIF(customer.shipping.address.state, ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NULLIF(customer.shipping.address.state, ""), | |
IF(LENGTH(customer.shipping.address.state) = 2, customer.shipping.address.state, NULL) |
While this isn't technically required at the moment since all existing customer shipping address state values for the US and Canada are two-letter codes like we want, I figure it's wise to guard against non-two-letter values possibly appearing in the future.
-- Prefer charges that succeeded and non-null country | ||
IF(charges.status = 'succeeded', 1, 2), | ||
cards.country DESC NULLS LAST, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The country-not-null prioritization should be done first, and we don't want to actually sort by the non-null country values.
-- Prefer charges that succeeded and non-null country | |
IF(charges.status = 'succeeded', 1, 2), | |
cards.country DESC NULLS LAST, | |
-- Prefer non-null country and charges that succeeded. | |
IF(cards.country IS NOT NULL, 1, 2), | |
IF(charges.status = 'succeeded', 1, 2), |
Technically this still wouldn't be exactly equivalent to cards.country IGNORE NULLS
, as it could select a row with a null cards.country
value. Luckily, due to how the state code joins have conditions on non-null cards.country
values the latest_card_state
value is guaranteed to also be null in that case, so it wouldn't actually produce incorrect results.
To be exactly equivalent to cards.country IGNORE NULLS
you'd have to do something like this:
ARRAY_AGG(
IF(
cards.country IS NOT NULL,
STRUCT(
cards.country AS latest_card_country,
COALESCE(
card_us_zip.state_code,
card_ca_post.province_code,
charge_us_zip.state_code,
charge_ca_post.province_code
) AS latest_card_state
),
NULL
) IGNORE NULLS
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the sql I was looking for. I didn't think to make the whole struct null
-- charges usually have postal code and are sometimes associated with | ||
-- a card that does not have a state or postal code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but I'd prioritize the charge's address over the card's address in COALESCE(...) AS latest_card_state
, since the charge presumably has the more up-to-date postal code.
And since there have been no cases thus far where a card has a postal code and an associated charge does not, you could opt to ignore the card postal codes and just use the charge postal codes (which is what I did in stripe_subscriptions_history_v1
).
...-shared-prod/subscription_platform_derived/stripe_logical_subscriptions_history_v1/query.sql
Show resolved
Hide resolved
THEN STRUCT( | ||
NULLIF(history.subscription.customer.shipping.address.country, '') AS country_code, | ||
NULLIF( | ||
history.subscription.customer.shipping.address.state, | ||
'' | ||
) AS country_state_code | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use the customer's (billing) address, not their shipping address, and we don't need to apply NULLIF(..., '')
to the state value since that's already done in stripe_customers_revised_changelog_v1
.
THEN STRUCT( | |
NULLIF(history.subscription.customer.shipping.address.country, '') AS country_code, | |
NULLIF( | |
history.subscription.customer.shipping.address.state, | |
'' | |
) AS country_state_code | |
) | |
THEN STRUCT( | |
NULLIF(history.subscription.customer.address.country, '') AS country_code, | |
history.subscription.customer.address.state AS country_state_code | |
) |
Speaking of which, it occurs to me that if you changed stripe_customers_revised_changelog_v1
to also apply NULLIF(..., '')
to the country values there, then the repetitive NULLIF(..., '')
logic for country values in this ETL could be removed.
charge_summaries.latest_card_state AS country_state_code | ||
) | ||
END | ||
-- SubPlat copies the PayPal billing agreement country to the customer's address. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- SubPlat copies the PayPal billing agreement country to the customer's address. | |
-- SubPlat copied the PayPal billing agreement country to the customer's address before we enabled Stripe Tax (FXA-5457). |
To match the equivalent comment in stripe_subscriptions_history_v1
.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r+wc
-- Limit address data to country and state since the metadata includes FxA user IDs | ||
-- and this is in a Mozilla-confidential dataset. State is only for US and Canada. | ||
STRUCT( | ||
NULLIF(customer.address.country, ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NULLIF(customer.address.country, ""), | |
NULLIF(customer.address.country, "") AS country, |
customer.shipping.* REPLACE (STRUCT(customer.shipping.address.country) AS address) | ||
customer.shipping.* REPLACE ( | ||
STRUCT( | ||
NULLIF(customer.shipping.address.country, ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NULLIF(customer.shipping.address.country, ""), | |
NULLIF(customer.shipping.address.country, "") AS country, |
LIMIT | ||
1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend leaving LIMIT 1
in. While it isn't technically required, I believe it does help a bit with performance since BigQuery can just keep a single record per aggregation rather than aggregating them all and only using one.
Integration report for "🧹"
|
DS-3326
This adds state for US and Canada to derived stripe customer and downstream subscription tables
I tested this by changing the project qualifiers and deploying/running in a sandbox project. The results look fine on a cursory look
Checklist for reviewer:
<username>:<branch>
of the fork as parameter. The parameter will also show upin the logs of the
manual-trigger-required-for-fork
CI task together with more detailed instructions.For modifications to schemas in restricted namespaces (see
CODEOWNERS
):┆Issue is synchronized with this Jira Task