feat: list participants first seen from/to #232

bajtos · 2024-10-04T06:27:23Z

Links:

Get list of Filecoin wallets onboarded through Spark/Station space-meridian/roadmap#170

Signed-off-by: Miroslav Bajtoš <oss@bajtos.net>

bajtos · 2024-10-04T06:31:17Z

I ran the query for 2024Q3 manually; it took 647ms to complete.

List of first-seen participant addresses (19,395 items):

https://gist.github.com/bajtos/0fb8985e3f5928c5c31ff7a28e1260ab

Query:

WITH participant_first_seen AS (
  SELECT participant_id, MIN(day) as first_seen
  FROM daily_participants
  GROUP BY participant_id
)
SELECT participant_address, first_seen::TEXT
FROM participant_first_seen
LEFT JOIN participants ON participant_id = id
WHERE first_seen >= '2024-07-01' AND first_seen <= '2024-09-3'
ORDER BY first_seen ASC, participant_address ASC;

juliangruber

I don't understand why this is "first seen". We want to get a list of all participant addresses seen in a time range. It doesn't matter whether they have been seen before that time range as well. Therefore I think first seen doesn't make sense, correct me if I'm wrong. What we actually want is a deduplicated list of participants seen in that time span, right?

bajtos · 2024-10-07T09:53:52Z

I don't understand why this is "first seen". We want to get a list of all participant addresses seen in a time range. It doesn't matter whether they have been seen before that time range as well. Therefore I think first seen doesn't make sense, correct me if I'm wrong. What we actually want is a deduplicated list of participants seen in that time span, right?

Here is my understanding:

Ideally, we want the list of new wallets created by Filecoin Station Desktop in 2024Q3.
The data we have is the list of wallets seen by Spark as participants. This data has daily granularity.

The query I implemented finds all participant addresses we have seen during 2024Q3 for the first time. I.e. addresses that are most likely new wallets created by Station Desktop or manually by Station Core operators.

We want to get a list of all participant addresses seen in a time range. It doesn't matter whether they have been seen before that time range as well.

As I understand our goal, this does matter. If we have a person who installed Station Desktop in January 2024, did not run it since then, and then started to run their Station again in July, then this is not a candidate for a new wallet created in 2024Q3.

Of course, one can argue that the list produced by my query is not final, and we still need to check each address to see whether it does not have any on-chain interactions before 2024Q3. In that light, your proposal works too.

I see two benefits of my solution:

It reduces the number of addresses we must check in the second step. To be fair, I don't know how significant this reduction will be. It's not entirely unlikely the reduction will be less than 10%.
It's an approximation, a value between "the number of active participants in Q3" and "the number of new wallets created in Q3".

One more thought: Since our current plan is to run this query only once per quarter, I am unsure if it makes sense to implement it in our REST API. (Also, considering the execution cost is already >600ms and will grow linearly with the number of new participants.)

@juliangruber please let me know what you prefer:

Keep the current query ("participants first seen") or your proposal ("active participants")
Expose this data via REST API (land this PR) or run the SQL query manually (close this PR as rejected).

juliangruber · 2024-10-07T11:13:59Z

Ok I understand now!

I assumed we wanted to know about participants onboarded through Spark for all of its duration, and not only since Q3 2024. Where does this data range come from?

bajtos · 2024-10-07T11:42:34Z

I assumed we wanted to know about participants onboarded through Spark for all of its duration, and not only since Q3 2024. Where does this data range come from?

Ah, great call. We want to show our impact in our application for FIL-RetroPGF Round 2. I thought that means impact made in Q3.

I double-checked the requirements, and we need to show impact in Q2 & Q3.

Quoting from filecoin-project/community#714:

Round 2 involves distributing 300K FIL to projects that have shown impact in the previous 6 months (April - September 2024).

bajtos · 2024-10-07T11:49:26Z

I re-run the query for April to September 2024 and found 37,231 addresses.

See https://gist.github.com/bajtos/3b68c2f9a654bcc755fc5f428dfd37ba

juliangruber · 2024-10-07T12:01:16Z

Ah, great call. We want to show our impact in our application for FIL-RetroPGF Round 2. I thought that means impact made in Q3.

I double-checked the requirements, and we need to show impact in Q2 & Q3.

Thanks for finding this out!

I agree that since we want to know the impact since Q2, it doesn't make sense to include addresses seen before Q2 👍

juliangruber · 2024-10-07T12:02:30Z

One more thought: Since our current plan is to run this query only once per quarter, I am unsure if it makes sense to implement it in our REST API. (Also, considering the execution cost is already >600ms and will grow linearly with the number of new participants.)

@juliangruber please let me know what you prefer:

Expose this data via REST API (land this PR) or run the SQL query manually (close this PR as rejected).

What do you think about repackaging this PR as a CLI? Since it's an expensive query and we don't know that someone besides us needs it, the risk as a REST endpoint is higher than the reward

bajtos · 2024-10-07T12:09:48Z

What do you think about repackaging this PR as a CLI? Since it's an expensive query and we don't know that someone besides us needs it, the risk as a REST endpoint is higher than the reward

Great idea!

Since the next step in space-meridian/roadmap#170 is to check if participant has transactions before Spark/Station, would you mind to take my query and place it into the tool you will build for this next step?

Things to consider if you want to run this query from your machine:

You need to setup a tunnel to our PG server, preferably the read-only replica. I use the following command:
```
fly proxy 5455:5433 -a spark-db
```
Then you need to configure the PG client to connect through this tunnel and provide the correct credentials (you can find them in our shared 1Password vault).
```
DATABASE_URL=postgres://credentials@localhost:5455/spark_evaluate
```

Considering the complexity, it may be better to run this query via the REST API? We can limit access to this REST API by requesting an authorization header.

juliangruber · 2024-10-07T14:29:29Z

Considering the complexity, it may be better to run this query via the REST API? We can limit access to this REST API by requesting an authorization header.

Isn't this the same complexity we have for all other CLIs though?

feat: list participants first seen from/to

0efc2cc

Signed-off-by: Miroslav Bajtoš <oss@bajtos.net>

bajtos requested a review from juliangruber October 4, 2024 06:27

bajtos mentioned this pull request Oct 4, 2024

Get list of Filecoin wallets onboarded through Spark/Station space-meridian/roadmap#170

Open

2 tasks

juliangruber requested changes Oct 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: list participants first seen from/to #232

feat: list participants first seen from/to #232

bajtos commented Oct 4, 2024

bajtos commented Oct 4, 2024

juliangruber left a comment

bajtos commented Oct 7, 2024

juliangruber commented Oct 7, 2024

bajtos commented Oct 7, 2024

bajtos commented Oct 7, 2024

juliangruber commented Oct 7, 2024

juliangruber commented Oct 7, 2024

bajtos commented Oct 7, 2024 •

edited

Loading

juliangruber commented Oct 7, 2024

feat: list participants first seen from/to #232

Are you sure you want to change the base?

feat: list participants first seen from/to #232

Conversation

bajtos commented Oct 4, 2024

bajtos commented Oct 4, 2024

juliangruber left a comment

Choose a reason for hiding this comment

bajtos commented Oct 7, 2024

juliangruber commented Oct 7, 2024

bajtos commented Oct 7, 2024

bajtos commented Oct 7, 2024

juliangruber commented Oct 7, 2024

juliangruber commented Oct 7, 2024

bajtos commented Oct 7, 2024 • edited Loading

juliangruber commented Oct 7, 2024

bajtos commented Oct 7, 2024 •

edited

Loading