Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: list participants first seen from/to #232

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bajtos
Copy link
Member

@bajtos bajtos commented Oct 4, 2024

Signed-off-by: Miroslav Bajtoš <oss@bajtos.net>
@bajtos
Copy link
Member Author

bajtos commented Oct 4, 2024

I ran the query for 2024Q3 manually; it took 647ms to complete.

List of first-seen participant addresses (19,395 items):

https://gist.github.com/bajtos/0fb8985e3f5928c5c31ff7a28e1260ab

Query:

WITH participant_first_seen AS (
  SELECT participant_id, MIN(day) as first_seen
  FROM daily_participants
  GROUP BY participant_id
)
SELECT participant_address, first_seen::TEXT
FROM participant_first_seen
LEFT JOIN participants ON participant_id = id
WHERE first_seen >= '2024-07-01' AND first_seen <= '2024-09-3'
ORDER BY first_seen ASC, participant_address ASC;

Copy link
Member

@juliangruber juliangruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is "first seen". We want to get a list of all participant addresses seen in a time range. It doesn't matter whether they have been seen before that time range as well. Therefore I think first seen doesn't make sense, correct me if I'm wrong. What we actually want is a deduplicated list of participants seen in that time span, right?

@bajtos
Copy link
Member Author

bajtos commented Oct 7, 2024

I don't understand why this is "first seen". We want to get a list of all participant addresses seen in a time range. It doesn't matter whether they have been seen before that time range as well. Therefore I think first seen doesn't make sense, correct me if I'm wrong. What we actually want is a deduplicated list of participants seen in that time span, right?

Here is my understanding:

  • Ideally, we want the list of new wallets created by Filecoin Station Desktop in 2024Q3.
  • The data we have is the list of wallets seen by Spark as participants. This data has daily granularity.

The query I implemented finds all participant addresses we have seen during 2024Q3 for the first time. I.e. addresses that are most likely new wallets created by Station Desktop or manually by Station Core operators.

We want to get a list of all participant addresses seen in a time range. It doesn't matter whether they have been seen before that time range as well.

As I understand our goal, this does matter. If we have a person who installed Station Desktop in January 2024, did not run it since then, and then started to run their Station again in July, then this is not a candidate for a new wallet created in 2024Q3.

Of course, one can argue that the list produced by my query is not final, and we still need to check each address to see whether it does not have any on-chain interactions before 2024Q3. In that light, your proposal works too.

I see two benefits of my solution:

  • It reduces the number of addresses we must check in the second step. To be fair, I don't know how significant this reduction will be. It's not entirely unlikely the reduction will be less than 10%.
  • It's an approximation, a value between "the number of active participants in Q3" and "the number of new wallets created in Q3".

One more thought: Since our current plan is to run this query only once per quarter, I am unsure if it makes sense to implement it in our REST API. (Also, considering the execution cost is already >600ms and will grow linearly with the number of new participants.)

@juliangruber please let me know what you prefer:

  1. Keep the current query ("participants first seen") or your proposal ("active participants")
  2. Expose this data via REST API (land this PR) or run the SQL query manually (close this PR as rejected).

@juliangruber
Copy link
Member

Ok I understand now!

I assumed we wanted to know about participants onboarded through Spark for all of its duration, and not only since Q3 2024. Where does this data range come from?

@bajtos
Copy link
Member Author

bajtos commented Oct 7, 2024

I assumed we wanted to know about participants onboarded through Spark for all of its duration, and not only since Q3 2024. Where does this data range come from?

Ah, great call. We want to show our impact in our application for FIL-RetroPGF Round 2. I thought that means impact made in Q3.

I double-checked the requirements, and we need to show impact in Q2 & Q3.

Quoting from filecoin-project/community#714:

Round 2 involves distributing 300K FIL to projects that have shown impact in the previous 6 months (April - September 2024).

@bajtos
Copy link
Member Author

bajtos commented Oct 7, 2024

I re-run the query for April to September 2024 and found 37,231 addresses.

See https://gist.github.com/bajtos/3b68c2f9a654bcc755fc5f428dfd37ba

@juliangruber
Copy link
Member

Ah, great call. We want to show our impact in our application for FIL-RetroPGF Round 2. I thought that means impact made in Q3.

I double-checked the requirements, and we need to show impact in Q2 & Q3.

Thanks for finding this out!

I agree that since we want to know the impact since Q2, it doesn't make sense to include addresses seen before Q2 👍

@juliangruber
Copy link
Member

One more thought: Since our current plan is to run this query only once per quarter, I am unsure if it makes sense to implement it in our REST API. (Also, considering the execution cost is already >600ms and will grow linearly with the number of new participants.)

@juliangruber please let me know what you prefer:

Expose this data via REST API (land this PR) or run the SQL query manually (close this PR as rejected).

What do you think about repackaging this PR as a CLI? Since it's an expensive query and we don't know that someone besides us needs it, the risk as a REST endpoint is higher than the reward

@bajtos
Copy link
Member Author

bajtos commented Oct 7, 2024

What do you think about repackaging this PR as a CLI? Since it's an expensive query and we don't know that someone besides us needs it, the risk as a REST endpoint is higher than the reward

Great idea!

Since the next step in space-meridian/roadmap#170 is to check if participant has transactions before Spark/Station, would you mind to take my query and place it into the tool you will build for this next step?

Things to consider if you want to run this query from your machine:

  • You need to setup a tunnel to our PG server, preferably the read-only replica. I use the following command:

    fly proxy 5455:5433 -a spark-db
    
  • Then you need to configure the PG client to connect through this tunnel and provide the correct credentials (you can find them in our shared 1Password vault).

    DATABASE_URL=postgres://credentials@localhost:5455/spark_evaluate
    

Considering the complexity, it may be better to run this query via the REST API? We can limit access to this REST API by requesting an authorization header.

@juliangruber
Copy link
Member

Considering the complexity, it may be better to run this query via the REST API? We can limit access to this REST API by requesting an authorization header.

Isn't this the same complexity we have for all other CLIs though?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants