Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StateGetBeaconEntry uses drand client to look up round instead of using tipset data #12414

Open
rvagg opened this issue Aug 23, 2024 · 12 comments · May be fixed by #12428
Open

StateGetBeaconEntry uses drand client to look up round instead of using tipset data #12414

rvagg opened this issue Aug 23, 2024 · 12 comments · May be fixed by #12428
Assignees

Comments

@rvagg
Copy link
Member

rvagg commented Aug 23, 2024

Reported by @LesnyRumcajs:

  2024-08-23T08:06:24.759Z	WARN	rpc	go-jsonrpc@v0.6.0/handler.go:474	error in RPC call to 'Filecoin.StateGetBeaconEntry': drand failed Get request:
      github.com/filecoin-project/lotus/chain/beacon/drand.(*DrandBeacon).Entry.func1
          /opt/filecoin/chain/beacon/drand/drand.go:166
    - no valid clients
In more detail, running calibnet comparisons with forest on RPC calls:
2024-08-23T08:09:05.3627399Z * Status: Exited (1) About a minute ago
2024-08-23T08:09:05.3628163Z **********************************************************************
2024-08-23T08:09:05.4071544Z Request dump: Request { method_name: "Filecoin.BeaconGetEntry", params: Array [Number(10101)], result_type: PhantomData<serde_json::value::Value>, api_paths: V0, timeout: 120s }
2024-08-23T08:09:05.4073354Z ++ cat /data/lotus-token
2024-08-23T08:09:05.4079345Z + LOTUS_API_INFO=***
2024-08-23T08:09:05.4080139Z + FOREST_API_INFO=/dns/api-serve/tcp/3456/http
2024-08-23T08:09:05.4084839Z ++ ls /data/forest_snapshot_calibnet_2024-08-23_height_1902405.forest.car.zst
2024-08-23T08:09:05.4085830Z ++ tail -n 1
2024-08-23T08:09:05.4089242Z + forest-tool api compare /data/forest_snapshot_calibnet_2024-08-23_height_1902405.forest.car.zst --forest /dns/api-serve/tcp/3456/http --lotus *** --n-tipsets 10 --filter-file /data/filter-list-offline
2024-08-23T08:09:05.4091113Z Error: Some tests failed
2024-08-23T08:09:05.4091573Z Request params JSON: [10101]
2024-08-23T08:09:05.4098254Z Forest response: {
2024-08-23T08:09:05.4099907Z   "Data": "q4Ld1oDM4YtLFvAb8ovBzrh0GJTeiWqNj1CMDSs2RDeS1Jjp83aO7+rJcB9NO3QWE7fvPT++jjT1fJKyto4xkRkLIiTgtYxORj0KmfcmFRrv4uI3H5bub/Ak111wzf55",
2024-08-23T08:09:05.4106313Z   "Round": 2406612
2024-08-23T08:09:05.4120020Z }
2024-08-23T08:09:05.4120296Z 
2024-08-23T08:09:05.4121397Z Request dump: Request { method_name: "Filecoin.StateGetBeaconEntry", params: Array [Number(0)], result_type: PhantomData<serde_json::value::Value>, api_paths: V1, timeout: 120s }
2024-08-23T08:09:05.4123162Z Request params JSON: [0]
2024-08-23T08:09:05.4136112Z Forest response: {
2024-08-23T08:09:05.4137548Z   "Data": "kyxpp5dBVVIr3oJSXVA8caSMkJWpe24/20K60QAVPY/879dqhVmOpMs8Xvt9us3uAZeKCI1R1d6nfUq8qDuHgTN/eMFeahhTNEYhwQh2hRXQWs8j+YWdFFYUh5iotknN",
2024-08-23T08:09:05.4139047Z   "Round": 2396510
2024-08-23T08:09:05.4145758Z }
2024-08-23T08:09:05.4146028Z 
2024-08-23T08:09:05.4146829Z Request dump: Request { method_name: "Filecoin.StateGetBeaconEntry", params: Array [Number(1)], result_type: PhantomData<serde_json::value::Value>, api_paths: V1, timeout: 120s }
2024-08-23T08:09:05.4171856Z Request params JSON: [1]
2024-08-23T08:09:05.4172731Z Forest response: {
2024-08-23T08:09:05.4174834Z   "Data": "okAr/DU2fJ9WFpyqunhK3kkI6Z6dG5uZqBk+kfcEYXSTYIbHLL/y5ka0ZRopqAdEGAfODlP4wUaYnWh0c1COMWBtkfFGXBWGc15oU8LTDs2oZB9Sr/DNkfTQLJD9BtcA",
2024-08-23T08:09:05.4182001Z   "Round": 2396511
2024-08-23T08:09:05.4182605Z }

I believe this is due to removal of clients for looking up historic beacons.

Mainnet uses 3 different Drand networks: incentinet, mainnet and quicknet. We switched to quicknet in nv22. Calibnet only uses mainnet and quicknet. Client details for incentinet were removed in #10476, mainnet were removed more recently, in #12272.

I don't believe drand are on the hook for maintaining ancient beacons, incentinet is entirely gone afaik so you couldn't even look that far back. Mainnet probably too someday.

What confuses me about this API is that it takes an epoch and then does a remote look-up even though we should have it in the chainstore hanging off each block. TipSet->BlockHeader[]->BeaconEntries[]. I'm just not clear on which would we would return if we looked it up from there. Logic to build it is here:

func BeaconEntriesForBlock(ctx context.Context, bSchedule Schedule, nv network.Version, epoch abi.ChainEpoch, parentEpoch abi.ChainEpoch, prev types.BeaconEntry) ([]types.BeaconEntry, error) {

Perhaps there's null-round implications of trying to use the tipsets for this?

@ribasushi
Copy link
Collaborator

What confuses me about this API is that it takes an epoch and then does a remote look-up even though we should have it in the chainstore hanging off each block. TipSet->BlockHeader[]->BeaconEntries[]

@rvagg I was getting different results here on every round, not just null: https://github.com/data-preservation-programs/spade/blob/22dc7dc0ba81b/webapi/auth.go#L268-L280
It was right before nucleation-winter, so I never had a chance to dig further...

@ribasushi
Copy link
Collaborator

Retested it now out of curiosity. It might be that I was tripped up by null and adjacent rounds which are not handled the same way. This seems like an error in StateGetBeaconEntry for historic rounds past finality: pretty sure it should return "last finalized beacon", instead of "drand chain beacon", to be consistent. Especially in cases where the block header contains both before/after beacons.

Test with recent null as of writing this (to be pasted into terminal with local running API)

for epoch in 4203296 4203297 4203298 4203299; do
echo -e "===\n$epoch\n====\n"

printf '{ "jsonrpc": "2.0", "id":1, "method": "Filecoin.BeaconGetEntry", "params": [ %d ] }' $epoch \
| curl http://localhost:1234/rpc/v0 -sH "Content-Type: application/json" -d@/dev/stdin | jq .result

printf '{ "jsonrpc": "2.0", "id":1, "method": "Filecoin.ChainGetTipSetByHeight", "params": [ %d, null ] }' $epoch \
| curl http://localhost:1234/rpc/v0 -sH "Content-Type: application/json" -d@/dev/stdin | jq .result.Blocks[0].BeaconEntries 

done

Result:

===
4203296
====

{
  "Round": 10533962,
  "Data": "tZIDFvTTM1Gke17XIOe8/MRsuKP/Kjz7+2XOMeiBK6YFfA60IwBgUwGbeO7Wqj+U"
}
[
  {
    "Round": 10533962,
    "Data": "tZIDFvTTM1Gke17XIOe8/MRsuKP/Kjz7+2XOMeiBK6YFfA60IwBgUwGbeO7Wqj+U"
  }
]
===
4203297
====

{
  "Round": 10533972,
  "Data": "jNCsF8Xog2iq+6YIWE8bm5nFfCbhTySDi8uPP3bwXaeSQVTZg0esSUtxe39g+A8y"
}
[
  {
    "Round": 10533962,
    "Data": "tZIDFvTTM1Gke17XIOe8/MRsuKP/Kjz7+2XOMeiBK6YFfA60IwBgUwGbeO7Wqj+U"
  }
]
===
4203298
====

{
  "Round": 10533982,
  "Data": "pOvWUP9EYoYo5lZ+sUgtV25Q9MAhDvlrKiRI8qGyZLxxlLin6TpsT3eo0NNRCOZ0"
}
[
  {
    "Round": 10533972,
    "Data": "jNCsF8Xog2iq+6YIWE8bm5nFfCbhTySDi8uPP3bwXaeSQVTZg0esSUtxe39g+A8y"
  },
  {
    "Round": 10533982,
    "Data": "pOvWUP9EYoYo5lZ+sUgtV25Q9MAhDvlrKiRI8qGyZLxxlLin6TpsT3eo0NNRCOZ0"
  }
]
===
4203299
====

{
  "Round": 10533992,
  "Data": "mNBq7iBvlj+0p8+qPeIIggFKvjytQcf5AjxbVjiunjXq4S3HgN4SPu4nGF7omnHS"
}
[
  {
    "Round": 10533992,
    "Data": "mNBq7iBvlj+0p8+qPeIIggFKvjytQcf5AjxbVjiunjXq4S3HgN4SPu4nGF7omnHS"
  }
]

@ribasushi
Copy link
Collaborator

Behavior on calibnet with multiple consecutive null rounds:

===
1892388
====

{
  "Round": 10431542,
  "Data": "qN3de36QOxOL1kCZIJkCKuvC3ron+zDbF4+j6pCkK+bGrREzS2Uuou7TVbZ47e+8"
}
[
  {
    "Round": 10431542,
    "Data": "qN3de36QOxOL1kCZIJkCKuvC3ron+zDbF4+j6pCkK+bGrREzS2Uuou7TVbZ47e+8"
  }
]
===
1892389
====

{
  "Round": 10431552,
  "Data": "jhhtwfWcQ4I7venWqQDOblkkAwqiRRcs8SMOowpR49q2m7PVCZKnakLEFujfkHkM"
}
[
  {
    "Round": 10431542,
    "Data": "qN3de36QOxOL1kCZIJkCKuvC3ron+zDbF4+j6pCkK+bGrREzS2Uuou7TVbZ47e+8"
  }
]
===
1892390
====

{
  "Round": 10431562,
  "Data": "rFG33n7vPYlV7u2EQofKOgcI0si0WJbpQ3lMIH68Tb5bTDK3CzgCJozQ80HiI4Ie"
}
[
  {
    "Round": 10431542,
    "Data": "qN3de36QOxOL1kCZIJkCKuvC3ron+zDbF4+j6pCkK+bGrREzS2Uuou7TVbZ47e+8"
  }
]
===
1892391
====

{
  "Round": 10431572,
  "Data": "q4KugG+yOuE87Jy2sOQkVdmSWnYuV+94fYJj0xpllpe9RK5dwMWXnITpc+I/GHLa"
}
[
  {
    "Round": 10431552,
    "Data": "jhhtwfWcQ4I7venWqQDOblkkAwqiRRcs8SMOowpR49q2m7PVCZKnakLEFujfkHkM"
  },
  {
    "Round": 10431562,
    "Data": "rFG33n7vPYlV7u2EQofKOgcI0si0WJbpQ3lMIH68Tb5bTDK3CzgCJozQ80HiI4Ie"
  },
  {
    "Round": 10431572,
    "Data": "q4KugG+yOuE87Jy2sOQkVdmSWnYuV+94fYJj0xpllpe9RK5dwMWXnITpc+I/GHLa"
  }
]

@rvagg
Copy link
Member Author

rvagg commented Aug 24, 2024

There's a whole infrastructure in the FVM randomness syscall package for picking out the beacon from the chain, entering here and forking depending on network version:

lotus/chain/rand/rand.go

Lines 180 to 190 in dbef5de

func (sr *stateRand) GetBeaconRandomness(ctx context.Context, filecoinEpoch abi.ChainEpoch) ([32]byte, error) {
nv := sr.networkVersionGetter(ctx, filecoinEpoch)
if nv >= network.Version14 {
return sr.getBeaconRandomnessV3(ctx, filecoinEpoch)
} else if nv == network.Version13 {
return sr.getBeaconRandomnessV2(ctx, filecoinEpoch)
}
return sr.getBeaconRandomnessV1(ctx, filecoinEpoch)
}
; the "lookback" param for determining whether to go before or after a null round, and that logic switches at nv13. I think maybe we could just wire this API up to that call since that's what on-chain actors are going to see anyway.

We just need a bit of historical knowledge of this API to inform the decision here. @Stebalien @Kubuxu might know whether it's safe to switch from getting the beacon from drand directly to getting it off what we have on the chain.

@rvagg
Copy link
Member Author

rvagg commented Aug 29, 2024

Some historical explanation for the NV13 special casing: #3613

@jennijuju
Copy link
Member

We discussed this offline - I think this api is implemented wrongly and I’d expect this to return what’s (beacon entries) in the block header.
@arajasek convince me otherwise!

@rvagg
Copy link
Member Author

rvagg commented Aug 30, 2024

Some additional context:

StateGetRandomnessFromBeacon works as we expect it should, by using the chain beacons just like the get_beacon_randomness syscall, but it's designed to take a DomainSeparationTag and additional entropy, so not suitable to get the beacon randomness.

StateGetBeaconEntry does have some additional documented properties which make it a bit trickier to implement than just getting historical chain randomness:

	// StateGetBeaconEntry returns the beacon entry for the given filecoin epoch. If
	// the entry has not yet been produced, the call will block until the entry
	// becomes available

The blocking behaviour for future randomness means that it is easier to implement as a call directly into drand.

But we end up with the property that the beacon entry returned from this will be different from historical beacon entries in a number of epoch ranges where we fetch them off chain, as used in StateGetRandomnessFromBeacon and get_beacon_randomness.

@rvagg
Copy link
Member Author

rvagg commented Aug 30, 2024

A wrinkle:

lotus/miner/miner.go

Lines 271 to 280 in 4a4ddaa

// Ensure the beacon entry is available before finalizing the mining base.
_, err = m.api.StateGetBeaconEntry(ctx, prebase.TipSet.Height()+prebase.NullRounds+1)
if err != nil {
log.Errorf("failed getting beacon entry: %s", err)
if !m.niceSleep(time.Second) {
continue minerLoop
}
continue
}

The block mining process relies on StateGetBeaconEntry to check that we even have a round available for the epoch being mined before we go ahead and mine. So, drand could be down and this call fails, in which case we have to skip an epoch and it becomes a null.

So, even if we decide to "fix" this API for historical epochs, we need it to do query drand for >now. I'm not sure if it matters for =now though. We're always doing a +1 on the mining base, which is either chain head or the previously selected tipset if that was heavier. From what I can tell in that code, it may be calling StateGetBeaconEntry on a current, or even previous epoch, if the previously selected tipset is heavier than the currently reported head but it's from an earlier epoch.

But maybe that also doesn't matter if we're able to fetch the correct round from the tipset anyway, we know it's "available" and that's all the call is doing in this case, it's not even getting the value.

@rvagg
Copy link
Member Author

rvagg commented Aug 30, 2024

Results of spending far too much time figuring out the behaviour of the beacon-from-chain behaviour:

Filecoin Drand Beacon Randomness from Chain

As used in the FVM syscall get_beacon_randomness andt the StateGetRandomnessFromBeacon API.

Based on https://github.com/filecoin-project/lotus/blob/4a4ddaaeccc56bbd1e86db404726415921695da7/chain/rand/rand.go

Background

For each tipset, we expect all blocks to have the same BeaconEntries array. For most cases this will be a single BeaconEntry where the beacon round matches the epoch time. In the case where there are null rounds between tipsets, we expect the BeaconEntries array to include one BeaconEntry per missed epoch.

One special case in code is that when there is a change of beacons (drand networks) and the new beacon is "chained" (i.e. not quicknet, because Filecoin no longer tracks connected chain of rounds), an entry will be recorded for both the previous and the new beacons in the new block header. This doesn't appear to have been in place during the only such transition at Smoke(2) @ 51000, where we see no such double-entry. (I was looking at the wrong chain) This can be seen at the only such transition at Smoke(2) @ 51000 which has two entries recorded for the new "mainnet" beacon, rounds 146843 and 146844 (while the previous epoch has incentinet round 132084).

Since quicknet (Phoenix @ 3855480 (2024-04-11T15:00:00Z)), there is not a 1:1 correspondence between drand rounds and filecoin epochs since drand moves in 3s intervals and filecoin in 30s intervals. This means that we no longer store the full "chain" of drand rounds. It also means that any applications relying on specific drand rounds to be available via get_beacon_randomness or StateGetRandomnessFromBeacon can only rely on drand rounds that align with filecoin epochs.

Network versions 0-12: < Hyperdrive(13) @ 892800 (2021-06-30T22:00:00Z)

getBeaconRandomnessV1

  • Get tipset for the requested epoch.
    • Where the requetsted epoch is a null round, start from the previous non-null tipset.
  • Use GetLatestBeaconEntry on the selected tipset to return the last BeaconEntry on the block headers where there is a BeaconEntry.
    • Otherwise walks back the chain to find a tipset with at least one BeaconEntry.
  • Blake2b-256 hash the BeaconEntry.Data to get the randomness.

This results in a mismatch between the epoch and the beacon round for null rounds since we only look backward on the chain.

Network version 13: >= Hyperdrive(13) @ 892800 (2021-06-30T22:00:00Z) < Chocolate(14) 1231620 (2021-10-26T13:30:00Z)

getBeaconRandomnessV2

  • Get tipset for the requested epoch.
    • Where the requetsted epoch is a null round, start from the next non-null tipset.
  • Use GetLatestBeaconEntry on the selected tipset to return the last BeaconEntry on the block headers where there is a BeaconEntry.
    • Otherwise walks back the chain to find a tipset with at least one BeaconEntry.
  • Blake2b-256 hash the BeaconEntry.Data to get the randomness.

Additional context:

This still results in a mismatch between the epoch and the beacon round for null rounds since we only fetch the last BeaconEntry on the tipset we find, which will not match the beacon round.

Network version 14: >= Chocolate(14) 1231620 (2021-10-26T13:30:00Z)

getBeaconRandomnessV3

  • Get tipset for the requested epoch.
    • Where the requetsted epoch is a null round, start from the next non-null tipset.
  • Calculate beacon round for requested epoch.
    • Where network version >= Chocolate(14) 1231620 (2021-10-26T13:30:00Z) < OhSnap(15) 1594680 (2022-03-01T15:00:00Z).
      • Calculate drand round number for the epoch - (epoch - drand genesis) / drand period.
    • Where network version >= OhSnap(15) 1594680 (2022-03-01T15:00:00Z).
      • Calculate drand round number for the epoch - (epoch - drand genesis) / drand period + 1 (as per fix).
  • Search in the BeaconEnties for the tipset to find the BeaconEntry with the matching round number.
    • If there is no match, walk back to the previous tipset and repeat the search.
    • If we have walked back 20 tipsets and not found a match, return an error.
  • Blake2b-256 hash the BeaconEntry.Data to get the randomness.

@rvagg
Copy link
Member Author

rvagg commented Aug 30, 2024

Some more comments @ filecoin-project/FIPs#1051 (comment)

I would like to test this if we end up implementing this call as chain-beacon-lookup using the existing logic:

I don't have the means to test this at hand, but I believe that calling this for epoch 892765, a null round will get you drand round 988608 instead of 988609 [the tipset-backward rule], while epoch 1231599 will get you drand round 1327444 instead of 1327443 [the tipset-forward rule]

--

I think the path forward here is twofold:

  1. Use chain beacon entries for historical lookups <= chain head
  2. Use the live beacon for anything > chain head, including the blocking behaviour.

The outstanding question I think is whether to use StateGetRandomnessFromBeacon for that historical look-up or roll a new one based on the >= Chocolate(14) 1231620 behaviour such that it should get you the right round as long as we can find it in the chain, i.e. you figure out what round number you want and then go fishing for that precise number. I'm just not sure how much it really matters for this external API, who's using it like this and who's going to be looking up such deep history? If anything I would imagine people using this to match the behaviour of the syscall, since you can't use StateGetRandomnessFromBeacon to match what the syscall is going to give you (the DomainSeparationTag gets in the way).

@rvagg
Copy link
Member Author

rvagg commented Sep 2, 2024

Continuing my comment @ filecoin-project/FIPs#1051 (comment) about the 20 tipset walkback for >=nv14 if it doesn't find a matching round: I think the 20 is a mistake and probably pointless. #7376 introduced it but I think the logic is kind of copied from the GetLatestBeaconEntry code (which was moved from chain/store/rand.go to chain/store/store.go). In that case, it keeps looking for up to 20 tipsets previous if there are no beacon entries—I'm not sure why there wouldn't be, as far as I can tell there should always be at least one, maybe this is a historical blip. Either way, for >=nv14, if you ask for a null round epoch then it'll fetch the next available tipset and then search through its beacons for the one you want. According to chain logic, the one you want should be there and if it's not then it should be an error. The 20 only prolongs that error while we walk back 20 epochs and we shouldn't find the matching Round.

rvagg added a commit that referenced this issue Sep 3, 2024
Ref: #12414

Previously StateGetBeaconEntry would always try and use a drand beacon to get
the appropriate round. But as drand has shut down old beacons and we've
removed client details from Lotus, it has stopped working for historical
beacons.
This fix restores historical beacon entries by using the on-chain lookup,
however it now follows the rules used by StateGetRandomnessFromBeacon and the
get_beacon_randomness syscall which has some quirks with null rounds prior to
nv14. See #12414 (comment)
for specifics.

StateGetBeaconEntry still blocks for future epochs and uses live drand beacon
clients to wait for and fetch rounds as they are available.
rvagg added a commit that referenced this issue Sep 3, 2024
Fixes: #12414

Previously StateGetBeaconEntry would always try and use a drand beacon to get
the appropriate round. But as drand has shut down old beacons and we've
removed client details from Lotus, it has stopped working for historical
beacons.
This fix restores historical beacon entries by using the on-chain lookup,
however it now follows the rules used by StateGetRandomnessFromBeacon and the
get_beacon_randomness syscall which has some quirks with null rounds prior to
nv14. See #12414 (comment)
for specifics.

StateGetBeaconEntry still blocks for future epochs and uses live drand beacon
clients to wait for and fetch rounds as they are available.
rvagg added a commit that referenced this issue Sep 3, 2024
…pochs

Fixes: #12414

Previously StateGetBeaconEntry would always try and use a drand beacon to get
the appropriate round. But as drand has shut down old beacons and we've
removed client details from Lotus, it has stopped working for historical
beacons.
This fix restores historical beacon entries by using the on-chain lookup,
however it now follows the rules used by StateGetRandomnessFromBeacon and the
get_beacon_randomness syscall which has some quirks with null rounds prior to
nv14. See #12414 (comment)
for specifics.

StateGetBeaconEntry still blocks for future epochs and uses live drand beacon
clients to wait for and fetch rounds as they are available.
@Stebalien Stebalien assigned rvagg and unassigned Stebalien Sep 7, 2024
@rvagg
Copy link
Member Author

rvagg commented Sep 9, 2024

This will be closed when I can get a review to get #12428 landed

rvagg added a commit that referenced this issue Sep 12, 2024
…pochs

Fixes: #12414

Previously StateGetBeaconEntry would always try and use a drand beacon to get
the appropriate round. But as drand has shut down old beacons and we've
removed client details from Lotus, it has stopped working for historical
beacons.
This fix restores historical beacon entries by using the on-chain lookup,
however it now follows the rules used by StateGetRandomnessFromBeacon and the
get_beacon_randomness syscall which has some quirks with null rounds prior to
nv14. See #12414 (comment)
for specifics.

StateGetBeaconEntry still blocks for future epochs and uses live drand beacon
clients to wait for and fetch rounds as they are available.
rvagg added a commit that referenced this issue Sep 17, 2024
…pochs

Fixes: #12414

Previously StateGetBeaconEntry would always try and use a drand beacon to get
the appropriate round. But as drand has shut down old beacons and we've
removed client details from Lotus, it has stopped working for historical
beacons.
This fix restores historical beacon entries by using the on-chain lookup,
however it now follows the rules used by StateGetRandomnessFromBeacon and the
get_beacon_randomness syscall which has some quirks with null rounds prior to
nv14. See #12414 (comment)
for specifics.

StateGetBeaconEntry still blocks for future epochs and uses live drand beacon
clients to wait for and fetch rounds as they are available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📌 Triage
Development

Successfully merging a pull request may close this issue.

4 participants