Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: n historical states #6008

Closed
wants to merge 42 commits into from
Closed

feat: n historical states #6008

wants to merge 42 commits into from

Conversation

twoeths
Copy link
Contributor

@twoeths twoeths commented Sep 29, 2023

Motivation

  • It's not sustainable to add more states to our cache, lodestar is not able to be alive in long unfinality period
  • Also when memory is high, lodestar has bad performance due to a lot of gc calls (specifically state transition function)
  • We're now able to restore a state from bytes in a way that there is only 1 base state across lodestar

Description

  • New loadState api to load state bytes into an existing TreeBacked state

    • Since it mostly share the validators array, memory only increase ~100MB
    • Time to load state and do the hashTreeRoot is from 2s to 3s on with mainnet state
  • Checkpoint state cache

    • Only store up to maxEpochsInMemory (default = 2) epochs. If more than that persist the states to disk. Right now we persist 2 checkpoint states (spec + non-spec) to disk per epoch
    • Maintain an order of states to choose good state to prune from memory (and persist to disk)
      • Persist to fs through FilePersistentApis or db through DbPersistentApis, cleanup at startup
      • by default, persist to db
    • The cache may contain a CachedBeaconState or a persistent key
    • Find closest state and reload from disk in "getOrReload" apis based on persistent key
    • Also maintain "get" apis
    • Maintain the old implementation through MemoryCheckpointStateCache and CheckpointStateCache interface
  • State cache

    • Store maxStates (default = 32, but 4 worked in Holesky) consistently
    • Maintain order of state to make it's easier to select a state to prune. Head state always stay on top
    • Maintain the old implementation through StateContextCache and BlockStateCache interface
  • Regen

    • With reload capability, we can regen from whatever checkpoint state. Reload takes 2s - 3s, and rarely happen
      • For important flows, like "getPreState" or "updateHeadState", use "getOrReload" apis which may reload checkpoint states from disk
      • For other flows, use the normal "get" apis
  • Feature flag:

    • nHistoricalStates flag to use the new PersistentCheckpointStateCache and LRUBlockStateCache
    • by default, go with the old implementation - MemoryCheckpointStateCache and StateContextCache
  • Shuffling cache

    • Populate cache when we cross through epoch boundary
    • Used when we reloading a state from disk, or attestation verification
    • This is technically not part of this PR but a prerequisite, will make a separate PR for this

Test result on Holesky

  • This branch was able to survive under non-finality time of holesky with default heap size config

    • Node 1 (maxEpochsInMemory=2 & maxStates=32): no reload happened
    • Node 2 (maxEpochsInMemory=0 & maxStates=4): reloaded > 15 times, average reload duration = 2s (this did not include the hashTreeRoot)
  • Used heap is reduced based on maxEpochsInMemory (for checkpoint cache) and maxStates (for state cache)

tested node used heap (GB)
regular holesky (when stable) 4.7 to 5.7
2 epochs - 32 maxStates 3.2 to 3.8
1 epoch - 16 maxStates 2.95 to 3.7
0 epoch - 4 maxStates 2.8 to 3.4

Some down sides with this approach

  • We are not able to access the finalized tree backed state, we can only get finalized state bytes from file
  • We have to store 2 checkpoint states per epoch which take 750ms * 2 = 1.5s. It choose to persist to disk at the last 1/3 slot of the 1st slot of epoch but this could be improved to store only the spec checkpoints

part of #5968

TODOs

  • Use shuffling cache for attestation verification
  • At each epoch we store both spec checkpoint and non-spec checkpoint and persist both of them. May want to only persist spec checkpoint (it takes 750ms to 900ms to serialize 1 Holesky state)
  • Investigate this issue: With default config of maxEpochsInMemory=2 and maxStates=32, heap size is a little bit smaller but rss increased more compared to unstable => this is due to subscribeAllSubnets flag
  • Find the best default config => based on used heap, 2 epochs - 32 maxStates seems to be the best default config
  • Confirm same state type before reloading state => modified loadState to bypass this check
  • Investigate high state transition time and memory issue on beta mainnet node => this is due to subscribeAllSubnets flag, and partially ssz
  • Refactor to have a feature flag to turn this feature on/off as it may take weeks to test
  • Track epochs in memory and epochs in disk
  • Test and fix bugs

ensureDir,
};

const CHECKPOINT_STATES_FOLDER = "./unfinalized_checkpoint_states";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why write in a random directory instead of writing into the db?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the beginning, I started with writing checkpoint state to fs to make it easier to debug, also was worried about db size increase because 2 states persisted per epoch.

Now I switched to persisting 1 state per epoch, make 2 persistent option as db and fs and make it configurable through cli option. Seems persisting to db does not increase its size a lot so will make db option default

@twoeths
Copy link
Contributor Author

twoeths commented Oct 9, 2023

ready for review, would like to confirm the approach before I separate to smaller PRs to make it easier to review

@@ -35,8 +33,8 @@ import {from} from "multiformats/dist/types/src/bases/base.js";
*/
export class PersistentCheckpointStateCache implements CheckpointStateCache {
private readonly cache: MapTracker<string, CachedBeaconStateAllForks | StateFile>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While typing the Map value as CachedBeaconStateAllForks | string, you could be more explicit defining a wrapper object. The amount of items in this map is low (< 1000) so the wrapper does not add any performance penalty but helps readibility and manteinance.

Something like:

private readonly cache: MapTracker<string, StateCacheItem>;

type StateCacheItem =
  | {type: StateCacheItemType.filePath, filepath: string}
  | {type: StateCacheItemType.cachedBeaconState, value: CachedBeaconStateAllForks}

Then in the consumer code you can be explicit about what to do in each case

this.maxEpochsInMemory = opts.maxEpochsInMemory;
// Specify different persistentApis for testing
this.persistentApis = persistentApis ?? FILE_APIS;
this.shufflingCache = shufflingCache;
this.getHeadState = getHeadState;
this.inMemoryKeyOrder = new LinkedList<string>();
this.inMemoryEpochs = new Set();
void ensureDir(CHECKPOINT_STATES_FOLDER);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function can throw an error, don't void

Comment on lines +116 to +117
* Get a state from cache, it will reload from disk.
* This is expensive api, should only be called in some important flows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Get a state from cache, it will reload from disk.
* This is expensive api, should only be called in some important flows:
* Get a state from cache, it may reload from disk.
* This is an expensive api, should only be called in some important flows:

this.logger.debug("Error reloading state from disk", {persistentKey}, e as Error);
return null;
}
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unreachable code

*/
export async function getStateForAttestationVerification(
export async function getShufflingForAttestationVerification(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this function is called twice at the same time for a checkpoint that requires regen? What happens if we ever raise the concurrency of regen queue to more than 1?

I think there should be an added layer of caching in the shuffling cache tracking active regen requests as pending promises. When multiple callers want the same pending shuffle, return the promise to all to always guarantee that there will never be repeated regen for attestation verification.

Also this code is extremely sensitive and can blow up Lodestar easily. Placing this code in the /validation folder as a random function hides how critical it is. This code should live closer to regen and have the locks explained in the paragraph above backed into it, and comments explaining their cost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dapplion thanks this is a great point :+1, I addressed in #6030

Comment on lines +593 to +624
let shufflingDependentRoot: RootHex;
if (blockEpoch === attEpoch) {
// current shuffling, this is equivalent to `headState.currentShuffling`
// given blockEpoch = attEpoch = n
// epoch: (n-2) (n-1) n (n+1)
// |-------|-------|-------|-------|
// attHeadBlock ------------------------^
// shufflingDependentRoot ------^
shufflingDependentRoot = chain.forkChoice.getDependentRoot(attHeadBlock, EpochDifference.previous);
} else if (blockEpoch === attEpoch - 1) {
// next shuffling, this is equivalent to `headState.nextShuffling`
// given blockEpoch = n-1, attEpoch = n
// epoch: (n-2) (n-1) n (n+1)
// |-------|-------|-------|-------|
// attHeadBlock -------------------^
// shufflingDependentRoot ------^
shufflingDependentRoot = chain.forkChoice.getDependentRoot(attHeadBlock, EpochDifference.current);
} else if (blockEpoch < attEpoch - 1) {
// this never happens with default chain option of maxSkipSlots = 32, however we still need to handle it
// check the verifyHeadBlockAndTargetRoot() function above
// given blockEpoch = n-2, attEpoch = n
// epoch: (n-2) (n-1) n (n+1)
// |-------|-------|-------|-------|
// attHeadBlock -----------^
// shufflingDependentRoot -----^
shufflingDependentRoot = attHeadBlock.blockRoot;
// use lodestar_gossip_attestation_head_slot_to_attestation_slot metric to track this case
} else {
// blockEpoch > attEpoch
// should not happen, handled in verifyAttestationTargetRoot
throw Error(`attestation epoch ${attEpoch} is before head block epoch ${blockEpoch}`);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move all of this to a separate function, like getShufflingDependentRoot. What you have an if else assignment pattern, it's always a good indication to move the code to a function to use the return flow control

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in #6030 👍

*/
export function getAttestationDataSigningRoot(config: BeaconConfig, data: phase0.AttestationData): Uint8Array {
const slot = computeStartSlotAtEpoch(data.target.epoch);
// previously, we call `domain = config.getDomain(state.slot, DOMAIN_BEACON_ATTESTER, slot)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this change is correct

@@ -59,6 +59,7 @@ export enum Bucket {
// 54 was for bestPartialLightClientUpdate, allocate a fresh one
// lightClient_bestLightClientUpdate = 55, // SyncPeriod -> LightClientUpdate // DEPRECATED on v1.5.0
lightClient_bestLightClientUpdate = 56, // SyncPeriod -> [Slot, LightClientUpdate]
allForks_checkpointState = 57, // Root -> allForks.BeaconState
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the lowest available bucket

@dapplion
Copy link
Contributor

dapplion commented Oct 9, 2023

@tuyennhv continued the review on a plane without internet, dumping inline comments taken on the code, please excuse the different format

lion@lion-tp:~/code/chainsafe/lodestar$ git diff
diff --git a/packages/beacon-node/src/chain/prepareNextSlot.ts b/packages/beacon-node/src/chain/prepareNextSlot.ts
index 42003052da..34e91bbc1f 100644
--- a/packages/beacon-node/src/chain/prepareNextSlot.ts
+++ b/packages/beacon-node/src/chain/prepareNextSlot.ts
@@ -84,6 +84,9 @@ export class PrepareNextSlotScheduler {
           clockSlot,
         });
         // It's important to still do this to get through Holesky unfinality time of low resouce nodes
+        // This sentence needs much more context, just saying it's important is not enough. Link to an issue or
+        // PR with the full context and summarize in 1 sentence here. Explain why this function appears two
+        // times in the same function.
         await this.prunePerSlot(clockSlot);
         return;
       }
@@ -176,6 +179,9 @@ export class PrepareNextSlotScheduler {
       }
 
       // do this after all as it's the lowest priority task
+      // LION: Calling this function does not have anything to do really with "prepareNextSlot". The scope creep of this function
+      // is bad, basically being a dump of every action that must be performed routinely of-slot. This class a whole is kind of
+      // useless it holds no state, and just subscribes to an event to run this function.
       await this.prunePerSlot(clockSlot);
     } catch (e) {
       if (!isErrorAborted(e) && !isQueueErrorAborted(e)) {
@@ -191,10 +197,12 @@ export class PrepareNextSlotScheduler {
    * However, it's not likely `inMemoryEpochs` is configured as 0, and this scenario rarely happen
    * since we only use `inMemoryEpochs = 0` for testing, if it happens it's a good thing because it helps us test the reload flow
    */
+  // This function only prunes checkpointStates, call it pruneCheckpointStateCache, rename to a generic "pruneX" latter when multiuse
   private prunePerSlot = async (clockSlot: Slot): Promise<void> => {
     // a contabo vpss can have 10-12 holesky epoch transitions per epoch when syncing, stronger node may have more
     // it's better to prune at the last 1/3 of every slot in order not to cache a lot of checkpoint states
     // at synced time, it's likely we only prune at the 1st slot of epoch, all other prunes are no-op
+    // LION: Where do you enfore that this function runs on the 1/3 of the slot?
     const pruneCount = await this.chain.regen.pruneCheckpointStateCache();
     this.logger.verbose("Pruned checkpoint state cache", {clockSlot, pruneCount});
   };
diff --git a/packages/beacon-node/src/chain/stateCache/persistentCheckpointsCache.ts b/packages/beacon-node/src/chain/stateCache/persistentCheckpointsCache.ts
index 7ddf57eb2b..5f9b43799a 100644
--- a/packages/beacon-node/src/chain/stateCache/persistentCheckpointsCache.ts
+++ b/packages/beacon-node/src/chain/stateCache/persistentCheckpointsCache.ts
@@ -143,10 +143,13 @@ export class PersistentCheckpointStateCache implements CheckpointStateCache {
       this.logger.verbose("Reload: read state failed", {persistentKey});
       return null;
     }
+    // Add white space to this function, space code out with comments where necessary
     this.logger.verbose("Reload: read state successfully", {persistentKey});
     this.metrics?.stateRemoveCount.inc({reason: RemovePersistedStateReason.reload});
+    // Why is the clock maybe nullish? Why does it use the current slot? If some value is nullish, what's the point of observing zero?
     this.metrics?.stateReloadSecFromSlot.observe(this.clock?.secFromSlot(this.clock?.currentSlot ?? 0) ?? 0);
     const closestState = findClosestCheckpointState(cp, this.cache) ?? this.getHeadState?.();
+    // Is this a possible state? If getHeadState always return a value, not
     if (closestState == null) {
       throw new Error("No closest state found for cp " + toCheckpointKey(cp));
     }
@@ -171,6 +174,9 @@ export class PersistentCheckpointStateCache implements CheckpointStateCache {
       // don't prune from memory here, call it at the last 1/3 of slot 0 of an epoch
       return newCachedState;
     } catch (e) {
+      // This is not true, and this broad try catch for so many different steps is very bad.
+      // There should probably not be a try catch here at all. If it must be, do something meaningful,
+      // and wrap individual steps. Not all code from `loadCachedBeaconState` until end of function.
       this.logger.debug("Error reloading state from disk", {persistentKey}, e as Error);
       return null;
     }
@@ -279,6 +285,7 @@ export class PersistentCheckpointStateCache implements CheckpointStateCache {
     // sort epochs in descending order, only consider epochs lte `epoch`
     const epochs = Array.from(this.epochIndex.keys())
       .sort((a, b) => b - a)
+    // filter first
       .filter((e) => e <= maxEpoch);
     for (const epoch of epochs) {
       if (this.epochIndex.get(epoch)?.has(rootHex)) {
@@ -288,6 +295,7 @@ export class PersistentCheckpointStateCache implements CheckpointStateCache {
             return state;
           }
         } catch (e) {
+          // Why would this fail? Blind try catches are not okay
           this.logger.debug("Error get or reload state", {epoch, rootHex}, e as Error);
         }
       }
@@ -316,6 +324,7 @@ export class PersistentCheckpointStateCache implements CheckpointStateCache {
   pruneFinalized(finalizedEpoch: Epoch): void {
     for (const epoch of this.epochIndex.keys()) {
       if (epoch < finalizedEpoch) {
+        // This can overwhelm the underlying persistent API by requesting many delete operations at once
         this.deleteAllEpochItems(epoch).catch((e) =>
           this.logger.debug("Error delete all epoch items", {epoch, finalizedEpoch}, e as Error)
         );
@@ -385,6 +394,7 @@ export class PersistentCheckpointStateCache implements CheckpointStateCache {
       }
       if (firstEpoch === undefined) {
         // should not happen
+        // LION: why not? explain
         throw new Error("No epoch in memory");
       }
       // first loop to check if the 1st slot of epoch is a skipped slot or not
@@ -408,6 +418,7 @@ export class PersistentCheckpointStateCache implements CheckpointStateCache {
       // if not found firstSlotBlockRoot, first slot of state is skipped, we should persist the Previous Root Checkpoint State, where the root
       // is the last block slot root of pervious epoch. In this case Previous Root Checkpoint State would become the justified/finalized state.
       for (const rootHex of this.epochIndex.get(firstEpoch) ?? []) {
+        // rewrite this with direct assignments to toPersist and toDelete, no let
         let toPersist = false;
         let toDelete = false;
         if (firstSlotBlockRoot === undefined) {
@@ -487,6 +498,9 @@ export function findClosestCheckpointState(
   cp: CheckpointHex,
   cache: Map<string, CachedBeaconStateAllForks | PersistentKey>
 ): CachedBeaconStateAllForks | null {
+  // LION: This function must handle checkpoints on different branches
+  // Just picking the state with the closest epoch does not work in really bad cases
+  // Must add unit tests for this
   let smallestEpochDiff = Infinity;
   let closestState: CachedBeaconStateAllForks | null = null;
   for (const [key, value] of cache.entries()) {
diff --git a/packages/state-transition/src/util/loadState.ts b/packages/state-transition/src/util/loadState.ts
index 18eac53e6f..5d6a096585 100644
--- a/packages/state-transition/src/util/loadState.ts
+++ b/packages/state-transition/src/util/loadState.ts
@@ -32,6 +32,7 @@ export function loadState(
   const fieldRanges = stateType.getFieldRanges(dataView, 0, stateBytes.length);
   const allFields = Object.keys(stateType.fields);
   const validatorsFieldIndex = allFields.indexOf("validators");
+  // This call for a mainnet state is not free, why is it necessary to initialize all vectors to zero?
   const migratedState = stateType.defaultViewDU();
   // validators is rarely changed
   const validatorsRange = fieldRanges[validatorsFieldIndex];
@@ -42,7 +43,9 @@ export function loadState(
   );
   // inactivityScores
   // this takes ~500 to hashTreeRoot while this field is rarely changed
+  // LION: 500 what? use units
   const fork = getForkFromStateBytes(config, stateBytes);
+  // rename to seedStateFork
   const seedFork = config.getForkSeq(seedState.slot);
 
   let loadedInactivityScores = false;

Copy link
Member

@wemeetagain wemeetagain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

posting some comments so far, will continue looking over things

@@ -68,6 +68,7 @@ export type StateCacheItem = {
/** Unix timestamp (ms) of the last read */
lastRead: number;
checkpointState: boolean;
persistentKey?: string;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a jsdoc comment?

@@ -0,0 +1,41 @@
import fs from "node:fs";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can get promise versions of fs functions by using "node:fs/promises"

Copy link
Contributor

@dapplion dapplion Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file API has to go IMO. Managing files as part of the beacon node package going around the DB is a dangerous anti-pattern which: can break with permissions issues, is not measured, it's not proven to be more efficient than the level db. Definitely it's a bit of scope creep to introduce this in a sensitive change like this PR. Best to:

  • implement the core functionality of this PR re-using existing db mechanisms
  • proof that using level DB for temporal states is a problem
  • proof that persisting states as files is more efficient in disk usage

if (finalizedStateOrBytes instanceof Uint8Array) {
const slot = getStateSlotFromBytes(finalizedStateOrBytes);
await this.db.stateArchive.putBinary(slot, finalizedStateOrBytes);
this.logger.verbose("Archived finalized state bytes", {finalizedEpoch: finalized.epoch, slot, root: rootHex});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log context is different from the one below

Suggested change
this.logger.verbose("Archived finalized state bytes", {finalizedEpoch: finalized.epoch, slot, root: rootHex});
this.logger.verbose("Archived finalized state bytes", {epoch: finalized.epoch, slot, root: rootHex});

} else {
// state
await this.db.stateArchive.put(finalizedStateOrBytes.slot, finalizedStateOrBytes);
this.logger.verbose("Archived finalized state", {epoch: finalized.epoch, root: rootHex});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could consider logging slot here (same as above)

return toHexString(root);
}

async remove(persistentKey: PersistentKey): Promise<boolean> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this return a boolean instead of just void? It can never return false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's to conform to theFilePersistentApis , will check if it's really necessary to return boolean

this.logger.verbose("Reload: read state successfully", {persistentKey});
this.metrics?.stateRemoveCount.inc({reason: RemovePersistedStateReason.reload});
this.metrics?.stateReloadSecFromSlot.observe(this.clock?.secFromSlot(this.clock?.currentSlot ?? 0) ?? 0);
const closestState = findClosestCheckpointState(cp, this.cache) ?? this.getHeadState?.();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: consider a better strategy than picking closest state based on epoch diff

for (const i of modifiedValidators) {
migratedState.validators.set(
i,
ssz.phase0.Validator.deserializeToViewDU(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to reuse the unchanged properties of the old validator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least re-use pubkey and withdrawal_credentials which take the most memory of all

const migratedState = stateType.defaultViewDU();
// validators is rarely changed
const validatorsRange = fieldRanges[validatorsFieldIndex];
const modifiedValidators = loadValidators(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add benchmark result, how do we come up with this strategy

config: ChainForkConfig,
seedState: BeaconStateAllForks,
stateBytes: Uint8Array
): MigrateStateOutput {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving the details of type manipulation to the package types/utils/container.ts

  const migratedState = deserializeContainerIgnoreFields(stateBytes, ["validators", "inactivityScores"]);

  // validators is rarely changed
  const validatorsBytes = sliceContainerField(stateType, "validators", stateBytes)
  const modifiedValidators = loadValidators(
    migratedState,
    seedState,
    validatorsBytes,
  );

* ║ ║ ^ ^ ║
* ╚════════════════════════════════════╝═══════════════╝
*/
export class PersistentCheckpointStateCache implements CheckpointStateCache {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write unit tests for the weird case where we do multiple epoch transitions, no blocks in the middle

}
}

// if found firstSlotBlockRoot it means it's Current Root Checkpoint State and we should only persist that checkpoint as it's the state
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: even there is Current Root Checkpoint State, it could be from an orphaned block so maybe we still have to store both checkpoints but could improve by:

  • store Previous Root Checkpoint State with state bytes
  • store Current Root Checkpoint State with root of Previous Root Checkpoint State and block at slot 0 of epoch, and do a replay when we need it

} else {
this.epochIndex.set(epoch, new Set([key]));
}
this.keyOrder.unshift(key);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert at 2nd position, 1st position for head

const {epoch} = fromCheckpointKey(key);
if (isPersistentKey(stateOrPersistentKey)) {
persistCount++;
memoryEpochs.add(epoch);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
memoryEpochs.add(epoch);
persistentEpochs.add(epoch);

this.logger.verbose("Processed shuffling for next epoch", {parentEpoch, blockEpoch, slot: block.message.slot});

// This is the real check point state per spec because the root is in current epoch
// it's important to add this to cache, when chain is finalized we'll query this state later
const checkpointState = postState;
const cp = getCheckpointFromState(checkpointState);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this inside if (block.message.slot % SLOTS_PER_EPOCH === 0) { condition below, otherwise will get "Block error slot=452161 error=Checkpoint state slot must be first in an epoch" error as monitored on test branch

// this is usually added when we prepare for next slot or validate gossip block
// then when we process the 1st block of epoch, we don't have to do state transition again
// This adds Previous Root Checkpoint State to the checkpoint state cache
// This may becomes the "official" checkpoint state if the 1st block of epoch is skipped
const checkpointState = postState;
const cp = getCheckpointFromState(checkpointState);
checkpointStateCache.add(cp, checkpointState);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
checkpointStateCache.add(cp, checkpointState);
if (shouldReload) {
checkpointStateCache.add(cp, checkpointState);
}

@twoeths
Copy link
Contributor Author

twoeths commented Jan 4, 2024

replaced by #6250

@twoeths twoeths closed this Jan 4, 2024
@twoeths twoeths deleted the tuyen/n_historical_states branch January 4, 2024 08:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants