Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not enough public IPs in farm despite having a free public ips #970

Closed
A-Harby opened this issue May 22, 2024 · 10 comments
Closed

Not enough public IPs in farm despite having a free public ips #970

A-Harby opened this issue May 22, 2024 · 10 comments
Assignees
Labels
type_bug Something isn't working
Milestone

Comments

@A-Harby
Copy link

A-Harby commented May 22, 2024

Describe the bug

Mainnet, 2.4.2.

Dashboard is giving error Not enough public IPs in farm despite having a free public ips on the farm.

To Reproduce

Steps to reproduce the behavior:
Just deploy any instance on the dashboard using IPv4.

Screenshots

image
image

@AhmedHanafy725
Copy link

Further investigation, these 2 IPs on farm ID 1 look free on graphql.

image

However, they are reserved on the chain.

    {
      ip: 185.69.166.147/24
      gateway: 185.69.166.1
      contractId: 15946
    },
    {
      ip: 185.69.166.157/24
      gateway: 185.69.166.1
      contractId: 16178
    }

but @sameh-farouk found these contracts are deleted, but they are still on the farm object and reserving these IPs

@sameh-farouk sameh-farouk self-assigned this May 29, 2024
@sameh-farouk sameh-farouk modified the milestones: 2.7.0, 2.8.0 May 29, 2024
@sameh-farouk
Copy link
Member

sameh-farouk commented May 29, 2024

Update:
I attempted to debug this issue today.
I found the calls that created and canceled these contracts was executed with runtime 116
I inspected the codebase tagged with v2.1.4 to see if it contained any bugs, and I couldn't find anything obvious with the logic of reserving and unreserving IPs.
Digging more into what happened vs how it happened, I found that:

  • These contracts were created with 0 public IPs, so this shouldn't trigger the reserve IP logic.
  • There have been no events to release these IPs due to these contracts' cancelation, which makes sense, since these contracts weren't reserved any public IPs.

https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Ftfchain.grid.tf%2Fws#/explorer/query/5844419

What could be causing these incorrect states? It's still not clear to me, but it's probably either:

  • Likely, an old migration that interfered with some of the public IPs' states
  • Less likely, a hidden bug in the runtime 116

Now, how to fix this:

  • The council can call force_reset_farm_ip to force unreserve these IPs

It's important to conduct checks to ensure that there are no similar cases with other IPs on different farms as well. I'll follow up.

@sameh-farouk
Copy link
Member

@LeeSmet @renauter Can we approve the suggested solution here so I can open tickets for ops to prepare and execute related proposals on mainnet and free these IPs ?
Otherwise, they will remain reserved forever

@renauter renauter modified the milestones: 2.8.0, 2.9.0 Jun 3, 2024
@LeeSmet
Copy link
Contributor

LeeSmet commented Jun 3, 2024

I'm personally a bit uneasy with the force reset if the root cause is not identified. Since the working theory seems to be that something messed with the state in an unexpected way, not removing the IP's without knowing the full extend could lead to some more dangling state

@renauter
Copy link
Collaborator

renauter commented Jun 4, 2024

  • The council can call force_reset_farm_ip to force unreserve these IPs

I think I found origin of the force_reset_farm_ip() creation.
See here:
#308
072bdc8

My guess is when trying to call remove_contract() in some context the _free_ip() didn't work as expected.

@sameh-farouk
Copy link
Member

sameh-farouk commented Jun 4, 2024

I'm personally a bit uneasy with the force reset if the root cause is not identified. Since the working theory seems to be that something messed with the state in an unexpected way, not removing the IP's without knowing the full extend could lead to some more dangling state

@LeeSmet I am currently in the process of debugging to identify the root cause of the issue. I agree with you that understanding the root cause is crucial. However, I would like to present a counter-argument that resetting these IPs could potentially lead to more dangling states. This is because these contracts have been removed from the chain, and it doesn’t seem logical to keep these IPs reserved for contracts that no longer exist.

Please note that the effect of the call is limited to the provided IP, and it only resets its contract id to 0. The code is straightforward, and I don’t foresee any harm in proceeding in this manner. We can certainly wait until the root cause is identified, but regardless, we will need to execute these calls eventually to free up these IPs, in addition to any other fixes.

Lastly, I would like to point out that freeing these IPs will not hinder any investigations since all activities are recorded on the blockchain. Therefore, I propose that we work in parallel by freeing these IPs while continuing to investigate this incident. This approach is particularly important as the issue partially affects deploying on mainnet.

@sameh-farouk
Copy link
Member

sameh-farouk commented Jun 5, 2024

Alright, I’ve managed to find the root cause of this erroneous state.
Into the details. We begin with the instance at IP 185.69.166.157 and the contracts 16178 that are reserving it on-chain.

The creation of Contract 16178 occurred at block 5844419. As previously mentioned, it was created with 0 IP, so it shouldn’t have reserved any IP.

Screenshot_20240605_160052

However, an examination of the state of farm 1 at the parent block shows that the IP was already reserved to contract 16178 before its creation.

I conducted a backward search to determine when exactly the IP was reserved for this specific contract and found that this occurred at block 5844377. I then began to scrutinize the events and calls at this block to understand how we arrived at this state after the block was finalized.

At this state of the block, the last contract ID was 16177. However, there was a failed transaction. This transaction attempted to create a node contract, which, if successful, would have been assigned the next available ID of 16178.

Screenshot_20240605_160653

Upon examining the call arguments and decoding the error details, we can ascertain two things: the optional solution provider provided by the user was 2, and the error that caused the transaction to fail was SolutionProviderNotApproved. To verify this, I checked the solution provider with ID 2 at this block state.

{
  solutionProviderId: 2
  providers: [
    {
      who: 5HYbojcq9jBrxtuW3wfgMSuEq4G6BTMN2vW2ZDiYTKJXVEmF
      take: 50
    }
  ]
  description: zonaris is a decentralised blockchain node hosting solution built on the ThreeFold Grid. 
  link: https://www.zonaris.io/
  approved: false
}

Now, let’s piece together the final part of the puzzle: why did this failed transaction alter the state?

In older versions of Substrate, if you wrote a function within the runtime that could modify storage and if storage was altered during that extrinsic, it was impossible to undo those specific changes.

This led to a crucial “best practice” in runtime development: “verify first, write last”. This principle states that you should never return an Err from an extrinsic after you have modified storage, as those changes will persist, potentially leading to an inconsistent state.

Looking at carete_node_contract, Initially, it validates few conditions, followed by a call to the _create_contract function. This function start to modify the state, reserves the IP, but then attempts to validate the solution provider. If this validation fails, it results in an error, leaving the IP reserved for a contract that wasn't created yet.

if let types::ContractData::NodeContract(ref mut nc) = contract_type {
Self::_reserve_ip(id, nc)?;
};
Self::validate_solution_provider(solution_provider_id)?;

To the best of my knowledge, this issue is no longer pertinent as the behavior has been altered later in the more recent releases of Substrate. Currently, the extrinsic execution logic discards any changes if an error occurs.

So in the most recent version of TFChain runtime, this issue can't occurs. It’s also straightforward to try to reproduce and verify.

@xmonader xmonader added the type_bug Something isn't working label Jun 5, 2024
@renauter
Copy link
Collaborator

renauter commented Jun 5, 2024

Great investigation ! Thank you.

I am ok with the solution since code is already in place (probably for same reason) and had already been used to restore coherency of storage state probably.

To evaluate the number of IPs which are reserved but should not an additional check could be added to the pallet-tfgrid live checking #817. Then we could have a list of IPs to "free".

@sameh-farouk
Copy link
Member

sameh-farouk commented Jun 6, 2024

Update:
A ticket was submitted for ops to handle freeing these two public IPs

@sameh-farouk
Copy link
Member

sameh-farouk commented Jun 6, 2024

Verified!

Screenshot_20240606_125455
Screenshot_20240606_130244

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type_bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

6 participants