Not enough public IPs in farm despite having a free public ips #970

A-Harby · 2024-05-22T14:02:02Z

Describe the bug

Mainnet, 2.4.2.

Dashboard is giving error Not enough public IPs in farm despite having a free public ips on the farm.

To Reproduce

Steps to reproduce the behavior:
Just deploy any instance on the dashboard using IPv4.

Screenshots

AhmedHanafy725 · 2024-05-22T14:38:14Z

Further investigation, these 2 IPs on farm ID 1 look free on graphql.

However, they are reserved on the chain.

    {
      ip: 185.69.166.147/24
      gateway: 185.69.166.1
      contractId: 15946
    },
    {
      ip: 185.69.166.157/24
      gateway: 185.69.166.1
      contractId: 16178
    }

but @sameh-farouk found these contracts are deleted, but they are still on the farm object and reserving these IPs

sameh-farouk · 2024-05-29T20:23:41Z

Update:
I attempted to debug this issue today.
I found the calls that created and canceled these contracts was executed with runtime 116
I inspected the codebase tagged with v2.1.4 to see if it contained any bugs, and I couldn't find anything obvious with the logic of reserving and unreserving IPs.
Digging more into what happened vs how it happened, I found that:

These contracts were created with 0 public IPs, so this shouldn't trigger the reserve IP logic.
There have been no events to release these IPs due to these contracts' cancelation, which makes sense, since these contracts weren't reserved any public IPs.

https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Ftfchain.grid.tf%2Fws#/explorer/query/5844419

What could be causing these incorrect states? It's still not clear to me, but it's probably either:

Likely, an old migration that interfered with some of the public IPs' states
Less likely, a hidden bug in the runtime 116

Now, how to fix this:

The council can call force_reset_farm_ip to force unreserve these IPs

It's important to conduct checks to ensure that there are no similar cases with other IPs on different farms as well. I'll follow up.

sameh-farouk · 2024-06-02T10:25:37Z

@LeeSmet @renauter Can we approve the suggested solution here so I can open tickets for ops to prepare and execute related proposals on mainnet and free these IPs ?
Otherwise, they will remain reserved forever

LeeSmet · 2024-06-03T14:25:43Z

I'm personally a bit uneasy with the force reset if the root cause is not identified. Since the working theory seems to be that something messed with the state in an unexpected way, not removing the IP's without knowing the full extend could lead to some more dangling state

renauter · 2024-06-04T04:27:56Z

The council can call force_reset_farm_ip to force unreserve these IPs

I think I found origin of the force_reset_farm_ip() creation.
See here:
#308
072bdc8

My guess is when trying to call remove_contract() in some context the _free_ip() didn't work as expected.

sameh-farouk · 2024-06-04T11:23:47Z

I'm personally a bit uneasy with the force reset if the root cause is not identified. Since the working theory seems to be that something messed with the state in an unexpected way, not removing the IP's without knowing the full extend could lead to some more dangling state

@LeeSmet I am currently in the process of debugging to identify the root cause of the issue. I agree with you that understanding the root cause is crucial. However, I would like to present a counter-argument that resetting these IPs could potentially lead to more dangling states. This is because these contracts have been removed from the chain, and it doesn’t seem logical to keep these IPs reserved for contracts that no longer exist.

Please note that the effect of the call is limited to the provided IP, and it only resets its contract id to 0. The code is straightforward, and I don’t foresee any harm in proceeding in this manner. We can certainly wait until the root cause is identified, but regardless, we will need to execute these calls eventually to free up these IPs, in addition to any other fixes.

Lastly, I would like to point out that freeing these IPs will not hinder any investigations since all activities are recorded on the blockchain. Therefore, I propose that we work in parallel by freeing these IPs while continuing to investigate this incident. This approach is particularly important as the issue partially affects deploying on mainnet.

sameh-farouk · 2024-06-05T13:43:14Z

Alright, I’ve managed to find the root cause of this erroneous state.
Into the details. We begin with the instance at IP 185.69.166.157 and the contracts 16178 that are reserving it on-chain.

The creation of Contract 16178 occurred at block 5844419. As previously mentioned, it was created with 0 IP, so it shouldn’t have reserved any IP.

However, an examination of the state of farm 1 at the parent block shows that the IP was already reserved to contract 16178 before its creation.

I conducted a backward search to determine when exactly the IP was reserved for this specific contract and found that this occurred at block 5844377. I then began to scrutinize the events and calls at this block to understand how we arrived at this state after the block was finalized.

At this state of the block, the last contract ID was 16177. However, there was a failed transaction. This transaction attempted to create a node contract, which, if successful, would have been assigned the next available ID of 16178.

Upon examining the call arguments and decoding the error details, we can ascertain two things: the optional solution provider provided by the user was 2, and the error that caused the transaction to fail was SolutionProviderNotApproved. To verify this, I checked the solution provider with ID 2 at this block state.

{
  solutionProviderId: 2
  providers: [
    {
      who: 5HYbojcq9jBrxtuW3wfgMSuEq4G6BTMN2vW2ZDiYTKJXVEmF
      take: 50
    }
  ]
  description: zonaris is a decentralised blockchain node hosting solution built on the ThreeFold Grid. 
  link: https://www.zonaris.io/
  approved: false
}

Now, let’s piece together the final part of the puzzle: why did this failed transaction alter the state?

In older versions of Substrate, if you wrote a function within the runtime that could modify storage and if storage was altered during that extrinsic, it was impossible to undo those specific changes.

This led to a crucial “best practice” in runtime development: “verify first, write last”. This principle states that you should never return an Err from an extrinsic after you have modified storage, as those changes will persist, potentially leading to an inconsistent state.

Looking at carete_node_contract, Initially, it validates few conditions, followed by a call to the _create_contract function. This function start to modify the state, reserves the IP, but then attempts to validate the solution provider. If this validation fails, it results in an error, leaving the IP reserved for a contract that wasn't created yet.

tfchain/substrate-node/pallets/pallet-smart-contract/src/lib.rs

Lines 648 to 652 in c242557

    
           if let types::ContractData::NodeContract(ref mut nc) = contract_type { 
        
               Self::_reserve_ip(id, nc)?; 
        
           }; 
        
           Self::validate_solution_provider(solution_provider_id)?;

To the best of my knowledge, this issue is no longer pertinent as the behavior has been altered later in the more recent releases of Substrate. Currently, the extrinsic execution logic discards any changes if an error occurs.

So in the most recent version of TFChain runtime, this issue can't occurs. It’s also straightforward to try to reproduce and verify.

renauter · 2024-06-05T14:14:21Z

Great investigation ! Thank you.

I am ok with the solution since code is already in place (probably for same reason) and had already been used to restore coherency of storage state probably.

To evaluate the number of IPs which are reserved but should not an additional check could be added to the pallet-tfgrid live checking #817. Then we could have a list of IPs to "free".

sameh-farouk · 2024-06-06T09:33:29Z

Update:
A ticket was submitted for ops to handle freeing these two public IPs

sameh-farouk · 2024-06-06T10:05:58Z

Verified!

sameh-farouk self-assigned this May 29, 2024

sameh-farouk modified the milestones: 2.7.0, 2.8.0 May 29, 2024

renauter modified the milestones: 2.8.0, 2.9.0 Jun 3, 2024

xmonader added the type_bug Something isn't working label Jun 5, 2024

sameh-farouk closed this as completed Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not enough public IPs in farm despite having a free public ips #970

Not enough public IPs in farm despite having a free public ips #970

A-Harby commented May 22, 2024

AhmedHanafy725 commented May 22, 2024

sameh-farouk commented May 29, 2024 •

edited

Loading

sameh-farouk commented Jun 2, 2024

LeeSmet commented Jun 3, 2024

renauter commented Jun 4, 2024

sameh-farouk commented Jun 4, 2024 •

edited

Loading

sameh-farouk commented Jun 5, 2024 •

edited

Loading

renauter commented Jun 5, 2024 •

edited

Loading

sameh-farouk commented Jun 6, 2024 •

edited

Loading

sameh-farouk commented Jun 6, 2024 •

edited

Loading

Not enough public IPs in farm despite having a free public ips #970

Not enough public IPs in farm despite having a free public ips #970

Comments

A-Harby commented May 22, 2024

Describe the bug

To Reproduce

Screenshots

AhmedHanafy725 commented May 22, 2024

sameh-farouk commented May 29, 2024 • edited Loading

sameh-farouk commented Jun 2, 2024

LeeSmet commented Jun 3, 2024

renauter commented Jun 4, 2024

sameh-farouk commented Jun 4, 2024 • edited Loading

sameh-farouk commented Jun 5, 2024 • edited Loading

renauter commented Jun 5, 2024 • edited Loading

sameh-farouk commented Jun 6, 2024 • edited Loading

sameh-farouk commented Jun 6, 2024 • edited Loading

sameh-farouk commented May 29, 2024 •

edited

Loading

sameh-farouk commented Jun 4, 2024 •

edited

Loading

sameh-farouk commented Jun 5, 2024 •

edited

Loading

renauter commented Jun 5, 2024 •

edited

Loading

sameh-farouk commented Jun 6, 2024 •

edited

Loading

sameh-farouk commented Jun 6, 2024 •

edited

Loading