Remove abusable code from subnet template #88

cxmplex · 2024-05-22T08:12:29Z

The default subnet code has been reused on major mainnet subnets leading to predictable issues with lack of checks towards stakeless validators, and a triggerable exception with a longer stacktrace that a normal blacklistexception.

Forcing an error in the blacklist_fn leads to an amplication of the normal blacklist response. For instance, a normal blacklist repsonse will state the short reason, i.e. not registered. The error blacklist response will include the full error message. Some error messages may be longer than others (and the ValueError one is), leading to an amplication in the normal response. The effect amplification is around 2-3x depending on the original return text of the blacklist_fn.

Utilizing these flaws, attackers were able to do two things:

Subnet 2 (missing stake check)

On subnet 2, a validators minimum stake was not being checked. As a result, a malicious miner was able to register a low stake validator, and prompt users with challenges. The purpose of these challenges was to slow miners average response time in comparison to their normal response time to a real validator. In addition, the attacker employed several sneaky methods such as monitoring other validator requests, in order to sync their request to a validators, forcing a concurrent request on the innocent miner.

As a result, I believe it is best practice to leave in a default stake blacklist as an example, and allow subnet devs to REMOVE this if they wish, not the other way around. In best practice, the safest solution should be offered as the example, not a barebones solution. If the goal was for subnet devs to inspect the blacklist function and tailor it to their use-case, this has been demonstrated to not happen on at least 6 subnets (likely more). In most subnets, a stakeless validator has no actual purpose of calling a miners synapse. In practice, this lack of a default example has been shown to cause issues in various subnets that are often not picked up on right away.

Subnet 31 (triggerable unhandled exception)

Other affected subnets with a similar issue to 31 included subnet 33, 28, 25, 16, 11, 5. Here's my leading theory on why this leads to exhaustion, rather than the normal blacklist behavior, which as @mjurbanski-reef has shown, already throws an exception normally. Let's compare both scenarios, specifically taking a look at the traceback that will be printed here:

https://github.com/opentensor/bittensor/blob/master/bittensor/axon.py#L959

Unhandled exception in blacklist_fn:

blacklist_fn -> blacklist -> dispatch

Normal handled exception stacktrace:

blacklist -> dispatch

As you can see, forcing an error in the blacklist_fn in the subnet (note not the blacklist fn in axon.py) leads to a larger stacktrace. This stacktrace would cost additional resources to gather and print, and would even possibly include stack trace information from the asyncio upstream, metagraph calls, etc. The stacktrace length should be roughly 3-5x the size due to the increased callstack.

While a normal exception thrown by blacklist with normal behavior (i.e. the person is blacklisted) leads to a much shorter stacktrace, only blacklist, then caught by dispatch.

It's my belief this is part one to why this leads to exhaustion. The second part, is the increased response size when throwing an error from the blacklist_fn. A combination of both means greater CPU/mem pressure, and greater networking pressure. When hundreds to thousands of requests are coming in, small inefficiencies like these can play a big part to exhausting your miner. From testing, removing the unhandled exception from blacklist_fn led to continued operation for the miner while being attacked, while not handling the exception and letting it pass downstream caused server reboots on various miners machines, and overall total DoS to the axon. Why this would happen? I can only guess differences in stacktrace/callstack length.

…hotkey exists if it exists in metagraph)

mjurbanski-reef

IMO it would be better just to handle these edge-cases explicitly, other wise you will get error in logs, where its clear it was just someone sending malformed requests. And if that was you, then you would also benefit from proper description of that.

neurons/miner.py

mjurbanski-reef · 2024-05-22T21:08:32Z

I must admit I don't get this

a) overhead of uncaught exception - isn't that exactly how blacklist operates? by throwing an exception? https://github.com/opentensor/bittensor/blob/16f26047bbcce7c8e283599ad9523de0638891d3/bittensor/axon.py#L1318
b) amplification of response - wasn't the response just string value of exception text? why was it bigger? Do you have example?

Btw, in bittensor v7 there are some changes in how exceptions are handled, and even less information is actually returned to the client: https://github.com/opentensor/bittensor/pull/1822/files#diff-7b2c4a5751ea6956b701861216059a5cef7c3c3e236e77e9295b0b8f6a6438e2R978
So it should make amplification even less possible

cxmplex · 2024-05-22T21:12:59Z

Logs show amplification in response size with ValueError response vs non registered rejection and increased delay in processing. After moving that line below, I experienced a 90% reduction in latency to only a few ms each request. The amount of requests was 1000s per second, so I think any additional overhead could cause issues.

…

On Wed, May 22, 2024 at 5:08 PM Maciej Urbański ***@***.***> wrote: I must admit I don't get this a) overhead of uncaught exception - isn't that exactly how blacklist operates? by throwing an exception? https://github.com/opentensor/bittensor/blob/16f26047bbcce7c8e283599ad9523de0638891d3/bittensor/axon.py#L1318 b) amplification of response - wasn't the response just string value of exception text? why was it bigger? Do you have example? Btw, in bittensor v7 there are some changes in how exceptions are handled, and even less information is actually returned to the client: https://github.com/opentensor/bittensor/pull/1822/files#diff-7b2c4a5751ea6956b701861216059a5cef7c3c3e236e77e9295b0b8f6a6438e2R978 So it should make amplification even less possible — Reply to this email directly, view it on GitHub <#88 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKRBZ2C6TY4NDES2JZGL3B3ZDUCONAVCNFSM6AAAAABIDDGRXSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRVG42DSNJUGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Co-authored-by: Maciej Urbański <122983254+mjurbanski-reef@users.noreply.github.com>

cxmplex · 2024-05-23T00:48:40Z

Reopening soon with more comprehensive pull.

cxmplex · 2024-05-23T01:32:46Z

examples of the attacks:

2024-05-22 06:59:25.684 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49350 | 500 | None is not in list
2024-05-22 06:59:25.685 | TRACE | axon | <-- | 0 B | Dummy | None | 104.54.246.177:49360 | 200 | Success
INFO: 104.54.246.177:49360 - "POST /Dummy HTTP/1.1" 500 Internal Server Error
2024-05-22 06:59:25.693 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49360 | 500 | None is not in list
2024-05-22 06:59:25.694 | TRACE | axon | <-- | 0 B | Dummy | None | 104.54.246.177:49364 | 200 | Success
2024-05-22 06:59:25.701 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49364 | 500 | None is not in list
INFO: 104.54.246.177:49364 - "POST /Dummy HTTP/1.1" 500 Internal Server Error
2024-05-22 06:59:25.702 | TRACE | axon | <-- | 0 B | Dummy | None | 104.54.246.177:49356 | 200 | Success
2024-05-22 06:59:25.710 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49356 | 500 | None is not in list
INFO: 104.54.246.177:49356 - "POST /Dummy HTTP/1.1" 500 Internal Server Error
2024-05-22 06:59:25.711 | TRACE | axon | <-- | 0 B | Dummy | None | 104.54.246.177:49366 | 200 | Success
INFO: 104.54.246.177:49366 - "POST /Dummy HTTP/1.1" 500 Internal Server Error
2024-05-22 06:59:25.718 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49366 | 500 | None is not in list
2024-05-22 06:59:25.719 | TRACE | axon | <-- | 0 B | Dummy | None | 104.54.246.177:49378 | 200 | Success
INFO: 104.54.246.177:49378 - "POST /Dummy HTTP/1.1" 500 Internal Server Error

example of missing minimum stake attack:

https://discord.com/channels/799672011265015819/1220504695404236800/1239769414585286738

cxmplex · 2024-05-23T01:38:47Z

@mjurbanski-reef

Addressed, and added a theory on why this might be. Also added in minimum stake example as well, with explanation on why I think this is needed (based on real world exploitation in sn2) and should be offered by default rather than simply mentioning that you may want to do this. This isn't a MAY, it's a MUST, in my opinion, hence the change in the documentation as well, unless otherwise demonstrated in their subnet design pattern.

Anyways, you shouldn't have unhandled exceptions (especailly of generic types like ValueErorr) that can be triggered by users even if they are effectively caught upstream.

mjurbanski-reef

Minimum stake and other blacklisting options are great and needed additions, no doubt about that :)
As for "response size amplification" if we talk about difference measure in <100Bytes, that is for sure not what caused the initial problem - and it will be hard to tell without an example of logs from miner that was DoS into restart with logs showing what was happening right before that.
In fact, the ~50B responses size shown by logs (it's not real response size, just size of Synapse object in memory) is what you get when the blacklist is functioning correctly.
The stacktrace being printed should be only slightly of different size (its 1 frame, and 3 extra lines being printed, so its more like 1/3 bigger, not 5x as you theorize). If that pushes it over the edge is hard to say, it is possible, while unlikely.
But I do think overeager exc trace is a problem, which hopefully can be addressed by less verbose handled synapse exceptions bittensor#1928 .

neurons/miner.py

cxmplex · 2024-05-24T14:45:40Z

I don't think 1/3 bigger is accurate FYI. @mjurbanski-reef

Perhaps I'm wrong, but you can't just count functions and say okay it's 1 extra. Its dependent on the total depth of nested calls and the complexity of the execution path of the function that throws the exception. Not like this matters too much, they straight up just shouldn't print a greedy stacktrace like that as you've mentioned, especially if its triggerable by an outside user.

remove vulnerabilities from default code

ce2a3a9

cxmplex mentioned this pull request May 22, 2024

fix error handling in blacklist processing opentensor/bittensor#1916

Closed

cxmplex added 3 commits May 22, 2024 04:52

this can be ignored if blacklist is fixed by my other pr

e09a5ed

log exception

0d045de

move vuln line below the register check (should handle checking that …

ca3ee79

…hotkey exists if it exists in metagraph)

cxmplex changed the title ~~Remove vulnerabilities/attack vectors from default code~~ Remove triggerable unhandled exceptions from default code May 22, 2024

mjurbanski-reef reviewed May 22, 2024

View reviewed changes

neurons/miner.py Outdated Show resolved Hide resolved

Update neurons/miner.py

624ebd7

Co-authored-by: Maciej Urbański <122983254+mjurbanski-reef@users.noreply.github.com>

Update miner.py

e804f0b

cxmplex closed this May 23, 2024

cxmplex reopened this May 23, 2024

cxmplex changed the title ~~Remove triggerable unhandled exceptions from default code~~ Remove abusable code from subnet template May 23, 2024

cxmplex added 4 commits May 22, 2024 21:26

add minimum stake config check

f9dd361

add in config option

a8feafc

remove extra line

8487a97

typo

3884162

add back in TODO

64f27a0

mjurbanski-reef mentioned this pull request May 23, 2024

less verbose handled synapse exceptions opentensor/bittensor#1928

Merged

mjurbanski-reef reviewed May 23, 2024

View reviewed changes

neurons/miner.py Outdated Show resolved Hide resolved

remove broad try except

9d96d66

gus-opentensor added the help wanted Extra attention is needed label Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove abusable code from subnet template #88

Remove abusable code from subnet template #88

cxmplex commented May 22, 2024 •

edited

Loading

mjurbanski-reef left a comment

mjurbanski-reef commented May 22, 2024

cxmplex commented May 22, 2024 via email

cxmplex commented May 23, 2024

cxmplex commented May 23, 2024

cxmplex commented May 23, 2024 •

edited

Loading

mjurbanski-reef left a comment •

edited

Loading

cxmplex commented May 24, 2024 •

edited

Loading

Remove abusable code from subnet template #88

Are you sure you want to change the base?

Remove abusable code from subnet template #88

Conversation

cxmplex commented May 22, 2024 • edited Loading

mjurbanski-reef left a comment

Choose a reason for hiding this comment

mjurbanski-reef commented May 22, 2024

cxmplex commented May 22, 2024 via email

cxmplex commented May 23, 2024

cxmplex commented May 23, 2024

cxmplex commented May 23, 2024 • edited Loading

mjurbanski-reef left a comment • edited Loading

Choose a reason for hiding this comment

cxmplex commented May 24, 2024 • edited Loading

cxmplex commented May 22, 2024 •

edited

Loading

cxmplex commented May 23, 2024 •

edited

Loading

mjurbanski-reef left a comment •

edited

Loading

cxmplex commented May 24, 2024 •

edited

Loading