Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove abusable code from subnet template #88

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

cxmplex
Copy link

@cxmplex cxmplex commented May 22, 2024

The default subnet code has been reused on major mainnet subnets leading to predictable issues with lack of checks towards stakeless validators, and a triggerable exception with a longer stacktrace that a normal blacklistexception.

Forcing an error in the blacklist_fn leads to an amplication of the normal blacklist response. For instance, a normal blacklist repsonse will state the short reason, i.e. not registered. The error blacklist response will include the full error message. Some error messages may be longer than others (and the ValueError one is), leading to an amplication in the normal response. The effect amplification is around 2-3x depending on the original return text of the blacklist_fn.

Utilizing these flaws, attackers were able to do two things:

Subnet 2 (missing stake check)

On subnet 2, a validators minimum stake was not being checked. As a result, a malicious miner was able to register a low stake validator, and prompt users with challenges. The purpose of these challenges was to slow miners average response time in comparison to their normal response time to a real validator. In addition, the attacker employed several sneaky methods such as monitoring other validator requests, in order to sync their request to a validators, forcing a concurrent request on the innocent miner.

As a result, I believe it is best practice to leave in a default stake blacklist as an example, and allow subnet devs to REMOVE this if they wish, not the other way around. In best practice, the safest solution should be offered as the example, not a barebones solution. If the goal was for subnet devs to inspect the blacklist function and tailor it to their use-case, this has been demonstrated to not happen on at least 6 subnets (likely more). In most subnets, a stakeless validator has no actual purpose of calling a miners synapse. In practice, this lack of a default example has been shown to cause issues in various subnets that are often not picked up on right away.

Subnet 31 (triggerable unhandled exception)

Other affected subnets with a similar issue to 31 included subnet 33, 28, 25, 16, 11, 5. Here's my leading theory on why this leads to exhaustion, rather than the normal blacklist behavior, which as @mjurbanski-reef has shown, already throws an exception normally. Let's compare both scenarios, specifically taking a look at the traceback that will be printed here:

https://github.com/opentensor/bittensor/blob/master/bittensor/axon.py#L959

Unhandled exception in blacklist_fn:

blacklist_fn -> blacklist -> dispatch

Normal handled exception stacktrace:

blacklist -> dispatch

As you can see, forcing an error in the blacklist_fn in the subnet (note not the blacklist fn in axon.py) leads to a larger stacktrace. This stacktrace would cost additional resources to gather and print, and would even possibly include stack trace information from the asyncio upstream, metagraph calls, etc. The stacktrace length should be roughly 3-5x the size due to the increased callstack.

While a normal exception thrown by blacklist with normal behavior (i.e. the person is blacklisted) leads to a much shorter stacktrace, only blacklist, then caught by dispatch.

It's my belief this is part one to why this leads to exhaustion. The second part, is the increased response size when throwing an error from the blacklist_fn. A combination of both means greater CPU/mem pressure, and greater networking pressure. When hundreds to thousands of requests are coming in, small inefficiencies like these can play a big part to exhausting your miner. From testing, removing the unhandled exception from blacklist_fn led to continued operation for the miner while being attacked, while not handling the exception and letting it pass downstream caused server reboots on various miners machines, and overall total DoS to the axon. Why this would happen? I can only guess differences in stacktrace/callstack length.

@cxmplex cxmplex changed the title Remove vulnerabilities/attack vectors from default code Remove triggerable unhandled exceptions from default code May 22, 2024
Copy link

@mjurbanski-reef mjurbanski-reef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be better just to handle these edge-cases explicitly, other wise you will get error in logs, where its clear it was just someone sending malformed requests. And if that was you, then you would also benefit from proper description of that.

neurons/miner.py Outdated Show resolved Hide resolved
@mjurbanski-reef
Copy link

I must admit I don't get this

a) overhead of uncaught exception - isn't that exactly how blacklist operates? by throwing an exception? https://github.com/opentensor/bittensor/blob/16f26047bbcce7c8e283599ad9523de0638891d3/bittensor/axon.py#L1318
b) amplification of response - wasn't the response just string value of exception text? why was it bigger? Do you have example?

Btw, in bittensor v7 there are some changes in how exceptions are handled, and even less information is actually returned to the client: https://github.com/opentensor/bittensor/pull/1822/files#diff-7b2c4a5751ea6956b701861216059a5cef7c3c3e236e77e9295b0b8f6a6438e2R978
So it should make amplification even less possible

@cxmplex
Copy link
Author

cxmplex commented May 22, 2024 via email

Co-authored-by: Maciej Urbański <122983254+mjurbanski-reef@users.noreply.github.com>
@cxmplex
Copy link
Author

cxmplex commented May 23, 2024

Reopening soon with more comprehensive pull.

@cxmplex cxmplex closed this May 23, 2024
@cxmplex cxmplex reopened this May 23, 2024
@cxmplex cxmplex changed the title Remove triggerable unhandled exceptions from default code Remove abusable code from subnet template May 23, 2024
@cxmplex
Copy link
Author

cxmplex commented May 23, 2024

examples of the attacks:

2024-05-22 06:59:25.684 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49350 | 500 | None is not in list
2024-05-22 06:59:25.685 | TRACE | axon | <-- | 0 B | Dummy | None | 104.54.246.177:49360 | 200 | Success
INFO: 104.54.246.177:49360 - "POST /Dummy HTTP/1.1" 500 Internal Server Error
2024-05-22 06:59:25.693 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49360 | 500 | None is not in list
2024-05-22 06:59:25.694 | TRACE | axon | <-- | 0 B | Dummy | None | 104.54.246.177:49364 | 200 | Success
2024-05-22 06:59:25.701 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49364 | 500 | None is not in list
INFO: 104.54.246.177:49364 - "POST /Dummy HTTP/1.1" 500 Internal Server Error
2024-05-22 06:59:25.702 | TRACE | axon | <-- | 0 B | Dummy | None | 104.54.246.177:49356 | 200 | Success
2024-05-22 06:59:25.710 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49356 | 500 | None is not in list
INFO: 104.54.246.177:49356 - "POST /Dummy HTTP/1.1" 500 Internal Server Error
2024-05-22 06:59:25.711 | TRACE | axon | <-- | 0 B | Dummy | None | 104.54.246.177:49366 | 200 | Success
INFO: 104.54.246.177:49366 - "POST /Dummy HTTP/1.1" 500 Internal Server Error
2024-05-22 06:59:25.718 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49366 | 500 | None is not in list
2024-05-22 06:59:25.719 | TRACE | axon | <-- | 0 B | Dummy | None | 104.54.246.177:49378 | 200 | Success
INFO: 104.54.246.177:49378 - "POST /Dummy HTTP/1.1" 500 Internal Server Error

example of missing minimum stake attack:

https://discord.com/channels/799672011265015819/1220504695404236800/1239769414585286738

@cxmplex
Copy link
Author

cxmplex commented May 23, 2024

@mjurbanski-reef

Addressed, and added a theory on why this might be. Also added in minimum stake example as well, with explanation on why I think this is needed (based on real world exploitation in sn2) and should be offered by default rather than simply mentioning that you may want to do this. This isn't a MAY, it's a MUST, in my opinion, hence the change in the documentation as well, unless otherwise demonstrated in their subnet design pattern.

Anyways, you shouldn't have unhandled exceptions (especailly of generic types like ValueErorr) that can be triggered by users even if they are effectively caught upstream.

Copy link

@mjurbanski-reef mjurbanski-reef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Minimum stake and other blacklisting options are great and needed additions, no doubt about that :)

  2. As for "response size amplification" if we talk about difference measure in <100Bytes, that is for sure not what caused the initial problem - and it will be hard to tell without an example of logs from miner that was DoS into restart with logs showing what was happening right before that.
    In fact, the ~50B responses size shown by logs (it's not real response size, just size of Synapse object in memory) is what you get when the blacklist is functioning correctly.

  3. The stacktrace being printed should be only slightly of different size (its 1 frame, and 3 extra lines being printed, so its more like 1/3 bigger, not 5x as you theorize). If that pushes it over the edge is hard to say, it is possible, while unlikely.
    But I do think overeager exc trace is a problem, which hopefully can be addressed by less verbose handled synapse exceptions bittensor#1928 .

neurons/miner.py Outdated Show resolved Hide resolved
@cxmplex
Copy link
Author

cxmplex commented May 24, 2024

I don't think 1/3 bigger is accurate FYI. @mjurbanski-reef

Perhaps I'm wrong, but you can't just count functions and say okay it's 1 extra. Its dependent on the total depth of nested calls and the complexity of the execution path of the function that throws the exception. Not like this matters too much, they straight up just shouldn't print a greedy stacktrace like that as you've mentioned, especially if its triggerable by an outside user.

@gus-opentensor gus-opentensor added the help wanted Extra attention is needed label Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants