New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add find-large-objects subcommand to scrubber #8257

Merged

arpad-m merged 10 commits into main from arpad/scrubber_ls_larger

Jul 4, 2024

Member

arpad-m commented Jul 4, 2024

Adds a find-large-objects subcommand to the scrubber to allow listing layer objects larger than a specific size.

To be used like:

AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas

Part of #5431

arpad-m added 8 commits

July 4, 2024 02:57


          Add find-large-objects subcommand

b39cf76


          Add progress printing

c69fe6d


          Don't issue calls for each timeline, but do a per-tenant listing

a36c679

Before:
2024-07-04T01:16:22.446504Z  INFO Scanned 10 tenants, 88 objects. current <CENSORED>.
2024-07-04T01:16:27.463923Z  INFO Scanned 20 tenants, 352 objects. current <CENSORED>.
2024-07-04T01:16:31.973158Z  INFO Scanned 30 tenants, 499 objects. current <CENSORED>.
2024-07-04T01:18:16.523722Z  INFO Scanned 40 tenants, 19919 objects. current <CENSORED>.
2024-07-04T01:18:20.928643Z  INFO Scanned 50 tenants, 20018 objects. current <CENSORED>.

After:
2024-07-04T01:20:50.641208Z  INFO Scanned 10 shards, 97 objects. current <CENSORED>.
2024-07-04T01:20:52.687879Z  INFO Scanned 20 shards, 368 objects. current <CENSORED>.
2024-07-04T01:20:54.578079Z  INFO Scanned 30 shards, 522 objects. current <CENSORED>.
2024-07-04T01:21:05.835102Z  INFO Scanned 40 shards, 19952 objects. current <CENSORED>.
2024-07-04T01:21:08.150155Z  INFO Scanned 50 shards, 20060 objects. current <CENSORED>.

So it goes from 118s down to 18s.


          Improve printing

8fc02d1


          Add kind info

139473d


          Add way to ignore large delta layers

0dedfc5


          Remove early exit limit

dff1afc


          Print less often

arpad-m requested review from jcsp and problame

July 4, 2024 01:48

arpad-m mentioned this pull request

Epic: pageserver image layer compression #5431

Closed

30 tasks

github-actions bot commented Jul 4, 2024 •

edited

Loading

3006 tests run: 2891 passed, 0 failed, 115 skipped (full report)

Flaky tests (1)

Postgres 15

test_delete_timeline_client_hangup: debug

Code coverage* (full report)

functions: 32.6% (6933 of 21250 functions)
lines: 50.1% (54447 of 108781 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
4361c69 at 2024-07-04T15:02:02.447Z :recycle:}

problame reviewed

View reviewed changes

storage_scrubber/src/find_large_objects.rs Show resolved Hide resolved

problame reviewed

View reviewed changes

storage_scrubber/src/find_large_objects.rs Show resolved Hide resolved

problame requested changes

View reviewed changes

storage_scrubber/src/find_large_objects.rs Outdated Show resolved Hide resolved

storage_scrubber/src/find_large_objects.rs Show resolved Hide resolved

problame approved these changes

View reviewed changes

problame reviewed

View reviewed changes

storage_scrubber/src/find_large_objects.rs Show resolved Hide resolved

arpad-m added 2 commits

July 4, 2024 15:45


          Improve parsing (now it also supports generation numbers)

fe2280c


          Use expect instead of graceful error

4361c69

arpad-m enabled auto-merge (squash)

July 4, 2024 14:08

arpad-m merged commit e579bc0 into main

58 checks passed

arpad-m deleted the arpad/scrubber_ls_larger branch

July 4, 2024 15:07

arpad-m mentioned this pull request

Add concurrency to the find-large-objects scrubber subcommand #8291

Merged

arpad-m added a commit that referenced this pull request


          Add concurrency to the find-large-objects scrubber subcommand (#8291)

0a937b7

The find-large-objects scrubber subcommand is quite fast if you run it
in an environment with low latency to the S3 bucket (say an EC2 instance
in the same region). However, the higher the latency gets, the slower
the command becomes. Therefore, add a concurrency param and make it
parallelized. This doesn't change that general relationship, but at
least lets us do multiple requests in parallel and therefore hopefully
faster.

Running with concurrency of 64 (default):

```
2024-07-05T17:30:22.882959Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:30:28.289853Z  INFO Scanned 500 shards. [...]
```

With concurrency of 1, simulating state before this PR:

```
2024-07-05T17:31:43.375153Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:33:51.987092Z  INFO Scanned 500 shards. [...]
```

In other words, to list 500 shards, speed is increased from 2:08 minutes
to 6 seconds.

Follow-up of  #8257, part of #5431

VladLazar pushed a commit that referenced this pull request


          Add find-large-objects subcommand to scrubber (#8257)

0d5b1a9

Adds a find-large-objects subcommand to the scrubber to allow listing
layer objects larger than a specific size.

To be used like:

```
AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas
```

Part of #5431

VladLazar pushed a commit that referenced this pull request


          Add concurrency to the find-large-objects scrubber subcommand (#8291)

8b90865

The find-large-objects scrubber subcommand is quite fast if you run it
in an environment with low latency to the S3 bucket (say an EC2 instance
in the same region). However, the higher the latency gets, the slower
the command becomes. Therefore, add a concurrency param and make it
parallelized. This doesn't change that general relationship, but at
least lets us do multiple requests in parallel and therefore hopefully
faster.

Running with concurrency of 64 (default):

```
2024-07-05T17:30:22.882959Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:30:28.289853Z  INFO Scanned 500 shards. [...]
```

With concurrency of 1, simulating state before this PR:

```
2024-07-05T17:31:43.375153Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:33:51.987092Z  INFO Scanned 500 shards. [...]
```

In other words, to list 500 shards, speed is increased from 2:08 minutes
to 6 seconds.

Follow-up of  #8257, part of #5431

VladLazar pushed a commit that referenced this pull request


          Add find-large-objects subcommand to scrubber (#8257)

dfc6966

Adds a find-large-objects subcommand to the scrubber to allow listing
layer objects larger than a specific size.

To be used like:

```
AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas
```

Part of #5431

VladLazar pushed a commit that referenced this pull request


          Add concurrency to the find-large-objects scrubber subcommand (#8291)

4bbd38c

The find-large-objects scrubber subcommand is quite fast if you run it
in an environment with low latency to the S3 bucket (say an EC2 instance
in the same region). However, the higher the latency gets, the slower
the command becomes. Therefore, add a concurrency param and make it
parallelized. This doesn't change that general relationship, but at
least lets us do multiple requests in parallel and therefore hopefully
faster.

Running with concurrency of 64 (default):

```
2024-07-05T17:30:22.882959Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:30:28.289853Z  INFO Scanned 500 shards. [...]
```

With concurrency of 1, simulating state before this PR:

```
2024-07-05T17:31:43.375153Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:33:51.987092Z  INFO Scanned 500 shards. [...]
```

In other words, to list 500 shards, speed is increased from 2:08 minutes
to 6 seconds.

Follow-up of  #8257, part of #5431

VladLazar pushed a commit that referenced this pull request


          Add find-large-objects subcommand to scrubber (#8257)

5431c9c

Adds a find-large-objects subcommand to the scrubber to allow listing
layer objects larger than a specific size.

To be used like:

```
AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas
```

Part of #5431

VladLazar pushed a commit that referenced this pull request


          Add concurrency to the find-large-objects scrubber subcommand (#8291)

e896cbe

The find-large-objects scrubber subcommand is quite fast if you run it
in an environment with low latency to the S3 bucket (say an EC2 instance
in the same region). However, the higher the latency gets, the slower
the command becomes. Therefore, add a concurrency param and make it
parallelized. This doesn't change that general relationship, but at
least lets us do multiple requests in parallel and therefore hopefully
faster.

Running with concurrency of 64 (default):

```
2024-07-05T17:30:22.882959Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:30:28.289853Z  INFO Scanned 500 shards. [...]
```

With concurrency of 1, simulating state before this PR:

```
2024-07-05T17:31:43.375153Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:33:51.987092Z  INFO Scanned 500 shards. [...]
```

In other words, to list 500 shards, speed is increased from 2:08 minutes
to 6 seconds.

Follow-up of  #8257, part of #5431

VladLazar pushed a commit that referenced this pull request


          Add find-large-objects subcommand to scrubber (#8257)

112aef1

Adds a find-large-objects subcommand to the scrubber to allow listing
layer objects larger than a specific size.

To be used like:

```
AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas
```

Part of #5431

VladLazar pushed a commit that referenced this pull request


          Add concurrency to the find-large-objects scrubber subcommand (#8291)

60ef903

The find-large-objects scrubber subcommand is quite fast if you run it
in an environment with low latency to the S3 bucket (say an EC2 instance
in the same region). However, the higher the latency gets, the slower
the command becomes. Therefore, add a concurrency param and make it
parallelized. This doesn't change that general relationship, but at
least lets us do multiple requests in parallel and therefore hopefully
faster.

Running with concurrency of 64 (default):

```
2024-07-05T17:30:22.882959Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:30:28.289853Z  INFO Scanned 500 shards. [...]
```

With concurrency of 1, simulating state before this PR:

```
2024-07-05T17:31:43.375153Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:33:51.987092Z  INFO Scanned 500 shards. [...]
```

In other words, to list 500 shards, speed is increased from 2:08 minutes
to 6 seconds.

Follow-up of  #8257, part of #5431

VladLazar pushed a commit that referenced this pull request


          Add find-large-objects subcommand to scrubber (#8257)

2e67e48

Adds a find-large-objects subcommand to the scrubber to allow listing
layer objects larger than a specific size.

To be used like:

```
AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas
```

Part of #5431

VladLazar pushed a commit that referenced this pull request


          Add concurrency to the find-large-objects scrubber subcommand (#8291)

81a28a7

The find-large-objects scrubber subcommand is quite fast if you run it
in an environment with low latency to the S3 bucket (say an EC2 instance
in the same region). However, the higher the latency gets, the slower
the command becomes. Therefore, add a concurrency param and make it
parallelized. This doesn't change that general relationship, but at
least lets us do multiple requests in parallel and therefore hopefully
faster.

Running with concurrency of 64 (default):

```
2024-07-05T17:30:22.882959Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:30:28.289853Z  INFO Scanned 500 shards. [...]
```

With concurrency of 1, simulating state before this PR:

```
2024-07-05T17:31:43.375153Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:33:51.987092Z  INFO Scanned 500 shards. [...]
```

In other words, to list 500 shards, speed is increased from 2:08 minutes
to 6 seconds.

Follow-up of  #8257, part of #5431

VladLazar pushed a commit that referenced this pull request


          Add find-large-objects subcommand to scrubber (#8257)

bd2046e

Adds a find-large-objects subcommand to the scrubber to allow listing
layer objects larger than a specific size.

To be used like:

```
AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas
```

Part of #5431

VladLazar pushed a commit that referenced this pull request


          Add concurrency to the find-large-objects scrubber subcommand (#8291)

36b790f

The find-large-objects scrubber subcommand is quite fast if you run it
in an environment with low latency to the S3 bucket (say an EC2 instance
in the same region). However, the higher the latency gets, the slower
the command becomes. Therefore, add a concurrency param and make it
parallelized. This doesn't change that general relationship, but at
least lets us do multiple requests in parallel and therefore hopefully
faster.

Running with concurrency of 64 (default):

```
2024-07-05T17:30:22.882959Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:30:28.289853Z  INFO Scanned 500 shards. [...]
```

With concurrency of 1, simulating state before this PR:

```
2024-07-05T17:31:43.375153Z  INFO lazy_load_identity [...]
[...]
2024-07-05T17:33:51.987092Z  INFO Scanned 500 shards. [...]
```

In other words, to list 500 shards, speed is increased from 2:08 minutes
to 6 seconds.

Follow-up of  #8257, part of #5431

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet