Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RocksDBException thrown for .ldb file not found when calling ingestExternalFile() #10258

Open
dparrella opened this issue Jun 27, 2022 · 9 comments
Labels

Comments

@dparrella
Copy link

dparrella commented Jun 27, 2022

Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

Expected behavior

ingestExternalFile() should be able to successfully ingest SST files.

Actual behavior

RocksDBException is thrown indicating that a .ldb file could not be found.

The error message is:

While open a file for random read: /<path_obscured>/002462.ldb: No such file or directory

Steps to reproduce the behavior

Unknown what exact conditions are causing this behavior. The database contains about 28 million keys. Storage utilization is not an issue.

We periodically (about every 30 minutes) refresh data and publish the availability of new SST files to be ingested. At that time we attempt to ingest the new SST file(s).

We are using version 7.2.2.

@dparrella
Copy link
Author

This also appears to be affecting background compactions:

[/db_impl/db_impl_compaction_flush.cc:2925] Waiting after background compaction error: IO error: No such file or directory: While open a file for random read: /<path_obscured>/002462.ldb: No such file or directory, Accumulated background error counts: 1

@ajkr
Copy link
Contributor

ajkr commented Jun 28, 2022

I guess the file /<path_obscured>/002462.sst is also missing. ".sst" is the normal suffix but it looks like we try the ".ldb" suffix when a file is not found.

Do you have an example ingestion record in your LOG file where the ingested file later cannot be found? An ingestion record looks something like this:

2022/06/28-16:50:07.053178 373471 [db/external_sst_file_ingestion_job.cc:499] [AddFile] External SST file ./tmp.sst was ingested in L6 with path ./tmp-db/000012.sst (global_seqno=0)
2022/06/28-16:50:07.053186 373471 EVENT_LOG_v1 {"time_micros": 1656460207053174, "event": "ingest_finished", "files_ingested": ["file", "./tmp-db/000012.sst", "level", 6], "lsm_state": [0, 0, 0, 0, 0, 0, 1]}

@ajkr ajkr added question waiting Waiting for a response from the issue creator. labels Jun 28, 2022
@dparrella
Copy link
Author

@ajkr thanks for the reply. Yes that's correct I found a stack overflow article that mentioned .ldb is used as a fallback when the .sst file is not found.

Unfortunately we don't have those logs available. Oddly when we looked at the local directory all we saw were the .sst files and the MANIFEST.

We use rocksdbjni and have a Logger implementation we use to get logs in our Java application. We found the logs to be very noisy so we have the InfoLogLevel set to ERROR_LEVEL.

Is there a callback for this event we might be able to use so we can make sure this gets logged? Also - any idea why the log files would have been missing?

@ajkr ajkr added up-for-grabs Up for grabs and removed waiting Waiting for a response from the issue creator. labels Jun 29, 2022
@ZenoTan
Copy link

ZenoTan commented Jul 7, 2022

Hi I also met similar issues when I tried to open DB after ingestion:

Open DB failed: /tmp/test_client/features.dbCorruption: Corruption: IO error: No such file or directory: While open a file for random read: /tmp/test_client/features.db/000044.ldb: No such file or directory in file /tmp/test_client/features.db/MANIFEST-000009

I generated multiple SST files (****.sst) and ingest them by:

rocksdb::DB* db;
rocksdb::Options options;
rocksdb::Status status = rocksdb::DB::Open(options, path, &db);
if (!status.ok()) {
  LOG(FATAL) << "Failed to open rocksdb: " << status.ToString();
}

rocksdb::IngestExternalFileOptions ingest_options;
status = db->IngestExternalFile(thread_path_names, ingest_options);
if (!status.ok()) {
  LOG(FATAL) << "Failed to add file to rocksdb: " << status.ToString();
}

status = db->CompactRange(rocksdb::CompactRangeOptions(), nullptr, nullptr);
if (!status.ok()) {
  LOG(FATAL) << "Failed to compact rocksdb: " << status.ToString();
}

status = db->Flush(rocksdb::FlushOptions());
if (!status.ok()) {
  LOG(FATAL) << "Failed to flush rocksdb: " << status.ToString();
}

status = db->Close();
if (!status.ok()) {
  LOG(FATAL) << "Failed to close rocksdb: " << status.ToString();
}

After ingestion, the dir has many 0000***.sst files which looks correct, but after open it again it throws the same error as @dparrella suggested.

After checking the log, it shows:

2022/07/07-15:10:22.273535 2360928 [db/external_sst_file_ingestion_job.cc:490] [AddFile] External SST file /tmp/test_client/features.db/thread_tmp/tmp_31.sst was ingested in L6 with path /tmp/test_client/features.db/000044.sst (global_seqno=0)
2022/07/07-15:10:22.273545 2360928 EVENT_LOG_v1 {"time_micros": 1657206622273460, "event": "ingest_finished", "files_ingested": ["file", "/tmp/test_client/features.db/000013.sst", "level", 6, "file", "/tmp/test_client/features.db/000014.sst", "level", 6, "file", "/tmp/test_client/features.db/000015.sst", "level", 6, "file", "/tmp/test_client/features.db/000016.sst", "level", 6, "file", "/tmp/test_client/features.db/000017.sst", "level", 6, "file", "/tmp/test_client/features.db/000018.sst", "level", 6, "file", "/tmp/test_client/features.db/000019.sst", "level", 6, "file", "/tmp/test_client/features.db/000020.sst", "level", 6, "file", "/tmp/test_client/features.db/000021.sst", "level", 6, "file", "/tmp/test_client/features.db/000022.sst", "level", 6, "file", "/tmp/test_client/features.db/000023.sst", "level", 6, "file", "/tmp/test_client/features.db/000024.sst", "level", 6, "file", "/tmp/test_client/features.db/000025.sst", "level", 6, "file", "/tmp/test_client/features.db/000026.sst", "level", 6, "file", "/tmp/test_client/features.db/000027.sst", "level", 6, "file", "/tmp/test_client/features.db/000028.sst", "level", 6, "file", "/tmp/test_client/features.db/000029.sst", "level", 6, "file", "/tmp/test_client/features.db/000030.sst", "level", 6, "file", "/tmp/test_client/features.db/000031.sst", "level", 6, "file", "/tmp/test_client/features.db/000032.sst", "level", 6, "file", "/tmp/test_client/features.db/000033.sst", "level", 6, "file", "/tmp/test_client/features.db/000034.sst", "level", 6, "file", "/tmp/test_client/features.db/000035.sst", "level", 6, "file", "/tmp/test_client/features.db/000036.sst", "level", 6, "file", "/tmp/test_client/features.db/000037.sst", "level", 6, "file", "/tmp/test_client/features.db/000038.sst", "level", 6, "file", "/tmp/test_client/features.db/000039.sst", "level", 6, "file", "/tmp/test_client/features.db/000040.sst", "level", 6, "file", "/tmp/test_client/features.db/000041.sst", "level", 6, "file", "/tmp/test_client/features.db/000042.sst", "level", 6, "file", "/tmp/test_client/features.db/000043.sst", "level", 6, "file", "/tmp/test_client/features.db/000044.sst", "level", 6], "lsm_state": [0, 0, 0, 0, 0, 0, 32]}

I'm using version 7.2.2 sha256 928cbd416c053. I could reproduce the above error with ~100 keys.

@ZenoTan
Copy link

ZenoTan commented Jul 7, 2022

Here is the state after ingestion (before open DB crashes):

ls -l  /tmp/test_client/features.db/
total 412
-rw-r--r-- 1 ubuntu ubuntu     0 Jul  7 15:30 000005.log
-rw-r--r-- 1 ubuntu ubuntu     0 Jul  7 15:30 000010.log
-rw-r--r-- 1 ubuntu ubuntu  1131 Jul  7 15:30 000013.sst
-rw-r--r-- 1 ubuntu ubuntu  1112 Jul  7 15:30 000014.sst
-rw-r--r-- 1 ubuntu ubuntu  1116 Jul  7 15:30 000015.sst
-rw-r--r-- 1 ubuntu ubuntu  1124 Jul  7 15:30 000016.sst
-rw-r--r-- 1 ubuntu ubuntu  1126 Jul  7 15:30 000017.sst
-rw-r--r-- 1 ubuntu ubuntu  1124 Jul  7 15:30 000018.sst
-rw-r--r-- 1 ubuntu ubuntu  1170 Jul  7 15:30 000019.sst
-rw-r--r-- 1 ubuntu ubuntu  1143 Jul  7 15:30 000020.sst
-rw-r--r-- 1 ubuntu ubuntu  1118 Jul  7 15:30 000021.sst
-rw-r--r-- 1 ubuntu ubuntu  1130 Jul  7 15:30 000022.sst
-rw-r--r-- 1 ubuntu ubuntu  1111 Jul  7 15:30 000023.sst
-rw-r--r-- 1 ubuntu ubuntu  1111 Jul  7 15:30 000024.sst
-rw-r--r-- 1 ubuntu ubuntu  1133 Jul  7 15:30 000025.sst
-rw-r--r-- 1 ubuntu ubuntu  1118 Jul  7 15:30 000026.sst
-rw-r--r-- 1 ubuntu ubuntu  1114 Jul  7 15:30 000027.sst
-rw-r--r-- 1 ubuntu ubuntu  1122 Jul  7 15:30 000028.sst
-rw-r--r-- 1 ubuntu ubuntu  1113 Jul  7 15:30 000029.sst
-rw-r--r-- 1 ubuntu ubuntu  1122 Jul  7 15:30 000030.sst
-rw-r--r-- 1 ubuntu ubuntu  1118 Jul  7 15:30 000031.sst
-rw-r--r-- 1 ubuntu ubuntu  1108 Jul  7 15:30 000032.sst
-rw-r--r-- 1 ubuntu ubuntu  1140 Jul  7 15:30 000033.sst
-rw-r--r-- 1 ubuntu ubuntu  1118 Jul  7 15:30 000034.sst
-rw-r--r-- 1 ubuntu ubuntu  1106 Jul  7 15:30 000035.sst
-rw-r--r-- 1 ubuntu ubuntu  1121 Jul  7 15:30 000036.sst
-rw-r--r-- 1 ubuntu ubuntu  1126 Jul  7 15:30 000037.sst
-rw-r--r-- 1 ubuntu ubuntu  1121 Jul  7 15:30 000038.sst
-rw-r--r-- 1 ubuntu ubuntu  1134 Jul  7 15:30 000039.sst
-rw-r--r-- 1 ubuntu ubuntu  1122 Jul  7 15:30 000040.sst
-rw-r--r-- 1 ubuntu ubuntu  1107 Jul  7 15:30 000041.sst
-rw-r--r-- 1 ubuntu ubuntu  1116 Jul  7 15:30 000042.sst
-rw-r--r-- 1 ubuntu ubuntu  1122 Jul  7 15:30 000043.sst
-rw-r--r-- 1 ubuntu ubuntu  1853 Jul  7 15:30 000044.sst
-rw-r--r-- 1 ubuntu ubuntu    16 Jul  7 15:30 CURRENT
-rw-r--r-- 1 ubuntu ubuntu    36 Jul  7 15:30 IDENTITY
-rw-r--r-- 1 ubuntu ubuntu     0 Jul  7 15:30 LOCK
-rw-r--r-- 1 ubuntu ubuntu 32331 Jul  7 15:30 LOG
-rw-r--r-- 1 ubuntu ubuntu 18254 Jul  7 15:30 LOG.old.1657207808147946
-rw-r--r-- 1 ubuntu ubuntu  2231 Jul  7 15:30 MANIFEST-000009
-rw-r--r-- 1 ubuntu ubuntu  6583 Jul  7 15:30 OPTIONS-000007
-rw-r--r-- 1 ubuntu ubuntu  6578 Jul  7 15:30 OPTIONS-000012

@ajkr

@riversand963
Copy link
Contributor

If you don't close the db after ingestion, can you access the data in the ingested files?
Do you use the same options and dbname to re-open the db?

@ZenoTan
Copy link

ZenoTan commented Jul 7, 2022

If you don't close the db after ingestion, can you access the data in the ingested files?

Yes I could read values right before I call db->Close().

Do you use the same options and dbname to re-open the db?

It turns out I have another DB pointer with the same path, after removing it everything works well.

Thanks!

@dparrella
Copy link
Author

dparrella commented Jul 25, 2022

We started seeing this issue again today. So far our only way to resolve this is to restart the containers running our application, which clears the ephemeral storage we use for RocksDB. When new containers replace the old ones, we effectively start with a clean database.

Before we ingest SST files, we pause background work. Once the files have been ingested we continue background work. Could this be where the database is getting corrupted?

Here's the Java code we use to ingest. Note that only one thread can ingest files to a database at a time.

    public synchronized void ingestFiles(final List<Path> sstFiles) {
        Preconditions.checkCollectionNotNullOrEmpty(sstFiles, "sstFiles cannot be empty");

        final List<String> filePaths = sstFiles.stream()
                .map(Path::toString)
                .collect(Collectors.toList());

        log.info("Starting ingestion of [{}] SST files: {}", filePaths.size(), filePaths);

        try (IngestExternalFileOptions options = new IngestExternalFileOptions()) {
            // Pause background compactions to improve ingestion performance.
            log.info("Pausing background work before file ingestion.");
            this.rocksDB.pauseBackgroundWork();
            this.rocksDB.ingestExternalFile(filePaths, options);
        } catch (final RocksDBException e) {
            throw new IllegalStateException("Failed to ingest files: " + filePaths, e);
        } finally {
            log.info("Continuing background work after file ingestion.");
            try {
                this.rocksDB.continueBackgroundWork();
            } catch (final RocksDBException e) {
                throw new IllegalStateException("Failed to continue background work.", e);
            }
        }

        log.info("Completed ingestion of [{}] SST files.", filePaths.size());
    }

Does anyone have any suggestions for what we might be doing wrong here? Is there anything we could do try to recover and repair the database when we see this issue?

@btiwana12
Copy link

It also happens for us - Can't access /000041.sst: IO error: No such file or directorywhile stat a file for size:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants