Upon restart, Fscrawler deletes and reindexes even though no new files are added. #1941

ScottCov · 2024-09-26T15:23:13Z

Describe the bug

I have fscrawler on continuously. What I find is that if I turn it off and then restart, it proceeds to delete and reindex the documents which are already indexed. Specifically, the number of indexed documents doesn't change but it appears to be deleting and then adding them again even though there are no new ones. To be clear, i just stopped the docker container/elasticsearch and restarted it.

Job Settings

---
name: "job_name"
fs:
  #url: "/mnt/cloud/cases"
  url: "/tmp/es"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: true
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false 
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: true
  ocr:
    #path: "/usr/bin/"
    #data_path: "/usr/share/tesseract-ocr/5/tessdata/"
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  pipeline: "fscrawler-copy"
  nodes:
  - url: "https://192.168.1.199:9200"
 # - url: "https://192.168.1.196:9200"
 # - url: "https://192.168.1.198:9200"
 # - url: "https://192.168.1.200:9200"
 # - url: "https://192.168.1.201:9200"
  username: "elastic"
  password: "Dynaco123$"
  bulk_size: 100
 ssl_verification: false

Logs

14:16:30,482 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [364.4mb/5.8gb=6.07%], RAM [7.2gb/23.4gb=30.92%], Swap [22.3gb/22.3gb=100.0%].
14:16:30,816 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
14:16:30,817 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
14:16:30,942 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
14:16:31,702 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.13.2
14:16:31,711 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
14:16:31,827 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.13.2
14:16:31,855 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [job_name] for [/tmp/es] every [15m]
14:16:32,038 INFO  [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.

Expected behavior

I wouldn't expect any need to reindex as no new documents were added to the folder

Versions:

OS: [Debian 12]
Fscrawler 2-10 snapshot docker

Attachment

If the bug is related to a given file, please share this file, so we can reuse it in tests
to reproduce the problem and may be use it in our integration tests.

The text was updated successfully, but these errors were encountered:

dadoonet · 2024-09-26T15:41:31Z

Could you run this again with one single document? With the trace mode on.

ScottCov · 2024-09-26T16:02:24Z

I am pretty sure I originally started it with debug:
--env DOC_LEVEL=debug

would I start again with --env LOG_LEVEL=trace? And I just remove the files in the watch folder and put in one? Or I guess I could just change the folder in the yml file?

dadoonet · 2024-09-26T16:53:58Z

Yeah. Look at https://fscrawler.readthedocs.io/en/latest/admin/logger.html for details.

Alternatively, --trace is still supported but deprecated.

And yes, using a new dir would help. But you need to run with --restart.

ScottCov · 2024-09-26T17:58:59Z

OK I am using docker so I don't know --trace/--restart apply here? How do I restart the docker container with different logging options? I think I have to recreate the container, no?

dadoonet · 2024-09-26T18:02:16Z

I guess something like this:

docker run -it --rm \
     -v ~/.fscrawler:/root/.fscrawler \
     -v ~/tmp:/tmp/es:ro \
     dadoonet/fscrawler fscrawler job_name --trace --restart

ScottCov · 2024-09-26T18:07:37Z

docker run -d --env FS_JAVA_OPTS=-LOG_LEVEL=trace --name fscrawler -v /home/serveracct/logs/log1:/usr/share/fscrawler/logs -v /home/serveracct/logs/log2:/tmp -v ~/.fscrawler:/root/.fscrawler -v /mnt/cloud/cases/test:/tmp/es:ro dadoonet/fscrawler fscrawler job_name --trace --restart

serveracct@planck:/mnt/cloud/cases$ docker: Error response from daemon: Conflict. The container name "/fscrawler" is already in use by container "e3ed43fb9317fa65374564b70e5a1c79bfd5cbbae63de59b19d02ddcd6b0fe8b". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.

ScottCov · 2024-09-26T18:47:26Z

I guess I could just restart without actually naming it and we'd just have another container?

dadoonet · 2024-09-26T19:37:57Z

May be. I'm not that good with Docker 😅

ScottCov · 2024-09-30T16:54:38Z

OK if I change the folder and restart fscrawler it will then delete my old documents from Elasticsearch?

dadoonet · 2024-09-30T17:06:38Z

No. --restart just removes the status file which exists otherwise in the job dir.
It does not remove anything in Elasticsearch.

ScottCov added the check_for_bug Needs to be reproduced label Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upon restart, Fscrawler deletes and reindexes even though no new files are added. #1941

Upon restart, Fscrawler deletes and reindexes even though no new files are added. #1941

ScottCov commented Sep 26, 2024

dadoonet commented Sep 26, 2024

ScottCov commented Sep 26, 2024 •

edited

Loading

dadoonet commented Sep 26, 2024

ScottCov commented Sep 26, 2024 •

edited

Loading

dadoonet commented Sep 26, 2024

ScottCov commented Sep 26, 2024

ScottCov commented Sep 26, 2024

dadoonet commented Sep 26, 2024

ScottCov commented Sep 30, 2024

dadoonet commented Sep 30, 2024

Upon restart, Fscrawler deletes and reindexes even though no new files are added. #1941

Upon restart, Fscrawler deletes and reindexes even though no new files are added. #1941

Comments

ScottCov commented Sep 26, 2024

dadoonet commented Sep 26, 2024

ScottCov commented Sep 26, 2024 • edited Loading

dadoonet commented Sep 26, 2024

ScottCov commented Sep 26, 2024 • edited Loading

dadoonet commented Sep 26, 2024

ScottCov commented Sep 26, 2024

ScottCov commented Sep 26, 2024

dadoonet commented Sep 26, 2024

ScottCov commented Sep 30, 2024

dadoonet commented Sep 30, 2024

ScottCov commented Sep 26, 2024 •

edited

Loading

ScottCov commented Sep 26, 2024 •

edited

Loading