Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upon restart, Fscrawler deletes and reindexes even though no new files are added. #1941

Open
ScottCov opened this issue Sep 26, 2024 · 10 comments
Labels
check_for_bug Needs to be reproduced

Comments

@ScottCov
Copy link

Describe the bug

I have fscrawler on continuously. What I find is that if I turn it off and then restart, it proceeds to delete and reindex the documents which are already indexed. Specifically, the number of indexed documents doesn't change but it appears to be deleting and then adding them again even though there are no new ones. To be clear, i just stopped the docker container/elasticsearch and restarted it.

Job Settings

---
name: "job_name"
fs:
  #url: "/mnt/cloud/cases"
  url: "/tmp/es"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: true
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false 
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: true
  ocr:
    #path: "/usr/bin/"
    #data_path: "/usr/share/tesseract-ocr/5/tessdata/"
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  pipeline: "fscrawler-copy"
  nodes:
  - url: "https://192.168.1.199:9200"
 # - url: "https://192.168.1.196:9200"
 # - url: "https://192.168.1.198:9200"
 # - url: "https://192.168.1.200:9200"
 # - url: "https://192.168.1.201:9200"
  username: "elastic"
  password: "Dynaco123$"
  bulk_size: 100
 ssl_verification: false


Logs

14:16:30,482 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [364.4mb/5.8gb=6.07%], RAM [7.2gb/23.4gb=30.92%], Swap [22.3gb/22.3gb=100.0%].
14:16:30,816 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
14:16:30,817 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
14:16:30,942 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
14:16:31,702 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.13.2
14:16:31,711 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
14:16:31,827 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.13.2
14:16:31,855 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [job_name] for [/tmp/es] every [15m]
14:16:32,038 INFO  [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.

Expected behavior

I wouldn't expect any need to reindex as no new documents were added to the folder

Versions:

  • OS: [Debian 12]
    Fscrawler 2-10 snapshot docker

Attachment

If the bug is related to a given file, please share this file, so we can reuse it in tests
to reproduce the problem and may be use it in our integration tests.

@ScottCov ScottCov added the check_for_bug Needs to be reproduced label Sep 26, 2024
@dadoonet
Copy link
Owner

Could you run this again with one single document? With the trace mode on.

@ScottCov
Copy link
Author

ScottCov commented Sep 26, 2024

I am pretty sure I originally started it with debug:
--env DOC_LEVEL=debug

would I start again with --env LOG_LEVEL=trace? And I just remove the files in the watch folder and put in one? Or I guess I could just change the folder in the yml file?

@dadoonet
Copy link
Owner

Yeah. Look at https://fscrawler.readthedocs.io/en/latest/admin/logger.html for details.

Alternatively, --trace is still supported but deprecated.

And yes, using a new dir would help. But you need to run with --restart.

@ScottCov
Copy link
Author

ScottCov commented Sep 26, 2024

OK I am using docker so I don't know --trace/--restart apply here? How do I restart the docker container with different logging options? I think I have to recreate the container, no?

@dadoonet
Copy link
Owner

I guess something like this:

docker run -it --rm \
     -v ~/.fscrawler:/root/.fscrawler \
     -v ~/tmp:/tmp/es:ro \
     dadoonet/fscrawler fscrawler job_name --trace --restart

@ScottCov
Copy link
Author

docker run -d --env FS_JAVA_OPTS=-LOG_LEVEL=trace --name fscrawler -v /home/serveracct/logs/log1:/usr/share/fscrawler/logs -v /home/serveracct/logs/log2:/tmp -v ~/.fscrawler:/root/.fscrawler -v /mnt/cloud/cases/test:/tmp/es:ro dadoonet/fscrawler fscrawler job_name --trace --restart

serveracct@planck:/mnt/cloud/cases$ docker: Error response from daemon: Conflict. The container name "/fscrawler" is already in use by container "e3ed43fb9317fa65374564b70e5a1c79bfd5cbbae63de59b19d02ddcd6b0fe8b". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.

@ScottCov
Copy link
Author

I guess I could just restart without actually naming it and we'd just have another container?

@dadoonet
Copy link
Owner

May be. I'm not that good with Docker 😅

@ScottCov
Copy link
Author

OK if I change the folder and restart fscrawler it will then delete my old documents from Elasticsearch?

@dadoonet
Copy link
Owner

No. --restart just removes the status file which exists otherwise in the job dir.
It does not remove anything in Elasticsearch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
check_for_bug Needs to be reproduced
Projects
None yet
Development

No branches or pull requests

2 participants