Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documents.log is empty, but documents are getting sent to my index #1667

Open
UltraSalem opened this issue Jun 4, 2023 · 5 comments
Open
Labels
check_for_bug Needs to be reproduced

Comments

@UltraSalem
Copy link

Describe the bug

Running docker-compose, set the logging directory in the docker-compose file. fscrawler.log gets populated and rotated, but documents.log in that same folder does not.

Job Settings

$ cat config/whitedwarfscryer/_settings.yaml

---
name: "whitedwarfscryer"
fs:
  indexed_chars: -1
  continue_on_error: true
  add_filesize: true
  store_source: false
  index_content: true
  filename_as_id: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
elasticsearch:
  nodes:
    - url: "http://elasticsearch:9200"
  username: "[redacted]"
  password: "[redacted]"
  ssl_verification: false
  bulk_size: 200
  flush_interval: "5s"
  byte_size: "25mb"

$ cat docker-compose.yml

version: '3'
services:
  fscrawler:
    image: dadoonet/fscrawler
    container_name: fscrawler
    volumes:
      - "/zdata/zsalem/Downloads/death stuffs/Games/WhiteDwarfs/first200:/tmp/es:ro"
      - ${PWD}/config:/root/.fscrawler
      - ${PWD}/logs:/usr/share/fscrawler/logs
    command: fscrawler whitedwarfscryer --loop 1 
    networks:
      - es_network
networks:
  es_network:
    external:
      name: es_network

Logs

$ docker logs fscrawler

SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/usr/share/fscrawler/lib/log4j-slf4j-impl-2.20.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.

$ cat fscrawler.log

03:57:45,482 INFO  [f.console] ,----------------------------------------------------------------------------------------------------.
|       ,---,.  .--.--.     ,----..                                     ,--,           2.10-SNAPSHOT |
|     ,'  .' | /  /    '.  /   /   \                                  ,--.'|                         |
|   ,---.'   ||  :  /`. / |   :     :  __  ,-.                   .---.|  | :               __  ,-.   |
|   |   |   .';  |  |--`  .   |  ;. /,' ,'/ /|                  /. ./|:  : '             ,' ,'/ /|   |
|   :   :  :  |  :  ;_    .   ; /--` '  | |' | ,--.--.       .-'-. ' ||  ' |      ,---.  '  | |' |   |
|   :   |  |-, \  \    `. ;   | ;    |  |   ,'/       \     /___/ \: |'  | |     /     \ |  |   ,'   |
|   |   :  ;/|  `----.   \|   : |    '  :  / .--.  .-. | .-'.. '   ' .|  | :    /    /  |'  :  /     |
|   |   |   .'  __ \  \  |.   | '___ |  | '   \__\/: . ./___/ \:     ''  : |__ .    ' / ||  | '      |
|   '   :  '   /  /`--'  /'   ; : .'|;  : |   ," .--.; |.   \  ' .\   |  | '.'|'   ;   /|;  : |      |
|   |   |  |  '--'.     / '   | '/  :|  , ;  /  /  ,.  | \   \   ' \ |;  :    ;'   |  / ||  , ;      |
|   |   :  \    `--'---'  |   :    /  ---'  ;  :   .'   \ \   \  |--" |  ,   / |   :    | ---'       |
|   |   | ,'               \   \ .'         |  ,     .-./  \   \ |     ---`-'   \   \  /             |
|   `----'                  `---`            `--`---'       '---"                `----'              |
+----------------------------------------------------------------------------------------------------+
|                                        You know, for Files!                                        |
|                                     Made from France with Love                                     |
|                           Source: https://github.com/dadoonet/fscrawler/                           |
|                          Documentation: https://fscrawler.readthedocs.io/                          |
`----------------------------------------------------------------------------------------------------'

03:57:45,497 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [500.2mb/7.8gb=6.25%], RAM [5.1gb/31.2gb=16.49%], Swap [7.9gb/7.9gb=100.0%].
03:57:45,723 WARN  [f.p.e.c.f.c.FsCrawlerCli] `url` is not set. Please define it. Falling back to default: [/tmp/es].
03:57:45,731 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
03:57:45,809 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
03:57:46,143 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.7.1
03:57:46,146 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
03:57:46,178 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.7.1
03:57:46,198 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [whitedwarfscryer] for [/tmp/es] every [15m]
03:57:58,595 INFO  [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.

Expected behavior

I am expecting that as fscrawler runs from inside this container, that the documents.log would be populated. It seems like it got created the first time I ran this container a week ago, but has never had any info in it, despite my index being populated successfully and fscrawler.log getting populated and rotated. But assuming the documents are being scanned in alphabetical order (I could not find any info in the docs, but Bard said it was alphabetical first...?)
White Dwarf Magazine Issue 001 - Jun 1977 (UK)-001.pdf
, not all documents are going in to my index, so I suspect some are erroring out, but I can't see what ones they are and I don't want to manually check 13,000 documents. Hence looking for documents.log info.

-r--r--r-- 1 root  root     0 May 29 23:44 documents.log
-rw-r--r-- 1 root  root  1.1K May 30 00:21 fscrawler-2023-05-29-1.log.gz
-rw-r--r-- 1 root  root  1.1K May 30 00:31 fscrawler-2023-05-29-2.log.gz
-rw-r--r-- 1 root  root  1.1K May 30 00:32 fscrawler-2023-05-29-3.log.gz
-rw-r--r-- 1 root  root  1.1K May 30 00:39 fscrawler-2023-05-29-4.log.gz
-rw-r--r-- 1 root  root  1.2K May 30 01:56 fscrawler-2023-05-29-5.log.gz
-rw-r--r-- 1 root  root  1.1K May 30 02:05 fscrawler-2023-05-29-6.log.gz
-rw-r--r-- 1 root  root  1.1K May 31 03:06 fscrawler-2023-05-29-7.log.gz
-rw-r--r-- 1 root  root   154 Jun  3 14:38 fscrawler-2023-05-30-1.log.gz
-rw-r--r-- 1 root  root  1.1K Jun  3 14:38 fscrawler-2023-06-03-1.log.gz
-rw-r--r-- 1 root  root  1.1K Jun  3 14:39 fscrawler-2023-06-03-2.log.gz
-rw-r--r-- 1 root  root  1.1K Jun  3 17:45 fscrawler-2023-06-03-3.log.gz
-rw-r--r-- 1 root  root  1.1K Jun  3 18:01 fscrawler-2023-06-03-4.log.gz
-rw-r--r-- 1 root  root  1.1K Jun  4 00:53 fscrawler-2023-06-03-5.log.gz
-rw-r--r-- 1 root  root  1.1K Jun  4 13:25 fscrawler-2023-06-03-6.log.gz
-rw-r--r-- 1 root  root   829 Jun  4 13:47 fscrawler-2023-06-04-1.log.gz
-rw-r--r-- 1 root  root  1.1K Jun  4 13:57 fscrawler-2023-06-04-2.log.gz
-rw-r--r-- 1 root  root  3.2K Jun  4 13:57 fscrawler.log

Versions:

  • OS: Ubuntu 22.04
  • Version 2.10 snapshot

Attachment

Attempting to attach the document that should have been scanned first, but does not appear in my index.

@UltraSalem UltraSalem added the check_for_bug Needs to be reproduced label Jun 4, 2023
@UltraSalem
Copy link
Author

ok that attachment got inserted not where I was expecting, sorry! but still kind of a relevant spot at least :)

@UltraSalem
Copy link
Author

ok the document White.Dwarf.Magazine.Issue.001.-.Jun.1977.UK.-001.pdf is now in my index now that the job has completed (13,425 documents). Which is weird as it has the oldest created date, and the first name in alphabetical order, of all the documents, so it should have been the first documentin there, not sometime after 7000 other documents.

documents.log is still empty.

$cat logs/fscrawler.log

05:58:06,284 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
05:58:06,415 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [whitedwarfscryer] stopped
05:58:06,418 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [whitedwarfscryer] stopped

@dadoonet
Copy link
Owner

dadoonet commented Jun 5, 2023

So what should be a good order in your opinion?

For some use cases, I have the feeling that the most recent documents are the most relevant vs the oldest. What do you think?

@UltraSalem
Copy link
Author

I think oldest file first, by the date time it arrives in the scanning folder (last modified date maybe?). Users will expect first in, first out, for the index when they're using it. So if a set of files get written in to the monitored folders over the day, the user would expect to see the first ones that went in appear in the index first.

That's my thoughts anyway! I don't really mind as long as I can find whatever it is documented somewhere. I can mess around with data prep to get the order I need if I have a particular requirement, as long as I know what I'm aiming for..

@ScottCov
Copy link

ScottCov commented Aug 3, 2023

I am experiencing the same issue with documents.log using docker although my documents.log file does record errors, it isn't recording documents indexed:

2023-08-02 08:38:31,003 [ERROR] [603.pdf][/23-90020/603.pdf] Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout
2023-08-02 16:22:08,837 [ERROR] [859-9.pdf][/23-90020/859-9.pdf] Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
check_for_bug Needs to be reproduced
Projects
None yet
Development

No branches or pull requests

3 participants