Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files. #286

lyuwen · 2024-09-10T03:35:02Z

I came across this bug where I was running comparison of datasets of same number of documents but split into different shards. After inspection of the code, I found that in the code below the number got sent into the documents stat is actually the index of the last docuemnt in a given file, which is less than the number of documents in the file by 1.

datatrove/src/datatrove/pipeline/readers/base.py

Lines 193 to 204 in c2fc902

    
           for di, document in enumerate(self.read_file(filepath)): 
        
               if skipped < self.skip: 
        
                   skipped += 1 
        
                   continue 
        
               if self.limit != -1 and li >= self.limit: 
        
                   break 
        
               yield document 
        
               doc_pbar.update() 
        
               li += 1 
        
           file_pbar.update() 
        
           self.stat_update("documents", value=di, unit="input_file") 
        
           if self.limit != -1 and li >= self.limit:

Therefore, I'm proposing a simple fix that just increment the variable di when it's non-zero.

than the actual number of documents by the number of files.

guipenedo · 2024-09-10T21:57:15Z

hi, thanks for the PR
what if there are actually 0 documents? won't this count them as "1" now?
maybe we can just add another variable to count the actual nb of documents? I could see the current version also having issues if we were to use skip

lyuwen · 2024-09-10T23:17:04Z

I used the statement

di += min(1, di)

So that if there were 0 document, di will be 0, it won't be incremented and the stat remains 0.

lyuwen · 2024-09-10T23:20:29Z

Ah, I guess this way my edit would not count correctly when there is exactly 1 document 🤦

lyuwen · 2024-09-10T23:22:58Z

Then I'd suggest to initialize di to -1 and always increment di after the loop.

lyuwen · 2024-09-11T03:36:16Z

Never mind, an extra variable is need to correctly account for the skip.

Now use a seperate variable `ndocs` to count number of docs yielded.

guipenedo · 2024-09-11T11:35:28Z

LGTM, thanks again!

fixed a bug that in reader pipeline, the document count is always less

acb7d72

than the actual number of documents by the number of files.

Update document counting based on advice from @guipenedo

177af1f

Now use a seperate variable `ndocs` to count number of docs yielded.

guipenedo merged commit 9142e3e into huggingface:main Sep 11, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files. #286

Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files. #286

lyuwen commented Sep 10, 2024

guipenedo commented Sep 10, 2024

lyuwen commented Sep 10, 2024 •

edited

Loading

lyuwen commented Sep 10, 2024 •

edited

Loading

lyuwen commented Sep 10, 2024

lyuwen commented Sep 11, 2024

guipenedo commented Sep 11, 2024

	for di, document in enumerate(self.read_file(filepath)):
	if skipped < self.skip:
	skipped += 1
	continue
	if self.limit != -1 and li >= self.limit:
	break
	yield document
	doc_pbar.update()
	li += 1
	file_pbar.update()
	self.stat_update("documents", value=di, unit="input_file")
	if self.limit != -1 and li >= self.limit:

Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files. #286

Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files. #286

Conversation

lyuwen commented Sep 10, 2024

guipenedo commented Sep 10, 2024

lyuwen commented Sep 10, 2024 • edited Loading

lyuwen commented Sep 10, 2024 • edited Loading

lyuwen commented Sep 10, 2024

lyuwen commented Sep 11, 2024

guipenedo commented Sep 11, 2024

lyuwen commented Sep 10, 2024 •

edited

Loading

lyuwen commented Sep 10, 2024 •

edited

Loading