Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread fails with uneven number of columns when max columns in final row (with fill=TRUE and col.names set) #2691

Closed
Tracked by #3189
alexdthomas opened this issue Mar 20, 2018 · 3 comments · Fixed by #5119
Milestone

Comments

@alexdthomas
Copy link

This may be related to issue #1812, but as that one does not have a reproducible example to confirm, I thought it would be more appropriate to open a new issue.

When a file with an uneven number of columns has the max number of columns in the final row fread fails with the following error:

Error in fread("foo", header = FALSE, fill = TRUE, sep = ",", col.names = paste("V", :
Expecting 3 cols, but line 9 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.

This occurs even with fill = TRUE and the maximum number of column names passed to col.names .

Here is a small example

text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n16520, California, ocean, summer, golden gate, beach, San Francisco\n"
cat(text, file = "foo")
max.fields<-max(count.fields("foo", sep = ','))
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))

However, when the row with the maximum number of fields is moved to the middle of the file (in this example row 6), fread behaves as expected.

text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n16520, California, ocean, summer, golden gate, beach, San Francisco\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n"
cat(text, file = "foo")
max.fields<-max(count.fields("foo", sep = ','))
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))

I included this caveat in my answer to this Stackoverflow question

laptop session info

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_2.2.1        microbenchmark_1.4-4 data.table_1.10.4-3 

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2 scales_0.5.0     compiler_3.4.1   lazyeval_0.2.1   plyr_1.8.4       tools_3.4.1      pillar_1.2.1    
 [8] gtable_0.2.0     tibble_1.4.2     Rcpp_0.12.15     grid_3.4.1       rlang_0.2.0      munsell_0.4.3 

Also tested on this machine, same results

#R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] vegan_2.4-4       lattice_0.20-35   permute_0.9-4     ggplot2_2.2.1     data.table_1.10.4 reshape2_1.4.3   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15     cluster_2.0.6    magrittr_1.5     MASS_7.3-49      munsell_0.4.3    colorspace_1.3-2
 [7] rlang_0.1.6      stringr_1.2.0    plyr_1.8.4       tools_3.4.4      parallel_3.4.4   grid_3.4.4      
[13] gtable_0.2.0     nlme_3.1-131.1   mgcv_1.8-23      digest_0.6.15    yaml_2.1.14      lazyeval_0.2.1  
[19] tibble_1.4.2     Matrix_1.2-11    labeling_0.3     stringi_1.1.6    compiler_3.4.4   pillar_1.1.0    
[25] scales_0.5.0
@MichaelChirico MichaelChirico changed the title fread fails with uneven number of columns when max collumns in final row (with fill=TRUE and col.names set) fread fails with uneven number of columns when max columns in final row (with fill=TRUE and col.names set) Feb 19, 2019
@jangorecki jangorecki removed the High label Jun 3, 2020
@tlapak
Copy link
Contributor

tlapak commented Jun 15, 2021

Both pieces of code produce, up to row order, identical output for me on current CRAN version (1.14.0).

@Rajdeep-689
Copy link

Hi Team,

I have the same problem. I have multiple .csv files under a directory. I am reading that under a list iteration. The below is the code and warning. I have used fill=True, but not working anything it seems. Can someone please just guide me..

Code:
setwd('E:/SOH-WORKING/CSV')
content <- rbindlist(
lapply(
list.files(path = 'E:/SOH-WORKING/CSV', pattern = "*.csv"),
fread,
select = c('#LOCATION', 'DIV_NAME', 'GROUP_NAME', 'DEPT_NAME', 'CLASS_NAME', 'SUB_NAME', 'ITEM_DESC', 'SEASON_DESC', 'STYLE_DESC', 'COLOR_DESC', 'SIZE_DESC', 'AVAILABLE_QTY')
), use.names=TRUE, fill=TRUE
)

Log:
Warning messages:
1: In FUN(X[[i]], ...) :
Stopped early on line 84. Expected 96 fields but found 97. Consider fill=TRUE and comment.char=. First discarded non-empty line:
2: In FUN(X[[i]], ...) :
Stopped early on line 20. Expected 96 fields but found 97. Consider fill=TRUE and comment.char=. First discarded non-empty line:
3: In FUN(X[[i]], ...) :
Stopped early on line 72. Expected 96 fields but found 97. Consider fill=TRUE and comment.char=. First discarded non-empty line:
4: In FUN(X[[i]], ...) :
Stopped early on line 119. Expected 96 fields but found 97. Consider fill=TRUE and comment.char=. First discarded non-empty line:
5: In FUN(X[[i]], ...) :
Stopped early on line 218. Expected 96 fields but found 97. Consider fill=TRUE and comment.char=. First discarded non-empty line:
6: In FUN(X[[i]], ...) :
Stopped early on line 60. Expected 96 fields but found 97. Consider fill=TRUE and comment.char=. First discarded non-empty line:
7: In FUN(X[[i]], ...) :
Stopped early on line 53. Expected 96 fields but found 97. Consider fill=TRUE and comment.char=. First discarded non-empty line:
8: In FUN(X[[i]], ...) :
Stopped early on line 253. Expected 96 fields but found 97. Consider fill=TRUE and comment.char=. First discarded non-empty line:
9: In FUN(X[[i]], ...) :
Stopped early on line 214. Expected 96 fields but found 97. Consider fill=TRUE and comment.char=. First discarded non-empty line:

Please help me if there's any work around.

@ben-schwen ben-schwen added this to the 1.16.0 milestone Jan 5, 2024
@ben-schwen
Copy link
Member

#5119 added the examples as test cases. Both work now with fread(file, fill=TRUE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants