Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate 'row.names' are not allowed #123

Open
stemangiola opened this issue Jan 10, 2024 · 6 comments
Open

duplicate 'row.names' are not allowed #123

stemangiola opened this issue Jan 10, 2024 · 6 comments

Comments

@stemangiola
Copy link
Owner

Hi @multimeric ,

I get this error for this query

  library(CuratedAtlasQueryR)  # For accessing curated atlas data
  library(tidyverse)          # For data manipulation and visualization
  
  # Retrieve and process the metadata
  get_metadata() |> 
    # Filter for specific conditions
    filter(
      tissue_harmonised == "blood",          # Select only 'blood' tissue
      cell_type_harmonised == "b memory",    # Select only 'b memory' cell type
      disease %in% c("normal", "COVID-19")   # Filter for 'normal' and 'COVID-19' diseases
    ) |> 
    # Convert the data frame to a tibble for better handling
    as_tibble() |> 
    # Nest data excluding sample and disease columns
    nest(data_cells = -c(sample_, disease)) |> 
    # Add a new column 'n' that contains the row count of each nested dataframe
    mutate(n = map_int(data_cells, nrow)) |> 
    # Filter out groups with less than 10 rows
    filter(n > 9) |> 
    # Nest the data again, this time excluding the disease column
    nest(data_samples = -disease) |> 
    # Add columns for lower and upper count thresholds
    mutate(count_low = c(10, 10), count_high = c(500, 10)) |> 
    # Apply function to each row using parallel mapping
    mutate(data_samples = pmap(
      list(data_samples, count_low, count_high),
      ~ bind_rows(
        # Select nine samples closest to the lower count threshold
        ..1 |> 
          arrange(abs(n-..2)) |> 
          head(9),
        
        # Select one sample closest to the higher count threshold
        ..1 |> 
          arrange(abs(n-..3)) |> 
          head(1)
      )
    )) |> 
    # Unnest the nested 'data_samples' dataframe
    unnest(data_samples) |> 
    # Group by 'disease' and sort each group by 'n'
    with_groups(disease, ~ .x |> arrange(n)) |> 
    # Finally, unnest the 'data_cells' to expand the nested data
    unnest(data_cells) |> 
    get_single_cell_experiment()
! Some cells were filtered out while loading 503aba0168fd5b11b6719b7cf61126bfbecause of extremely low counts. Thenumber of cells in the SingleCellExperiment will be less than thenumber of cells you have selected from the metadata.
Error in `map2()`:In index: 1.With name: counts.
Caused by error in `dplyr::summarise()`:In argument: `sces = list(...)`.In group 6: `file_id_db = "503aba0168fd5b11b6719b7cf61126bf"`.
Caused by error in `.rowNamesDF<-`:
! duplicate 'row.names' are not allowed
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
In .f(.x[[i]], ...) : NAs introduced by coercion to integer range
> rlang::last_trace()
<error/purrr_error_indexed>
Error in `map2()`:In index: 1.With name: counts.
Caused by error in `dplyr::summarise()`:In argument: `sces = list(...)`.In group 6: `file_id_db = "503aba0168fd5b11b6719b7cf61126bf"`.
Caused by error in `.rowNamesDF<-`:
! duplicate 'row.names' are not allowed
---
Backtrace:1. ├─CuratedAtlasQueryR::get_single_cell_experiment(...)
  2. │ └─purrr::imap(...)
  3. │   └─purrr::map2(.x, vec_index(.x), .f, ...)
  4. │     └─purrr:::map2_("list", .x, .y, .f, ..., .progress = .progress)
  5. │       ├─purrr:::with_indexed_errors(...)
  6. │       │ └─base::withCallingHandlers(...)
  7. │       ├─purrr:::call_with_cleanup(...)
  8. │       └─CuratedAtlasQueryR (local) .f(.x[[i]], .y[[i]], ...)
  9. │         ├─base::do.call(...)
 10. │         ├─dplyr::pull(...)
 11. │         ├─dplyr::summarise(...)
 12. │         └─dplyr:::summarise.grouped_df(...)
 13. │           └─dplyr:::summarise_cols(.data, dplyr_quosures(...), by, "summarise")
 14. │             ├─base::withCallingHandlers(...)
 15. │             └─dplyr:::map(quosures, summarise_eval_one, mask = mask)
 16. │               └─base::lapply(.x, .f, ...)
 17. │                 └─dplyr (local) FUN(X[[i]], ...)
 18. │                   └─mask$eval_all_summarise(quo)
 19. │                     └─dplyr (local) eval()
 20. └─CuratedAtlasQueryR:::group_to_sce(...)
 21.   ├─methods::as(...)
 22.   │ └─methods:::.class1(object)
 23.   └─tibble::column_to_rownames(...)
 24.     └─base::`rownames<-`(`*tmp*`, value = .data[[var]])
 25.       ├─base::`row.names<-`(`*tmp*`, value = value)
 26.       └─base::`row.names<-.data.frame`(`*tmp*`, value = value)
 27.         └─base::`.rowNamesDF<-`(x, value = value)
 28.           └─base::stop("duplicate 'row.names' are not allowed")
> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /stornext/System/data/apps/R/R-4.3.0/lib64/R/lib/libRblas.so 
LAPACK: /stornext/System/data/apps/R/R-4.3.0/lib64/R/lib/libRlapack.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3          forcats_1.0.0            stringr_1.5.1            dplyr_1.1.4             
 [5] purrr_1.0.2              readr_2.1.4              tidyr_1.3.0              tibble_3.2.1            
 [9] ggplot2_3.4.4            tidyverse_2.0.0          CuratedAtlasQueryR_1.1.1

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.21            splines_4.3.0               later_1.3.1                
  [4] bitops_1.0-7                polyclip_1.10-6             fastDummies_1.7.3          
  [7] lifecycle_1.0.4             StanHeaders_2.33.1.9000     globals_0.16.2             
 [10] processx_3.8.2              lattice_0.22-5              MASS_7.3-60                
 [13] magrittr_2.0.3              plotly_4.10.3               httpuv_1.6.12              
 [16] Seurat_5.0.1                sctransform_0.4.1           spam_2.10-0                
 [19] sp_2.1-2                    pkgbuild_1.4.2              spatstat.sparse_3.0-3      
 [22] reticulate_1.34.0           cowplot_1.1.1               pbapply_1.7-2              
 [25] DBI_1.1.3                   minqa_1.2.6                 RColorBrewer_1.1-3         
 [28] abind_1.4-5                 zlibbioc_1.48.0             GenomicRanges_1.54.1       
 [31] Rtsne_0.16                  RCurl_1.98-1.12             BiocGenerics_0.48.1        
 [34] GenomeInfoDbData_1.2.11     IRanges_2.36.0              S4Vectors_0.40.2           
 [37] ggrepel_0.9.4               inline_0.3.19               irlba_2.3.5.1              
 [40] listenv_0.9.0               spatstat.utils_3.0-4        goftest_1.2-3              
 [43] RSpectra_0.16-1             spatstat.random_3.2-2       fitdistrplus_1.1-11        
 [46] parallelly_1.36.0           leiden_0.4.3.1              codetools_0.2-19           
 [49] DelayedArray_0.28.0         tidyselect_1.2.0            farver_2.1.1               
 [52] lme4_1.1-35.1               matrixStats_1.2.0           stats4_4.3.0               
 [55] spatstat.explore_3.2-5      duckdb_0.9.2-1              jsonlite_1.8.8             
 [58] ellipsis_0.3.2              progressr_0.14.0            ggridges_0.5.4             
 [61] survival_3.5-7              tools_4.3.0                 ica_1.0-3                  
 [64] Rcpp_1.0.11                 glue_1.6.2                  gridExtra_2.3              
 [67] SparseArray_1.2.2           MatrixGenerics_1.14.0       GenomeInfoDb_1.38.2        
 [70] HDF5Array_1.30.0            withr_2.5.2                 loo_2.6.0                  
 [73] numDeriv_2016.8-1.1         fastmap_1.1.1               boot_1.3-28.1              
 [76] rhdf5filters_1.14.0         fansi_1.0.5                 callr_3.7.3                
 [79] digest_0.6.33               timechange_0.2.0            R6_2.5.1                   
 [82] mime_0.12                   colorspace_2.1-0            scattermore_1.2            
 [85] tensor_1.5                  spatstat.data_3.0-3         utf8_1.2.4                 
 [88] generics_0.1.3              data.table_1.14.8           prettyunits_1.2.0          
 [91] httr_1.4.7                  htmlwidgets_1.6.3           S4Arrays_1.2.0             
 [94] uwot_0.1.16                 pkgconfig_2.0.3             gtable_0.3.4               
 [97] blob_1.2.4                  lmtest_0.9-40               SingleCellExperiment_1.24.0
[100] XVector_0.42.0              htmltools_0.5.7             dotCall64_1.1-1            
[103] Biobase_2.62.0              SeuratObject_5.0.1          scales_1.3.0               
[106] png_0.1-8                   nanonext_0.11.0             rstudioapi_0.15.0          
[109] tzdb_0.4.0                  reshape2_1.4.4              nlme_3.1-164               
[112] curl_5.2.0                  nloptr_2.0.3                crew_0.7.0                 
[115] zoo_1.8-12                  rhdf5_2.46.0                KernSmooth_2.23-22         
[118] parallel_4.3.0              miniUI_0.1.1.1              pillar_1.9.0               
[121] grid_4.3.0                  vctrs_0.6.5                 RANN_2.6.1                 
[124] promises_1.2.1              dbplyr_2.4.0                xtable_1.8-4               
[127] cluster_2.1.6               cli_3.6.2                   compiler_4.3.0             
[130] rlang_1.1.2                 crayon_1.5.2                crew.cluster_0.1.4         
[133] future.apply_1.11.0         ps_1.7.5                    plyr_1.8.9                 
[136] stringi_1.8.2               rstan_2.32.3                viridisLite_0.4.2          
[139] deldir_2.0-2                QuickJSR_1.0.8              assertthat_0.2.1           
[142] getip_0.1-3                 lmerTest_3.1-3              munsell_0.5.0              
[145] lazyeval_0.2.2              spatstat.geom_3.2-7         V8_4.4.0                   
[148] Matrix_1.6-4                RcppHNSW_0.5.0              hms_1.1.3                  
[151] patchwork_1.1.3             future_1.33.0               Rhdf5lib_1.24.0            
[154] shiny_1.8.0                 SummarizedExperiment_1.32.0 ROCR_1.0-11                
[157] mirai_0.11.3                igraph_1.5.0.1              RcppParallel_5.1.7 
@multimeric
Copy link
Collaborator

This error suggests that you have multiple rows in your input metadata with the same cell ID (ie the cell_ column). Can you double check that your query isn't producing duplicates? For example, is it possible your pmap is binding multiple rows with the same cell ID?

@stemangiola
Copy link
Owner Author

Sorry I should have tested it before, you were right.

If it's not too annoying we could capture this error with a more informative message.

CuratedAtlasQueryR says: ...... Please check if your input metadata does not include duplicated elements in the `cell_` column. For example, execute `<your input metadata> |> count(cell, name = "number_of_cell_id_instances") |> filter(number_of_cell_id_instances > 1)`

@multimeric
Copy link
Collaborator

Would you rather I test the input data frame for duplicates (big performance implications), or just catch errors resulting from the code where I try to set the row names, and throw a better error message?

@stemangiola
Copy link
Owner Author

stemangiola commented Jan 11, 2024

Would

input |> pull(cell_) |> duplicates() |> length() > 0

take long for 100M rows?

or faster methods here

https://stackoverflow.com/questions/37148567/fastest-way-to-remove-all-duplicates-in-r

or just to check if duplicates exist -> anyDuplicated

https://stackoverflow.com/questions/5263498/how-to-test-whether-a-vector-contains-repetitive-elements

...

But maybe catching the error is the actually right thing to do, as it is exactly what we are doing, replacing an error with another.

@multimeric
Copy link
Collaborator

Yeah the performance hit probably won't be too bad compared to the time it takes to actually download and process the data. I think the best function to use to detect duplicates would be one that dbplyr supports so it can be run in the database instead of purely in R.

Up to you though.

@stemangiola
Copy link
Owner Author

the input could easily be a tibble, incase you manipulate first.

I think catching the error is the most transparent thing we can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants