Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove non-API calls in C #6180

Open
5 of 11 tasks
MichaelChirico opened this issue Jun 13, 2024 · 17 comments
Open
5 of 11 tasks

Remove non-API calls in C #6180

MichaelChirico opened this issue Jun 13, 2024 · 17 comments
Milestone

Comments

@MichaelChirico
Copy link
Member

MichaelChirico commented Jun 13, 2024

We are getting these NOTEs on CRAN, e.g.

Found non-API calls to R: ‘SETLENGTH’, ‘SET_S4_OBJECT’,
‘SET_TRUELENGTH’, ‘UNSET_S4_OBJECT’

I'm really skeptical about the SETLENGTH and SET_TRUELENGTH parts, which have long been a key underlying feature of data.table internals. I propose we try and keep using those unless CRAN really forces our hands. It be really great if someone wanted to do a performance comparison of the package with/without those calls.

Anyway, it should be easier to remove the non-API S4 calls.

The list does keep changing as this is an actively evolving piece of r-devel. Here's the current list of things we should (eventually) address:

  • LEVELS
  • NAMED
  • SETLENGTH
  • SET_GROWABLE_BIT
  • SET_S4_OBJECT
  • SET_TRUELENGTH
  • SET_TYPEOF
  • STRING_PTR
  • TRUELENGTH
  • UNSET_S4_OBJECT
  • isFrame
@MichaelChirico MichaelChirico added this to the 1.16.0 milestone Jun 13, 2024
@tdhock
Copy link
Member

tdhock commented Jun 13, 2024

there is a web page which lists all of the API calls, grouped into 3 categories
https://yutannihilation.github.io/R-fun-API/
see also the R-devel thread which is linked from that page

@ecoRoland2
Copy link

You should probably be proactive and contact R-core regarding your need of an API to SETLENGTH and SET_TRUELENGTH. I would hate seeing slower performance in data.table. It could cause some difficult decisions at my day job.

@MichaelChirico
Copy link
Member Author

if this is important for you (or anyone else reading) please take the time to construct a performance comparison.

@ecoRoland2
Copy link

I would but unfortunately I think you need at least basic knowledge of C to do that.

@ben-schwen
Copy link
Member

AFAIU we can replace SETLENGTH calls with Rf_lengthgets calls and reassignment. But there is no safe way to handle SET_TRUELENGTH right?

@tdhock
Copy link
Member

tdhock commented Jun 14, 2024

Where is SET_TRUELENGTH used? https://github.com/search?q=repo%3ARdatatable%2Fdata.table%20SET_TRUELENGTH&type=code
one place is in rbindlist which is a pretty important function for performance.
@Anirban166 can you please construct a PR with a single change SET_TRUELENGTH -> Rf_lengthgets in rbindlist along with an atime performance test?

@ben-schwen
Copy link
Member

ben-schwen commented Jun 14, 2024

Where is SET_TRUELENGTH used?

We use it as part of memrecycle and subsetting so its basically in most operations :D

But we are not alone with this problem r-lib/vctrs#1933

@Anirban166
Copy link
Member

@Anirban166 can you please construct a PR with a single change SET_TRUELENGTH -> Rf_lengthgets in rbindlist along with an atime performance test?

Sure, but I'm wondering what the tests will be focused on (since this affects multiple functions) and then what about SETLENGTH?

@MichaelChirico
Copy link
Member Author

AFAIU we can replace SETLENGTH calls with Rf_lengthgets calls and reassignment.

I don't see lengthgets mentioned in WRE.

By the same definition, touching S4 objects from C is also non-API! 🤔

@MichaelChirico
Copy link
Member Author

I see, lengthgets is just what length<- calls under the hood. But that induces a copy:

a = 1:10
address(a)
# [1] "0x5651b8564b30"
length(a) <- 20
address(a)
# [1] "0x5651b8ab5c68"

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jun 20, 2024

It looks like more of our usage has been marked as non-API in the meantime:

Result: NOTE 
  File ‘data.table/libs/data_table.so’:
    Found non-API calls to R: ‘LEVELS’, ‘NAMED’, ‘SETLENGTH’,
      ‘SET_GROWABLE_BIT’, ‘SET_S4_OBJECT’, ‘SET_TRUELENGTH’,
      ‘UNSET_S4_OBJECT’

new:

LEVELS():

#define IS_UTF8(x) (LEVELS(x) & 8)
#define IS_ASCII(x) (LEVELS(x) & 64)
#define IS_LATIN(x) (LEVELS(x) & 4)

#define ENC_KNOWN(x) (LEVELS(x) & 12)

NAMED():

#ifndef MAYBE_SHARED
# define MAYBE_SHARED(x) (NAMED(x) > 1)
#endif
#ifndef MAYBE_REFERENCED
# define MAYBE_REFERENCED(x) ( NAMED(x) > 0 )
#endif

data.table/src/assign.c

Lines 548 to 552 in fab93a9

i+1, NAMED(thisvalue), MAYBE_SHARED(thisvalue), length(values), length(cols));
}
thisvalue = copyAsPlain(thisvalue); // PROTECT not needed as assigned as element to protected list below.
} else {
if (verbose) Rprintf(_("Direct plonk of unnamed RHS, no copy. NAMED==%d, MAYBE_SHARED==%d\n"), NAMED(thisvalue), MAYBE_SHARED(thisvalue)); // e.g. DT[,a:=as.character(a)] as tested by 754.5

SET_GROWABLE_BIT():

#if !defined(R_VERSION) || R_VERSION < R_Version(3, 4, 0)
# define SET_GROWABLE_BIT(x) // #3292
#endif

SET_GROWABLE_BIT(VECTOR_ELT(DT,i)); // #3292

@tdhock
Copy link
Member

tdhock commented Jun 21, 2024

@Jean-Romain
Copy link

Jean-Romain commented Jun 21, 2024

@tdhock there is nothing about SETLENGTH and SET_TRUELENGTH in this new section. I'm in trouble with that one too. I can remove the feature that uses it but I'm wondering how data.table team will handle as it is a key feature for data.table. See also https://stat.ethz.ch/pipermail/r-devel/2024-June/083449.html

@jangorecki
Copy link
Member

I can imagine group by, when not gforce optimized, is setting true length as well. It allocates memory for a biggest group and then reuses that space for smaller groups as well.
Similarly frollapply is doing the same for adaptive=TRUE (possibly not yet in master).

@ben-schwen
Copy link
Member

This issue also makes our CI fail since we get now a new 3rd NOTE: Compiled code should not call non-API entry points in R. Should we alter the CI to allow 3 notes on dev?

@jangorecki
Copy link
Member

Yes, for the moment till we address this issue. CI is used to spot new issues, this one is already known.

@MichaelChirico
Copy link
Member Author

After #6312 and #6313 I'll bump the milestone here to the next release, I think we've put in good effort and definitely applied the fixes they recommend in R-exts. The others remain more difficult to satisfy & can be handled later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants