Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove PdfPages::delete_page_at_index() in favour of PdfPage::delete(). #67

Closed
ajrcarey opened this issue Feb 4, 2023 · 11 comments
Closed
Assignees

Comments

@ajrcarey
Copy link
Owner

ajrcarey commented Feb 4, 2023

If a reference to a PdfPage is retrieved using PdfPages::get(), then that page is deleted from the document using PdfPages::delete_page_at_index(), we still possess a (now invalid) PdfPage reference. This is effectively a use-after-free violation. Avoid this by removing the PdfPages::delete_page_at_index() function and instead use a PdfPage::delete() function that consumes mut self. This ensures the caller cannot hold an invalid reference to a page after it is deleted.

@ajrcarey
Copy link
Owner Author

ajrcarey commented Feb 4, 2023

Added PdfPages::delete_page_at_index() to deprecated items list as part of #36.

@ajrcarey
Copy link
Owner Author

ajrcarey commented Feb 5, 2023

Added PdfPages::delete_page_range() to deprecated items list as part of #36.

@ajrcarey
Copy link
Owner Author

ajrcarey commented Feb 5, 2023

Actually, if PdfDocument::pages() were to return a &PdfPages / &mut PdfPages reference (rather than the owned PdfPages instance it returns at the moment), and if PdfPages::get() were to return a &PdfPage / &mut PdfPage reference (rather than the owned PdfPage instance it returns at the moment), then it might be safe to reinstate these functions, as Rust would be able to manage the lifetimes safely. This could be done as part of #47. For now, though, deprecating the functions is the safe move.

@ajrcarey
Copy link
Owner Author

ajrcarey commented Feb 5, 2023

Deprecated functions in PdfPages. Added PdfPage::delete() function, taking advantage of new PdfPageIndexCache introduced in #60. Removed references to deprecated functions from examples/concat.rs and examples/copy_objects.rs. Removed reference to deprecated function from PdfPageIndexCache unit tests. Updated documentation. Package for inclusion in release 0.7.30.

@mara004
Copy link

mara004 commented Mar 23, 2023

As PDFium only provides a document-level page deleter, how did you implement this?
How can you be sure to get the right index for a PdfPage?

mara004 added a commit to pypdfium2-team/pypdfium2 that referenced this issue Mar 23, 2023
@ajrcarey
Copy link
Owner Author

Page indices of each FPDF_PAGE in each open FPDF_DOCUMENT are cached as the page references are instantiated by Pdfium, either when a caller retrieves a page or creates a new page. This cache of page indices is used from within PdfPage::delete(). The implementation is at https://github.com/ajrcarey/pdfium-render/blob/master/src/page_index_cache.rs.

@mara004
Copy link

mara004 commented Mar 23, 2023

I see. We can't do this in pypdfium2, as we also expose the raw pdfium API, so the caller might do operations that change page indices without helper classes knowing.
In general, I would find it safer if pdfium could just provide something like your get_index_for_page() natively to reduce the risk of a page index tracker getting out of sync.

@ajrcarey
Copy link
Owner Author

ajrcarey commented Mar 23, 2023

I suspect it would be difficult to manage in a garbage-collected language because the point at which references are dropped is likely non-deterministic, making it difficult to know when to update any cache. Rust's memory model proves advantageous in this case.

@mara004
Copy link

mara004 commented Mar 23, 2023

Sorry, I don't understand yet. What has this got to do with object references?
Apart from that, attaching code to be run on garbage collection of an object is possible in Python.

@ajrcarey
Copy link
Owner Author

When a caller calls FPDF_ClosePage() (or whatever the function is called), any reference to that FPDF_PAGE object that is still held by any other caller is no longer valid (a "dangling" reference). Garbage collection is not an ideal way to determine when to tidy up these dangling references, because there will be a non-deterministic delay between invalidating the reference and dropping any held copies.

However, like you say, you expose the raw Pdfium API, so the point is somewhat moot as I imagine you probably don't have a way of intercepting when callers are creating and releasing FPDF_* object references.

@mara004
Copy link

mara004 commented Mar 24, 2023

Yes, the non-deterministic nature of garbage collection is a problem, see this section of pypdfium2's readme.

However, I think we're kind of drifting away from the initial point. What I meant was just that a page object -> index mapping can be slightly dangerous if not implemented natively in PDFium. If there is any change to page order the mapping fails to track, then page.delete() might remove a different page than intended. It will be OK if your code is correct and you do not expose the raw API, though. (Except that it also adds additional, hidden complexity/expense on top of pdfium's API.)

In pypdfium2 this is something we cannot do, as callers are encouraged to use helpers in conjunction with the raw API, and the raw API may potentially be used to do arbitrary modifications to page order.

There's not much point in attempting to intercept raw API calls, because (a) we want to make the raw API available to the caller as-is (it would be confusing if calling a raw function would implicitly trigger something) (b) new APIs that change page order might be added to pdfium, so a page cache would imply the same risks, and (c) the purpose does not justify the complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants