Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated next_partname generation #646

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Updated next_partname generation #646

wants to merge 1 commit into from

Conversation

simkusr
Copy link

@simkusr simkusr commented Sep 2, 2020

Now partnames are generated on the fly and cashed for the next iteration of slides. This update solves bottleneck mentioned in #644

@simkusr
Copy link
Author

simkusr commented Sep 2, 2020

So what I did is that on each loop iteration I create new key in the self.partnames corresponding to received tmpl that goes after /ppt/<something> and increment by 1, this doesn't require each time to loop over all partnames and identify what the next partname is available.

I know that it could be improved further more, just need some tips on what I have missed during this improvement...

Copy link
Owner

@scanny scanny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some feedback. Note you are unlikely to get this committed, so you should probably focus on getting it working well enough for your purposes in a local spike.

@@ -10,3 +10,4 @@ _scratch/
/spec/gen_spec/spec*.db
tags
/tests/debug.py
venv
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A git-ignore particular to your personal development environment belongs in your own gitignore file, not the project's. You find that somewhere like ~/.config/git/ignore or perhaps ~/.gitconfig.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for that, will check it and update.

@@ -26,6 +29,7 @@ class OpcPackage(object):

def __init__(self):
super(OpcPackage, self).__init__()
self.partnames = defaultdict(int)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid adding instance variables to the interface of the class (in other words, make instance variables private. If you need the value outside the class, make a property for it). So this would be self._partnames.

return PackURI(candidate_partname)
raise Exception("ProgrammingError: ran out of candidate_partnames")
name = tmpl.split(os.sep)[2]
self.partnames[name] += 1
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing where ._partname gets initialized with the part-names already present in the package. I think a better strategy would be to make ._partnames an @lazyproperty (rather than an instance variable) and do the initialization in that lazyproperty (which only gets executed the first time it is called).

if candidate_partname not in partnames:
return PackURI(candidate_partname)
raise Exception("ProgrammingError: ran out of candidate_partnames")
name = tmpl.split(os.sep)[2]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unreliable. Are you sure you can count on the name always being the third item? Better to figure it is the last item, like tmpl.split(os.sep)[-1]. Also, I'm not sure there's a compelling reason to split out the name at all. What is lost by keeping the path in there? The more important thing would be to split before the %d so you could match existing part-names with something like part_name.startswith("/ppt/slides/slide"), although be careful because /ppt/slides/slideMaster42.xml also starts with "/ppt/slides/slide". You'll probably need to use a regular expression or similar method to separate out the numeric suffix.

name = tmpl.split(os.sep)[2]
self.partnames[name] += 1
candidate_partname = tmpl % self.partnames[name]
return PackURI(candidate_partname)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the tests? You would have found most of these problems with good tests.

@Tzahi12345
Copy link

Just curious, why wouldn't these changes get merged (even if your comments were addressed)?

I understand that a PowerPoint with thousands of slides isn't a case that you should overhaul a bunch of code for, but if it's a simple addition I'm wondering what the hesitation is.


I'll share my case since it's a bit relevant.

I'm generating a powerpoint 12k slides large, to do so I'm creating plots and adding the plots to the slide, 60 slides at a time, in a pipeline. So for example when I'm generating plots for slides 60-119, I simultaneously create slides 0-59. The next iteration I'll generate plots for slides 120-179 and create slides 0-59 with the plots I just generated and so on.

It always takes longer to create the slides than to create the plots, making me wait for the former on each iteration. The reason I share this is to explain the following graph. which shows the difference in time between the slide and plot generation:

image

At first, creating the first 60 slides only takes 30s longer than creating the plots, this stays under a minute for about 3000 slides, but thereafter quickly grows to delays that are unmanageable. I took the integral of the trendline, and the total delay adds up to 24 hours. If it were constant at 30s, it would be just an hour and a half.

Do you have any recommendations here? Perhaps I can create a bunch of smaller powerpoints and manually add them together, it would definitely shave a bunch of time but require manual work in a process I was trying to automate. Hope this comment wasn't too long-winded, just wanted to explain the sticky situation I'm in.

@scanny
Copy link
Owner

scanny commented Jul 21, 2021

Hi @Tzahi12345, the performance problem here arises from the .next_partname() method having O(N^2) time complexity. On its first execution it finds zero existing partnames, the second finds one, etc. to produce n(n-1)/2 work where n is the number of calls.

The reason it does this is that the "state" of partname assignments is reflected only in the current underlying XML, or package-state anyway. So each time a new partname is requested, that full state needs to be consulted to rule out any duplicates.

Any caching of that partname-assignments state in a more efficient data structure (like a dict or set having O(1) time complexity) risks getting out of synch with the actual package state. You would have to guarantee that every new partname assignment "registered" itself with that redundant record/cache.

This is probably possible, but would require substantial reasoning about the current code and where to place such a "one-and-only-one-per-open-package" data structure and the code that updated it. Do we know that every new part calls this method, for example. And what about when parts are removed?

Another possible way would be to give parts partnames that were not sequential, like maybe SHA1 hashes or hash prefixes computed from something including datetime or whatever. That would be ugly but I don't think it would violate the OPC spec.

Another approach is to set an "I'll be really careful" flag and allow a cache to be built that could possibly become invalid if you didn't follow a prescribed procedure to request new partnames. We did something similar to this for adding new shapes to a slide when a sponsor needed to add thousands of shapes to each slide. I think we called it turbo mode, so that might be worth a search.

Anyway, I suppose the main reason is it's not a big enough problem for someone with project funding to have decided they're willing to sponsor it.

If you wanted to "fix" it for your own purposes, you could fork python-pptx and use a local patched version that takes roughly the above approach as sketched out in the top of this thread. I suspect you can be pretty confident that the OpcPackage instance for each .pptx file (package) you "open" is effectively a singleton for that package. So if you cache the partnames there somehow then you should be able to generate new ones in O(1) time. Theoretically you also need to detect when partnames are removed, but I don't think that matters enough to bother with. The main thing is not to assign the same name twice. Re-using abandoned names I don't expect is anywhere close to worth the trouble.

One other approach might be to monkey-patch OpcPackage to add the cache data-structure to it and "replace" its .next_partname() method with one you write yourself. If you could get that to work it would be a lot less trouble than a fork. The main thing to get rid of in that method is the partnames = [part.partname for part in self.iter_parts()] line which is where the O(N^2) comes from. You still need to do that once to initialize the partname cache, but not every time a new one is created.

@Tzahi12345
Copy link

Another approach is to set an "I'll be really careful" flag and allow a cache to be built that could possibly become invalid if you didn't follow a prescribed procedure to request new partnames. We did something similar to this for adding new shapes to a slide when a sponsor needed to add thousands of shapes to each slide. I think we called it turbo mode, so that might be worth a search.

I think this may be the best/safest approach here. I understand the issues with the cache, one thing I'm curious about is why it may get out of sync. I'm not familiar with the codebase but if all add/remove part operations use the same function, every addition/removal of a part can involve updating the cache, there must be something I'm missing here.

Anyways I do understand why it's not high priority or why it would never get merged, it's not a highly requested feature nor something a sponsor is requesting. I did try out @simkusr's fork and it worked well -- I'm not doing anything fancy so I'm not surprised. I'll keep using that for the time being, thank you for the lengthy answer!

@scanny
Copy link
Owner

scanny commented Jul 22, 2021

@Tzahi12345 yes, I think you're right. I don't know of any place that adds a part that doesn't use package.next_partname(). I think we've gotten that factored pretty well. If I were building a production capability I wouldn't stop at believing and would proceed to proving it to myself, but we're not talking about that at the moment.

On the other hand, if I were only doing it for my own uses, I'd definitely implement it first and give it a try. If it worked for me I'd leave it at that.

I'd start by adding a max_partname dict to OpcPackage that gets initialized with the maximum partname found for each partname template (like "ppt/slides/slide%d.xml" I vaguely remember) in the initial package, and then gets used to look up and increment the max for each new partname that is requested.

So the (O(1)) implementation of .next_partname would become something like:

@lazyproperty
def _max_partnames(self):
    """defaultdict(int) mapping partname templates to max count so far.

    ... more details needed to understand proper use ... all keys work, value is zero for template with
    no partname so far, etc. ...
    """
    max_partnames = collections.defaultdict(int)
    # --- initialize partname-number cache ---
    for part in self.iter_parts():
        tmpl, n = parse_partname(part.partname)
        max_partnames[tmpl] = max(max_partnames[tmpl], n)
    return max_partnames

def next_partname(self, tmpl):
    """Return next available |PackURI| partname matching `tmpl`."""
    n = self._max_partnames[tmpl] + 1
    self._max_partnames[tmpl] = n
    return PackURI(tmpl % n)

You'll need to work out your own implementation of parse_partname(). Shouldn't be too hard, split at period and look for numbers at the end. Maybe use a regex.

@Tzahi12345
Copy link

Thanks for the tips! I'll see if I can get the time to implement this -- I probably will soon as this PR isn't as great as I thought (there was still a O(n^2) issue). In orange you can see the performance of this PR.

image

Over the course of the 27hr job, this only saves me 3.5hrs. I haven't looked through the codebase so I'm not sure where your implementation and @simkusr's differ, but these are the results I saw. Worst case, there's some other scaling issue hiding there but since this issue has been seen through profiling, I'll assume there isn't.

@scanny
Copy link
Owner

scanny commented Jul 23, 2021

@Tzahi12345 try skipping adding the image and see how the time changes. Still generate it, just comment out the .add_picture() call and see how the curve looks.

There's potentially quadratic time in .add_picture() because it checks to make sure that the picture isn't already in the package (PowerPoint only stores the same image once).

Also, how big is the file that you generate and how much memory does your machine have? The whole file times a significant multiplier (at least 3, maybe 8) is being built up in memory until you save. You could just be swapping to disk after a certain point.

@scanny
Copy link
Owner

scanny commented Jul 23, 2021

Btw, I'm not seeing anything O(N^2) in the .next_partname of this PR. I don't think it would work exactly as it is (it doesn't load partnames that exist in the package at initial load time) but it should have constant time (O(1)).

@MartinPacker
Copy link

Is it feasible to disembowel :-) add_picture() in a local fork to remove the check? Similarly, is it feasible to add a "check for existence" boolean parameter that defaults to True?

@scanny
Copy link
Owner

scanny commented Jul 23, 2021

@MartinPacker such a thing could make sense. I think there would be better options, like caching the SHA1 hash of all current images in a set or dict to allow O(1) lookup. But right now I'm not sure that's where the time is going. I can't see it taking even a second to search through 12,000 objects for a constant (already computed) SHA1. I'm thinking the time has to be going somewhere else. My money is on memory being swapped to disk at the moment. We've got something like 1000 seconds to account for on each slide.

@Tzahi12345
Copy link

I re-ran and checked for hard faults after a noticeable increase in delay (slide index 5000 or so) and there was still a decent amount of memory free.

image

Some stats:

  • 16GB of RAM
    • Around 8-10GB free at any given time
  • Each slide uses ~70-80kB

But, I did extrapolate out to slide 12000 and it looks like I'd only have about 400MB left of memory by the end of it, so I could imagine that would contribute to the latency.

I also ran without add_picture() and there was no scaling in the delay at all, see the short line in gray:

image

@scanny
Copy link
Owner

scanny commented Jul 24, 2021

@Tzahi12345 okay, this finding is very revealing. So my next suspect is this method: https://github.com/scanny/python-pptx/blob/master/pptx/package.py#L171-L183

It iterates though each of the package relationships before each image-add operation to see if the same image already exists. That way if you're using the same image on every slide or whatever you only embed one copy of it in the .pptx package.

So that's O(N) with N roughly the number of parts. Each slide is a part and each image is a part, so that probably gets up to 2500 items to iterate through for each added image by the end. Total iterations would be N(N-1)/2, which is around 3 million. It's still hard for me to believe that would take hours, but it's a worthy suspect.

To test this hypothesis, we make that ._find_by_sha1() method just always immediately return None (not found) and try the measurements again (restoring the .add_picture() calls. That should definitely either identify the suspect or clearly indicate the likely culprit.

Btw, what is your CPU doing while this process is running? Is it pinned at 100% (of one core) or is it lollygagging around waiting for disk access or something?

@scanny
Copy link
Owner

scanny commented Jul 24, 2021

Actually, you know what? This __iter__() method that ._find_by_sha1() uses is itself quadratic, so that would make the ._find_by_sha1() quartic (O(N^4)). I'm betting this is exactly where the problem lies. Fortunately that's an easy one-line fix to bring that down to O(1) (still quadratic overall, but that's a big improvement). Just change line 145 from image_parts = [] to image_parts = set().

https://github.com/scanny/python-pptx/blob/master/pptx/package.py#L141-L155

Where quadratic time is about 3 million time-units, quartic time is more like 5 trillion time-units. That I can totally believe takes hours, no matter how small "time-unit" is :)

@MartinPacker
Copy link

(Sorry if this is parenthetic but I want to check one thing that will affect my project - md2pptx).

So, @scanny, it seems you're saying that the PowerPoint format allows reuse of graphics - where only one copy is stored and the slides that need it point to it.

I was considering - in md2pptx - trying to avoid the creation of duplicate graphics where they are used more than once in a presentation. It seems I don't need to - if python-pptx is deduping for me.

I do hope this fix of using a set works. That would completely obviate the need for me to dedup in my code - even if I could figure out how to.

@scanny
Copy link
Owner

scanny commented Jul 24, 2021

@MartinPacker Yes, that's my understanding of PowerPoint's behavior and that's how I implemented image storage in python-pptx. It's easily verified by taking a single image and placing multiple on one slide and then multiple on another slide and then unzipping the .pptx file (or just listing its contents with unzip -l my_prs.pptx) to see how many times that image file appears in the package.

I don't recall how thoroughly I experimented with PowerPoint on that initially. I may have just tried all on the same slide, so those would be interesting results to reproduce. But I saw it was clearly not duplicating the image in at least same-slide circumstances so I just went to "save one-time only". The size of a PowerPoint file is generally determined primarily by the size of the images it contains, so it didn't take much to make me believe its designers would have taken the time to implement this space optimization.

The de-duping does make things a little more complicated though, as one might expect. Basically the strategy I used is to hash each image binary (e.g. .jpg "file") and use that to detect duplicates. Then each new one needs to be distinct from all the existing ones before it is added, otherwise we just point to the existing one. That's what the .get_or_add_image_part() bit is about.

A little fancier is getting rid of one when the last reference is deleted. I think that happens automatically on save, like that image part becomes an orphaned node when the last reference is removed and only connected nodes in the part graph end up in the saved package.

Anyway, all that should be transparent to your app. Note that changing the [] to set() bit just makes lookup in that data structure O(1) instead of O(N). Its members should already be unique, we're only interested in the constant time.

I believe further time optimizations are possible, but perhaps not without risk or broader change. I'm betting this one-liner will reduce the time by a very large amount and then we can see if we need to go further. I would commit that [] => set() change at the next opportunity either way. That was just my 10-years-ago-self not being fully sensitive to asymptotic time behaviors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants