Updated next_partname generation #646

simkusr · 2020-09-02T11:47:37Z

Now partnames are generated on the fly and cashed for the next iteration of slides. This update solves bottleneck mentioned in #644

… when generating 1000+ slides

simkusr · 2020-09-02T12:14:36Z

So what I did is that on each loop iteration I create new key in the self.partnames corresponding to received tmpl that goes after /ppt/<something> and increment by 1, this doesn't require each time to loop over all partnames and identify what the next partname is available.

I know that it could be improved further more, just need some tips on what I have missed during this improvement...

scanny

Here's some feedback. Note you are unlikely to get this committed, so you should probably focus on getting it working well enough for your purposes in a local spike.

scanny · 2020-09-02T18:46:55Z

.gitignore

@@ -10,3 +10,4 @@ _scratch/
 /spec/gen_spec/spec*.db
 tags
 /tests/debug.py
+venv


A git-ignore particular to your personal development environment belongs in your own gitignore file, not the project's. You find that somewhere like ~/.config/git/ignore or perhaps ~/.gitconfig.

Thank you for that, will check it and update.

scanny · 2020-09-02T18:48:13Z

pptx/opc/package.py

@@ -26,6 +29,7 @@ class OpcPackage(object):

    def __init__(self):
        super(OpcPackage, self).__init__()
+        self.partnames = defaultdict(int)


Avoid adding instance variables to the interface of the class (in other words, make instance variables private. If you need the value outside the class, make a property for it). So this would be self._partnames.

scanny · 2020-09-02T18:51:31Z

pptx/opc/package.py

-                return PackURI(candidate_partname)
-        raise Exception("ProgrammingError: ran out of candidate_partnames")
+        name = tmpl.split(os.sep)[2]
+        self.partnames[name] += 1


I'm not seeing where ._partname gets initialized with the part-names already present in the package. I think a better strategy would be to make ._partnames an @lazyproperty (rather than an instance variable) and do the initialization in that lazyproperty (which only gets executed the first time it is called).

scanny · 2020-09-02T18:57:30Z

pptx/opc/package.py

-            if candidate_partname not in partnames:
-                return PackURI(candidate_partname)
-        raise Exception("ProgrammingError: ran out of candidate_partnames")
+        name = tmpl.split(os.sep)[2]


This seems unreliable. Are you sure you can count on the name always being the third item? Better to figure it is the last item, like tmpl.split(os.sep)[-1]. Also, I'm not sure there's a compelling reason to split out the name at all. What is lost by keeping the path in there? The more important thing would be to split before the %d so you could match existing part-names with something like part_name.startswith("/ppt/slides/slide"), although be careful because /ppt/slides/slideMaster42.xml also starts with "/ppt/slides/slide". You'll probably need to use a regular expression or similar method to separate out the numeric suffix.

scanny · 2020-09-02T18:59:16Z

pptx/opc/package.py

+        name = tmpl.split(os.sep)[2]
+        self.partnames[name] += 1
+        candidate_partname = tmpl % self.partnames[name]
+        return PackURI(candidate_partname)



Where are the tests? You would have found most of these problems with good tests.

Tzahi12345 · 2021-07-21T19:49:41Z

Just curious, why wouldn't these changes get merged (even if your comments were addressed)?

I understand that a PowerPoint with thousands of slides isn't a case that you should overhaul a bunch of code for, but if it's a simple addition I'm wondering what the hesitation is.

I'll share my case since it's a bit relevant.

I'm generating a powerpoint 12k slides large, to do so I'm creating plots and adding the plots to the slide, 60 slides at a time, in a pipeline. So for example when I'm generating plots for slides 60-119, I simultaneously create slides 0-59. The next iteration I'll generate plots for slides 120-179 and create slides 0-59 with the plots I just generated and so on.

It always takes longer to create the slides than to create the plots, making me wait for the former on each iteration. The reason I share this is to explain the following graph. which shows the difference in time between the slide and plot generation:

At first, creating the first 60 slides only takes 30s longer than creating the plots, this stays under a minute for about 3000 slides, but thereafter quickly grows to delays that are unmanageable. I took the integral of the trendline, and the total delay adds up to 24 hours. If it were constant at 30s, it would be just an hour and a half.

Do you have any recommendations here? Perhaps I can create a bunch of smaller powerpoints and manually add them together, it would definitely shave a bunch of time but require manual work in a process I was trying to automate. Hope this comment wasn't too long-winded, just wanted to explain the sticky situation I'm in.

scanny · 2021-07-21T20:59:37Z

Hi @Tzahi12345, the performance problem here arises from the .next_partname() method having O(N^2) time complexity. On its first execution it finds zero existing partnames, the second finds one, etc. to produce n(n-1)/2 work where n is the number of calls.

The reason it does this is that the "state" of partname assignments is reflected only in the current underlying XML, or package-state anyway. So each time a new partname is requested, that full state needs to be consulted to rule out any duplicates.

Any caching of that partname-assignments state in a more efficient data structure (like a dict or set having O(1) time complexity) risks getting out of synch with the actual package state. You would have to guarantee that every new partname assignment "registered" itself with that redundant record/cache.

This is probably possible, but would require substantial reasoning about the current code and where to place such a "one-and-only-one-per-open-package" data structure and the code that updated it. Do we know that every new part calls this method, for example. And what about when parts are removed?

Another possible way would be to give parts partnames that were not sequential, like maybe SHA1 hashes or hash prefixes computed from something including datetime or whatever. That would be ugly but I don't think it would violate the OPC spec.

Another approach is to set an "I'll be really careful" flag and allow a cache to be built that could possibly become invalid if you didn't follow a prescribed procedure to request new partnames. We did something similar to this for adding new shapes to a slide when a sponsor needed to add thousands of shapes to each slide. I think we called it turbo mode, so that might be worth a search.

Anyway, I suppose the main reason is it's not a big enough problem for someone with project funding to have decided they're willing to sponsor it.

If you wanted to "fix" it for your own purposes, you could fork python-pptx and use a local patched version that takes roughly the above approach as sketched out in the top of this thread. I suspect you can be pretty confident that the OpcPackage instance for each .pptx file (package) you "open" is effectively a singleton for that package. So if you cache the partnames there somehow then you should be able to generate new ones in O(1) time. Theoretically you also need to detect when partnames are removed, but I don't think that matters enough to bother with. The main thing is not to assign the same name twice. Re-using abandoned names I don't expect is anywhere close to worth the trouble.

One other approach might be to monkey-patch OpcPackage to add the cache data-structure to it and "replace" its .next_partname() method with one you write yourself. If you could get that to work it would be a lot less trouble than a fork. The main thing to get rid of in that method is the partnames = [part.partname for part in self.iter_parts()] line which is where the O(N^2) comes from. You still need to do that once to initialize the partname cache, but not every time a new one is created.

Tzahi12345 · 2021-07-22T16:41:13Z

Another approach is to set an "I'll be really careful" flag and allow a cache to be built that could possibly become invalid if you didn't follow a prescribed procedure to request new partnames. We did something similar to this for adding new shapes to a slide when a sponsor needed to add thousands of shapes to each slide. I think we called it turbo mode, so that might be worth a search.

I think this may be the best/safest approach here. I understand the issues with the cache, one thing I'm curious about is why it may get out of sync. I'm not familiar with the codebase but if all add/remove part operations use the same function, every addition/removal of a part can involve updating the cache, there must be something I'm missing here.

Anyways I do understand why it's not high priority or why it would never get merged, it's not a highly requested feature nor something a sponsor is requesting. I did try out @simkusr's fork and it worked well -- I'm not doing anything fancy so I'm not surprised. I'll keep using that for the time being, thank you for the lengthy answer!

scanny · 2021-07-22T19:23:20Z

@Tzahi12345 yes, I think you're right. I don't know of any place that adds a part that doesn't use package.next_partname(). I think we've gotten that factored pretty well. If I were building a production capability I wouldn't stop at believing and would proceed to proving it to myself, but we're not talking about that at the moment.

On the other hand, if I were only doing it for my own uses, I'd definitely implement it first and give it a try. If it worked for me I'd leave it at that.

I'd start by adding a max_partname dict to OpcPackage that gets initialized with the maximum partname found for each partname template (like "ppt/slides/slide%d.xml" I vaguely remember) in the initial package, and then gets used to look up and increment the max for each new partname that is requested.

So the (O(1)) implementation of .next_partname would become something like:

@lazyproperty
def _max_partnames(self):
    """defaultdict(int) mapping partname templates to max count so far.

    ... more details needed to understand proper use ... all keys work, value is zero for template with
    no partname so far, etc. ...
    """
    max_partnames = collections.defaultdict(int)
    # --- initialize partname-number cache ---
    for part in self.iter_parts():
        tmpl, n = parse_partname(part.partname)
        max_partnames[tmpl] = max(max_partnames[tmpl], n)
    return max_partnames

def next_partname(self, tmpl):
    """Return next available |PackURI| partname matching `tmpl`."""
    n = self._max_partnames[tmpl] + 1
    self._max_partnames[tmpl] = n
    return PackURI(tmpl % n)

You'll need to work out your own implementation of parse_partname(). Shouldn't be too hard, split at period and look for numbers at the end. Maybe use a regex.

Tzahi12345 · 2021-07-23T14:54:34Z

Thanks for the tips! I'll see if I can get the time to implement this -- I probably will soon as this PR isn't as great as I thought (there was still a O(n^2) issue). In orange you can see the performance of this PR.

Over the course of the 27hr job, this only saves me 3.5hrs. I haven't looked through the codebase so I'm not sure where your implementation and @simkusr's differ, but these are the results I saw. Worst case, there's some other scaling issue hiding there but since this issue has been seen through profiling, I'll assume there isn't.

scanny · 2021-07-23T16:24:38Z

@Tzahi12345 try skipping adding the image and see how the time changes. Still generate it, just comment out the .add_picture() call and see how the curve looks.

There's potentially quadratic time in .add_picture() because it checks to make sure that the picture isn't already in the package (PowerPoint only stores the same image once).

Also, how big is the file that you generate and how much memory does your machine have? The whole file times a significant multiplier (at least 3, maybe 8) is being built up in memory until you save. You could just be swapping to disk after a certain point.

scanny · 2021-07-23T16:28:37Z

Btw, I'm not seeing anything O(N^2) in the .next_partname of this PR. I don't think it would work exactly as it is (it doesn't load partnames that exist in the package at initial load time) but it should have constant time (O(1)).

MartinPacker · 2021-07-23T16:31:03Z

Is it feasible to disembowel :-) add_picture() in a local fork to remove the check? Similarly, is it feasible to add a "check for existence" boolean parameter that defaults to True?

scanny · 2021-07-23T16:59:36Z

@MartinPacker such a thing could make sense. I think there would be better options, like caching the SHA1 hash of all current images in a set or dict to allow O(1) lookup. But right now I'm not sure that's where the time is going. I can't see it taking even a second to search through 12,000 objects for a constant (already computed) SHA1. I'm thinking the time has to be going somewhere else. My money is on memory being swapped to disk at the moment. We've got something like 1000 seconds to account for on each slide.

Tzahi12345 · 2021-07-23T21:14:37Z

I re-ran and checked for hard faults after a noticeable increase in delay (slide index 5000 or so) and there was still a decent amount of memory free.

Some stats:

16GB of RAM
- Around 8-10GB free at any given time
Each slide uses ~70-80kB

But, I did extrapolate out to slide 12000 and it looks like I'd only have about 400MB left of memory by the end of it, so I could imagine that would contribute to the latency.

I also ran without add_picture() and there was no scaling in the delay at all, see the short line in gray:

scanny · 2021-07-24T00:33:41Z

@Tzahi12345 okay, this finding is very revealing. So my next suspect is this method: https://github.com/scanny/python-pptx/blob/master/pptx/package.py#L171-L183

It iterates though each of the package relationships before each image-add operation to see if the same image already exists. That way if you're using the same image on every slide or whatever you only embed one copy of it in the .pptx package.

So that's O(N) with N roughly the number of parts. Each slide is a part and each image is a part, so that probably gets up to 2500 items to iterate through for each added image by the end. Total iterations would be N(N-1)/2, which is around 3 million. It's still hard for me to believe that would take hours, but it's a worthy suspect.

To test this hypothesis, we make that ._find_by_sha1() method just always immediately return None (not found) and try the measurements again (restoring the .add_picture() calls. That should definitely either identify the suspect or clearly indicate the likely culprit.

Btw, what is your CPU doing while this process is running? Is it pinned at 100% (of one core) or is it lollygagging around waiting for disk access or something?

scanny · 2021-07-24T00:44:28Z

Actually, you know what? This __iter__() method that ._find_by_sha1() uses is itself quadratic, so that would make the ._find_by_sha1() quartic (O(N^4)). I'm betting this is exactly where the problem lies. Fortunately that's an easy one-line fix to bring that down to O(1) (still quadratic overall, but that's a big improvement). Just change line 145 from image_parts = [] to image_parts = set().

https://github.com/scanny/python-pptx/blob/master/pptx/package.py#L141-L155

Where quadratic time is about 3 million time-units, quartic time is more like 5 trillion time-units. That I can totally believe takes hours, no matter how small "time-unit" is :)

MartinPacker · 2021-07-24T10:15:17Z

(Sorry if this is parenthetic but I want to check one thing that will affect my project - md2pptx).

So, @scanny, it seems you're saying that the PowerPoint format allows reuse of graphics - where only one copy is stored and the slides that need it point to it.

I was considering - in md2pptx - trying to avoid the creation of duplicate graphics where they are used more than once in a presentation. It seems I don't need to - if python-pptx is deduping for me.

I do hope this fix of using a set works. That would completely obviate the need for me to dedup in my code - even if I could figure out how to.

scanny · 2021-07-24T17:26:00Z

@MartinPacker Yes, that's my understanding of PowerPoint's behavior and that's how I implemented image storage in python-pptx. It's easily verified by taking a single image and placing multiple on one slide and then multiple on another slide and then unzipping the .pptx file (or just listing its contents with unzip -l my_prs.pptx) to see how many times that image file appears in the package.

I don't recall how thoroughly I experimented with PowerPoint on that initially. I may have just tried all on the same slide, so those would be interesting results to reproduce. But I saw it was clearly not duplicating the image in at least same-slide circumstances so I just went to "save one-time only". The size of a PowerPoint file is generally determined primarily by the size of the images it contains, so it didn't take much to make me believe its designers would have taken the time to implement this space optimization.

The de-duping does make things a little more complicated though, as one might expect. Basically the strategy I used is to hash each image binary (e.g. .jpg "file") and use that to detect duplicates. Then each new one needs to be distinct from all the existing ones before it is added, otherwise we just point to the existing one. That's what the .get_or_add_image_part() bit is about.

A little fancier is getting rid of one when the last reference is deleted. I think that happens automatically on save, like that image part becomes an orphaned node when the last reference is removed and only connected nodes in the part graph end up in the saved package.

Anyway, all that should be transparent to your app. Note that changing the [] to set() bit just makes lookup in that data structure O(1) instead of O(N). Its members should already be unique, we're only interested in the constant time.

I believe further time optimizations are possible, but perhaps not without risk or broader change. I'm betting this one-liner will reduce the time by a very large amount and then we can see if we need to go further. I would commit that [] => set() change at the next opportunity either way. That was just my 10-years-ago-self not being fully sensitive to asymptotic time behaviors.

improved next_partname generation to be cashed, this solves bottlenec…

99685d2

… when generating 1000+ slides

scanny reviewed Sep 2, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated next_partname generation #646

Updated next_partname generation #646

simkusr commented Sep 2, 2020

simkusr commented Sep 2, 2020

scanny left a comment

scanny Sep 2, 2020

simkusr Sep 3, 2020

scanny Sep 2, 2020

scanny Sep 2, 2020

scanny Sep 2, 2020

scanny Sep 2, 2020

Tzahi12345 commented Jul 21, 2021

scanny commented Jul 21, 2021

Tzahi12345 commented Jul 22, 2021

scanny commented Jul 22, 2021

Tzahi12345 commented Jul 23, 2021

scanny commented Jul 23, 2021

scanny commented Jul 23, 2021

MartinPacker commented Jul 23, 2021

scanny commented Jul 23, 2021

Tzahi12345 commented Jul 23, 2021

scanny commented Jul 24, 2021

scanny commented Jul 24, 2021 •

edited

Loading

MartinPacker commented Jul 24, 2021

scanny commented Jul 24, 2021

Updated next_partname generation #646

Are you sure you want to change the base?

Updated next_partname generation #646

Conversation

simkusr commented Sep 2, 2020

simkusr commented Sep 2, 2020

scanny left a comment

Choose a reason for hiding this comment

scanny Sep 2, 2020

Choose a reason for hiding this comment

simkusr Sep 3, 2020

Choose a reason for hiding this comment

scanny Sep 2, 2020

Choose a reason for hiding this comment

scanny Sep 2, 2020

Choose a reason for hiding this comment

scanny Sep 2, 2020

Choose a reason for hiding this comment

scanny Sep 2, 2020

Choose a reason for hiding this comment

Tzahi12345 commented Jul 21, 2021

scanny commented Jul 21, 2021

Tzahi12345 commented Jul 22, 2021

scanny commented Jul 22, 2021

Tzahi12345 commented Jul 23, 2021

scanny commented Jul 23, 2021

scanny commented Jul 23, 2021

MartinPacker commented Jul 23, 2021

scanny commented Jul 23, 2021

Tzahi12345 commented Jul 23, 2021

scanny commented Jul 24, 2021

scanny commented Jul 24, 2021 • edited Loading

MartinPacker commented Jul 24, 2021

scanny commented Jul 24, 2021

scanny commented Jul 24, 2021 •

edited

Loading