Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support HyperTile optimization #13948

Merged
merged 9 commits into from
Nov 26, 2023

Conversation

aria1th
Copy link
Collaborator

@aria1th aria1th commented Nov 11, 2023

https://github.com/tfernd/HyperTile

Description

HyperTile is optimization with yet another split attention.

Currently, for testing other extensions like ControlNet, it requires modified repository

Thus requirements.txt is not modified here, it will require

pip install git+https://github.com/tfernd/HyperTile

Screenshots/videos:

The test is done for 3 environments:

  1. 512x512 (and 2x latent upscale)
  2. 768x768 (and 2x latent upscale)
  3. OpenPose ControlNet with 512x768 2x latent upscale
<style> </style>
        512x512 2x Latent 768x768 2x latent  
      NO 6.02 3.64 5.56 1.03 it/s
      YES 6.04 4.65 5.86 2.05 it/s
      NO(b2) 4.97 1.7 3.6 0.45045 it/s
      YES(b2) 5.38 2.43 4.26 0.943396 it/s
      B1 0.33% 27.75% 5.40% 99.03% Improve
      B2 8.25% 42.94% 18.33% 109.43% Improve
                 
ControlNet                
512 768   NO 1.33 it/s      
OpenPose Latent 2x   YES 1.92 it/s      
        44.36%        

The test is done with animefull-latest-pruned model, with 1girl, negative easynegative.
Test is done in RTX4090@i7-8700-DDR4-3200 RAM, 16 batch counts for each test.

The patch is done to make hypertile deterministic.
Here is the behavior with / without hypertile:

Without HyperTile - original image. The result should be reproducable without patch.

07456-2448603727-1girl

With HyperTile - tested for 2 times with same seed. The result is slightly different, but deterministic.

07458-2448603727-1girl
07457-2448603727-1girl

TODO ; We need infotext for hypertile enabled / disabled.

Checklist:

@aria1th aria1th marked this pull request as draft November 11, 2023 14:41
@AndreyRGW
Copy link
Contributor

Finally someone decided to make HyperTile for Automatic1111!

@aria1th
Copy link
Collaborator Author

aria1th commented Nov 12, 2023

Note :
There is problem with Non-standard height / widths, Hypertile usually supports 128-multiples.

But with just txt2img process, some shapes like 512x704 would just work.

Unfortunately, for img2img process, the shape has to be reshaped into 128-multipliers first, or you will see these artifacts:
07515-2209052780-1girl

(Well its kinda beautiful(?) but still )

Thus we need more test for this.

Still, it should not have any problem with extensions / other stuffs though.

@aria1th aria1th marked this pull request as ready for review November 12, 2023 13:35
@gel-crabs
Copy link
Contributor

gel-crabs commented Nov 14, 2023

Hey, if anyone wants to test this on SDXL, I created an amateur port of hypertile.py to use the SDXL depth layers.

I spent a lot of time tuning the numbers and tile sizes and whatnot over the past few days and I think I've found the best settings, i.e. best performance, no artifacts and it only slightly changes the seed compared to without.

I also added the LDM key for VAE, the original only had the diffusers key (A1111 only uses LDM/SGM and not diffusers) so this PR wasn't tiling the VAE at all, VAE hasn't changed across SD versions so it's relevant for both 1.5 and SDXL.

Big thank you to the creator of https://github.com/arenasys/stable-diffusion-webui-model-toolkit as the components directory in that repo is the only place I could find any info about what layers 1.5 and XL use.

hypertile.py.txt (Rename to hypertile.py in the modules directory)

This only works for SDXL and not 1.5. Without it I get 1.7 it/s, with it I get 1.82 it/s, with no loss to quality or determinism. Definitely worth it.

(Oh yeah, forgot to mention, I also commented out the line that prints every layer it hijacks to the console. SDXL has a LOT of layers.)

@aria1th
Copy link
Collaborator Author

aria1th commented Nov 15, 2023

SD Base (1.4-1.5)
512x768 1.5x Latent - 5.61it/s
Without - 4.36it/s

SD XL
512x768 1.5x - 2.71it/s
Without - 2.60it/s

(3 pass, batch count 6)

Co-Authored-By: Kieran Hunt <kph@hotmail.ca>
@gel-crabs
Copy link
Contributor

Also note that increasing the max_depth further increases it/s, as it hijacks more layers.

I'm not sure about 1.5, but on SDXL I've gotten the best results with max_depth 2 (max_depth 1 is about the same speed as max_depth 2, but with a reduction in quality).

@aria1th
Copy link
Collaborator Author

aria1th commented Nov 16, 2023

The tile size / depth / etc options will be added to options soon ™️
(Except for auto-determining the largest tile size, I guess current implementation is correct for that)

Also, I found the old vladmandic's implementation, which says it is not compatible with ToMe / other types of extensions - but I guess it can just work if we hijack the hypertile at last moment, confirmed with ToMe ratio 0.3 / etc

Thus, if anyone find some bug - please ping me

@gel-crabs
Copy link
Contributor

I'm unable to get the newest version of the patch to work (on SDXL at least), I'm unsure as to what's causing it

I'm pretty sure I was sleepy while implementing this
@aria1th
Copy link
Collaborator Author

aria1th commented Nov 17, 2023

@gel-crabs Thank you for the comment! You're right, I confirmed the issue was from typos from refactoring. (I may have to refactor again...)

A. The options were inverted so if you enable, it was disabled...
B. The hijack was only working for VAE, so there was very minor speed improvement. It is now fixed.

Confirmed working for SD Base 1.5 Now.
768x768 2x upscale 3 images - 2.15it/s vs 1.60it/s
512x512 2x upscale 6 images - 4.95it/s vs 3.72it/s

SD XL - 768x768 depth 0
2.73it/s vs 2.58it/s
(Not so dramatic maybe?)

@gel-crabs
Copy link
Contributor

It works! With the newest commits and full max_depth, my it/s now goes from 1.7 to 1.88. Not bad at all!

If I'm able to find any information about SDXL depth layers in diffusers, I will hook it up in case A1111 gets diffusers support in the future (plz, I need inpaint)

@AUTOMATIC1111 AUTOMATIC1111 merged commit fd8674a into AUTOMATIC1111:dev Nov 26, 2023
3 checks passed
@AUTOMATIC1111
Copy link
Owner

AUTOMATIC1111 commented Nov 26, 2023

I wanted this to work without changes to processing.py so I partially reworked the file into a built-in extension; additionally added an option to only apply unet hypertile to a hires fix pass. Still no infotext params - adding them is easy but I think before that reasonable defaults should be figured out - ones that give most speed improvement with least image difference.

@AUTOMATIC1111
Copy link
Owner

1024x1024, it/s 1600x1600, it/s
without hypertile 3.68 1.03
ht d=3, tile=256, s=3 4.68 2.33
ht d=2, tile=256, s=3 4.74 2.31
ht d=1, tile=256, s=3 4.72 2.34
ht d=0, tile=256, s=3 4.86 2.15
ht d=3, tile=128, s=3 4.84 2.35
ht d=3, tile=64, s=3 5.64 2.32
ht d=3, tile=512, s=3 4.42 2.33
ht d=3, tile=512, s=0 --- ---
ht d=3, tile=512, s=1 5.39 2.54
ht d=3, tile=512, s=2 5.41 2.32
ht d=3, tile=512, s=4 4.75 2.33
ht d=3, tile=512, s=5 4.72 2.18
ht d=3, tile=512, s=6 4.73 1.94
ht d=3, tile=64, s=1 5.98 2.57
ht d=2, tile=64, s=1 6.09 2.5
ht d=1, tile=64, s=1 6.02 2.51
ht d=0, tile=64, s=1 2.95 2.15
ht d=0, tile=512, s=1 5.52 2.17
EXCEL_CxbXlESSys

@w-e-w w-e-w mentioned this pull request Dec 4, 2023
@FurkanGozukara
Copy link

FurkanGozukara commented Dec 4, 2023

ok i found it how do we use

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 4, 2023

@FurkanGozukara go to Settings - Hypertile options, enable optimizations (and set swap size as large like 12 for safety) - then there you go

@FurkanGozukara
Copy link

@FurkanGozukara go to Settings - Hypertile options, enable optimizations (and set swap size as large like 12 for safety) - then there you go

thank you. i am testing right now SDXL on RTX 3060 - 12 GB - i don't see any difference in speed for 1024x1024

outputs changing

what does each option do

depth
swap size
max tile size

@ArxFusion
Copy link

Sorry but I don't quite understand how this works, is this for txt2img with hires fix or only img2img upscale? And as for the options, do I enable Enable Hypertile U-Net, Enable Hypertile U-Net for hires fix second pass and Enable Hypertile VAE? I have tried to use it with all the options enabled and one-by-one for txt2img, I do not see any real difference in speed or image quality....unless I am doing something wrong. There are some very minor changes in 1.5 but none that I can see in SDXL.

@FurkanGozukara
Copy link

i got very little speed improvement

testing with RTX 3090 TI

from 1280x1024 to 2176x1740

without hyper tile : 1.13 second / it - second pass
with hyper tile : 1.07 second / it - second pass

nothing like @AUTOMATIC1111 provided table above

tested settings

image

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 4, 2023

@ArxFusion @FurkanGozukara
The speed-up is only provided when GPU was suffering from big image tiles - usually, high resolutions. For normal cases, as tested, it is not really noticable.

Thus the options are separated to 'first pass' and 'hires pass' and 'vae stage', to be used for corresponding bottlenecks.

(In other words, if you just use 512x512 then you usually don't need it)

Depth option is noticable if you are creating gigantic images, (well, depends on your ratio...)

Max tile size - large is better (adjusted by ratio)

Swap size - smaller is usually faster, but can produce artifact, thus there is trade-off between speed and aesthetic score.

@FurkanGozukara
Copy link

@ArxFusion @FurkanGozukara The speed-up is only provided when GPU was suffering from big image tiles - usually, high resolutions. For normal cases, as tested, it is not really noticable.

Thus the options are separated to 'first pass' and 'hires pass' and 'vae stage', to be used for corresponding bottlenecks.

(In other words, if you just use 512x512 then you usually don't need it)

Depth option is noticable if you are creating gigantic images, (well, depends on your ratio...)

Max tile size - large is better (adjusted by ratio)

Swap size - smaller is usually faster, but can produce artifact, thus there is trade-off between speed and aesthetic score.

thank you i tested like this. shouldnt i see super speed improvement at high res fix pass?

image

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 4, 2023

@FurkanGozukara Did you get hit with any memory problem while generating images?

But as mentioned above, SD XL does not show that dramatic improvement compared to 1.5-type models, as expected.

There are too many layers in SD XL, which can be the cause for this issue... (comfyUI shows same behavior - afaik it does not do anything for SD XL)

@FurkanGozukara
Copy link

FurkanGozukara commented Dec 4, 2023

@FurkanGozukara Did you get hit with any memory problem while generating images?

But as mentioned above, SD XL does not show that dramatic improvement compared to 1.5-type models, as expected.

There are too many layers in SD XL, which can be the cause for this issue... (comfyUI shows same behavior - afaik it does not do anything for SD XL)

I see. Well I tested without any VRAM limiting issue. I have 24 GB VRAM with RTX 3090 TI

For SD 1.5 where can we utilize this? I mean when we make it higher resolution it produces garbage. So which places we could utilize?

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 4, 2023

@FurkanGozukara It can be used for 1024x1024, 1600x1600 - or be combined with kohya's hires fix too, or even extreme high resolution with low denoise strength. (which was the main purpose by original author)

@FurkanGozukara
Copy link

@FurkanGozukara It can be used for 1024x1024, 1600x1600 - or be combined with kohya's hires fix too, or even extreme high resolution with low denoise strength. (which was the main purpose by original author)

can you show a screenshot of such sd 1.5 settings so i would like to test here

like generating 1600x1600 image with sd 1.5

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 4, 2023

image

The simple setting ™️ for 1.5 will be like this - (unfortunately I'm currently running training, so I can't take screenshot of speed)

@FurkanGozukara
Copy link

i see thanks. yes i also saw some real improvement at sd 1.5. tensor RT brings more improvement will this work with TensorRT? @aria1th

@FurkanGozukara
Copy link

i will combine and test with tensorRT. if both works huge speed improvement for SD 1.5

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 4, 2023

@FurkanGozukara Yes, but note that tensorRT requires code to be 'included' to compile, thus hypertile has to be the part of the model itself.... but still I'll say it is barely possible.

@FurkanGozukara
Copy link

@FurkanGozukara Yes, but note that tensorRT requires code to be 'included' to compile, thus hypertile has to be the part of the model itself.... but still I'll say it is barely possible.

ok i see. ty

@ArxFusion
Copy link

Thank you for updating the text with some info on the various options. I can see that for SDXL its not really working, but 1.5 there are some slight improvements, so far its only really noticable when running at more than 30 steps and it seems deterministic on your own hardware. I do notice that the images being generated can deviate between settings,some for the better and some for the worst depending on the options selected. I won't post my results since not sure how to even benchmark this because I feel everyone will have different experiences.

@zcatharisis
Copy link

zcatharisis commented Dec 6, 2023

Since Hypertile is intended for large images usually, could an option be added so that it's only enabled for hiresfix pass and img2img? I tried doing it myself but I didn't understand enough of the code to guess where to change.

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 6, 2023

@zcatharisis Yes, Enable Hypertile for Unet second pass will exclusively allow Hypertile to be used for hires.fix.

Img2Img is, though, a first pass.

@zcatharisis
Copy link

zcatharisis commented Dec 7, 2023

Sorry, I didn't make myself clear; the option would be to enable hiresfix pass and img2img passes only, while it is disabled in regular txt2img first pass. Sometimes I flip flop around upscaling by 3x in img2img, then generating a 576x768 image in txt2img, and back. It's a bit of a pain having to turn hypertile on for img2img and off for txt2img (since as you pointed out, it generates artifacts if the resolution isn't 128-multiple).

EDIT:Or maybe a cleaner implementation would be to detect the resolution of the image being generated to toggle Hypertile on or off? For example, it is disabled at 1024x1024 and below and enable at that resolution and above?

@w-e-w w-e-w mentioned this pull request Dec 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants