-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
acgan: Add batch normalization to the Generator, etc #8616
Conversation
Generated images are better than previous best (#8482 (comment)) The biggest difference is much larger diversity in line thickness. |
Note that if you want to completely freeze a model that has BN layers (like here) you need to do more than just set it to non-trainable, you need to also disable its updates. Otherwise the batch statistics will still get updated during training. This is not a bug, this is simply a manifestation of the fact that non-backprop updates and layer trainability are independent (for instance, if you set a stateful RNN to non-trainable, that will not freeze its state). It is possible that we should have a |
Could you elaborate on that please? Do you mean setting every layer's trainable attribute to False? |
No. You'll need to clear the attribute |
I don't think it ever makes sense to freeze those 2 trainable parameters, while updating the 2 non-trainable parameters. This approach should properly be called "Breaking Deep Network Training by Introducing Uncorrectable Internal Covariate Shift". In terms of calling things trainable or non-trainable, I don't see what difference does it make conceptually whether the parameter updates happen during forward prop or backprop. I think those non-trainable weights should always be frozen as well, and renamed _trainable_when they aren't frozen. This issue does not affect this PR (which has BN only on the Generator, which is never frozen), but may be the reason why my attempts to put BN into the Discriminator broke things very badly. The following workaround #4762 (comment) seems to be working for the GAN use case, and I'll try it, but not for a while. |
@ozabluda Jeremy Howard did a study on this recently ("should we update the batch statistics during fine-tuning?") and the answer was, "it depends". It's not clear-cut. Beyond batch norm specifically, you seem confused by the different between trainability and stateful behavior. Setting a layer to non-trainable means its trainable weights will not be taken into account during training. It does not affect the parts of the layer's state that are independent from training. For instance, a layer that maintains a counter that is incremented by one with every batch, will not stop doing that if you set If you want to run your BN layers in inference mode, the way to do it is to pass a static boolean as the y = BatchNormalization()(x, training=False) |
unrelated, @ozabluda I don't know of any references. Tbh, it always just seemed more natural on an intuitive basis. |
@fchollet, Can't find that study by Jeremy Howard. Is there a recommended way to freeze those two non_trainable weights that would work for the acgan example, if I put BatchNormalization into the Discriminator? Right now, the best advice seems to be to go through the discriminator model, immediately after Also see tensorflow/tensorflow#10857 |
Yes, please! |
There is no "clean" way at the moment, but we need one. The simplest would be a layer/model attribute that regulates whether or not the layer/model will always return To make sure we have separation of concerns between trainability and updates, it's probably best for this attribute to only act on updates (no effect on trainability) and to explicit mention updates in the name. Like, |
FWIW, none of the following works to freeze those 2 weights in BatchNormalization I put into Discriminator in acgan example: Immediately after
before and after
|
@fchollet I agree, |
@ahundt, I have no immediate plans to work on it, since I don't understand that part of the Keras code, and therefore the suggestions (as of now). I also don't understand the performance implications, for example compared to native tensorflow/tensorflow#12580. So if you feel like doing it, it would be great. For the reference, this is how I check if freezing actually works in the acgan example after putting
A much simpler example can't be made, because this is the use case for which none of the hacks, etc work, but see example in #8676 (put |
@ozabluda sorry, meant to @ mention fchollet on that previous post, I was addressing the feature he was suggesting. I edited it in now. |
@fchollet Also, I'm not sure I mentioned this, but freezing batch normalization updates is quite helpful when fine tuning segmentation problems from pre-trained weights. I think something like this modified BatchNormalization class with freeze would do the trick. Would it be acceptable to place the |
For the record on master you can now set @lukedeo would you like to review this PR? |
yep @fchollet, can check by EOW |
@ahundt @ozabluda FYI I think the Thus I am reverting |
@lukedeo if you're still available to review this PR, we're waiting for your input. Otherwise, please tell and we will find another reviewer. Thanks! |
@fchollet thanks for the update, the new functionality you describe is the behavior I imagined when I first saw the name of the flag a year ago when I was first learning the code base. I look forward to using it! |
For the reference, the changes described in #8616 (comment) were made in 24246ea |
Specifically, the reverse commit that makes |
Hey @fchollet back from vacation. Will look today - sorry. |
@fchollet this seems fine to me overall. @ozabluda one question, though not necessarily needed for this PR, is if we should also add BN to the discriminator since it's technically closer to the paper. If we see good performance, which we do, we don't necessarily need it, but just raising as people might question now that I think about it. The one place I could see this maybe helping is if people want to adapt this script to CIFAR10 or that ilk. |
I tried adding BN to the discriminator (the very first message in this PR, item 1). With recent fixes to the BN, maybe I'll try it again later. An attempt to adapt it to CIFAR-10 is going on here: #8937 |
I've created animations of acgan training: Resolutions are 1080p (2K), 1440p (3K), 2160p (4K) 7680p (8K). There are two types of video Note that on YouTube, you can change the playback speed, and, when paused, go frame-by-frame with ',' and '.' Real images are on the bottom. Fake generated one are on top. Epoch/iteration is in the gray bar in between. For every column of 10 digits (0-9), latent noise vector is the same (class is different). For every row of same digits, latent noise vector is different (class is the same). Discriminator probability real/fake is shown as a grayscale square around a digit. It's scaled such that p=0.5 is pure black, p=0.75 is pure white, with p<0.5 and p>0.75 clipped to min and max values. Otherwise, one wouldn't be able to see what is going on. Staring at these videos for a while, falsified a lot of my preconceived notions of what GAN is actually doing. Too many to describe here. I highly recommend anyone who is interested to stare for a while as well. |
@ozabluda can you fix the youtube link? doesn't seem to work for me |
@ahundt, fixed. |
thanks! looks neat |
Two quick things:
P.S. I've added acgan0 8K video - 30 min, 100 GB, baby! |
I present you with: 🥇 Awarded for the highest resolution 28x28 image of all time. 👍 P.S. Any hints on what rendering tools you used to generate these? Sounds like something useful |
just ffmpeg. the code to generate pngs is an embarrassing mess, not sure if it's worthwhile to clean and add it to the example. |
A png utility could go in some non-keras repo or keras-contrib. |
@ahundt |
Add batch normalization to the Generator. This makes the example closer to the referenced paper and improves generated images. Adding batch normalization to the Discriminator breaks training so badly, that I suspect a bug (maybe Add tests exposing BatchNormalization bug(s)? when used in GANs. #5647 is fixed incompletely or something). Not adding batch normalization to the Discriminator also side-steps the issue of correlation of samples within a batch (https://github.com/soumith/ganhacks#4-batchnorm)
Use one-sided soft labels and a harder
soft_one=0.95
vs 0.9). The referenced paper says they don't need one-sided soft labels. This example also doesn't "need" them any more, but the generated images are better. Add reference to a paper.Increase
epoch=100
from 50, as good images often appear between epochs 50 and 100. Note that the training time per epoch is half that of the original example, after 67cd3b0Increase output precision of various losses from 2 decimal digits to 4. You can't really tell what is going on with just 2.
@lukedeo , I see a lot of examples online which use embedding with Hadamard, but do you know of any paper(s) we can reference? I haven't seen it in any of the GAN papers. I really like embedding with Hadamard, as replacing them would require multiple (3-5?) additional layers, but to be thorough I did a half-hearted attempts to remove them (make closer to the acgan paper), just to see if I can, and failed (generated images are much worse).