Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trained for 17K iters on COCO2014, OpenImages and OpenAI's Blog Images #100

Closed
afiaka87 opened this issue Mar 18, 2021 · 19 comments
Closed

Comments

@afiaka87
Copy link
Contributor

afiaka87 commented Mar 18, 2021

In case you haven't read my usual disclaimer: this data set is weird. The repetition in the OpenAI images causes those to be highly overfit (mannequins) and the remainder of the dataset is much more diverse, which dalle-pytorch doesnt manage to capture very well here. Also, keep in mind - this isn't even a full epoch. Just having fun. Try not to evaluate this as representative of dalle-pytorch's current capabilities.

closetotheend

Hey everyone. @lucidrains got the the new, lighter pretrained VAE from the taming-transformers group recently. It uses substantially less memory and compute. I decided to take all the datasets ive collected thus far, put them in a single folder on an A100, and train dalle-pytorch for several hours.

Here are the results:

https://wandb.ai/afiaka87/OpenImagesV6/reports/Training-on-COCO-OpenImage-Blogpost--Vmlldzo1NDE3NjU

I'm exhausted so that's all for now, but please click the link and have a look at the thousands of reconstructions it made (and the horrible captions from the "Localized Narratives" dataset I got from Google). I'll be updating this post with more info throughout the day.

@rom1504
Copy link
Contributor

rom1504 commented Mar 18, 2021

so mostly it doesn't work very well?
I guess a better dataset / training for longer is needed?

@afiaka87
Copy link
Contributor Author

afiaka87 commented Mar 18, 2021

so mostly it doesn't work very well?
I guess a better dataset / training for longer is needed?

edit: Happy to learn I'm probably wrong about this dataset being bad.. It's got some garbage in it, but any sufficiently large dataset will these days. What we need now is more data, more compute, larger batch sizes and higher depth.

Original:

Who knows what would happen after a full 17k iters * 20 epochs. This dataset is pretty bad. Seriously, go read some of the prompts from the labelled annotations. They very often dont make sense. They have mouse position as well though and it's more of a dataset for finding good image segmentation techniques, I now realize.

The largest portion of the dataset is generated images containing lots of mistakes. Apparently training even the top 32 of 512 generations from DALL-E proper will just produce something totally incorrect something like 20% of the time. Garbage in, garbage out. The definitely curated the examples on the front page even though they claim they didnt (even after getting CLIP to re-rank (basically automated curation) the images for them).

Having said that, yeah we still need a bigger dataset. OpenAI used an extremely large dataset. This doesnt get anywhere close to that. They also used a higher quality VaE and a batch size of 512... These thing aren't going to be possible without mesh-dalle. Hopefully we can continue to find techniques that get better results out of smaller datasets as well.

But yeah, please dont take this as some sort of scientific baseline for "how good dalle-pytorch is". It's a bad dataset with bad captions flooded with images that are likely to make CLIP very happy even if they have mistakes as they were partially generated using the very same CLIP. The only reasonable data in here is the COCO2014 set and it's only 200k images out of ~1.6 million.

@afiaka87
Copy link
Contributor Author

afiaka87 commented Mar 18, 2021

@lucidrains Until we can go through and really clean the hell out of the prompts, I'd advise staying away from OpenImages "Localized Narratives" for this. The phrasing is too verbose, distracted, wandering, and contains enough mistakes that I see gibberish about 5% of the time, potentially coherent 95 percent of time... It's...pretty bad. At least compared to the claims they make on the front page for the project. Was really annoying to download all 100 GiB of that to find out it was incredibly poorly labeled.

So I started two training sessions last night as well that are still running, but only on COCO and the blog post images. I'll post the results later today. In the meantime, enjoy this mannequin:

a male mannequin dressed in a blue and black bomber jacket and brown pleated trousers
mannequin_from_1024

@sorrge
Copy link
Contributor

sorrge commented Mar 18, 2021

You said OpenAI used a higher quality VaE. Didn't they release the weights for it?
Also, I don't agree with your assessment of localized narratives captions. They are decent. They at least mention the most important objects. Importantly, they are made for this dataset, with training in mind. The "wild" captions scraped from the internet, which OpenAI used, are much worse, because they are made by people with no intention to describe the picture accurately.

@lucidrains
Copy link
Owner

lucidrains commented Mar 18, 2021

@sorrge they are released, and you can even start training with them in this repo! https://github.com/lucidrains/dalle-pytorch#openais-pretrained-vae

@sorrge
Copy link
Contributor

sorrge commented Mar 18, 2021

Thanks. Is there a reason to believe that the OpenAI VaE is better than Taming Transformers one that @afiaka87 used? Besides the token values range. Did somebody compare their reconstructions?

@lucidrains
Copy link
Owner

@sorrge #86 (comment) yea, the mannequins look quite good, at least

@afiaka87
Copy link
Contributor Author

afiaka87 commented Mar 18, 2021

You said OpenAI used a higher quality VaE. Didn't they release the weights for it?

Yes, they did. You can train DALLE-pytorch with it. It's something of a VRAM hog though and the taming-transformers VaE shows decent accuracy for a much lower runtime/memory cost because it only uses 1024 tokens. It's impressive work. There are documented issues with not picking up certain details in reconstructions as well as OpenAI's VAE can. So it's not perfect, but it helps quite a bit in terms of being able to actually train DALLE-pytorch.

Also, I don't agree with your assessment of localized narratives captions. They are decent. They at least mention the most important objects. Importantly, they are made for this dataset, with training in mind. The "wild" captions scraped from the internet, which OpenAI used, are much worse, because they are made by people with no intention to describe the picture accurately.

That's totally fair. Dealing with some of this stuff and getting a bad result can be frustrating and may color my opinions unfortunately. I have to say though, I've been messing with OpenAI's pretrained ViT-B/32 CLIP for quite awhile and it's just never been well-suited to these types of prompts even when they are properly written. It tries its best to maximize the features in context, but sometimes it just doesnt know enough about that many tokens in that order to really get anything other than a few words relayed.

I think you'd need to train a custom CLIP on this data to get it to work the way you're thinking (where it fills in every little detail in the prompt). Which is a fantastic idea actually!

@afiaka87
Copy link
Contributor Author

@sorrge @lucidrains So that dataset is pretty cool in that it has mouse positions from the labeler. They were required to highlight the region they were talking about as they described the word vocally which then gets transcribed with timing information they can use to lookup "where" each word in the image is meant to generally go. This has obvious implications for segmentation (which they mention as their motivation). But is there any way we could train on that information in dalle-pytorch's transformer? It's essentially a mapping to the relevant region in the image for each token in the "Localized Narrative"

I could think of some prompt engineering tricks but that would require...prompt engineering.

@sorrge
Copy link
Contributor

sorrge commented Mar 18, 2021

@afiaka87 Their CLIP is trained on a gigantic dataset (400M image-captions IIRC). Surely there was a lot of garbage there. It may be confused by the format "in this image we can see", because that's not how people usually annotate their pictures. But it doesn't matter for DALL-E, does it? For post-filtering it will still work, because you would use the "normal" prompts for generation, which CLIP can understand.

@afiaka87
Copy link
Contributor Author

afiaka87 commented Mar 18, 2021

@afiaka87 Their CLIP is trained on a gigantic dataset (400M image-captions IIRC). Surely there was a lot of garbage there. It may be confused by the format "in this image we can see", because that's not how people usually annotate their pictures. But it doesn't matter for DALL-E, does it? For post-filtering it will still work, because you would use the "normal" prompts for generation, which CLIP can understand.

Yeah I'm out of my depth on this one. @lucidrains ?

Edit: all i know is that i've had trouble with it, like, anecdotally. I'm relatively new to machine learning though so I don't have the full depth of understanding needed and you could very well be correct!

If that's the case, do you think it's just a matter of needing to scale up the batch size and size of dataset? I'm getting okay-ish representations on these simpler datasets, but this one seemed like it wasn't going to converge anytime soon.

@sorrge
Copy link
Contributor

sorrge commented Mar 18, 2021

DALL-E will probably just learn to ignore the "In this image we can see" beginning and use the list of things that goes afterwards as the clues for what should be included.

@afiaka87
Copy link
Contributor Author

afiaka87 commented Mar 18, 2021

If that's the case, do you think it's just a matter of needing to scale up the batch size and size of dataset? I'm getting okay-ish representations on these simpler datasets, but this one seemed like it wasn't going to converge anytime soon.

edit: It could also be that the smaller VAE's errors can accumulate more on this dataset? No idea.

@sorrge
Copy link
Contributor

sorrge commented Mar 18, 2021

Yes, the size of the dataset and the depth of the model are the keys, per OpenAI's paper. That was the main point, as in their other notable works: how far can the model be pushed. So, if we want quality, we need to match the effort.

In this attempt that you made here, the repetition in the captions (from the blog post) likely caused some overfitting. For example, the mannequins are relatively similar in both captions and images, and it learned them the best. To train a powerful model, we need more diversity in the data.

@afiaka87
Copy link
Contributor Author

afiaka87 commented Mar 18, 2021

Yes, the size of the dataset and the depth of the model are the keys, per OpenAI's paper. That was the main point, as in their other notable works: how far can the model be pushed. So, if we want quality, we need to match the effort.

In this attempt that you made here, the repetition in the captions (from the blog post) likely caused some overfitting. For example, the mannequins are relatively similar in both captions and images, and it learned them the best. To train a powerful model, we need more diversity in the data.

Thanks that's helpful. I guess the main issues with that is the obvious lack of compute. Without finding potential optimizations (such as the 1024 token model) we're looking at some year-long training times. Anyway, that's always been obvious.

As for the depth - I continue to shoot for 64 (which surprisingly fits in VRAM at a batch_size of <12 on the 1024 vae). It does indeed produce a higher quality image. Do you think training on this same dataset with a depth of 64 is worthwhile?

@sorrge
Copy link
Contributor

sorrge commented Mar 18, 2021

I'd at least wait for the WIT data, which should come out in a few days. That's ~11M images with +-good captions from Wikipedia. It will be a dramatic jump forward from the current dataset.

@afiaka87
Copy link
Contributor Author

afiaka87 commented Mar 18, 2021

Cool thanks. This compute is expensive and it's very useful for me to know when something is a waste of effort/money or not.

At any rate, I just got some stimulus and decided to invest it in a couple hundred $ of GPU compute on vast ai so I can have an actual stable development environment for a while. This is all a good learning experience for me whether I get good results or not. If you have an idea for a dataset to train or need compute for debugging a new feature, do let me know and I'll see if I have any compute available still.

@robvanvolt
Copy link
Contributor

robvanvolt commented Mar 18, 2021

I'm still waiting for my Rig to get shipped, until then I will only be able to comment on "metadata". But I think you do a really good job @afiaka87 ! Even with the "bad" dataset, the results seem promising, and the tamed transformer speeds things up!

Things to look forward to:

Moreover, we should start a list with big datasets which might fit the DALL-E training:

@afiaka87
Copy link
Contributor Author

Please check the discussions tab for information on my training efforts:
#106

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants