[Re]generative Diffusion models by Benjamin Harbakk
Learn how companies that tell their Generative Ai programs are lying to you. Whenever a company tells you that their models are ethical and legally licensed they are referring to their Fine tuning data, not the pretraining, which is what programs like Midjourney and Stable Diffusion rely on to be so “good” They both rely heavily on the Laion Database.
This Blog was written by Benjamin Harbakk (@Stealcase) with information provided by Margarate Mitchell (@MMitchell)
[Re]generative Diffusion models
This documents attempts to create an overview of the often confusing landscape of popular diffusion models. Diffusion is NOT the only methodology used to generate image data, but it has become the most popular method due to the flexibility it offers and the results of some of the larger models released on the web.
Terminology
Pre-training: Synonymous with the term “training”. Usually involves running a massive amount of data through a mathematical formula, and comparing the result of that formula with the expected and desired result. Depending on how much the result deviates from the expected result, the internal formula parameters (weights) are adjusted.
For reference: Stable Diffusion 2.1 needed (conservative estimate) 200000~ hours of A100 GPUs to train, and around 1.2+ billion image/text pairs. (22 years of compute on a single GPU)
Training Stable Diffusion 1.5 allegedly cost 600k $ in compute
Training a diffusion model from scratch is currently not feasible for a regular consumer, and requires Machine Learning expertise and budget for error and experimentation.
Weights: the internal parameters in the formula that affect the final output. These dictate what part of the input is the most important, and what parts are the least important.
Pretrained model: Synonymous with the term “Checkpoint”. This is the resulting “AI Model”. A good way of thinking of this is: the final mathematical formula that takes input (text or images) and produces an output.
Finetuning: using an existing pre-trained model and feeding it a MUCH smaller dataset to focus the weights in a certain direction. The dataset used for finetuning is significantly smaller. Finetuning datasets can be as small as 10 images, or as large as 40 000. This is still significantly smaller than the pre-training model, and is accessible to consumers. A consumer can feasibly finetune an existing model on a single graphics card. When someone says they “trained their own model”, they are usually using the wrong terminology and have actually finetuned an existing model. For comparison:
Pretrained Firefly: At least 300+ million images (At time of training, Adobe Stock was more than 300 million images)
Pretrained Midjourney: 400+ million images
The distinction between Pretraining and Finetuning is not well-understood yet by the media, and as such there are very few articles that make a distinction. Machine Learning researchers like Margaret Mitchell (who works at HuggingFace, the most popular site to download pre-trained models) explains it briefly here:
Stable Diffusion
Stable diffusion consists of a family of pretrained models developed mainly by or in collaboration with Stability AI. Other companies involved with Stable Diffusion are Runway ML.
They are by-far the most popular Image Generators on the market, since the
Connected to Stability AI
Uses LAION 5b (and subsets, like the “English only” subset, and the “Higher than 0.5 aesthetic score” subset)
Uses CLIP or OpenCLIP
Usually “Open Source”: released online and possible to run locally
Pre-trained Models
The documentation of these models can be found below, where the usage of LAION 5b is documented.
Stable Diffusion SDXL (This model is only accessible through an API or web platforms)
Stable Diffusion 2.0 (Trained on LAION 5b)
Stable Diffusion 1.5 (Trained on LAION 2b English, including NSFW)
Finetuned models
Most finetuning is done using Stable Diffusion 1.5, since
This model was one of the first released openly.
This model was trained on an “unfiltered” version of LAION5b (it did not filter away NSFW images).
The model has high performance in producing photorealistic images.
The most popular website to download finetuned models is https://civitai.com/. WARNING: website contains NSFW.
Since finetuning this can be done on consumer GPUs, many users are downloading artists online portfolio and finetuning models on them. Here, a user has used Sam Yang’s art and finetuned a Stable Diffusion 1.5 model on it:
https://civitai.com/models/6638/samdoesarts-sam-yang-style-lora
https://www.instagram.com/samdoesarts/
Note the filesize of this finetune is 144 MB. To be able to run this finetune, you still need the 4.5 Gb Stable Diffusion as a base.
Technical details
There are several different methods of finetuning.
Hypernetworks
Lora
Textual Inversion
Some methods involve “overwriting” a pre-trained model with new model weights (often called “merged checkpoints”), but more modern methods involve creating smaller files that can be “overlayed” over an existing pre-trained model (like SD 1.5)
Here are multiple guides that explain technical methods of finetuning:
https://aituts.com/textual-inversion/
Midjourney
Midjourney is a series of closed-source pre-trained models produced by Midjourney trained on “...pretty much the entire internet” - (David Holz).
Pretrained Models
Midjourney models are accessible through their Discord Server, and have different versions:
Midjourney Niji (Anime style model)
Midjourney’s weights are noticeably stronger and more focused on artistic outputs than Stable Diffusion. Prompting even with single words will result in aesthetic-looking characters.
Data Sources
Midjourney models are trained on LAION 400m (this is mentioned multiple times in their discord, but is not commonly found on the open internet) and a whole bunch of other Datasets that have NOT been disclosed. They have never explicitly said they trained subsequent models on LAION 5b, but it is heavily implied by multiple moderators.
We can infer a lot of sources based on some of the popular prompts used in the Discord server. Users often use prompts that reference Artstation, CGsociety, Behance, pinterest, deviantart, conceptartworld, flickr, and direct artists names in order to get better results.