First, I set the random seed as '180', which will be set in the rest of the project a. Then, I utilized 3 prompts given as the project example and run the model in stage 1 and stage 2. Here are my results.
In the first line, these images are the results from stage1 model and their prompts are "an oil painting of a snowy mountain village", "a man wearing a hat", and "a rocket ship", And the second line lists the results from stage2 model, whose prompts are the same as line 1
First, I resized the image to size 64 * 64 to fit the requirement of the model input. According to the function:
where x0 is the clean image, e can be given by the function torch,rand_like, and cumprod alpha can be taken from the model itself, we get three noise images from different timesteps [250, 500, 750]
By using different method (gaussian blur and unet one step denoise), I got the following results
The first line contains the results from one-step unet denoise, which are clearly better than the results from gaussian blur
These are the denoise results from t=90,240,390,540,890
The first one is the raw image, second one is the result from one step denoise and third one is from iterative denoise and last one is gaussian denoise. As we can see, iterative denoise can contain more details from the image.
By using the prompt: 'a high quality photo ', the model can generate images form a random noise. Some of them seemed weird,
By introducing the classifier free guidance in iterative denoise, we can improve the image quality. And the conditional prompt is 'a high quality photo' and unconditional prompt is ' '(null)
To achieve the SDEit algorithim, I set a list of start point [1,3,5,6,10. 20], larger start point should get a image that is more similar compared to small start point.
After part1.7, I tried the same algorithm on non realistic images and get a fine result
In the experiment, I used the RePaint technique to repair the top of a bell tower. A binary mask was used to delineate the areas in need of restoration. During each step of the denoising diffusion cycle, areas outside the mask were forced to match the original image content, while areas within the mask were updated through a generative model.
By using different prompt, we can force model to output desired image. For example, using "a rocket" on campanile or "a photo of dog" on Captain Picard.
By using two different kinds of prompt, we can get noise_est1 and noise_est2, fliping the second one and average it onto first one, we get conditional noise in visual anagram. Then we handle the noise like CFG, and we can get the final result.
By using two different kinds of prompt, we can get noise_est1 and noise_est2, using gaussian blur to get the low frequency of noise1 and 2, then using the formula that high_freq = raw_noise - low_freq_noise to get the high frequency result.
The following results are skull and waterfall', 'rocket and pencil', 'a man with a hat and a dog'
Train loss graph: (I have changed the default set for smaller loss, such as batch size, learning rate)
Above one is the result from epoch 1, the bottom one is the result from epoch 5.
The result below is different sigma in different epoch: first image is epoch 1 and second one is epoch 5
With only time control, it's hard to decide what the model will generate, so the result is random. I picked up 10 samples for visualization in each 5 epoch. Here is the train log and the result
With the control of label embedding, it can be easy to control the direction of model image generation. I generate 0-9 in each 5 epoch for visualization. Here is the training log and result