Skip to content

Stable Diffusion XL

Posted on:February 12, 2024 at 08:00 PM

I’ve been fascinated with text-to-image generators ever since first hearing about DALL-E. I was excited to try them myself when DALL-E 2 came out, and used it to, among other things, create featured images for poetry.

I was impressed imagery generated by MidJourney as it flooded Reddit and Instagram, and was used by artists in live performance:

View this post on Instagram

A post shared by Terence Tuhinanshu (@rajadain)

I continue to be impressed by AI artists who can create both extremely realistic images:

View this post on Instagram

A post shared by Terence Tuhinanshu (@rajadain)

and absolutely fantastical ones:

View this post on Instagram

A post shared by Terence Tuhinanshu (@rajadain)

and I wanted to play with it myself. But I could never get MidJourney to work for me on the free tier, and DALL-E 2 was too limited.

Eventually, we got the open source Stable Diffusion XL, an open source model that allows users to create images from prompts locally on their computers. Additional tooling like Automatic1111, DiffusionBee, and Fooocus have made it significantly simpler, to the point that these are almost no-code solutions. I ended up installing Fooocus on my Windows computer back in December, and have played with it on and off since.

My attention was recently renewed when I started seeing ads for tools like Artisse and MyMood AI, which seem to use Stable Diffusion XL with Low Rank Adaptations (LoRAs) under the hood, but trained on user submitted faces. Most of their styles focus on sexy and glamorous female poses in various outfits and settings, which serves both the aspirationally glamorous and the soft deepfake curious demographics. Furthermore, it limits their training targets. I tried out Artisse, which given a set of 10 photos can train a model in about 20ish minutes, and then produce remarkable results by putting that learned face on generated photos, allowing you to control various aspects of the body: from complexion to curviness. From their prompts, I could tell that they were using Stable Diffusion XL LoRAs. So I set out to see if I could do this myself.

After some research I found this handy guide by All Your Tech AI, which introduced me to Hugging Face’s Autotrain Advanced. While most of the documentation encourages you to run it on Google Colab or HF Spaces, I was able to run it locally once I’d installed the appropriate CUDA drivers and Python libraries. The local training is much slower, and takes about ~2 hours on my nVidia 3080 and uses up all 10GB of VRAM, for ~8 images. I trained the LoRA using Stable Diffusion Base XL as the base model, with some recent photos of me. Two hours later, I copied the .safetensors file into Fooocus’s loras/ directory, then ran Fooocus and selected Advanced. Under Model I selected my new LoRA and set the strength to 0.95, and asked it to generate a photo of rajadain, the keyword for the LoRA:

prompt: photo of rajadain

This was very impressive! It looks almost exactly like me, but it’s not me, and it’s not exactly any picture in the dataset. Already it is shifting it in different directions, but is able to faithfully put my face on the generated image.

Next I wanted to try out a few different art styles. This led to some unexpected results. When I tried Pre-Raphaelite painting, I got a European version of myself:

When trying out the Ukiyo-E style, I got a Japanese version:

In an attempt to nudge it towards a more Indian look, I overshot:

Eventually, by tweaking a few different settings including the negative prompt and the LoRA strength, I was able to get something pretty close:

It was also fun to discover that certain prompts and styles have a significant female bias, and will produce a girl version of me until explicitly prompted otherwise. For example, this is rajadain in a blue scarf:

and a handsome rajadain man in a blue scarf:

Similarly, when using prompts like “high fashion” or “fashion shoot”, the bias is female until corrected:

That last image shows the kind of AI artifacting that can happen when the parameters clash: I had to turn the LoRA strength up to get a better resemblance, but then my face ends up being forced onto an otherwise incongruent composition.

One of the reasons why I got female outputs above is because I had trained the model to respond to “rajadain”, a trigger word the model knows nothing about. One tip I’ve heard is to use a trigger phrase that gives the model an inkling of what you want, like a celebrity name who perhaps looks like you. I trained a second model with “hrithik roshan man”, and got significantly handsomer results:

However, you can see that my face was made longer as it tried to combine it with Hrithik’s distinctive vertical visage.

Here’s some other ones that I thought turned out pretty good:

Overall I’ve had a lot of fun making these. I don’t know if there’s a use case for this beyond that. I don’t think I’d use one of these for my profile picture, because while some of these are really cool, none of them are me, and so it doesn’t have the personal signifcance necessary for that purpose. I do think there’s some great hairstyles in there though: I may take some of these to my stylist the next time I need a haircut!