Google Whisk: AI Image Generation Tool Guide & Tutorial
Learn how to create unique AI-generated images by combining subjects, scenes, and styles
What is Google Whisk?
Whisk is an experimental AI tool released by Google in December 2024 that revolutionizes image generation. Unlike traditional text-to-image AI tools, Whisk introduces a unique approach by letting users generate images through visual prompts.

Key Components
Subject
The main focus of your generated image - from characters and objects to complex combinations. Examples include vintage phones, furniture pieces, or fantasy characters.
Scene
The environment or context where your subject appears. This could be anything from fashion runways to holiday cards, allowing for creative character placement and interactions.
Style
Define the aesthetic direction, materials, or techniques for your creation. Enhance your vision by specifying style preferences in the main prompt box.
Guide
Drag and drop an image, upload it from a folder. You can also create a simple reference from a text prompt, … or have us seed a couple ideas by selecting "inspire me" or using the "roll the dice" features.The system will bring those together in creative remixes.

See what Whisk comes up with, and keep riffing! You can also throw in some light guidance to play around with details and keep your imagination going.
"The robot is running"
"Make the characters eat ice-cream"
"The dinosaur and the cat are high fiving!"
"Make sure the enamel pin is round."
"Adjust the color scheme to follow a pastel palette"
If the generated image looks a bit different from what you imagined, you can click the "refine" button to enter refine mode and make small to medium adjustments to get it closer to what you originally wanted.

At any time, you can click the image to see the under-the-hood prompts.

How it works
In order to whisk elements from different images together, Whisk first needs to develop an understanding of each image you reference. This is where Gemini's multi-modal understanding comes in! When you upload an image, Whisk uses Gemini to visually understand those images and generate text descriptions (or captions) about them. Or in other words, translate that image to text (I2T). These descriptions are meant to capture the essence of your references, not to replicate the original, to facilitate remixing ideas.
These captions are then used to write a detailed prompt to generate an image based on your guidance using Whisk's latest and most powerful image generation model, Imagen 3. Or in other words, translating text back to image (T2I).
This process above helps Whisk better understand and represent the ideas you're forming, and iterate while conversing with you.
Character inconsistency
Whisk extracts only a few key characteristics from the image you provide to guide the model. The goal is not to create an exact replica, but rather to capture the essence of the subject.
Therefore, the generated image may differ in appearance. For example, the generated subject might be of a different height, weight, or have a different hairstyle or skin tone. These features may be crucial to the unique identity of your character. To achieve a result closer to your vision, you can provide more detailed prompts and refine your instructions.