Google Whisk: AI Image Generation Tool Guide & Tutorial

Learn how to create unique AI-generated images by combining subjects, scenes, and styles

What is Google Whisk?

Whisk is an experimental AI tool released by Google in December 2024 that revolutionizes image generation. Unlike traditional text-to-image AI tools, Whisk introduces a unique approach by letting users generate images through visual prompts.

Generate visuals by combining subject, scene, and style images with Google Whisk
Figure 1: Main interface of Google Whisk showing the three-component system Source

Key Components

Subject

The main focus of your generated image - from characters and objects to complex combinations. Examples include vintage phones, furniture pieces, or fantasy characters.

Scene

The environment or context where your subject appears. This could be anything from fashion runways to holiday cards, allowing for creative character placement and interactions.

Style

Define the aesthetic direction, materials, or techniques for your creation. Enhance your vision by specifying style preferences in the main prompt box.

Guide

Drag and drop an image, upload it from a folder. You can also create a simple reference from a text prompt, … or have us seed a couple ideas by selecting "inspire me" or using the "roll the dice" features.The system will bring those together in creative remixes.

Interface showing drag and drop areas for subject, scene, and style images
Figure 2: Drag and drop interface for uploading reference images Source

See what Whisk comes up with, and keep riffing! You can also throw in some light guidance to play around with details and keep your imagination going.

"The robot is running"

"Make the characters eat ice-cream"

"The dinosaur and the cat are high fiving!"

"Make sure the enamel pin is round."

"Adjust the color scheme to follow a pastel palette"

If the generated image looks a bit different from what you imagined, you can click the "refine" button to enter refine mode and make small to medium adjustments to get it closer to what you originally wanted.

Refine mode controls for adjusting generated visuals
Figure 3: Refine mode interface for fine-tuning generated images Source

At any time, you can click the image to see the under-the-hood prompts.

Under-the-hood prompts used by Whisk
Figure 4: Behind-the-scenes view of Whisk's prompt generation system Source

How it works

In order to whisk elements from different images together, Whisk first needs to develop an understanding of each image you reference. This is where Gemini's multi-modal understanding comes in! When you upload an image, Whisk uses Gemini to visually understand those images and generate text descriptions (or captions) about them. Or in other words, translate that image to text (I2T). These descriptions are meant to capture the essence of your references, not to replicate the original, to facilitate remixing ideas.

These captions are then used to write a detailed prompt to generate an image based on your guidance using Whisk's latest and most powerful image generation model, Imagen 3. Or in other words, translating text back to image (T2I).

This process above helps Whisk better understand and represent the ideas you're forming, and iterate while conversing with you.

Character inconsistency

Whisk extracts only a few key characteristics from the image you provide to guide the model. The goal is not to create an exact replica, but rather to capture the essence of the subject.

Therefore, the generated image may differ in appearance. For example, the generated subject might be of a different height, weight, or have a different hairstyle or skin tone. These features may be crucial to the unique identity of your character. To achieve a result closer to your vision, you can provide more detailed prompts and refine your instructions.