Building GeminiKit: Automating Google Gemini Image Generation
Context
I was building a content pipeline. One component needed high-quality AI images. Gemini's image generation is genuinely impressive — better than most alternatives I tested. But like many of Google's consumer AI products, there's no public API for image generation through the web interface.
The official path is the Gemini API, which does exist — but at the time I was building this, the image generation capabilities there were limited or gated. The web interface was the only way to access the full model. So I automated it.
GeminiKit uses Playwright to drive the Gemini web app, maintains persistent login sessions, and returns the full-resolution image. It also optionally removes SynthID watermarks using a Florence-2 + LaMa inpainting pipeline.
Disclaimer: Automating Gemini's web interface likely violates Google's Terms of Service. Your account could be suspended. Use this for experimentation and personal projects only — not production systems.
Why Playwright, Not Puppeteer
For SunoKit (my Suno automation library), I used rebrowser-puppeteer-core specifically to patch CDP exposure and evade bot detection. Suno's anti-bot measures are real but relatively standard.
Google is a different story. Google runs some of the most sophisticated bot detection in the industry. No amount of CDP patching was going to reliably fool it. So I chose a different strategy: don't fight the fingerprinting — use a legitimate persistent session instead.
The key insight is that a saved browser profile with real Google cookies is indistinguishable from a human session. Google issued those cookies. They're valid. The session is genuine. Playwright's launchPersistentContext is the right API for this:
this.context = await chromium.launchPersistentContext(this.userDataDir, {
headless: options.headless ?? false,
viewport: { width: 1280, height: 800 },
args: [
'--disable-blink-features=AutomationControlled',
'--no-sandbox',
],
});
launchPersistentContext is a first-class Playwright API that writes the entire browser profile — cookies, localStorage, IndexedDB, credentials — to a directory on disk. Subsequent launches read it back. One manual login, then headless forever (until the session expires or Google invalidates it).
The --disable-blink-features=AutomationControlled flag is still necessary. It removes the navigator.webdriver property that JavaScript on the page can inspect. Without it, the browser openly announces that it's being automated.
The Activity Page Problem
After a few days of use, I noticed a strange failure mode: GeminiKit would launch, navigate successfully, but the chat input was unreachable. Debugging showed that Gemini was opening multiple tabs on startup — specifically an "activity" tab and sometimes a "what's new" page. My code was operating on the first page, which wasn't the Gemini chat.
The fix was to close all existing pages immediately after connecting and start fresh:
async connect(options: { headless?: boolean } = {}): Promise<void> {
// ...launch persistent context...
// Close all tabs Chrome reopened from the previous session
const existingPages = this.context.pages();
for (const page of existingPages) {
await page.close();
}
// Create one clean page
this._page = await this.context.newPage();
}
This solved it. Chrome's session restore was reopening whatever tabs were open when the browser last closed. Explicitly closing them before creating a new page ensures a clean starting state every time.
Login Detection
Gemini's login flow is a Google OAuth chain. After authenticating, you land on the Gemini chat interface. But "landed on the chat interface" isn't a single DOM condition — Gemini has rewritten parts of its UI multiple times, and different versions use different elements for the chat input.
Rather than betting on one selector, I check several:
const loggedIn = await page.evaluate(() => {
if (window.location.href.includes('accounts.google.com')) return false;
return !!(
document.querySelector('rich-textarea') ||
document.querySelector('div[contenteditable="true"]') ||
document.querySelector('.ql-editor') ||
document.querySelector('textarea')
);
});
If any of these exists and we're not on an accounts.google.com URL, we're in. This has survived several Gemini UI updates without breaking.
For the first-run login flow, I poll every 5 seconds for up to 5 minutes, waiting for the user to complete the OAuth flow in the headed browser window:
for (let i = 0; i < 60; i++) {
await page.waitForTimeout(5000);
if (page.url().includes('gemini.google.com')) {
const ready = await page.evaluate(() =>
!!(document.querySelector('rich-textarea') || /* ... */)
);
if (ready) return; // Login confirmed
}
}
throw new AuthenticationError('Login timeout after 5 minutes.');
Consent Pages and Overlays
Google loves consent dialogs. Cookie banners, privacy disclosures, "human review" notices. Each one blocks interaction with the actual UI. GeminiKit handles these proactively before attempting to type a prompt.
The approach is straightforward: look for common consent button patterns and click them if found. Playwright's getByRole locator is ideal here — it's more resilient than CSS selectors because it matches on accessible name, not implementation detail:
private async dismissOverlays(page: Page): Promise<void> {
const acceptButtons = [
page.getByRole('button', { name: /accept all/i }),
page.getByRole('button', { name: /agree/i }),
page.getByRole('button', { name: /got it/i }),
page.getByRole('button', { name: /continue/i }),
];
for (const button of acceptButtons) {
if (await button.isVisible({ timeout: 1000 }).catch(() => false)) {
await button.click();
await page.waitForTimeout(500);
}
}
}
This runs before every generation. If no overlays are present, the checks are fast no-ops.
Image Detection and Download
Detecting when Gemini has finished generating an image is harder than it sounds. The UI shows a spinner, then a loading state, then eventually renders an image inside a response card. The challenge: there are placeholder images in the UI, thumbnail previews, and UI chrome that could all match a naive img selector.
The reliable signal is a large image element inside a clickable button — Gemini always wraps the generated image in a button that triggers the download flow. I wait for that pattern to appear:
while (Date.now() - startTime < timeout) {
const imageFound = await page.evaluate(() => {
const buttons = document.querySelectorAll('button');
for (const btn of buttons) {
const img = btn.querySelector('img');
if (img && img.naturalWidth > 200 && img.naturalHeight > 200) {
return true;
}
}
return false;
});
if (imageFound) break;
await page.waitForTimeout(1000);
}
The naturalWidth > 200 threshold filters out UI icons and thumbnails. Only a real generated image passes it.
Downloading uses Playwright's Download event — when you click the download button, Playwright intercepts the file save dialog and streams the file to a path you control:
const [download] = await Promise.all([
page.waitForEvent('download'),
downloadButton.click(),
]);
await download.saveAs(outputPath);
This is cleaner than the CDN approach I used in SunoKit. Playwright's download interception handles whatever URL scheme the button triggers, regardless of whether it's a direct file link or a blob URL.
SynthID Watermarks
Google embeds an invisible watermark called SynthID into every Gemini-generated image. SynthID is a perceptual watermark — it encodes information in the pixel structure of the image in a way that's invisible to the human eye but detectable by Google's verification tools.
For most use cases this doesn't matter. But if you're using generated images in contexts where you need clean originals, it's relevant.
GeminiKit optionally integrates with WatermarkRemover-AI, which uses Florence-2 for watermark detection and LaMa for inpainting the detected region. The result isn't perfect — inpainting leaves subtle artifacts if the watermark region is in a visually complex area — but it's surprisingly effective for large flat regions.
private async removeWatermark(filePath: string): Promise<void> {
// Florence-2 detection + LaMa inpainting
const removerDir = path.join(os.homedir(), 'code', 'WatermarkRemover-AI');
const venvPython = path.join(removerDir, 'venv', 'bin', 'python');
const script = path.join(removerDir, 'remwm.py');
if (!fs.existsSync(script)) {
console.warn('Watermark removal skipped: WatermarkRemover-AI not found');
return;
}
const tmpOut = filePath + '.clean.png';
await execFileAsync(venvPython, [script, filePath, tmpOut], { timeout: 120000 });
if (fs.existsSync(tmpOut)) {
fs.renameSync(tmpOut, filePath); // Replace original with cleaned version
}
}
The watermark removal is optional — most users won't need or want it. When it's not configured, the step is skipped silently.
Reading PNG Dimensions Without a Library
One small detail I'm proud of: GeminiKit reads image dimensions from the file header directly, without pulling in an image processing library. PNG and JPEG both encode dimensions in well-known byte positions:
private async getImageDimensions(filePath: string): Promise<{ width: number; height: number }> {
const fd = fs.openSync(filePath, 'r');
const header = Buffer.alloc(24);
fs.readSync(fd, header, 0, 24, 0);
fs.closeSync(fd);
// PNG: magic bytes 0x89 0x50, dimensions at bytes 16-23
if (header[0] === 0x89 && header[1] === 0x50) {
return {
width: header.readUInt32BE(16),
height: header.readUInt32BE(20),
};
}
// JPEG: scan for SOF0 marker (0xFF 0xC0)
const buf = fs.readFileSync(filePath);
for (let i = 0; i < buf.length - 9; i++) {
if (buf[i] === 0xff && buf[i + 1] === 0xc0) {
return {
height: buf.readUInt16BE(i + 5),
width: buf.readUInt16BE(i + 7),
};
}
}
return { width: 0, height: 0 };
}
Zero dependencies for this. Reading 24 bytes from a file header is faster than spinning up sharp or jimp for what is ultimately a two-integer answer.
Using It
npm install gemgen # published as gemgen
import { generateImage } from 'gemgen';
// First run: headless: false to log in manually
await generateImage('a neon-lit cyberpunk street at night', './output.png', {
headless: false,
});
// Subsequent runs: headless by default
await generateImage('mountain range at golden hour', './mountain.png');
// Full client control for batches
import { GeminiClient } from 'gemgen';
const client = new GeminiClient();
await client.connect({ headless: true });
await client.generateImage('prompt one', './img1.png');
await client.generateImage('prompt two', './img2.png');
await client.disconnect();
The source is on GitHub under MIT.
Compared to SunoKit
Both libraries automate AI generation through browser automation, but they took different approaches to the same problem:
| | GeminiKit | SunoKit | |---|---|---| | Runtime | Playwright | rebrowser-puppeteer | | Anti-bot | Persistent session (cookies) | CDP patch + real Chrome | | Auth | Google OAuth (higher stakes) | Suno OAuth | | Download | Playwright Download API | CDN URL | | Tests | Manual test script | 85 unit tests + smoke tests |
The biggest practical difference: losing your Suno account is an inconvenience. Losing your Google account means losing Gmail, Drive, Calendar, and every other service tied to it. I run GeminiKit with a dedicated Google account, not my main one.
Both tools exist for the same reason: the web interface has capabilities the official API doesn't. Until the APIs catch up, browser automation is the pragmatic path.