Explore/muapi.ai/gemini-omni-image-to-video

muapi/gemini-omni-image-to-video

Image to Video

Gemini Omni Image to Video — animate one or more reference images with a text prompt. Unified reasoning across modalities preserves subject identity and generates synchronized audio natively.

Input

Configure the model parameters below.

0/7 items
Drag & drop images here or paste file/image
0/3
0/3

Result

No result data found.

📝

Overview

About this model

Gemini Omni Image to Video animates one or more reference images with a text prompt using Google's natively multimodal any-to-any model. Subject identity is preserved across frames while synchronized audio — dialogue, ambient sound, and music — is generated natively in the same forward pass.

1Product animation: Bring product photos to life with motion and ambient sound for social or advertising use.
2Character animation: Animate a character reference image with a prompt describing their action and setting.
3Storyboarding: Convert concept art into short video previews with narration and sound design.
4Social content: Turn still images into engaging vertical or widescreen clips for TikTok, Reels, or Shorts.
💰

Pricing & Value

Cost analysis

muapi$0.90–$1.80 (720p/1080p) · $2.10–$3.00 (4K)

Price scales with duration (4–10 s) and resolution. Synchronized audio included at no extra charge.

Fal.aiNot available

Gemini Omni Image to Video is not currently available on Fal.ai.

ReplicateNot available

Gemini Omni Image to Video is not currently available on Replicate.

* Competitor pricing is estimated based on similar model architectures and usage tiers.

⚙️

Technical Details

Configuration schema

Promptstring

Text description of the desired motion and scene. Gemini Omni supports rich multimodal prompts including camera direction, dialogue, and ambient audio cues.

Default ValueThe subject slowly turns to face the camera as golden-hour light sweeps across the scene, leaves rustling in the breeze.
Reference Imagesarray

Upload 1–7 reference images for the video. Maximum 20 MB each.

Default Valueundefined
Duration (seconds)Enum (4 options)

Duration of the generated video in seconds.

Default Value8
ResolutionEnum (3 options)

Output video resolution. 720p and 1080p are the same price; 4K costs more.

Default Value1080p
Aspect RatioEnum (2 options)

Output video aspect ratio.

Default Value16:9
Audio IDsarray

Up to 3 voice profile IDs returned by the Gemini Omni Audio endpoint.

Default Value-
Seedint

Random seed (0–2147483647). Fix for reproducibility; results may still vary due to model stochasticity.

Default Value0
Character IDsarray

Up to 3 character IDs from Gemini Omni Character to feature in the video.

Default Value-
📖

Implementation Guide

Developer documentation

How to Use Gemini Omni Image to Video

  1. Upload reference images Provide 1–5 images via image_urls. Each image acts as a visual anchor. The model preserves subject identity across frames.

  2. Write a motion and scene prompt Describe what happens in the video — motion, setting, lighting, and audio cues. Example: 'The subject slowly turns to face the camera as golden-hour light sweeps across the scene, leaves rustling in the breeze.'

  3. Choose duration and resolution Pick 4, 6, 8, or 10 seconds. Choose 720p / 1080p (same price) or 4K for higher resolution output.

  4. Pick an aspect ratio

    • 16:9 — widescreen, cinematic
    • 9:16 — vertical, mobile-first
  5. Submit and poll POST to /api/v1/gemini-omni-image-to-video and poll GET /api/v1/predictions/{request_id}/result until status is completed.

Common Questions

Frequently asked

How many reference images can I provide?

Between 1 and 5 images. Each image counts as 1 unit toward the 7-unit capacity (videos use 2, character IDs use 1).

Does it preserve the subject's appearance across frames?

Yes — Gemini Omni Image to Video is designed to maintain subject identity and appearance from the reference images throughout the generated clip.

Is audio generated automatically?

Yes — synchronized dialogue, ambient sound, and music are generated natively alongside the video in the same forward pass.