🎬 KLING AI ⏱ 11 min read 🔊 Kling Video

Kling Video — Technical Guide

Generate AI videos from text or photos — with built-in spoken dialogue, multi-shot storyboards, camera control, and character elements for consistent identity across scenes

🔊

Kling Video

klingai video /app/kling-video →

Generate AI videos from text or photos — with built-in spoken dialogue, multi-shot storyboards, camera control, and character elements for consistent identity across scenes

Kling Video is the core video generation tool. Describe a scene in text and the AI creates a video from scratch, or upload a photo and the AI animates it into motion. Characters can speak with synchronized lip movements, backgrounds move naturally, and camera angles follow your direction — all generated by AI in seconds.

What sets this apart from simpler video tools is built-in native audio. Write dialogue in your prompt using voice references, and the characters actually speak in the generated video with their lips perfectly synced. No separate lip sync step needed — the video comes out with voice, sound, and visuals together.

Multi-shot mode lets you build storyboard sequences of up to 6 scenes in a single generation. Each scene gets its own prompt and duration, creating a mini narrative — an opening shot, a reaction, a scene change, a close-up, a reveal. You can write each scene yourself or let the AI split your prompt into optimal shots automatically.

Elements let you reference pre-trained characters so the AI knows exactly what they look like. Voice references let you assign specific voices to characters in dialogue. Camera controls give you push-ins, pans, tilts, orbits, and crane shots. Start and end frame mode lets you define the first and last frame of the video, and the AI generates the transition between them.

Six model versions give you options from fast drafts to maximum cinematic quality, with v3 offering the latest capabilities and highest fidelity.

✦ Best Results Tips

🎬 Describe Action, Not Just Appearance

Your prompt should describe what happens in the video — movement, gestures, expressions, camera motion. A woman walks toward the camera smiling as wind blows her hair produces a dynamic video. A beautiful woman standing still produces a static one.

🔊 Pro Mode for Sound

Native audio — where characters actually speak in the video — requires Professional mode on v2.6 or later. Standard mode generates silent video only. Always use Pro mode when you want dialogue or sound effects.

📸 Use a Photo for Consistent Characters

Image-to-video mode gives you control over exactly how the character looks. Upload a photo of your character and describe the action — the AI animates that specific person rather than imagining one from text alone.

🎞️ Multi-Shot for Storytelling

Use multi-shot mode to create 2 to 6 scene sequences. Each shot can have a different angle, action, and location — turning a simple prompt into a mini narrative with scene changes and visual variety.

🎥 Add Camera Movement

Camera controls transform flat-looking video into cinematic content. A slow push-in builds tension, a pan reveals a scene, an orbit shot adds production value. Pick a camera move that matches the mood of your scene.

⏱️ 10 Seconds Minimum for Social Content

Short 5-second clips feel abrupt on social media. Set duration to 10 or 15 seconds to give the scene room to develop — especially for dialogue content where the character needs time to speak.

Kling Video — Available Models

Kling v3

Flagship Default

kling-v3

Top-tier cinematic video with native multilingual audio and lip-sync. Multi-shot storyboards up to 6 scenes with AI Director. Physics-aware motion, 3+ character consistency, flexible 3-15s duration. Best quality available for prompt-driven creative work.

3 aspect ratios

Kling v3 Omni

Recommended

kling-v3-omni

Industrial-grade character and voice consistency using Elements 3.0 references. Native audio with voice binding and cloning, perfect lip-sync across shots. Multi-shot via references. The model you choose when your character must look identical in every frame.

3 aspect ratios

Video O1

Multimodal

kling-video-o1

Advanced multimodal reasoning model with excellent start/end frame transitions and motion transfer. Strong visual consistency in single-shot mode. Precursor to v3 Omni architecture.

3 aspect ratios

Kling v2.6

Voice Control

kling-v2-6

Advanced motion engine with fluid actions and stable camera. First model with native audio support and voice control — characters can speak with assigned voices. Strong temporal coherence for cinematic final clips.

3 aspect ratios

Kling v2.5 Turbo

Fast

kling-v2-5-turbo

Speed-optimized model for rapid iteration. Decent cinematic motion at significantly lower cost and faster generation. Ideal for testing prompt ideas before committing to a higher-tier model.

3 aspect ratios

Kling v2.1 Master

Pro Only

kling-v2-1-master

Master quality tier with improved character motion stability. Professional mode only — designed for polished output rather than quick drafts.

3 aspect ratios

Kling v2 Master

Pro Only

kling-v2-master

Original master quality tier. Professional mode only. Superseded by v2.1 Master with better stability, but still available for existing workflows.

3 aspect ratios

Kling v1.6

Elements

kling-v1-6

Reliable mid-generation model at lower cost. Supports Element references for character consistency and camera controls. Good balance of features and affordability.

3 aspect ratios

Kling v1

Legacy

kling-v1

Original Kling model. Lowest cost for quick experiments and testing basic concepts. Simple text-to-video and image-to-video at minimal credit cost.

3 aspect ratios

Model Comparison — Which Model Should I Use?

Each model shares the same Kling Video engine but unlocks different features. Pick the tier that matches your needs.

Best cinematic quality?

🎬 Kling v3

Voice dialogue?

🎙️ Kling v2.6

Consistent characters on a budget?

🧩 Video O1

Quick draft / test?

⚡ Kling v2.5 Turbo

★ Flagship Maximum quality and features

Kling v3 Flagship Default

⏱ 3–15s

kling-v3 · Standard / Professional

🔊 Native Audio 🎬 Multi-Shot 🎙️ Voice Control 🧩 Elements 🎥 Camera Control 🖼️ End Frame

Best for: Cinematic shorts with dialogue, multi-scene storytelling, creative narratives

Limitation: Prompt-driven — less locked consistency without element references. Higher cost per second.

Kling v3 Omni Recommended

⏱ 3–15s

kling-v3-omni · Standard / Professional

🔊 Native Audio 🎬 Multi-Shot 🎙️ Voice Control 🧩 Elements 🎥 Camera Control 🖼️ End Frame

Best for: Brand-consistent characters, serialized content, commercial ads, e-commerce videos

Limitation: Optimized for 1-2 element references. Higher cost, especially with HD audio.

◆ Premium High quality with specialized strengths

Video O1 Multimodal

⏱ 3–10s

kling-video-o1 · Standard / Professional

Advanced multimodal reasoning model with excellent start/end frame transitions and motion transfer. Strong visual consistency in single-shot mode. Precursor to v3 Omni architecture.

🔊 Native Audio 🎬 Multi-Shot 🎙️ Voice Control 🧩 Elements 🎥 Camera Control 🖼️ End Frame

Best for: Precise framing with start/end frames, video editing, consistent single-shot with references

Limitation: T2V limited to 5s and 10s only. No multi-shot, no AI Director, no native audio. No voice control.

Kling v2.6 Voice Control

⏱ 5s / 10s

kling-v2-6 · Standard / Professional

🔊 Native Audio 🎬 Multi-Shot 🎙️ Voice Control 🧩 Elements 🎥 Camera Control 🖼️ End Frame

Best for: Talking head videos, cinematic final clips, expressive characters with voice

Limitation: No multi-shot. Voice control requires Pro mode. End frame not available with audio enabled. Fixed 5s or 10s duration.

● Standard Good balance of quality and cost

Kling v2.5 Turbo Fast

⏱ 5s / 10s

kling-v2-5-turbo · Standard / Professional

Speed-optimized model for rapid iteration. Decent cinematic motion at significantly lower cost and faster generation. Ideal for testing prompt ideas before committing to a higher-tier model.

🔊 Native Audio 🎬 Multi-Shot 🎙️ Voice Control 🧩 Elements 🎥 Camera Control 🖼️ End Frame

Best for: Quick drafts, rapid prototyping, high-volume generation at lower cost

Limitation: No native audio or voice control. No multi-shot. Slightly less polish than v2.6+. Fixed 5s or 10s.

Kling v2.1 Master Pro Only

⏱ 5s / 10s

kling-v2-1-master · Master

Master quality tier with improved character motion stability. Professional mode only — designed for polished output rather than quick drafts.

🔊 Native Audio 🎬 Multi-Shot 🎙️ Voice Control 🧩 Elements 🎥 Camera Control 🖼️ End Frame

Best for: High-quality single-shot video with stable character motion

Limitation: Pro mode only (no Standard). No audio, no multi-shot, no camera controls, no end frame, no elements. Fixed 5s or 10s.

○ Budget Lowest cost for testing and experiments

Kling v2 Master Pro Only

⏱ 5s / 10s

kling-v2-master · Master

Original master quality tier. Professional mode only. Superseded by v2.1 Master with better stability, but still available for existing workflows.

🔊 Native Audio 🎬 Multi-Shot 🎙️ Voice Control 🧩 Elements 🎥 Camera Control 🖼️ End Frame

Best for: Legacy workflows requiring the original master rendering

Limitation: Pro mode only. No audio, no multi-shot, no camera controls, no end frame, no elements. Fixed 5s or 10s.

Kling v1.6 Elements

⏱ 5s / 10s

kling-v1-6 · Standard / Professional

Reliable mid-generation model at lower cost. Supports Element references for character consistency and camera controls. Good balance of features and affordability.

🔊 Native Audio 🎬 Multi-Shot 🎙️ Voice Control 🧩 Elements 🎥 Camera Control 🖼️ End Frame

Best for: Budget-friendly generation with element references and camera control

Limitation: No native audio or voice control. No multi-shot. Basic motion physics compared to v2+. Fixed 5s or 10s.

Kling v1 Legacy

⏱ 5s / 10s

kling-v1 · Standard / Professional

Original Kling model. Lowest cost for quick experiments and testing basic concepts. Simple text-to-video and image-to-video at minimal credit cost.

🔊 Native Audio 🎬 Multi-Shot 🎙️ Voice Control 🧩 Elements 🎥 Camera Control 🖼️ End Frame

Best for: Quick tests, concept validation, lowest-cost experiments

Limitation: Basic motion and physics. No audio, no multi-shot, no camera controls, no elements. Poor character consistency. Fixed 5s or 10s.

Model	🔊 Native Audio	🎬 Multi-Shot	🎙️ Voice Control	🧩 Elements	🎥 Camera Control	🖼️ End Frame	Duration	Modes
Kling v3 Flagship	✅	✅	—	—	✅	✅	3–15s	Standard / Professional
Kling v3 Omni Recommended	✅	✅	—	✅	✅	✅	3–15s	Standard / Professional
Video O1 Multimodal	—	—	—	✅	✅	✅	3–10s	Standard / Professional
Kling v2.6 Voice Control	✅	—	✅	—	✅	✅	5/10s	Standard / Professional
Kling v2.5 Turbo Fast	—	—	—	—	✅	✅	5/10s	Standard / Professional
Kling v2.1 Master Pro Only	—	—	—	—	—	—	5/10s	Master
Kling v2 Master Pro Only	—	—	—	—	—	—	5/10s	Master
Kling v1.6 Elements	—	—	—	—	✅	✅	5/10s	Standard / Professional
Kling v1 Legacy	—	—	—	—	—	✅	5/10s	Standard / Professional

📥 You Give

📝Text Prompt 🖼️Start Frame Image (optional) 🖼️End Frame Image (optional) 🔊Sound / Voice (optional) 📐Aspect Ratio ⏱️Duration

✨

AI Magic

klingai

🎬 You Get

🎬 Video

Aspect ratios

16:9

9:16

1:1

Duration

15s

Quality modes

Standard

Professional

Master

Generation modes

Text-to-Video

Image-to-Video

🔊

Sound

Audio generation

🎙️

Voice control

Preset voices in prompt

🎬

Multi-shot

2-6 scene storyboard

📷

Camera control

Preset movements

🖼️

End frame

Start + end transitions

🧩

Elements

Reference in prompts

💰 Kling Video — Pricing

Estimated cost

—

Failed jobs are automatically refunded

KLING V2.6

🎬 Video

Kling V2.6 Model

With the VIDEO 2.6 Model, we are introducing the "Native Audio" feature for the first time: a single generation that simultaneously produces video visuals and complete audio, including voiceovers, sound effects, and ambient sounds. This feature achieves seamless coordination in rhythm, emotion, and narrative expression, delivering a true "see what you hear" audio-visual experience.

This upgrade focuses on:

Audio-Visual Coordination: Voice rhythm, ambient sounds, and visual actions are closely aligned, eliminating the disconnect between "visuals and separate audio."

Audio Quality: Supports various sound types such as voice, sound effects, and ambient sounds, with cleaner sound quality and richer layers, closely mimicking real mixing effects.

Semantic Understanding: Strong semantic comprehension of text descriptions, spoken language, and complex storylines in different contexts, ensuring more accurate interpretation of creator intentions and delivering content that better meets needs.

For the creation process, KLING 2.6 provides two efficient creation paths centered around the core need of "fast audio-video content generation from text/images":

Text-to-Audio-Visual: From a sentence to a complete audio-visual video. Input text to generate a video with voiceovers, sound effects, and ambient sounds.

Image-to-Audio-Visual: Bring static images to life with sound and motion. Upload images/text to instantly create audio-visual content, perfect for expanding existing images into full audio-visual experiences.

Image-to-Video

🎬 Video

Image-to-Video

By inputting an image, the "Kling" large model generates a 5-second or 10-second video that animates the image into moving visuals. With the addition of a text description, the "Kling" large model can produce a video sequence that integrates the text's narrative with the image.

It currently supports two modes of generation:

• Standard Mode for quicker video output
- Professional Mode for enhanced visual quality

Moreover, it accommodates three aspect ratios: 16:9, 9:16, and 1:1, catering to a wider range of video creation requirements.

Why Image-to-Video?

Image-to-Video is currently the most utilized feature by users, primarily because it offers greater control over the video creation process. Users can utilize pre-generated images to create dynamic videos, greatly reducing the professional video production costs and entry barriers.

From a creative perspective, "Kling" offers a new platform for creativity, enabling users to direct the motion of the subjects within images through text. Trends such as "reviving old photos," "embracing your younger self," and the whimsically termed "hallucinogenic mushroom video" where mushrooms appear to turn into penguins, showcase "Kling" as a creative tool. It provides infinite possibilities for users to bring their creative visions to life.

The Prompt Formula

For Image-to-Video generation, controlling the motion of the subject within the image is the core aspect. Here's the formula for Kling prompts:

💡 Prompt Format = Subject (Main Focus) + Movement (Motion Description) + Background (Scene Movement)

Subject: The main focus in the video, serving as an important embodiment of the theme. It can be people, animals, plants, objects, and so on.

Movement: Descriptions of the subject's movement status.

Background: Background of the scene.

Key Principles

The most fundamental elements of the formula are the subject and the movement. In contrast to Text-to-Video, which necessitates scene description, Image-to-Video is already provided with a scene. Thus, it only requires the depiction of the subjects in the image and the intended movement for these subjects.

Should there be several subjects with various movements, list them sequentially. "Kling" will then extrapolate from our expressions and its comprehension of the image to produce a video that aligns with the anticipated outcome.

Example: The Mona Lisa

If you want to have Mona Lisa in the painting wear sunglasses, when we simply input "wear sunglasses", the model may have difficulty understanding the instruction, and thus is more likely to generate a video based on its own judgment.

When "Kling" determines that it is a painting, it is more likely to generate a video with panning effects of the painting exhibition, which is also the reason why photos are prone to generating static videos.

💡 Solution: We need to describe "subject + movement" to help the model understand the instruction:
— Single subject: "Mona Lisa puts on sunglasses with her hand"
— Multiple subjects: "Mona Lisa puts on sunglasses with her hand, and a ray of light appears in the background"

The model will respond more easily to these specific instructions.

As we mentioned before, the purpose of the formula is to help everyone more effectively describe the video scenes they envision. Please feel free to communicate with Kling! Here are some excellent examples shared by creators, let's check them out~

✦ Some high-quality examples - Video examples below are shared by Kling creators

Prompt Two people hugging each other

Input

Output

Prompt Two boys hugging each other

Input

Output

Prompt The little boy smiles at the camera

Input

Output

Prompt A beautiful Chinese girl looks into the distance and smiles.

Input

Output

Prompt A cat is kneading dough in the kitchen.

Input

Output

Prompt No input

Input

Output

Prompt The red-crowned crane is looking for food.

Input

Output

Prompt No input

Input

Output

Prompt The model is smiling with her hair blown by the wind.

Input

Output

Best Practices for Image-to-Video Prompts

• Use simple words and sentence structures, avoiding overly complex language
- Movement should comply with the laws of physics, and it's best to describe movements that are likely to occur in the image
- A description that significantly deviates from the image may cause a camera cut or transition
- At the current stage, it is challenging to generate complex physical movements, such as the bouncing of a ball or the trajectory of a high-altitude throw

💡 Tip: Keep your prompts grounded in realistic motion that naturally fits the scene. Simple, physics-compliant movements yield the best results.

Text-to-Video

🎬 Video

Text-to-Video

By inputting a text passage, the Kling large model generates a 5-second or 10-second video that translates the text into visual imagery. It currently supports two modes of generation: "Standard Mode" for quicker video production and "Professional Mode" for superior image quality. "Kling" also supports three aspect ratios: 16:9, 9:16, and 1:1, to more diversely meet everyone's video creation requirements.

We recognize that "Prompt" serves as the key interactive language for the text-to-video model, and it directly dictates the content of the video produced by the model. Consequently, understanding and learning how to use effective Prompts for AI video creation is a goal for all users. As the new incarnation of the AI video model 2.0, "Kling" continues to evolve and improve. It's essential to explore continuously and tap into the full potential of Kling to adeptly utilize it and excel in AI video production. We have crafted a formula for Kling prompts for your reference:

💡 Prompt Format = Subject (Subject Description) + Subject Movement + Scene (Scene Description) + (Camera Language + Lighting + Atmosphere)
— optional

Subject: The subject is the main focus in the video, serving as an important embodiment of the theme. It can be people, animals, plants, objects, and so on.

Subject Description: Descriptions of the subject's appearance details and body posture can be listed using multiple short sentences. For example: Athletic performance, Hairstyle and color, Clothing and accessories, Facial features, Body posture and so on.

Subject Movement: Descriptions of the subject's movement status, including stillness and motion, should be straightforward and suitable for a 5-second video.

Scene: The scene represents the environment in which the subject is situated, encompassing the foreground, background, and other elements.

Scene Description: Scene descriptions for the subject's environment can be concise and focused, using a few short sentences to outline the setting without overwhelming the viewer. It should be suitable for what can be displayed within a 5-second video. Such as Indoor scene, Outdoor setting, Natural scene.

Camera Language: It pertains to employing various applications of the camera lens, along with the transitions and edits between shots, to communicate a narrative or message and to generate particular visual impacts and emotional tones. Techniques include ultra-wide angle shots, bokeh (background blur), close-ups, telephoto shots, low-angle shots, high-angle shots, aerial views, and depth of field, among others. (Note: This should be differentiated from camera motion control.)

Lighting: Light and shadow are the vital elements that imbue photographic works with soul. The application of light and shadow can make photos more profound and emotionally resonant, enabling us to create works with a rich sense of depth and expressive power. Techniques include: Ambient lighting, Morning light, Sunset, Interplay of light and shadow, Tyndall effect, Artificial lighting.

Atmosphere: Describing the atmosphere of the anticipated video footage can involve various elements to set the mood and tone.

Key Principles

The most fundamental components of the aforementioned formula are the subject, motion, and setting, which constitute the most straightforward and essential units for depicting a video scene. To provide a more detailed description of the subject and setting, one should enumerate various descriptive short sentences, maintaining the integrity of the elements intended to appear in the Prompt. "Kling" will then extrapolate from our expressions to produce a video that aligns with our vision.

Example: The Giant Panda

Given "A giant panda is reading a book in a café," we can enrich the details of the subject and scene by adding: "A giant panda, wearing black-rimmed glasses, is reading a book in a café, with the book resting on a table where a steaming cup of coffee sits beside it, next to the café's window." This creates a more specific and manageable image.

If you want to add some cinematic language and lighting ambience, we can also try: "Shot in medium range, with a blurred background and atmospheric lighting, a giant panda, adorned with black-rimmed glasses, is seen reading a book in a café. The book lies on a table, accompanied by a steaming cup of coffee, steaming hot, next to the cafe windows, movie-level color palette." The texture of the video generated in this way will be further enhanced, and it is possible to get results beyond expectations.

Panda exemple

A giant panda is reading a book in a café.

A giant panda wearing black-framed glasses is reading a book in a café, with the book placed on the table. On the table, there is also a cup of coffee emitting steam, and next to it is the café's window.

In the shot, a medium shot with a blurred background and ambient lighting captures a scene where a giant panda, adorned with black-framed glasses, is reading a book in a café. The book rests on the table, accompanied by a cup of coffee that's steaming gently. Beside the cozy setting is the café's window, with a cinematic color grading applied to enhance the visual appeal.

The purpose of the formula is to help everyone more effectively describe the video scenes they envision. We can also let our imagination run wild and not be limited by the formula, to communicate freely and boldly with "Kling," which might yield even more astonishing outcomes! Here are some excellent examples shared by creators, let's check them out~

Best Practices for Text-to-Video Prompts

• Use simple words and sentence structures, avoiding overly complex language
- Keep the visual content as simple as possible, aiming for a completion within 5 to 10 seconds
- Using words like "Oriental mood," "China," and "Asia" can more easily generate a Chinese style and depict Chinese people
- Current large video models are not sensitive to numbers, making it difficult to maintain consistency in counts, such as "10 puppies on the beach"
- For a split-screen scene, you can use a prompt like: "4 camera angles, representing spring, summer, autumn, and winter"
- At the current stage, it is challenging to generate complex physical movements, such as the bouncing of a ball or the trajectory of a high-altitude throw

💡 Tip: Keep prompts simple and focused. Avoid precise counting or complex physics. For cultural specificity, use clear regional descriptors.

START & END FRAMES

🎬 Video

Start and End Frames

The Start and End Frames function allows you to upload two images, and the model will use these two images as the starting and ending frames to generate a transition video. Experience it by clicking on the "Add End Frame" option located at the top right corner of the Image to Video function.

The first and last frame functions can achieve finer control over videos. At this stage, they are mainly used in video creation for generating videos with control requirements for the first and last frames, which can better achieve the desired dynamic transition of the generated video. However, it should be noted that the content of the first and last frame videos should be as similar as possible, as significant differences may cause a lens switch.

Some tips

• Choose two similar images with the same theme for smoother transitions within 5 seconds. Large differences may trigger a shot switch.

Start and End Frames

Start Frame

End Frame

Output

Start Frame

End Frame

Output

Start Frame

End Frame

Output

Start Frame

End Frame

Output

Start Frame

End Frame

Output

Solo Monologue - Product Showcase

Display products and highlight key selling points. Clear speech, natural tone, and a match to the product's atmosphere are key.

In a beauty live-streaming room, warm yellow lighting illuminates the table, with lipstick samples displayed on either side.[Caucasian beauty influencer] raises a matte dusty rose lipstick. [Caucasian beauty influencer, sweet and fresh voice] says: "Perfect for yellow undertones! Brightens the complexion without drying, and the finish looks beautifully soft all day." Background: Soft beauty BGM playing.

Visual: In a fashion live-streaming room, clothes hang on a rack, and a full-length mirror reflects the host's figure. Dialog: [African-American female host] turns to show off the sweatshirt fit. [African-American female host, cheerful voice] says: "360-degree flawless cut, slimming and flattering." Immediately, [African-American female host] moves closer to the camera. [African-American female host, lively voice] says: "Double-sided brushed fleece, 30 dollars off with purchase now."

Some high-quality examples

Video examples below are shared by Kling creators

A giant panda is eating hot pot with chopsticks, with the street as the background. Ratio: 16:9 Mode: Standard Mode length: 5s

A Pikachu is sitting on a chair, drinking coffee and reading a newspaper. Ratio: 16:9 Mode: Standard Mode length: 5s

A polar bear is playing the violin in the snow. Ratio: 16:9 Mode: Standard Mode length: 5s

A bee with a puppy's head Ratio: 16:9 Mode: Standard Mode length: 5s

Morning mist, sunrise, lens flare, and a cool breeze. A young Chinese woman with exquisite facial features, her long hair blown by the wind, strands of hair scattered across her face, dressed in summer attire, with a seaside beach as the backdrop. Ratio: 16:9 Mode: Standard Mode length: 5s

Indoor shooting, close-up, a Chinese child is eating dumplings. Ratio: 16:9 Mode: Standard Mode length: 5s

A beautiful girl with Chinese style Ratio: 16:9 Mode: Standard Mode length: 10s

A Chinese little girl is holding a pink balloon and smiling happily in the playground, with a slide in the background. Ratio: 16:9 Mode: Standard Mode length: 10s

Aerial shot, blue waves pounding against the rocks, a magnificent and magnificent scene. Ratio: 16:9 Mode: Standard Mode length: 10s

A medieval sailing ship sailing on the sea, a foggy night, bright moonlight, and an eerie atmosphere. Ratio: 16:9 Mode: Standard Mode length: 5s

First-person perspective, high-speed flight, symmetrical composition, rotation, countless lightning bolts amidst dark clouds, motion blur. Ratio: 16:9 Mode: Standard Mode length: 5s

The camera zooms into a beacon tower on the Great Wall, first-person perspective, high-speed flight, symmetrical composition, motion blur, and atmospheric lighting. Ratio: 16:9 Mode: Standard Mode length: 5s

A space fighter jet speeds through a huge sci-fi internal tunnel, rushes out of the tunnel into space, and a space battle can be seen at the end of the tunnel. Ratio: 16:9 Mode: Standard Mode length: 5s

A racing car is racing on the surface of the moon against a space backdrop, with tilt-shift zoom effect. Ratio: 16:9 Mode: Professional Mode length: 5s

Aerial shot of a cyberpunk city. Ratio: 16:9 Mode: Standard Mode length: 10s

On an alien planet, the streetscape of a cyberpunk city, with futuristic buildings, the camera slowly advances forward, and there are pedestrians on the street. Ratio: 16:9 Mode: Professional Mode length: 5s

A woman is engaged in a gunfight with someone in an alley, with a Blade Runner-style atmosphere, neon lights, and ambient lighting. Ratio: 16:9 Mode: Professional Mode length: 5s

First-person perspective, a man driving a car on a night street with fireworks blooming ahead. Ratio: 16:9 Mode: Standard Mode length: 5s

A circling camera shot captures a handsome young man dressed in ancient clothing, wearing white, seated by the pond with his eyes closed, meditating. Ratio: 16:9 Mode: Professional Mode length: 5s

The back view of a woman, in a red long gown, standing on the rooftop, with buildings smoking in the distance. Ratio: 16:9 Mode: Standard Mode length: 5s

Solo Monologue - Lifestyle Vlog

Showcasing easy, natural moments from daily life.

On the beach, the waves crash against the shore. [Young Caucasian male] wearing a backward baseball cap, holding a camera and taking a selfie, with a smile at the corner of his mouth. [Young Caucasian male, sunny voice] says: "The weather is amazing today! All my worries feel totally gone. I've been needing a day like this—sun, breeze, just the sound of the waves." The camera is in vlog close-up style.

In a kitchen, the oven door is half open, revealing a golden chiffon cake resting on the table. [Latina girl] gently breaks the cake with her hands (cake crumbs fall off), her eyes shining with excitement. [Latina girl, proud and sweet voice] says: "My first success. Look at that crumb!" Background: Upbeat BGM plays.

Supported Audio Types

Dialogue : Multi-person voice dialogue

Voice Narration : Character voice narration

Singing/Rap : Characters singing or rapping with lyrics

Ambient Sound Effects : Background sounds like wind, ocean waves, street noise, traffic

Object/Action sound effects : Sounds like glass breaking, footsteps, knife slicing, machine rumble

Mixed Sound Effects : A combination of voice, background sounds, and sound effects for an immersive audio-visual experience.

Prompt

Put on sunglasses

Mona Lisa puts on sunglasses with her hand

Mona Lisa puts on sunglasses with her hand, and a ray of light appears in the background

Solo Monologue - News Reporting

Emphasizes professionalism, formality, and stable tone.

Visual: In front of an outdoor shopping mall, a crowd gathers, cheering. Dialog: [African-American male reporter] stands next to the crowd, holding a microphone, his body slightly turned. [African-American male reporter, steady voice] says: "Now we can see the atmosphere here is absolutely electric. Let's go check it out together! There's so much happening all at once." Background: Cheerful crowd noises and event BGM, with occasional close-ups of the event.

In a sports news studio, the screen behind the sports anchor is showing a basketball game replay.[Sports anchor] sits behind the news desk, tapping his fingers lightly on the table. [Sports anchor, clear and strong voice] says: "Look at this clutch play! He stepped up when it mattered most, hitting the shot that decided the championship! This game-winning shot sealed the victory outright." Background: Cheers from the live game, with the camera focusing on the sports anchor's face.

Solo Monologue - Public Speaking

Shows strong, persuasive delivery.

The main venue of an international tech summit, with delegates from various countries filling the seats. [Indian entrepreneur] stands at the center of the stage. [Indian entrepreneur] gazes steadily at the audience, his hands naturally hanging by his sides. [Indian entrepreneur, loud voice] says: "A decade ago, the world saw India through call centers." After a brief pause, he extends his hands upward. [Indian entrepreneur, passionate voice] says: "Now, Indian innovation is reshaping the world with tech!" The camera slowly zooms in on the Indian entrepreneur's face, and as he finishes his speech, he joins his hands together in a prayer gesture. The audience bursts into applause.

Visual: A TED-style circular stage, with the speaker sitting, and the audience hidden in the shadows. Dialog: [Speaker] leans slightly forward, resting his hands lightly on the podium. [Speaker, sincere and gentle voice] says: "Your biggest limitation isn't your ability; it's the story you tell yourself about your ability." Background: Light chuckles from the audience followed by applause, with the camera holding a slow, subtle zoom in on the speaker in mid-close-up.

Narration - Product Explanation

Static visuals + professional narration, ideal for e-commerce videos.

Visual: In a tidy living room, a white robotic vacuum sits in the center, with no clutter around it. Dialog: [Narrator, soft female voice] accompanied by the gentle sound of vacuuming: "Are you still troubled by dust in hard-to-reach corners? This robotic vacuum features edge-to-edge cleaning, leaving no gaps behind—making your life easier and effortless!" The camera closely follows the vacuum's path as it cleans.

Visual: On a bright weekend morning, the living room is filled with light, and a vintage green Bluetooth speaker rests on the coffee table. Dialog: [Young Caucasian man] walks over to the speaker with a coffee cup in hand, gently tapping the switch. [Young Caucasian man, casual voice] says: "Good morning. With 360-degree surround sound, you can enjoy rich, full music from anywhere in the room." After speaking, the young man walks away, and the camera focuses on the speaker.

Narration - Event Commentary

Requires dynamic pacing and event atmosphere.

Visual: At the World Cup final, the lights are dazzling, and the stands are roaring with excitement. Dialog: (No characters, just narration) [Narrator, excited male voice] as the ball hits the net: "The game is over!" Background: Fans erupt in cheers, and the camera captures the moment the ball enters the net from the goalkeeper's perspective.

In front of the main grandstand at an F1 racetrack, the cars zoom by. [Narrator, excited male voice] says: "Final lap! He's on the inside! Oh, what a move! They are side by side to the line! Unbelievable!" Background: The roar of engines and the screech of tires, with the camera following the two cars nearly side by side.

Multi-Character Dialogue - Interview Show

Two people sit down for an interview with natural tone switching.

Visual: A modern industrial-style recording studio with brick walls covered in soundproof panels, equipment neatly arranged. Dialog: [Caucasian male host] sits in front of the microphone, slightly leaning forward. [Caucasian male host, steady voice] says: "Today we're excited to have Dr. Sarah Miller from Stanford AI Lab. Sarah, your research on neural networks is groundbreaking." During this, [African-American female guest] remains silent. Immediately, [African-American female guest] raises her chin slightly, holding the microphone. [African-American female guest, gentle voice] says: "Thank you for having me." During this, [Caucasian male host] remains silent.

A modern podcast studio in Los Angeles, with a warm yellow filter wrapping around a beige fabric sofa. [Caucasian female host] looks at the camera, her fingers gently resting on the armrest of the sofa. [Caucasian female host, sweet voice] says: "The Santorini sunset in Greece is absolutely breathtaking! Highly recommend adding it to your bucket list." During this, [African-American male host] remains silent. Immediately, [African-American male host] nods slightly. [African-American male host, gentle voice] says: "Exactly, that's the perfect spot to unwind and escape the daily grind." During this, [Caucasian female host] remains silent. The camera focuses on the interaction between the Caucasian female host and the African-American male host.

Multi-Character Dialogue - Scripted Performance (Short Play)

For short stories and emotional dialogue.

Visual: A dimly lit casino VIP room with a green-felt poker table at the center, surrounded by swirling smoke. Wall lamps cast warm, silhouetted glows.Dialog: [Man in suit, elbows on the table leaning forward, deep male voice]: "Three rounds to decide. Win, and all the chips are yours. Lose, and tell me the real reason you're getting close to him."[Woman with curly hair, fingers gently tracing the edge of the table, red lips curling into a faint smile, cool and glamorous female voice]: "I don't care about the chips."

A frigid wilderness. Explorers are starting a fire, with firewood crackling. [Explorer A, exhausted yet resolute]: "The fire is lit."[Explorer B, voice brimming with hope, as the speaker switches]: "We're saved!"Sound Effects: Crackling of the burning flame, distant wolf howls, howling cold wind sweeping past.

Multi-Character Dialogue - Daily Conversation

Casual, Natural, and Conversational

Visual: In an office area of a New York office building, cool-toned lighting illuminates the workspace, and a printer is running. Dialog: [Foreign male employee] and [Foreign female employee] stand next to the printer, facing each other. [Foreign male employee, calm voice]: "How's the project report coming along? Manager needs it this afternoon." During this, [Foreign female employee] remains silent. Immediately, [Foreign female employee, efficient voice] responds: "Almost done. I'll send it in 10 minutes." During this, [Foreign male employee] remains silent. The camera focuses on their interaction, with the sound of the printer and the office background ambiance.

A kitchen in the morning, sunlight streaming through the window onto the countertop, with a frying pan sizzling. [Boyfriend] places a blackened fried egg on the table, raising an eyebrow proudly. [Boyfriend, cheerful voice]: "Try my breakfast made with love!" During this, [Girlfriend] remains silent. Immediately, [Girlfriend] leans in, takes a light sniff, and raises an eyebrow. [Girlfriend, teasing voice]: "The love is definitely felt, it's just a bit burnt." During this, [Boyfriend] remains silent. Then, the two make eye contact, and together, both smile and say: "It's just a bit burnt." The camera cuts from a close-up of the fried egg to [Boyfriend and Girlfriend] sharing a smile.

Multi-Character Dialogue - Comedy Skit

Fast-paced, with Strong Contrast

Visual: On a comedy stage, the spotlight is focused on the center, while the audience remains in the shadows. Dialog: [Stand-up comedian] holds a microphone on stage, slightly swaying his body. [Stand-up comedian, humorous male voice]: "My gym trainer said the first step is the hardest... Lies! The first step is easy. It's the 5,000th step that's trying to murder you!" After finishing, the comedian shrugs and raises his hands. Background: Laughter and applause from the audience, with the camera focused on the comedian's face.

Visual: In a cherry blossom plaza, pink petals fall, and there are faint ruins near the fountain. Dialog: [Pink Mecha Girl] extends her energy wings (with a loud alarm sound), hurriedly looks down at her control screen. [Pink Mecha Girl, panicked voice]: "Oh no, only five percent battery left!" Immediately, the Pink Mecha Girl lands near the fountain, fumbles to plug in the mobile power bank, and glances at the giant monster. [Pink Mecha Girl, embarrassed voice]: "Um, could you please wait while I recharge?" The giant monster tilts its head and makes a confused low growl, retracting its claws and sitting down in the ruins. The camera focuses on the Pink Mecha Girl's frantic movements

Music Performance - Sing

Visual: A sunlit garden path, with daisies in full bloom and butterflies fluttering gently. Dialog: [Asian woman] walks slowly with loose braids, her floral dress brushing against the daisies. [Asian woman, gentle voice] sings: "In this tranquil morning, I've found my way. With dreams in my heart, there's light in my days." The Asian woman reaches out to brush past the flowers, startling a white butterfly into flight.

Visual: In a livehouse, bathed in blue light, a high barstool is placed in the center, with the audience hidden in the shadows. Dialog: [Short-haired female singer] sits on the high barstool, holding a wooden guitar, her fingers gently strumming the strings. [Short-haired female singer, heartfelt voice] sings: "And I will try to fix you, all night long..." When she reaches the chorus, [Short-haired female singer] looks out toward the audience. Background: The sound of clinking glasses. The camera switches between focusing on the short-haired female singer's fingers on the strings and her facial expression.

Music Performance - Rap

Visual: Brooklyn, New York – in front of a graffiti-covered wall, the street vibe is intense, with breakdancers freestyling nearby.Subject: An African-American rapper wearing a gold chain and an oversized hoodie, grooving to the beat while facing the camera.Audio: [African-American rapper, energetic male voice] Rapping over a drum beat: "Yeah, from the bottom to the top, I’m shining bright like a star. Brooklyn streets raised me tough, fought through the dark. Gold chain swingin’, flow hits hard, grindin’ daily, never bored. Now I’m livin’ in the light, this is my life, raw and hardcore!"Background: Layered with deep bass and turntable scratches. Camera cuts rapidly between close-ups of his facial expressions, hand gestures, and the breakdancers.

On a street stage, the audience stands around. [Young rapper] wears a silver chain and a black hoodie, swaying his body to the beat. [Young rapper, dynamic male voice] raps: "Yo, pavement to stage, flow lit, crowd goin’ wild! Mic in my grip, dreams unchained, let the rhythm ride! Raw vibe, sharp rhymes, keep the energy high—this is how we fly, no need to deny! Grind hard, spit fire, make the moment mine, street-born rhythm, let times shine!" The camera focuses on the young Caucasian rapper's movements.

Music Performance - Group Chorus

In a bright rehearsal room, sunlight streams through the window, and a standing microphone is placed in the center of the room. [Campus band female lead singer] stands in front of the microphone with her eyes closed, while the other members stand around her. [Campus band female lead singer, full voice] leads: "I will try to fix you, with all my heart and soul..." The background is an a cappella harmony, and the camera slowly circles around the band members.

Visual: On a university rooftop at sunset, the golden glow of the setting sun wraps around the ground. Dialog: [Asian male and female students] sit in a circle, playing acoustic guitars, their expressions deeply immersed. [Asian male and female students, youthful chorus] sing: "Starlight all over the sky, please light the way ahead; let our youthful voices sail away with the wind." The camera slowly circles each person's face, and the guitar strings gleam with golden light in the sunset.

Music Performance - Instrumental Performance

In a traditional study room, a scroll hangs on the wall, and a guqin rests on the desk, bathed in soft light. [Scholar] sits calmly at the desk, gently plucking the strings of the guqin with his fingertips, his expression serene. Background: The sound of the scroll turning and the melody of the guqin. The camera focuses on the scholar's fingers as he plucks the strings.

Visual: On a neon-lit rainy street at night, raindrops fall to the ground. Dialog: [Cellist] stands under a streetlight, raindrops on the tips of his hair, playing the cello. Background: A slow, emotional solo cello piece. The camera focuses on the water droplets vibrating on the cello strings and [Cellist]'s closed eyes.

Creative Scene - Visual Effects

Visual: In a cozy living room, the firewood is burning in the fireplace, and the sofa is placed next to a coffee table. Dialog: [Male protagonist] enters the living room and speaks. [Male protagonist, gentle voice]: "Babe, taking a break from work?" During this, [Female protagonist] remains silent and smiles, nodding. Immediately, the male protagonist walks over to the sofa, gently sets down his cup, and reaches out to ruffle the female protagonist's hair. The camera focuses on their interaction.

A scene in Antarctica with towering ice formations, the overall tone being a cold, white, frigid color palette. The glacier cracks with a loud noise, followed by the sound of ice shattering, as the engines of the research team's snowmobiles roar. The camera follows the retreating research team and the collapsing ice towers.

Creative Scene - Life Scene Atmosphere

In an afternoon room, sunlight filters through the blinds, creating striped light spots on the floor. A [ginger cat] lies on the windowsill. The [ginger cat] breathes slowly, with background sounds of distant birds and rustling leaves. The camera focuses on the light spots shifting with the cat's breath.

Visual: In a late-night diner scene, only the counter lights are on, and the TV shows a scene titled "Man Wandering in the Park at Midnight." Dialog: [African-American owner] looks at the TV. [African-American owner, deep voice]: "I wonder who needs help this time?" The African-American owner stares at the TV for a moment before his expression softens. [African-American owner, gentle voice]: "I see. It's a father carrying his daughter in his heart." The camera switches between focusing on the African-American owner's face and the TV screen.

Creative Scene - ASMR

Visual: In the library's restoration room at night, a warm desk lamp illuminates ancient books, and the restorer wears white gloves. Dialog: [Book restorer] gently sweeps a soft brush across the cover of the ancient book (with a subtle brushing sound), bringing the brush closer to the microphone. [Book restorer, whispering voice]: "These pages have been asleep for two hundred years. Today, we wake them gently." Background: The soft rustling of book pages, with the camera focusing on the cleaning motion.

A clean live-streaming desk, with props such as a crystal glass and wooden blocks neatly arranged. The makeup brush lightly sweeps across the crystal glass and wooden blocks, producing a "shh-shh" sound. The camera focuses on the props and the details of the action.

Creative Scene - Creative Ads / Material

Visual: In a product display scene, with a simple, bright background, a [raisin] is placed in the center. Dialog: [Raisin] twists and is hydrated, transforming into a plump green grape. [Off-screen voice, crisp female voice]: "Don't want to end up shriveled like I was? Hydrating face cream quenches your skin's thirst and turns back time." Background: The sound of water splashing, and the camera pulls back to show the face cream.

In a cinematic rainy-day café, rain splashes against the window, with a cool, blue-green tone overall. [Blonde French woman] walks in and sits down, her hair slightly damp, gazing directly at the camera. [Blonde French woman, low voice]: "You don't remember the moment, you just remember the feeling." The camera then focuses on a bottle of golden perfume that appears in the center, zooming in on the blonde French woman's face.

How to Write Effective Prompts

🎬 Video

How to Write Effective Prompts

When using the VIDEO 2.6 Model, simply write down the scene you want to see + the action that happens + the sound you want to hear, and you'll generate high-quality audio-visual output videos. You can refer to the following formula:

💡 Prompt Format = Scene (Scene Description) + Element (Subject Description) + Movement (Movement Description) + Audio (Dialogue / Singing / Sound Effects / Pure Music) + Other (Style / Emotion / Camera)

💡 Dialogue: "Sentence" + Emotion + Speech Speed + Tone + Character Label
— Single Character: Specify voice attributes (e.g., [Man speaking], "Sentence" + Deep + Fast)
— Multiple Characters: Use clear labels to distinguish (e.g., [Character A, angrily] says, "Sentence" — [Character B, calmly] replies, "Sentence")

Singing: "Lyrics" + Singing Style + Accompaniment Description + Emotion
— Style: Pop, Opera, Country, etc.
— Emotion/Techniques: High-pitched, Vibrato, Gentle singing

Rap: "Sentence (Rhyming)" + Rhythm Style + Emotion
— Rhythm Style: Intense Boom Bap, Trap Style Beat, Fast Flow
— Content: "Sentence" should reflect Rhyme and Meter

Sound Effects: Sound Source (Action/Object) + State + Professional Sound Effects
— Structure: [Object: Wooden Door] suddenly [Action: Slams] + [Sound Effect: Bang]
— Material/State: Glass Breaking, Metal Impact, Screeching Brakes

Ambient Sound: Scene + Sound Elements + Spatial Reverb
— Elements: Rain, Insects, Crowd Murmurs, Traffic
— Spatial Feel: Echo in an Open Hall (Reverb), Small Room Acoustics

Pure Music: Instrument Type + Music Genre + Emotion
— Structure: Piano Performance + Jazz + Melancholy
— Genres: Classical, Rock, Electronic

💡 Tip: It is recommended to use quotation marks " " to clarify sound content when writing prompts.

Key Tutorial — Multicharacter Dialogue Prompt Examples and Guidelines

Guidelines	Core Principles	Prompt Guidelines and Examples	Incorrect Example (Prone to Model Failure)
P1. Structured Naming	Character labels must be unique and consistent.	[Character A: Black-suited Agent] and [Character B: Female Assistant]. ❌ Avoid using pronouns or synonyms.	[Agent] says… Then, he says…
P2. Visual Anchoring	Bind the dialogue to the character's unique actions.	First describe the action, then follow with the dialogue: The black-suited agent slams his hand on the table. [Black-suited Agent, angrily shouting]: "Where is the truth?"	[Black-suited Agent]: "Where is the truth?" (The model won't know who slammed the table)
P3. Audio Details	Assign unique tone and emotion labels to each character.	[Black-suited Agent, raspy, deep voice]: "Don't move." [Female Assistant, clear, fearful voice]: "I'm scared."	[Man] says… [Woman] says… (The voice characteristics are too vague and can confuse the model)
P4. Temporal Control	Use clear linking words to control the sequence and rhythm of dialogue.	…. [Black-suited Agent]: "Why?" Immediately, [Female Assistant]: "Because it's time." ⚠️ (Optional strong constraint: Insert "this is when the speaker switches" between the two.)	[Black-suited Agent]: "Why?" [Female Assistant]: "Because it's time." (The model may generate a continuous speech from one character)

Common Audio Trigger Words

Audio Type	Category	Trigger Words	Examples
Speech	Core Speech	Speaking / Talking	A woman is sitting at a desk, calmly speaking into a microphone.
		Asking / Querying	A curious boy in the garden asking his father a question.
		Telling / Narrating	An old man sitting by the fireplace, slowly telling a story.
		Explaining	A tour guide pointing at a map, clearly explaining the route.
	Volume / Clarity	Whispering	Two friends leaning in close in a crowded room, whispering a secret.
		Softly Speaking	A student in the quiet library is softly speaking on the phone.
		Clearly Speaking / Crisp Voice	A radio announcer with a clear voice is speaking the news.
	Emotion / Tone	Excitedly Speaking	The award winner is holding a trophy, excitedly speaking their acceptance speech.
		Complaining	A customer at the counter complaining about poor service.
		Sighing	A tired worker sitting by a window, letting out a heavy sighing sound.
		Gently Speaking	A mother is rocking a baby, gently speaking a lullaby.
	Vocal Quality	Hoarse Voice	A patient waking up, requesting help with a hoarse voice.
	Vocal Quality	Deep Voice	A middle-aged man telling a scary story in a deep voice.
	Pace / Rhythm	Fast Talking / Rapid Speech	A fast-talking salesperson rapidly describing the product features.
	Pace / Rhythm	Slow Talking	An old professor slow talking while carefully elaborating on a complex theory.
	Performance	Reciting / Reading Aloud	A poet on a stage, reciting a dramatic poem.
	Performance	Monologue	An actor standing alone on stage, performing a sad monologue.
Speech (cont.)	Performance (cont.)	Narration / Voiceover	A film scene cuts to a background sound of a deep narration.
	Dialogue Interaction	Answering / Responding	The interviewee is answering the question immediately.
		Arguing / Quarrelling	A couple in the kitchen, arguing loudly.
		Shouting / Yelling	A father standing at the door is shouting / yelling at his children playing outside.
		Discussing	A group of students gathered around a table, discussing a difficult problem.
	Vocal Action	Crying / Sobbing	A little girl sitting on the ground crying / sobbing after falling down.
Speech (cont.)	Vocal Action (cont.)	Screaming	A woman seeing a mouse, letting out a sharp screaming sound.
Speech (cont.)	Vocal Action (cont.)	Laughing / Chuckling	Three people sharing a joke and laughing / chuckling loudly.
Singing	Core Form	A Capella	A singer on an empty stage performs the first line a capella.
		Humming	A chef happily humming a tune while cooking in the kitchen.
		Loud Singing	A rock musician singing loudly from the mountaintop.
	Technique / Style	Bel Canto / Opera	A soprano in a gown performing a bel canto / opera piece.
		Pop Vocals	A young artist in a studio, recording a new track with pop vocals.
		Vibrato	A singer adding a beautiful vibrato to the high note.
		Falsetto	A male vocalist using falsetto to hit a very high note.
		Harmony / Layered Vocals	A quartet performing a section with perfect harmony.
	Rap Terminology	Rapping / Hip-Hop	A street performer rapping / hip-hop under neon lights.
	Rap Terminology	Flow / Rhyme	A rapper performing a verse with a smooth flow and tight rhyme.
Singing (cont.)	Rap Terminology (cont.)	Fast Rap / Rapid Delivery	A section of the song is a high-speed, machine-gun like fast rap / rapid delivery.
Singing (cont.)	Rap Terminology (cont.)	Strong Rhythm / Heavy Beat	A Hip-Hop track with a strong rhythm / heavy beat.
Sound Effects — SFX	Daily Actions	Tapping / Knocking	A carpenter is tapping / knocking a nail with a hammer.
		Footsteps	Slow and heavy footsteps walking in an empty hallway.
		Chewing / Munching	A person chewing / munching on crunchy chips.
	Material Impact	Glass Shattering	A rock hitting a window, followed by the sound of glass shattering.
		Metal Clanging	Two large iron blocks metal clanging in a factory.
		Friction / Rubbing	Friction sound of two pieces of rough fabric rubbing together.
	Natural Elements	Thunder	A flash of lightning, followed by a low thunder rumble.
		Fire Crackling	A campfire fire crackling and burning brightly.
		Bubbling / Gurgling	Hot soup on the stove, bubbling / gurgling as it heats up.
	Mechanical Noise	Alarm / Siren	A police car driving by at night, its alarm / siren wailing.
		Braking	A car performing an emergency stop, with a screeching braking sound.
		Gears Whirring	The internal workings of an old clock, with subtle gears whirring sound.
Musical Instruments	Instruments	Piano Music	A pianist playing classical piano music in a concert hall.
Musical Instruments	Instruments	Guitar Plucking	A street artist gently plucking a guitar string.
Ambient Soundscapes	Urban	Traffic Noise / Car Flow	Continuous traffic noise / car flow at a busy intersection.
		Crowd Murmur	The background sound of crowd murmur in a museum.
		Subway Noise	Subway noise as a train arrives and departs from the station.
		Construction Noise	Distant, persistent construction noise in the city during the day.
	Nature	Ocean Waves	The soothing sound of ocean waves hitting the beach in the morning.
		Bird Chirping	Various bird chirping sounds in a morning forest.
		Wind Sound (Nature)	Wind sound blowing across an open field.
		Rainforest	A hot and humid rainforest, filled with unique bird calls and dripping water.
	Indoor Space	Library Silence	The deep library silence punctuated by the occasional book drop.
		Café Background Music	A casual café background music with quiet chatter.
		Air Conditioner Hum	The steady, low air conditioner hum in a quiet office.
		Fireplace Burning	The warm, comforting sound of a fireplace burning in a winter cabin.

Kling VIDEO 2.6: Voice Control

🎬 Video

Kling VIDEO 2.6: Voice Control for Image to Audio & Video

Have you ever struggled with inconsistent voices or a lack of personalisation when creating content across multiple videos or characters? With Kling VIDEO 2.6, we're introducing the all-new Voice Control feature. Simply select a target voice, and the model will accurately replicate its vocal characteristics to perform your specified content. The workflow is effortless — just provide visual input + voice prompt + target voice, and generate high-quality audio-visual content in seconds.

With Voice Control, you can now achieve:

Stable, High-Fidelity Voice Output: The voice remains consistent throughout the entire video, accurately preserving the target timbre. Ideal for long-term voice consistency across IP characters, brand personas, and recurring roles.

Flexible Style Adaptation: A single voice can be seamlessly applied to multiple scenarios — such as narration, conversation, or speeches — automatically adapting tone, rhythm, and delivery style to match the context.

Natural Cross-Language Performance: No additional configuration required. Voices trained in one language can naturally perform dialogue in another (e.g., Chinese ↔ English), with smooth pronunciation and expressive consistency. Currently supports bidirectional Chinese–English adaptation.

Prompt-Based Voice Binding: With simple prompts like [Character@VoiceName], the model automatically binds voices to specific characters — making multi-character dialogue with distinct voices effortless.

With Voice Control, you can now achieve

Signature Voice for Avatars

Product Demos & Explanations

Multi-Character Voice Control

Storytelling & Performance

Voice Control Prompt Guide

Prompt = Scene (scene description) + [Element (element description) @Voice Name] + Motion (motion description) + Audio (dialogue / singing / sound effects / music) + Others (style / emotion / camera)

Recommended Prompt Format:

Type	Prompt Format	Prompt Example
Single person speaking/singing	[Role Name] @Voice: "Dialogue."	Police interrogation scene. In a tense police interrogation, [Detective Li @Mr. Wang's voice] stands and demands, "Where's the evidence?" [Suspect Zhang @Xiaohong's voice] lowers his head, trembling slightly, and replies, "I don't know anything." The overall tone is serious and intense, with close-up shots focusing on their exchange.
Multi-character conversation/singing Recommended for two-person dialogues; performance may degrade in scenes with three or more speakers.	[Role A] @Voice A: "Line A." [Role B] @Voice B: "Line B."	Family farewell scene. In an emotional family farewell, [Father @My Own Voice] lowers his head slightly and says, "The train is about to depart." [Daughter @My Best Friend's Voice] wipes away tears and responds, "Dad, please don't go." The scene carries a sorrowful tone, captured in a medium shot that frames the moment of separation.

Voice Binding Rules:

Specified Voice Binding: To assign a specific voice to a specific character, add @VoiceName immediately after the character name in the prompt, using the format: Element@VoiceName
— Example: [Livestream Host] @Sweet Female Voice: "This top is a trending must-have!"

Multiple Voice Usage: @VoiceName are independent for each character and will not override one another. Recommended for two-character dialogue scenarios; performance may degrade in scenes with three or more characters.
— Example: [Teacher] @Intellectual Female Voice: "Turn to page 20." [Student] @Teen Voice: "Okay, teacher."

Current Limitations: Voice creation and usage currently support Chinese and English audio/video content only. Voice consistency may be weaker in singing scenarios.

Not Recommended Prompt Formats:

Please avoid the following formats — the model may fail to correctly recognise voice bindings, resulting in incorrect or inconsistent voice application.

Error Type	Prompt Format	Example
Missing character entity (using voice as the element)	@Voice: "Dialogue"	Weather forecast scene. [@Female Narrator Voice] calmly says: "Welcome to today's weather forecast."
Multiple characters bound to the same voice	[Character A] @Voice1. [Character B] @Voice1.	Security booth scene. [Guard A@My College Friend's Voice] reports in a low tone: "The gate is locked." [Guard B@My College Friend's Voice] nods: "All clear." Calm, serious tone; close-up shot capturing the dialogue.
Incorrect voice tag placement	[A]: "Line A." [B]: "Line B." @Voice [A]: "Line A." [B]: "Line B." @Voice bound to [B] @Voice. [Character]: "Dialogue"	Home conversation scene. [Man]: "Where did you go?" (Speaker switches) [Woman]: "I went out for a walk." @My Own Voice Security booth duty scene: [Guard1]: "Yes." [Guard2]: "Copy that." @Voice bound to [Guard1] Street rap scene. @Rapper A's voice. [Young Man]: "The streets never sleep, the beat never fades." Cool, urban style with street traffic ambience and light drum beats.
Voice tag embedded inside dialogue text	[Character]: "Dialogue @Voice" [A]: "Line 1 @Voice." Then the speaker switch to [B]: "Line 2."	[Protagonist]: "I've decided @My Voice to leave this place." Firm and restrained tone, close-up shot, with soft breathing ambience in the background. Family farewell scene. [Father]: "Are you really leaving, @My Own Voice?" (Speaker switches) [Daughter]: "Yes, this is my decision." Emotional farewell atmosphere, medium shot, with low indoor ambient sound.
Voice bound to visual actions or non-human audio	@Voice [Visual Action], [Character]: "Dialogue" [Non-human object] @Voice, [Character]: "Dialogue"	Indoor standoff scene: [@Zhang San's voice] slowly walks into the room. [Man]: "Who are you?" Side-angle shot, with faint footsteps in the background. Police chase scene: Sound effect: [police siren @my voice] wailing. [Officer]: "We're almost there." Overall tense, fast-paced tone, handheld tracking shot, with tire screeching layered into the background audio.
Invalid binding (silent character or attribute conflict)	[A]: "Line A." (Bind to [B]) (B does not speak) [Character with conflicting attributes] @Voice Attribute: "Line"	Clothing store scene. [Main character]: "This outfit looks great." [A silent clerk] looks at him. [@My colleague's voice] is bound to the silent clerk. Indoor interrogation scene. [A tall man @a sharp, high-pitched female child voice]: "You've caught me." Overall style is suspenseful with strong contrast, using a close-up shot.

FAQ

Q: What languages does the current model support for voice output?

The current model only supports voice output in Chinese and English. If you input other languages, we will automatically translate them into English and generate the corresponding audio, which won't affect the overall experience. We are also rapidly expanding support for additional languages, so stay tuned!

Q: Can I generate audio only, without video?

Yes! You can go to the platform's [Sound Effect Generation] module, where you can choose either Text-to-Sound Effects or Video-to-Sound Effects:

— Input text to generate standalone audio.
— Upload a video to extract sound effects.
— This allows you to create pure audio content without needing to generate a video.

Q: How can I improve generation results?

To achieve better generation results, we recommend optimising in the following ways:

Optimise your prompt: Keep the description clear and specific, detailing the scene, sound effect type, style, etc. Avoid overwhelming the prompt with too many complex instructions; it's best to describe each element separately.

Enhance image-text alignment: If you're using reference images, ensure the image content matches the text description. For example, if describing "outdoor camping," avoid using indoor photos as reference images to reduce conflicting information.

Set parameters accurately: Adjust the video duration, resolution, and other settings according to your specific needs. Avoid using default settings if they don't meet your expectations.

Simplify the creation scene: Focus on one core theme in each creation to avoid stacking too many elements (e.g., multiple ambient sounds + complex speech), which helps the model generate more stable and ideal content.

KLING VIDEO O1

🎬 Video

Kling AI Video O1

Kling AI Video O1 — the world's first unified multimodal video model — is a brand-new creative engine for creators to unlock endless creative possibilities.

Kling O1, based on the Multi-modal Visual Language (MVL) concept, uses natural language to combine videos, images, elements, and other multimodal descriptions to precisely understand your intentions, making the creative process more intuitive and efficient.

1. Input Anything: World's First Unified Multimodal Video Model

The KLING O1 Video Model marks an industry first by integrating diverse video tasks into a single unified architecture.

Capabilities include Reference-based Generation, Text-to-Video, Keyframe Interpolation (Start/End Frame), Video Inpainting, Transformation, Stylization, and Video Extension. This integration eliminates the need to jump between multiple models or tools, allowing users to execute an end-to-end creative pipeline—from ideation to modification—in one place.

2. Understand Everything: Multimodal Input, All-in-one Creation & Modification

With the model's deep semantic understanding, everything — including images, videos, elements, texts, etc — could be included in your input to Kling O1. The model goes beyond the limitations of modality, integrating and understanding different perspectives of an image, video, or character you upload, to return outputs with precision.

Kling Video O1 - Images Reference + Elements

Preview

Kling Video O1 - Images Reference + Elements

Reference Image

Element

Output with Element Binding

@element_1 walking on the streets of new york. she is wearing the exact dress from @image_1 @image_2 @image_3 She is walking towards the camera then doing a 180 degrees tun and shows off her new dress

At the same time, Kling O1 turns tedious post-production processes into simple conversations. There’s no need for manual masking or keyframing; simply input prompts like “remove bystanders”, “change daytime to dusk”, or “replace the main character’s outfit”, and the model will understand the visual context. From local subject replacement to entire-video restyling, the model will automatically complete pixel-level semantic reconstruction. Your prompts become the most efficient editing tool.

Kling Video O1 - Images Reference + Elements

Reference Image

Element

Output with Element Binding

@element_1 wearing the dress from @image_1 . She is sitting on the hood of her car from @image_2 with relaxed @element_2. The scene background is from @image_3. She is looking at the camera.

King V3 Omni - Image Reference + Native Audio

Reference Image

Element

Output with Element Binding

@element_1 wearing the dress from @image_1 looking at her Ultra-high detail extreme macro close-up of her sophisticated @element_2 articulated metallic joints, layered plating, fine mechanical texture, chrome and matte surfaces. The arm rests elegantly, fingers slightly moving. Her other arm is normal skin resting. Warm golden hour light catches the metallic edges, casting sharp specular highlights across each joint. Camera begins at full macro, slowly zooming out, said: "This is my strength. Every joint, every surface — precision forged. This is what I am." White sand and crystal blue Bora Bora water softly blurred in background. Bokeh. Cinematic. Photorealistic. She is looking at her hand and then straight facing the camera for a direct contact with the viewer.

In addition to the examples above, you can also achieve the following with Kling O1's multimodal prompt input:

• Image/Element Reference — Supports reference images/elements, including characters, items, backgrounds, and more, to generate with more creativity and consistency.

• Input-based Modification — Supports inpainting/outpainting, or changing shot compositions or angles. It also supports localized or full-scale adjustments, such as modifying/swapping subjects, backgrounds, partial areas, styles, colors, weather, and more.

• Video Reference — Supports using reference video content to generate previous or next shots within the same context or set. It can also reference video actions or camera movements for generation.

• And more — Additional capabilities such as Text to Video, Start & End Frames, and more.

All-in-One Reference: Video Consistency Now Resolved

🎬 Video

All-in-One Reference: Video Consistency Now Resolved

Kling O1 comes with enhanced capabilities to understand image and video inputs better, and supports the building of elements with images from multiple angles. With reference images/elements, Kling O1 can remember your characters, props, scenes, etc, just like a human director, to maintain consistency, accuracy, and continuity regardless of how the camera moves or how the scene develops.

Prompt Close-up of the film effect, fashionable @element_1 neon cyber light and shadow projected onto his sunglasses.Suddenly, his face instantly malfunctioned and transformed into a digital data stream, which then reorganized. High contrast, fast shutter speed, intense visual impact.

Input

Output

Prompt Medium shot, a rainy night in Tokyo.@element_1 is standing with a windbreaker, facing away from the camera. He slowly turns his head, and raindrops fall in slow motion. The background neon lights blur. A melancholic atmosphere.

Input

Output

Prompt Close-up: @element_1 is holding a spray paint can. He sprayed at the brick wall, but what came out was not paint, but colorful, glowing flowers that quickly bloomed and grew on the wall. He stepped back to admire his masterpiece, a work of magical realism with vivid colors.

Input

Output

Prompt Film-style wide-angle shot. @element_1 dressed in a stylized spacesuit, stands at the edge of the cliff on the red planet. He removed his helmet and inhaled the air of the alien planet. The camera slowly and dramatically pulls back, revealing two satellites rising above the horizon. Epic orchestral atmosphere, with exquisite texture details.

Input

Output

Video O1 goes far beyond single characters or objects, featuring powerful multi-subject fusion capabilities. You have the freedom to mix and match multiple subjects or blend them with reference images. Even in complex ensemble scenes or interactions, the model independently locks onto and preserves the unique features of every character and prop. No matter how drastically the environment changes, Video O1 ensures industrial-grade consistency for each of your actor across every shot.

Powerful Combinations: More Creativity Packed in One Generation

The Kling O1 model is not limited to single tasks; it supports a combination of different tasks in one prompt, such as “adding a subject while modifying the background in the video”, or “changing the style while using elements”. This allows you to incorporate multiple creative ideas at once, exploring infinite creative possibilities.

Combine reference images + Restyle

Modify background + Add elements + Restyle

Control the Pace: Supports 3-10s for More Narrative Freedom

Every shot needs its own duration for better pacing of the story. Kling O1 supports generations anywhere between 3-10s, giving you more control on how you want your story to unfold. Whether it's a fast-paced, impactful scene, or a story with narrative arc, you get to decide the pacing of the shots.

Use Cases in different scenarios

🎬 Video

Use Cases in different scenario

With a groundbreaking unified multimodal architecture, Kling O1 integrates generation and modification to empower endless creativity. Whether you’re developing a story from scratch, or deeply reshaping existing content, Kling O1 brings versatility to a variety of creative projects from film production to advertising.

Filmmaking

With Kling O1’s exceptional consistency with references, and powerful features like the Element Library, you can lock in characters and props for each project to generate multiple scenes with consistency and continuity.

Video green screen keying

Images - Elements reference

Advertising

Traditional advertising shoots are costly and time-consuming. In Kling O1, simply upload product, model, and background images along with simple prompts to quickly generate cool shots for product showcases.

image - Elements Reference

Fashion

Shooting with models with different looks and sets could be a lot. With Kling O1, you can create a never-ending virtual runway. Upload model photos and clothing images, input prompts, and create lookbook videos with clothing details perfectly retained.

Image - Elements Reference

Film Post-production

Forget about tracking and masking. In Kling O1, post-production is as simple as having a conversation. Input natural language like “remove the bystanders in the background”, or “make the sky blue”, and the model will use deep semantic understanding to automatically complete pixel-level adjustments.

Video Element Modification

Video Content Removal

Image Element Reference

🎬 Video

Image Element Reference

Supports uploading 1–7 reference images or elements in the input area. You can combine characters, items, outfits, scenes, and other elements, and use text prompts to define their interactions, bringing static elements to life. Prompt Structure: [Detailed description of Elements] + [Interactions/actions between Elements] + [Environment or background] + [Visual directions like lighting, style, etc]

Image - Element reference

Transformation

In Kling O1, you can freely combine multimodal inputs — texts and images/elements — to easily add, modify, or remove subjects and backgrounds from the original video. You can also change the video’s style, environment, colors, shot composition, angles, and more.

Prompt Structure: Add [describe content to add] from [@Image] to [@Video]

Prompt Structure: Add [describe content to add] to [@Video]

Prompt Structure: Add [@Element] to [@Video]

Prompt Structure: Add [@Element] and [describe content to add] from [@Image] to [@Video]

Video Content Removal

Prompt Structure: Remove [describe content to remove] from [@Video]

Changing Angle or Composition

Prompt Structure: Generate [another angle/composition, e.g, close-up, wide shot] in [@Video]

Modify Subject

Prompt Structure: Change [describe specified subject] in [@Video] to [describe target subject].

Prompt Structure: Change [describe specified subject] in [@Video] to [describe target subject] from [@Image]

Prompt Structure: Change [describe specified subject] in [@Video] to [@Element]

Modify Video Background

Prompt Structure: Change the background in [@Video] with [describe specified background]

Prompt Structure: Change the background in [@Video] with [@Image]

Localized Modification

Prompt Structure: Change [describe specified content] in [@Video] to [describe target content]

Prompt Structure: Change [describe specified content] in [@Video] to [describe target content] from [@Image]

Video Restyle

Prompt Structure: Change [@Video] to [Style words: American cartoon, Japanese anime, wool felt, cyberpunk, pixel art, ink wash painting, oil painting, etc] style

American Cartoon

Cyberpunk

Pixel Art

Ink Wash Painting

Watercolor style

Clay style

Figure style

Monet-inspired Style

Wool felt Style

Recolor Video Element

Prompt Structure: Change the [item] in the [@Video] to [color]

Prompt Structure: Change the [item] in the [@Video] to [color] from [@Image]

Change Weather/Environment

Prompt Structure: Change [@Video] to [describe weather, like “a rainy day”]

Green Screen Keying in Video

Prompt Structure: Change the background in [@Video] to a green screen, and keep [describe content to keep]

Video creative effects

You can directly add flames to elements in the video or freeze the environment in the video via text commands. You can also add facial textures or red-eye effects to characters in the video. Additionally, you can reimagine and redraw the image of the main subject in the video, then replace the original subject to achieve more engaging visual effects.

Video Reference

🎬 Video

Video Reference

You can upload a 3-10s video as a reference to generate the previous/next shot within the same context. Or with text, images, or elements, create a completely new scene referencing actions or camera movements in the video.

Generate Next Shot

Based on [@video], generate the next shot: from the back seat, show a medium shot of a middle-aged man and a young man in front. They angle slightly apart, forming a tense, restrained opposition as they turn to look out their windows. The background is blurred, and soft natural light creates muted olive-green and brown tones with light film grain.

Prompt Structure: Based on [@Video], generate the next shot: [describe shot content]

Generate Previous Shot

Based on [@Video], generate the previous shot: the camera tracks right, following the middle-aged man in a black suit as he walks to the driver’s door, opens it with his left hand, and gets in, causing a slight shake. The young man in the left foreground speaks while looking at him.

Prompt Structure: Based on [@Video], generate the previous shot: [describe shot content]

Reference Video for Camera Movements

Prompt Structure: Take [@Image] as the start frame. Generate a new video following the camera movement of the [@video]

Reference Video for Actions

Prompt Structure: Animate the character in [@Image 1] with the same motion as the character in the [@Video]

Frames

🎬 Video

Frames

You can specify the start and end frames, and describe scene transitions, camera movement, or character actions to control the entire video from beginning to end.

Prompt Structure:

• Take [@Image1] as the start frame, [describe changes in subsequent frames]

• Take [@Image1] as the start frame, take [@Image2] as the end frame, [describe the changes between start and end frames]

💡 You can also click the "Start & End Frames" icon to open the upload slots for the start & end images, making the workflow clearer.

Note: Generation with only an end frame is not supported for now.

Start - End Frame

Start - End Frames

Start Frame

Start + End Frame

Text-to-Video

🎬 Video

Text To Video

Text-to-Video generation can be done by entering text in the input area without uploading any material. For text-to-video, the level of details in the prompt determines the richness of content in the generated video. Prompt Structure: Subject (subject description) + Movement + Scene (scene description) + (Cinematic Language + Lighting + Atmosphere)

American cartoon style animation. On a sunny summer afternoon, wildflowers bloom on a wide green hillside, sky blue with floating clouds. Two boys aged 8-10, wearing casual T-shirts and shorts, baseball caps, chasing butterflies on the hill. Wide-angle shot first shows them running over the rolling grass, then low-angle close-up captures determined and exaggerated expressions while swinging nets. One boy jumps to catch butterflies, another points excitedly. A car appears on the road in the background. Camera follows the car approaching, boys stop, holding nets, watching curiously. Car stops nearby, kicking up light dust, boys still in curious stance. Bright and colorful lighting, full of summer adventure joy.

Cyberpunk style, thrilling tomb-raiding yellow weasel.

More Skill Combos

Besides the abilities above, you can also combine different types of inputs and fully unleash your imagination to achieve even more surprising results. For example, 「image/subject reference + style modification」, 「remove subject + add subject」, 「background modification + add subject + style modification」, 「add subject + style modification」, etc.

Input Media Supported

• Images — You can upload up to 7 images, with minimum resolution 300px, max file size 10MB, in jpg, jpeg or png format.

• Videos — You can upload one video with 3s–10s duration, max file size 200MB, and max resolution 2K.

• Elements — You can upload/generate multiple images from different angles (up to 4 images) to form an element, providing more reference information for the model.

💡 When a video is present, you can upload up to 4 images/elements combined. Without a video, you can upload up to 7 images/elements.