
Google’s Latest Video Generation Model
Google has released Veo 3.1, an updated version of its AI video generator that now includes audio across all features and introduces new editing capabilities. The timing is interesting—it comes as OpenAI’s Sora 2 app has been climbing app store charts and sparking conversations about AI-generated content flooding social media platforms.
Veo 3.1 seems positioned as more of a professional alternative to Sora 2’s viral, social media-focused approach. OpenAI launched Sora 2 with a TikTok-style interface that prioritizes sharing and remixing, which helped it reach 1 million downloads within five days and hit the top spot in Apple’s App Store.
New Features and Capabilities
The updated model lets users create videos with synchronized ambient noise, dialogue, and Foley effects. There’s an “Ingredients to Video” tool that combines multiple reference images into a single scene, which sounds promising but has some limitations in practice.
Other features include “Frames to Video” for generating transitions between starting and ending images, and “Extend” which creates clips lasting up to a minute by continuing motion from existing videos. The editing tools allow users to add or remove elements from generated scenes with automatic shadow and lighting adjustments.
The model generates videos in 1080p resolution at horizontal or vertical aspect ratios and is available through Flow for consumer use, the Gemini API for developers, and Vertex AI for enterprise customers.
Performance and Limitations
Testing reveals some interesting patterns. Veo 3.1 shows definite improvement over its predecessor in text-to-video mode, handling coherence well and demonstrating better understanding of contextual environments. It works across different styles from photorealism to stylized content.
But there’s a tradeoff—the model prioritizes coherence over fluidity, making it challenging to generate fast-paced action. Elements move more slowly but maintain consistency throughout clips. For rapid movement, other models like Kling still lead, though they require more attempts to get usable results.
Where things get tricky is with image-to-video generation, which was actually one of Veo’s strengths in previous versions. In this update, when using different aspect ratios as starting frames, the model struggled to maintain the coherence levels it once had. If the prompt strays too far from what would logically follow the input image, Veo 3.1 tends to generate incoherent scenes or clips that jump between locations.
Audio and Dialogue Generation
This might be Google’s strongest selling point. Veo 3.1 handles lip sync better than any other model currently available. In text-to-video mode, it generates coherent ambient sound that matches scene elements, and the dialogue, intonation, voices, and emotions are quite accurate.
However, when you try image-to-video with dialogue, the same issues from standard image-to-video generation appear. The model prioritizes coherence so heavily that it often ignores prompt adherence and reference images. In testing, it generated completely different subjects than the reference images provided, making the results useless for certain applications.
Market Context and Pricing
The AI video generation market has become quite crowded in 2025. You’ve got Runway’s Gen-4 targeting filmmakers, Luma Labs offering fast generation for social media, Adobe integrating Firefly Video into Creative Cloud, and updates from various companies targeting realism, sound generation, and prompt adherence.
Pricing is worth noting—Veo 3.1 is currently among the most expensive video generation models, on par with Sora 2 and only behind Sora 2 Pro. Free users get 100 monthly credits to test the system, which translates to about five videos per month. Through the Gemini API, Veo 3.1 costs approximately $0.40 per second of generated video with audio, while a faster variant called Veo 3.1 Fast costs $0.15 per second.
For professional use cases where audio quality and lip sync matter most, Veo 3.1 might be worth considering. But for other applications, particularly those requiring strict adherence to reference images, alternative models might serve better.