AI
AI models for audio video android Windows
1. ) Multimodal Generative AI
- Gemini Google
- Grok
- chatgpt
- qwen 3.7max with vision capabilities.
2.) AI Audio model
2.2) AI music audio generative model
- stable audio 3.0 (os) : remix it. tune it. own it. music only.
- Khala (os): ai music generator. vocals and music both.
2.3) Audio dubbing AI
- Just Dub it (os) : 2.5GB VRAM, A joint audio-visual model is all you need for video dubbing. sync lips also.
2.4) video to audio model
- waveflow from meta (os) : : Audio Generation in Waveform Space. High-fidelity audio synthesized directly in raw waveform space — no VAE. no latent compression.
- PrismAudio (VIDEO) (os): 6 GB VRAM
2.5) TTS Text to Audio AI model
- Scenema Audio: Hindi voice working. voice cloaning bhi kar raha hai. reference audio bhi dal sakte hai. 16 GB VRAM, 32 GB SYSTEM RAM.
- RESEMBLE.AI DRAMA BOX: clone voices,
- LongCat-AudioDiT (video) (os) : voice cloaning, 6-15 GB VRAM
- omni voice (video) (os): voice cloaning, Zero-Shot Text-to-Speech with Diffusion Language Models, 3GB VRAM
3.) AI Video model
- FashionChameleon from alibaba: Towards Real-Time and Interactive Human-Garment Video Customization
3.1) AI Audio video Model
- LongCat Video avatar 1.5 / base model int8 (os) : 16-24GB VRAM GPU,
3.2) AI images + videos model
- Lance from Bytedance (os) = 40GB VRAM GPU,
- Krea2 : 24VRAM
3.3) AI images + audio videos
4.) AI Image model
4.1) AI Panoroma image model
- PanoWorld : A Generative Spatial World Model for Consistent Whole- House Panorama Synthesis
5.) AI 3d art model
- LiTO from Apple (os) : 3d model generator
- Pixal3D :
6.) AI gaming model
7.) AI Transcription model / Speech Recognition model / Automatic Speech Recognition Model
- MegaASR(os) : 5GB VRAM
- Qwen3-ASR
- Gemini-3-Pro
- Seed-ASR
- Whisper
8.) AI MODEL FOR Scientist works
- Gemini co-scientist works : A multi-agent system built with Gemini
9.) AI Video Language Model VLM
- Marlin VML is a 2B video VLM tuned for the two questions developers actually like ask their videos: what is happening, and when? It produces structured Scene + Event captions with second-precise timestamps, and resolves natural-language queries to span-grounded (start, end) ranges in the video.
10.) AI Live Translation Model:
0 Comments