AI  

 AI models for audio video android Windows 

1. ) Multimodal Generative AI 

  1. Gemini Google
  2. Grok
  3. chatgpt
  4. qwen 3.7max with vision capabilities. 
  5.  

2.) AI Audio model


2.2) AI music audio generative model

  1. stable audio 3.0  (os) : remix it. tune it. own it. music only.
  2. Khala (os): ai music generator.  vocals and music both.

2.3) Audio dubbing AI 

  1. Just Dub it (os) : 2.5GB VRAM, A joint audio-visual model is all you need for video dubbing. sync lips also.

2.4) video to audio model 

  1.  waveflow from meta  (os) : : Audio Generation in Waveform Space. High-fidelity audio synthesized directly in raw waveform space — no VAE. no latent compression.
  2. PrismAudio  (VIDEO(os)6 GB VRAM

2.5) TTS Text to Audio AI model 

  1.  Scenema Audio: Hindi voice workingvoice cloaning bhi kar raha hai. reference audio bhi dal sakte hai. 16 GB VRAM, 32 GB SYSTEM RAM.  
  2.  RESEMBLE.AI DRAMA BOX: clone voices, 
  3. LongCat-AudioDiT (video) (os) : voice cloaning, 6-15 GB VRAM 
  4. omni voice (video (os)voice cloaning, Zero-Shot Text-to-Speech with Diffusion Language Models, 3GB VRAM

3.) AI Video model 

  1.  FashionChameleon from alibaba:  Towards Real-Time and Interactive Human-Garment Video Customization

3.1) AI Audio video Model 

  1. LongCat Video avatar 1.5 / base model int8  (os) : 16-24GB VRAM GPU, 

3.2) AI images + videos model

  1. Lance from Bytedance (os) = 40GB VRAM GPU, 
  2. Krea2 :  24VRAM 

3.3) AI images + audio videos   

 

4.) AI Image model

4.1) AI Panoroma image model   

  1.  PanoWorld : A Generative Spatial World Model for Consistent Whole- House Panorama Synthesis

5.) AI 3d art model

  1. LiTO from Apple (os) :  3d model generator  
  2. Pixal3D :  

6.) AI gaming model

7.) AI Transcription model / Speech Recognition model / Automatic Speech Recognition Model 

  1.  MegaASR(os)  : 5GB VRAM
  2. Qwen3-ASR 
  3. Gemini-3-Pro
  4. Seed-ASR 
  5. Whisper   

  8.) AI MODEL FOR Scientist works

  1. Gemini co-scientist works : A multi-agent system built with Gemini 

 

9.) AI Video Language Model VLM 

  1.  Marlin VML is a 2B video VLM tuned for the two questions developers actually like ask their videos: what is happening, and when? It produces structured Scene + Event captions with second-precise timestamps, and resolves natural-language queries to span-grounded (start, end) ranges in the video.

 10.) AI Live Translation Model: 

  1.  Qwen 3.5 live translate
  2.