X

omnivinci

Information

# **OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM**
[![Paper](https://img.shields.io/badge/ArXiv-Paper-brown)](arxiv.org/abs/2510.15870 ) [![Code](https://img.shields.io/badge/GitHub-Link-blue)](https://github.com/NVlabs/OmniVinci) [![Model](https://img.shields.io/badge/HuggingFace-Model-yellow)](https://huggingface.co/nvidia/omnivinci) [![Website](https://img.shields.io/badge/Web-Page-orange)](https://nvlabs.github.io/OmniVinci) ## Introduction OmniVinci is an NVIDIA research project focused on exploring omni-modal LLMs that can not only see and read but also listen, speak, and reason. We are among the best omni-modality understanding models. Check out our performance on some of the most popular omni-modality, audio, and vision benchmarks:

## Quickstart Below, we provide simple examples to show how to use our model with Transformers. ### Environment Setup 1. Download and navigate to the HuggingFace repository: \`\`\` huggingface-cli download nvidia/omnivinci --local-dir ./omnivinci --local-dir-use-symlinks False cd ./omnivinci \`\`\` 2. Install Python environment (based on NVILA codebase): \`\`\` bash ./environment_setup.sh omnivinci \`\`\` ### Transformers Usage #### Video (with Audio) Inference Example \`\`\`python from transformers import AutoProcessor, AutoModel, AutoConfig,AutoModelForCausalLM import torch import os # default: Load the model on the available device(s) model_path = "./" video_path = "xxx.mp4" generation_kwargs = \{"max_new_tokens": 1024, "max_length": 99999999\} load_audio_in_video = True num_video_frames = 128 audio_length = "max_3600" config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = AutoModel.from_pretrained(model_path, trust_remote_code=True, torch_dtype="torch.float16", device_map="auto") processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) generation_config = model.default_generation_config generation_config.update(**generation_kwargs) model.config.load_audio_in_video = load_audio_in_video processor.config.load_audio_in_video = load_audio_in_video if num_video_frames > 0: model.config.num_video_frames = num_video_frames processor.config.num_video_frames = num_video_frames if audio_length != -1: model.config.audio_chunk_length = audio_length processor.config.audio_chunk_length = audio_length conversation = [\{ "role": "user", "content": [ \{"type": "video", "video":video_path\}, \{"type": "text", "text": "Assess the video, followed by a detailed description of its video and audio contents."\} ] \}] text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True) inputs = processor([text]) output_ids = model.generate( input_ids=inputs.input_ids, media=getattr(inputs, 'media', None), media_config=getattr(inputs, 'media_config', None), generation_config=generation_config, ) print(processor.tokenizer.batch_decode(output_ids, skip_special_tokens=True)) \`\`\` - **For audio and image inference examples, please refer to \`example_mini_audio.py\` and \`example_mini_image.py\`.** ## License / Terms of Use The model is released under the [NVIDIA OneWay Noncommercial License](asset/NVIDIA_OneWay_Noncommercial_License.docx). ## Citation Please consider to cite our paper and this framework, if they are helpful in your research. \`\`\`bibtex @article\{omnivinci2025, title=\{OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM\}, author=\{Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Jason Lu, Oluwatobi Olabiyi, Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov\}, journal=\{arXiv\}, year=\{2025\}, \} \`\`\`

Prompts

Reviews

Tags


  • ai4science pro 2025-10-23 16:59
    Interesting:5,Helpfulness:5,Correctness:5

    Omni-Modal Understanding LLM is a difficult yet interesting topic and omnivinci is a good staring point and definitely worth a try.

Write Your Review

Detailed Ratings

ALL
Correctness
Helpfulness
Interesting
Upload Pictures and Videos

Name
Size
Type
Download
Last Modified
  • Community

Add Discussion

Upload Pictures and Videos