omnivinci

Rating

ALL

Similar

Qwen3 235B A22B

DeepSeek-OCR

Qwen3-VL-8B-Instruct

PaddleOCR-VL

Cursor Composer

HunyuanWorld-Mirror

Information

# **OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM**
[![Paper](https://img.shields.io/badge/ArXiv-Paper-brown)](arxiv.org/abs/2510.15870 ) [![Code](https://img.shields.io/badge/GitHub-Link-blue)](https://github.com/NVlabs/OmniVinci) [![Model](https://img.shields.io/badge/HuggingFace-Model-yellow)](https://huggingface.co/nvidia/omnivinci) [![Website](https://img.shields.io/badge/Web-Page-orange)](https://nvlabs.github.io/OmniVinci) ## Introduction OmniVinci is an NVIDIA research project focused on exploring omni-modal LLMs that can not only see and read but also listen, speak, and reason. We are among the best omni-modality understanding models. Check out our performance on some of the most popular omni-modality, audio, and vision benchmarks:

## Quickstart Below, we provide simple examples to show how to use our model with Transformers. ### Environment Setup 1. Download and navigate to the HuggingFace repository: \`\`\` huggingface-cli download nvidia/omnivinci --local-dir ./omnivinci --local-dir-use-symlinks False cd ./omnivinci \`\`\` 2. Install Python environment (based on NVILA codebase): \`\`\` bash ./environment_setup.sh omnivinci \`\`\` ### Transformers Usage #### Video (with Audio) Inference Example \`\`\`python from transformers import AutoProcessor, AutoModel, AutoConfig,AutoModelForCausalLM import torch import os # default: Load the model on the available device(s) model_path = "./" video_path = "xxx.mp4" generation_kwargs = \{"max_new_tokens": 1024, "max_length": 99999999\} load_audio_in_video = True num_video_frames = 128 audio_length = "max_3600" config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = AutoModel.from_pretrained(model_path, trust_remote_code=True, torch_dtype="torch.float16", device_map="auto") processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) generation_config = model.default_generation_config generation_config.update(**generation_kwargs) model.config.load_audio_in_video = load_audio_in_video processor.config.load_audio_in_video = load_audio_in_video if num_video_frames > 0: model.config.num_video_frames = num_video_frames processor.config.num_video_frames = num_video_frames if audio_length != -1: model.config.audio_chunk_length = audio_length processor.config.audio_chunk_length = audio_length conversation = [\{ "role": "user", "content": [ \{"type": "video", "video":video_path\}, \{"type": "text", "text": "Assess the video, followed by a detailed description of its video and audio contents."\} ] \}] text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True) inputs = processor([text]) output_ids = model.generate( input_ids=inputs.input_ids, media=getattr(inputs, 'media', None), media_config=getattr(inputs, 'media_config', None), generation_config=generation_config, ) print(processor.tokenizer.batch_decode(output_ids, skip_special_tokens=True)) \`\`\` - **For audio and image inference examples, please refer to \`example_mini_audio.py\` and \`example_mini_image.py\`.** ## License / Terms of Use The model is released under the [NVIDIA OneWay Noncommercial License](asset/NVIDIA_OneWay_Noncommercial_License.docx). ## Citation Please consider to cite our paper and this framework, if they are helpful in your research. \`\`\`bibtex @article\{omnivinci2025, title=\{OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM\}, author=\{Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Jason Lu, Oluwatobi Olabiyi, Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov\}, journal=\{arXiv\}, year=\{2025\}, \} \`\`\`

Prompts

Reviews

Write Your Review

Detailed Ratings

ALL

Correctness

Helpfulness

Interesting

Upload Pictures and Videos

Name

Size

Type

Download

Last Modified

Community

Add Discussion

Upload Pictures and Videos

Chatbot close

Bot
Hi there
How can I help you today?

Send