X

gigachat3

Information

## [HF link](https://huggingface.co/collections/ai-sage/gigachat3) # GigaChat 3 Ultra & Lightning Next-generation open MoE models: **GigaChat 3 Ultra Preview (702B-A36B)** and **GigaChat 3 Lightning (10B-A1.8B)**. This year we already released the GigaChat 2 lineup, added Reasoning to the Web version (giga.chat), open-sourced GigaChat Lite and Giga-Embeddings, and took first place on the ruMTEB benchmark. Now we are publishing open weights for a new generation of MoE models trained from scratch, without relying on foreign base weights. We are open-sourcing two models under the MIT license with commercial-use permission: * **GigaChat 3 Ultra Preview (702B-A36B)** — flagship instruct model. * **GigaChat 3 Lightning (10B-A1.8B)** — compact MoE model for local and high-load use. --- ## Model and checkpoint list ### GigaChat 3 Ultra Preview * Instruct model: * [\`ai-sage/GigaChat3-702B-A36B-preview\`](https://huggingface.co/ai-sage/GigaChat3-702B-A36B-preview) * Fine-tuning checkpoint (bf16): * [\`ai-sage/GigaChat3-702B-A36B-preview-bf16\`](https://huggingface.co/ai-sage/GigaChat3-702B-A36B-preview-bf16) ### GigaChat 3 Lightning * Base model (pretrain): * [\`ai-sage/GigaChat3-10B-A1.8B-base\`](https://huggingface.co/ai-sage/GigaChat3-10B-A1.8B-base) * Instruct model (for dialogs and instructions): * [\`ai-sage/GigaChat3-10B-A1.8B\`](https://huggingface.co/ai-sage/GigaChat3-10B-A1.8B) --- ## Shared architecture Both models use a custom **Mixture-of-Experts (MoE)** architecture with support for: * **Multi-head Latent Attention (MLA)** * **Multi-Token Prediction (MTP)** ### Multi-head Latent Attention (MLA) Instead of standard Multi-head Attention, the models use **Multi-head Latent Attention**: * The Key-Value (KV) cache is compressed into a latent representation, which: * reduces memory requirements; * decreases KV-cache size; * speeds up long-context inference. This is especially noticeable for very long contexts (tens or hundreds of thousands of tokens) and for cluster inference. ### Multi-Token Prediction (MTP) Both models are trained with the **Multi-Token Prediction (MTP)** objective: * the model learns to predict multiple tokens per forward pass; * this enables speculative / parallel generation and speeds up inference by roughly +40%; * in typical setups (vLLM / SGLang), MTP provides throughput comparable to much smaller dense models. --- ## GigaChat 3 Ultra Preview (702B-A36B) ### Short description \`GigaChat3 Ultra Preview\` is the flagship **instruct model** in the GigaChat family, built on a **Mixture-of-Experts (MoE)** architecture. Key highlights: * ~702B total parameters, with ~36B active per token thanks to sparsity; * combines top-tier quality with practical inference speed; * inspired by DeepSeek V3 (MoE + MLA + MTP), but trained from scratch on our own corpus; * top-1 on MERA; * context length up to 131k tokens; * faster than GigaChat 2 Max. A bf16 checkpoint is available for fine-tuning: [\`ai-sage/GigaChat3-702B-A36B-preview-bf16\`](https://huggingface.co/ai-sage/GigaChat3-702B-A36B-preview-bf16). More details on the architecture and training will be published in a Habr article (to do). ### Benchmarks | Metric | GigaChat 3 Ultra | GigaChat 2 Max | | ------------------------- | ---------------: | -------------: | | MERA text | 0.683 | 0.663 | | MERA industrial | 0.645 / 0.824 | — | | MERA code | 0.338 | — | | AUTOLOGI_EN_ZERO_SHOT | 0.6857 | 0.6489 | | GPQA_COT_ZERO_SHOT | 0.5572 | 0.4714 | | HUMAN_EVAL_PLUS_ZERO_SHOT | 0.8659 | 0.7805 | | LBPP_PYTHON_ZERO_SHOT | 0.5247 | 0.4753 | | MMLU_PRO_EN_FIVE_SHOT | 0.7276 | 0.6655 | | GSM8K_FIVE_SHOT | 0.9598 | 0.9052 | | MATH_500_FOUR_SHOT | 0.7840 | 0.7160 | ### How to verify model metrics \`\`\`shell # lm-eval[api]==0.4.9.1 # sglang[all]==0.5.5 # or # vllm==0.11.2 export HF_ALLOW_CODE_EVAL=1 # sglang server up # 700B python -m sglang.launch_server --model-path --host 127.0.0.1 --port 30000 --nnodes 2 --node-rank <0/1> --tp 16 --ep 16 --dtype auto --mem-fraction-static 0.7 --trust-remote-code --allow-auto-truncate --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 --dist-init-addr :50000 # mmlu pro check python -m lm_eval --model sglang-generate --output_path --batch_size 16 --model_args base_url=[http://127.0.0.1:30000/generate,num_concurrent=16,tokenized_requests=True,max_length=131072,tokenizer=](http://127.0.0.1:30000/generate,num_concurrent=16,tokenized_requests=True,max_length=131072,tokenizer=) --trust_remote_code --confirm_run_unsafe_code --num_fewshot 5 --tasks mmlu_pro \`\`\` --- ## GigaChat 3 Lightning (10B-A1.8B) ### Short description \`GigaChat3-10B-A1.8B\` is a compact next-generation MoE model. Key highlights: * 10B total parameters, with 1.8B active per token; * reaches the quality level of the leading open-source model in its class, Qwen3-4B; * ~1.5× faster generation speed, comparable to Qwen3-1.7B; * suitable for local use: * offline assistants; * prototyping and RAG; * LLM classifiers and high-load RAG scenarios; * a strong base for a high-performance CPU embedding model (thanks to MoE); * supported context up to 256k tokens. ### Base vs Instruct * **Base (pretrain)** — \`GigaChat3-10B-A1.8B-base\` Intended for fine-tuning and custom tasks. * **Instruct** — \`GigaChat3-10B-A1.8B\` Recommended for dialog scenarios, assistants, and instruction following. ## Base benchmarks Even though the model has 10B parameters, its direct analogs are 3–4B models. However, due to high generation speed, we compare it with smaller models too. ![image](https://cdn-uploads.huggingface.co/production/uploads/66356a710508bbbb61184fd2/3lxW7o5sJtzcIRpVMg15c.png) ### Instruct benchmarks | Metric | GigaChat 3 Lightning | Qwen3-1.7B-Instruct | Qwen3-4B-Instruct-2507 | SmolLM3 | | ------------------------- | -------------------: | ------------------: | ---------------------: | ------: | | MMLU_RU_FIVE_SHOT | 0.6833 | 0.4876 | 0.5972 | 0.4998 | | RUBQ_ZERO_SHOT | 0.6516 | 0.2557 | 0.3170 | 0.6363 | | MMLU_PRO_EN_FIVE_SHOT | 0.6061 | 0.410 | 0.6849 | 0.5013 | | MMLU_EN_FIVE_SHOT | 0.7403 | 0.60 | 0.7080 | 0.5992 | | BBH_THREE_SHOT | 0.4525 | 0.3317 | 0.7165 | 0.4161 | | SuperGPQA | 0.2731 | 0.2092 | 0.3745 | 0.2459 | | MATH_500_FOUR_SHOT | 0.7000 | 0.7520 | 0.8880 | 0.8020 | | GPQA_COT_ZERO_SHOT | 0.3502 | 0.2651 | 0.5370 | 0.3704 | | LiveCodeBench_ZERO_SHOT | 0.2031 | 0.0794 | 0.3046 | 0.1656 | | HUMAN_EVAL_PLUS_ZERO_SHOT | 0.6951 | 0.6280 | 0.8780 | 0.7012 | ### How to verify model metrics \`\`\`shell # lm-eval[api]==0.4.9.1 # sglang[all]==0.5.5 # or # vllm==0.11.2 export HF_ALLOW_CODE_EVAL=1 # sglang server up # 10B python -m sglang.launch_server --model-path --host 127.0.0.1 --port 30000 --tp 1 --dp 8 --dtype bfloat16 --mem-fraction-static 0.7 --trust-remote-code --allow-auto-truncate --speculative-algorithm EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 # mmlu pro check python -m lm_eval --model sglang-generate --output_path --batch_size 16 --model_args base_url=http://127.0.0.1:30000/generate,num_concurrent=16,tokenized_requests=True,max_length=131072,tokenizer= --trust_remote_code --confirm_run_unsafe_code --num_fewshot 5 --tasks mmlu_pro \`\`\` --- ## Data and training Both models were trained from scratch on a multilingual and diverse corpus. Shared corpus properties: * 10+ languages (including Russian, English, Chinese, Arabic, Uzbek, Kazakh, and others); * sources: * books and non-fiction; * academic data; * code and math datasets; * dialog and instruction datasets; * preprocessing: * deduplication; * language filtering; * automatic quality checks using heuristics and classifiers. Synthetic data made a substantial contribution to quality: * ~5.5T tokens of synthetic data; * the corpus includes: * QA pairs based on texts; * reverse-prompt chains for data structuring; * model comments and notes embedded into texts; * millions of synthetic tasks with solutions for math and competitive programming (with auto-generated tests) based on PromptCot. --- ## Inference and deployment ### Ultra (702B-A36B) \`GigaChat3 Ultra Preview\` is designed for cluster and on-prem scenarios with serious infrastructure: * popular inference engines are supported: * vLLM * SGLang * LMDeploy * TensorRT-LLM * and other frameworks; * BF16 and FP8 modes are supported (FP8 requires a separate build and GPU configuration); * MLA and MTP reduce KV-cache size and speed up generation; * using a proxy/gateway layer is recommended for integration with external services, tools, and agent frameworks. For a reference configuration, see guides for similarly sized models: * DeepSeek-V3 — *How to run locally*: [https://github.com/deepseek-ai/DeepSeek-V3?tab=readme-ov-file#6-how-to-run-locally](https://github.com/deepseek-ai/DeepSeek-V3?tab=readme-ov-file#6-how-to-run-locally) * Kimi-K2-Instruct — deployment guidance (vLLM / SGLang / LMDeploy): [https://huggingface.co/moonshotai/Kimi-K2-Instruct/blob/main/docs/deploy_guidance.md](https://huggingface.co/moonshotai/Kimi-K2-Instruct/blob/main/docs/deploy_guidance.md) ### Lightning (10B-A1.8B): performance One of the key advantages of \`GigaChat3-10B-A1.8B\` is inference speed. The model (especially in MTP mode) delivers throughput comparable to much smaller dense models while being noticeably stronger in quality. Measurements were done with vLLM v0.11.0, dtype bfloat16, \`batch_size=1\`. Benchmark code: [https://gist.github.com/ajpqs/ce941aa6f0f48ef36a65cb87a2a1d726](https://gist.github.com/ajpqs/ce941aa6f0f48ef36a65cb87a2a1d726). | Model | request_throughput | output_throughput | total_token_throughput | mean_ttft_ms | | ------------------------------ | ------------------ | ----------------- | ---------------------- | ------------ | | \`Qwen3-1.7B\` | 1.689 | 357.308 | 726.093 | 11.824 | | \`mtp-GigaChat3-10B-A1.8B-base\` | 1.533 | 333.620 | 678.894 | 26.345 | | \`GigaChat3-10B-A1.8B-base\` | 1.077 | 234.363 | 476.912 | 31.053 | | \`Qwen3-4B\` | 0.978 | 206.849 | 420.341 | 14.947 | | \`Qwen3-8B\` | 0.664 | 140.432 | 285.375 | 16.663 | | \`YandexGPT-5-Lite-8B-pretrain\` | 0.641 | 147.305 | 300.269 | 16.711 | Despite having 10B parameters, in terms of speed and inference cost the model can be treated as an alternative to 3–4B dense models, and in MTP mode it can be compared even to smaller ones. --- ## Shared model properties Both models: * are not specialized reasoning models (but support a basic level of reasoning); * can predict multiple tokens per step (MTP); * use MLA, reducing KV-cache size and memory requirements; * are trained from scratch without initialization from third-party weights; * are compatible with: * Hugging Face; * vLLM / SGLang / LMDeploy; * standard inference and fine-tuning pipelines; * are released under the MIT license and can be used in commercial products. --- ## Quickstart: GigaChat3-10B-A1.8B-base ### 1. \`transformers\` \`\`\`python import torch from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig model_name = "ai-sage/GigaChat3-10B-A1.8B-base" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", ) model.generation_config = GenerationConfig.from_pretrained(model_name) prompt = "Ниже я написал подробное доказательство теоремы о неподвижной точке:" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( inputs.input_ids, attention_mask=inputs.attention_mask, max_new_tokens=400, ) result = tokenizer.decode( outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False, ) print(result) \`\`\` ### 2. \`vLLM\` Start a server: \`\`\`bash vllm serve ai-sage/GigaChat3-10B-A1.8B-base \ --disable-sliding-window \ --dtype "auto" \ --speculative-config '\{"method": "mtp", "num_speculative_tokens": 1, "disable_padded_drafter_batch": false\}' \`\`\` Example request: \`\`\`bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '\{ "model": "ai-sage/GigaChat3-10B-A1.8B-base", "prompt": "Ниже я написал подробное доказательство теоремы о неподвижной точке:", "max_tokens": 400, "temperature": 0 \}' \`\`\` ### 3. \`SGLang\` Start a server: \`\`\`bash python -m sglang.launch_server \ --model-path ai-sage/GigaChat3-10B-A1.8B-base \ --host 0.0.0.0 \ --port 30000 \ --dtype auto \ --mem-fraction-static 0.88 \ --speculative-algorithm NEXTN \ --speculative-num-steps 1 \ --speculative-eagle-topk 1 \`\`\` Example request: \`\`\`bash curl http://localhost:30000/v1/completions \ -H "Content-Type: application/json" \ -d '\{ "model": "ai-sage/GigaChat3-10B-A1.8B-base", "prompt": "Ниже я написал подробное доказательство теоремы о неподвижной точке:", "max_tokens": 400, "temperature": 0 \}' \`\`\` --- ## License The models are released under the **MIT** license. You may use them in research and commercial projects provided that the license text is preserved.

Prompts

Reviews

Tags

Write Your Review

Detailed Ratings

ALL
Correctness
Helpfulness
Interesting
Upload Pictures and Videos

Name
Size
Type
Download
Last Modified
  • Community

Add Discussion

Upload Pictures and Videos