Information
**Environment Configuration**: - Recommended Mode: AVD on Mac (arm64) - validated in our experiments.
- App Setup: Manual installation and task-specific configuration required.
- Compatibility Note: Original Docker images are not compatible with AVD environments.
### Model Deployment & Inference **vLLM Integration**: - Inference scripts available in ./vllm_script/ directory
- Optimized for efficient small model serving
**Model Access**: - OpenPhone Weights: 3B parameter model hosted on HuggingFace
- Deployment Process: Download weights → Deploy via vLLM → Configure inference service
- Service Ready: Seamless integration with evaluation pipeline
### ️ Pre-Testing Configuration - API Setup Required: Configure cloud model credentials in ./evaluation/evaluation.py: Line 63, Line 75, Line 81
- Coming Soon: Streamlined configuration interface in development
--- ## Key Features of OpenPhone ### Lightweight Agentic Foundation Models • **Compact Architecture**: Specialized **3B-scale** Vision-Language Models optimized for mobile GUI tasks with minimal computational footprint.
• **On-Device Deployment**: True smartphone-compatible models that maintain competitive performance while running locally without cloud dependency. ### ️ Device-Cloud Collaboration Framework • **Dynamic Orchestration**: Real-time task complexity assessment that intelligently switches between device and cloud models based on execution requirements.
• **Cost-Performance Optimization**: Strategic resource allocation that leverages cost-efficient on-device models while compensating limitations through selective cloud model usage. ### Comprehensive Mobile Agent Evaluation Playground • **Extended Benchmark Suite**: Beyond AndroidLab, incorporating 25+ additional tasks across popular mobile applications for real-world validation.
• **Multi-Dimensional Assessment**: Comprehensive evaluation covering performance metrics, computational efficiency, and practical deployment scenarios. --- ## Technical Innovation & Implementation ### Model Training: SFT+RL • **Synthetic Data Generation**: Leverages advanced MLLMs to create high-quality reasoning chain training data, addressing the scarcity of manual annotations.
• **Two-Stage Training**: SFT injects GUI foundational knowledge, while GRPO reinforcement learning optimizes task completion accuracy.
• **Small Model Enhancement**: Enables 3B models to achieve performance comparable to 7B-9B models on GUI tasks through structured training. ### ️ Device-Cloud Collaboration Framework • **Dynamic Task Assessment**: Real-time complexity evaluation determines when and how frequently to monitor device model performance.
• **Intelligent Orchestration**: Seamlessly switches between device and cloud models based on execution progress and failure patterns.
• **Cost-Performance Optimization**: Reduces cloud invocations by ~10% while maintaining high task success rates through strategic resource allocation. ### Efficient Memory Mechanism for Mobile Agents • **Long-Horizon Reasoning**: Multi-step chain-of-thought reasoning with reflective error correction to enhance decision-making capabilities.
• **Text-Based Summarization**: Compresses high-resolution screenshots into compact textual representations for efficient memory management.
• **Structured Context Retention**: Maintains 10-20 steps of historical context in resource-constrained environments through optimized token usage. ---
---
## Testing & Evaluation
### Single Task Testing
Test individual tasks using the following command structure:
\`\`\`bash
python eval.py -n test_name -c your path to config.yaml --task_id task_id
\`\`\`
Example Usage:
\`\`\`bash
python eval.py -n all_cloud_v1_hyper -c ./configs/example_xml_cloud_hyper.yaml --task_id zoom_1
\`\`\`
### Batch Evaluation Scripts
Convenient batch testing scripts are available in \`./test_script\`:
• \`all_test_cloud_v1_hyper.sh\`: Evaluates all 138 AndroidLab benchmark tasks• \`all_test_cloud_v1_hyper_add.sh\`: Evaluates tasks for four additional mobile apps
### Additional App Documentation For comprehensive details about the four additional app tasks, refer to the documentation: [Additional Apps Documentation](./docs/new_apps.md) --- ## Result Generation ### LLM Evaluator Setup Required Configuration: Set up LLM service credentials in ./evaluation/tasks/llm_evaluator.py: • Line 10: API configuration
• Line 12: Service URL
Enhancement: Our implementation replaces AndroidLab's rule-based evaluation with LLM-powered assessment, providing more nuanced and accurate task completion evaluation. ### Generate Evaluation Results Execute result generation with the following command: \`\`\`bash python generate_result.py --input_folder ./logs/evaluation/ --output_folder ./logs/evaluation/ --output_excel ./logs/evaluation/test_name.xlsx \`\`\` ### Batch Testing File Management ️ Important: When using batch scripts from ./test_script/:
• Manual Transfer Required: Move generated evaluation files from script directory to ./logs/
• Then Execute: Run the result generation command above
• Error Prevention: This step prevents file path conflicts and ensures proper result compilation
--- ## Key Evaluation Findings for OpenPhone ### Small Model, Big Performance - **Size vs Performance**: OpenPhone-3B achieves performance comparable to 9B models while maintaining the deployment advantages of a compact architecture. - **Efficiency Champion**: Establishes itself as a genuine "small powerhouse" that challenges the bigger-is-better assumption in mobile AI. ### Competitive Performance - **Against Proprietary Models**: OpenPhone-3B shows respectable performance compared to lightweight versions of proprietary models when evaluated on standard benchmarks. - **Potential of Small Models**: Demonstrates promising results that validate the viability of compact open-source approaches in mobile agent developmen. ### Device-Cloud Framework Works - **Performance with Efficiency**: OpenPhone's hybrid architecture delivers near-optimal performance while dramatically reducing cloud model usage. - **Intelligent Routing**: Proves that smart task routing creates practical efficiency gains without sacrificing capability. ### Longer Prompts Don't Always Help - **Context Matters**: Extended prompting strategies only improve performance when paired with sufficiently capable cloud models. - **Smart Matching**: Highlights the importance of matching reasoning complexity to model capability rather than assuming longer prompts always help.
| Model | GPUs | Size | SR | Time Cost / Step |
| ---------------------- | ----------- | ---- | ---- | ---------------- |
| Qwen2.5-VL-7B-Instruct | Single 3090 | 7B | 10.1 | 6289.15 ms |
| OpenPhone | Single 3090 | 3B | 15.2 | 4170.63 ms |
| GLM-4.1V-9B-Thinking | Two 3090s | 9B | 24.6 | 14584.89 ms |
| Qwen2.5-VL-7B-Instruct | Two 3090s | 7B | 10.1 | 4587.79 ms |
| OpenPhone | Two 3090s | 3B | 15.2 | 3524.25 ms |
### Speed Advantage
- **Clear Winner**: OpenPhone demonstrates significant inference speed advantages thanks to its lightweight 3B architecture
- **Real-World Ready**: Speed benefits become increasingly pronounced under constrained computational resources, matching typical edge deployment scenarios
### Quantified Comparison
- **3.5x Faster**: OpenPhone on single 3090 vs GLM-4.1V-9B-Thinking on dual 3090s.
- **4x Faster**: OpenPhone on dual 3090s vs GLM-4.1V-9B-Thinking on dual 3090s.
- **OpenPhone's Lightweight**: GLM-4.1V-9B-Thinking's inability to run on single 3090 severely limits edge deployment options.
### Practical Implications
The trade-off is clear: while larger models like GLM-4.1V-9B-Thinking achieve higher task performance, OpenPhone's speed advantages make it far more suitable for real-world on-device scenarios where response time and hardware constraints matter.
---
## Citation
If you find this work helpful to your research, please kindly consider citing our paper.
\`\`\`
@article\{jiang2025lightagent,
title=\{LightAgent: Mobile Agentic Foundation Models\},
author=\{Jiang, Yangqin and Huang, Chao\},
journal=\{arXiv preprint arXiv:2510.22009\},
year=\{2025\}
\}
\`\`\`
## Related Projects
OpenPhone builds upon excellent open-source projects. We sincerely thank their authors and contributors:
- [AndroidLab](https://github.com/THUDM/Android-Lab) - The benchmark framework.
- [R1-V](https://github.com/StarsfieldAI/R1-V) - Implementation details for the GRPO training methodology.
- [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) - The unified training framework enabling efficient model fine-tuning.
## License
This project is released under the [MIT License](./LICENSE).
**If this project helps you, please give us a Star**
** Empower AI Phone with Agents!**
️ Thanks for visiting OpenPhone!