X

lotus

Information

# LOTUS: LLM-Powered Data Processing Made Fast, Easy, and Robust [![Colab Demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1mP65YHHdD6mnZmC5-Uqm2uCXJ4-Kbkhu?usp=sharing) [![Arxiv](https://img.shields.io/badge/arXiv-2407.11418-B31B1B.svg)][#arxiv-paper-package] [![Slack](https://img.shields.io/badge/slack-lotus-purple.svg?logo=slack)][#slack] [![Documentation Status](https://readthedocs.org/projects/lotus-ai/badge/?version=latest)](https://lotus-ai.readthedocs.io/en/latest/?badge=latest) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/lotus-ai)][#pypi-package] [![PyPI](https://img.shields.io/pypi/v/lotus-ai)][#pypi-package] [#license-gh-package]: https://lbesson.mit-license.org/ [#arxiv-paper-package]: https://arxiv.org/abs/2407.11418 [#pypi-package]: https://pypi.org/project/lotus-ai/ [#slack]: https://join.slack.com/t/lotus-fnm8919/shared_invite/zt-319k232lx-nEcLF~5w274dcQLmw2Wqyg LOTUS is the framework that allows you to easily process your datasets, including unstructured and structured data, with LLMs. It provides an **intuitive Pandas-like API**, offers algorithms for **optimizing your programs for up to 1000x speedups**, and makes LLM-based data processing **robust with accuracy guarantees** with respect to high-quality reference algorithms. LOTUS stands for **L**LMs **O**ver **T**ext, **U**nstructured and **S**tructured Data, and it implements [**semantic operators**](https://arxiv.org/abs/2407.11418), which extend the core philosophy of relational operators—designed for declarative and robust _structured-data_ processing—to _unstructured-data_ processing with AI. Semantic operators are expressive, allowing you to easily capture all of your data-intensive AI programs, from simple RAG, to document extraction, image classification, LLM-judge evals, unstructured data analysis, and more. For trouble-shooting or feature requests, please raise an issue and we'll get to it promptly. To share feedback and applications you're working on, you can send us a message on our [community slack](https://join.slack.com/t/lotus-fnm8919/shared_invite/zt-319k232lx-nEcLF~5w274dcQLmw2Wqyg), or send an email (lianapat@stanford.edu). # Installation For the latest stable release: \`\`\` conda create -n lotus python=3.10 -y conda activate lotus pip install lotus-ai \`\`\` For the latest features, you can alternatively install as follows: \`\`\` conda create -n lotus python=3.10 -y conda activate lotus pip install git+https://github.com/lotus-data/lotus.git@main \`\`\` ## Running on Mac If you are running on mac, please install Faiss via conda: ### CPU-only version \`\`\` conda install -c pytorch faiss-cpu=1.8.0 \`\`\` ### GPU(+CPU) version \`\`\` conda install -c pytorch -c nvidia faiss-gpu=1.8.0 \`\`\` For more details, see [Installing FAISS via Conda](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md#installing-faiss-via-conda). # Quickstart If you're already familiar with Pandas, getting started will be a breeze! Below we provide a simple example program using the semantic join operator. The join, like many semantic operators, are specified by **langex** (natural language expressions), which the programmer uses to specify the operation. Each langex is parameterized by one or more table columns, denoted in brackets. The join's langex serves as a predicate and is parameterized by a right and left join key. \`\`\`python import pandas as pd import lotus from lotus.models import LM # configure the LM, and remember to export your API key lm = LM(model="gpt-4.1-nano") lotus.settings.configure(lm=lm) # create dataframes with course names and skills courses_data = \{ "Course Name": [ "History of the Atlantic World", "Riemannian Geometry", "Operating Systems", "Food Science", "Compilers", "Intro to computer science", ] \} skills_data = \{"Skill": ["Math", "Computer Science"]\} courses_df = pd.DataFrame(courses_data) skills_df = pd.DataFrame(skills_data) # lotus sem join res = courses_df.sem_join(skills_df, "Taking \{Course Name\} will help me learn \{Skill\}") print(res) # Print total LM usage lm.print_total_usage() \`\`\` ### Tutorials Below are some short tutorials in Google Colab, to help you get started. We recommend starting with \`[1] Introduction to Semantic Operators and LOTUS\`, which will provide a broad overview of useful functionality to help you get started.
| Tutorial | Difficulty | Colab Link | |----------------------------------------------------|-----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 1. Introduction to Semantic Operators and LOTUS | ![](https://img.shields.io/badge/Level-Beginner-green.svg) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1mP65YHHdD6mnZmC5-Uqm2uCXJ4-Kbkhu?usp=sharing) | | 2. Failure Analysis Over Agent Traces | ![](https://img.shields.io/badge/Level-Intermediate-yellow.svg) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1EJm9A8r_ShYxR0s218J70XhsopOgeT6k?usp=sharing) | | 3. System Prompt Analysis with LOTUS | ![](https://img.shields.io/badge/Level-Intermediate-yellow.svg) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1NSVQYOMp2GCre5ZRgvgs6BPGOa20ySMc?usp=sharing) | | 4. Processing Multimodal Datasets | ![](https://img.shields.io/badge/Level-Intermediate-yellow.svg) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/18oaa12T6PrhHIYGw-L01gw1bDmTYaE_e) |
## Key Concept: The Semantic Operator Model LOTUS introduces the semantic operator programming model. Semantic operators are declarative transformations over one or more datasets, parameterized by a natural language expression, that can be implemented by a variety of AI-based algorithms. Semantic operators seamlessly extend the relational model, operating over tables that may contain traditional structured data as well as unstructured fields, such as free-form text. These modular language-based operators allow you to write AI-based pipelines with high-level logic, leaving optimizations to the query engine. Each operator can be implemented and optimized in multiple ways, opening a rich space for execution plans, similar to relational operators. To learn more about the semantic operator model, read the full [research paper](https://arxiv.org/abs/2407.11418). LOTUS offers a number of semantic operators in a Pandas-like API, some of which are described below. To learn more about semantic operators provided in LOTUS, check out the full [documentation](https://lotus-ai.readthedocs.io/en/latest/), run the [colab tutorial](https://colab.research.google.com/drive/1mP65YHHdD6mnZmC5-Uqm2uCXJ4-Kbkhu?usp=sharing), or you can also refer to these [examples](https://github.com/TAG-Research/lotus/tree/main/examples/op_examples). | Operator | Description | |------------|-------------------------------------------------| | sem_map | Map each record using a natural language projection| | sem_filter | Keep records that match the natural language predicate | | sem_extract | Extract one or more attributes from each row | | sem_agg | Aggregate across all records (e.g. for summarization) | | sem_topk | Order the records by some natural langauge sorting criteria | | sem_join | Join two datasets based on a natural language predicate | | sem_sim_join | Join two DataFrames based on semantic similarity | | sem_search | Perform semantic search the over a text column | # Supported Models There are 3 main model classes in LOTUS: - \`LM\`: The language model class. - The \`LM\` class is built on top of the \`LiteLLM\` library, and supports any model that is supported by \`LiteLLM\`. See [this page](CONTRIBUTING.md) for examples of using models on \`OpenAI\`, \`Ollama\`, and \`vLLM\`. Any provider supported by \`LiteLLM\` should work. Check out [litellm's documentation](https://litellm.vercel.app) for more information. - \`RM\`: The retrieval model class. - Any model from \`SentenceTransformers\` can be used with the \`SentenceTransformersRM\` class, by passing the model name to the \`model\` parameter (see [an example here](examples/op_examples/dedup.py)). Additionally, \`LiteLLMRM\` can be used with any model supported by \`LiteLLM\` (see [an example here](examples/op_examples/sim_join.py)). - \`Reranker\`: The reranker model class. - Any \`CrossEncoder\` from \`SentenceTransformers\` can be used with the \`CrossEncoderReranker\` class, by passing the model name to the \`model\` parameter (see [an example here](examples/op_examples/search.py)). # Feature Requests and Contributing We welcome contributions from the community! Whether you're reporting bugs, suggesting features, or contributing code, we have comprehensive templates and guidelines to help you get started. ## Getting Started Before contributing, please: 1. **Read our [Contributing Guide](CONTRIBUTING.md)** - Comprehensive guidelines for contributors 2. **Check existing issues** - Avoid duplicates by searching existing issues and pull requests 3. **Join our community** - Connect with us on [Slack](https://join.slack.com/t/lotus-fnm8919/shared_invite/zt-319k232lx-nEcLF~5w274dcQLmw2Wqyg) ## Development Setup For development setup and detailed contribution guidelines, see our [Contributing Guide](CONTRIBUTING.md). ## Community - **Slack**: [Join our community](https://join.slack.com/t/lotus-fnm8919/shared_invite/zt-319k232lx-nEcLF~5w274dcQLmw2Wqyg) - **Email**: lianapat@stanford.edu - **Discussions**: [GitHub Discussions](https://github.com/lotus-data/lotus/discussions) We're excited to see what you build with LOTUS! # References For recent updates related to LOTUS, follow [@lianapatel_](https://x.com/lianapatel_) on X. If you find LOTUS or semantic operators useful, we'd appreciate if you can please cite this work as follows: \`\`\`bibtex @article\{patel2025semanticoptimization, title = \{Semantic Operators and Their Optimization: Enabling LLM-Based Data Processing with Accuracy Guarantees in LOTUS\}, author = \{Patel, Liana and Jha, Siddharth and Pan, Melissa and Gupta, Harshit and Asawa, Parth and Guestrin, Carlos and Zaharia, Matei\}, year = \{2025\}, journal = \{Proc. VLDB Endow.\}, url = \{https://doi.org/10.14778/3749646.3749685\}, \} @article\{patel2024semanticoperators, title=\{Semantic Operators: A Declarative Model for Rich, AI-based Analytics Over Text Data\}, author=\{Liana Patel and Siddharth Jha and Parth Asawa and Melissa Pan and Carlos Guestrin and Matei Zaharia\}, year=\{2024\}, eprint=\{2407.11418\}, url=\{https://arxiv.org/abs/2407.11418\}, \} \`\`\`

Prompts

Reviews

Tags

Write Your Review

Detailed Ratings

ALL
Correctness
Helpfulness
Interesting
Upload Pictures and Videos

Name
Size
Type
Download
Last Modified
  • Community

Add Discussion

Upload Pictures and Videos