Modern AI/LLM deployments often face challenges managing diverse model backends, complex routing decisions, and combining outputs from multiple models, all while handling production-grade concerns such as concurrency, streaming, and retries. This project offers a production-ready proof-of-concept for an MCP (Model Compute Paradigm) architecture. It provides a FastAPI-based microserver that orchestrates multiple AI/LLM models behind a unified, scalable interface, enabling dynamic task routing, LLM-based intent parsing, multi-model pipelines, and streaming chat capabilities, all built for asynchronous and Dockerized deployment.
主要功能
01Async FastAPI server for high concurrency
02Intelligent LLM-powered Model Routing to specific or sequential tasks
03Streaming Responses for chat and other real-time interactions
04Metadata-Driven Model Registry for flexible model configuration
05Dockerized for seamless production deployment
060 GitHub stars
使用案例
01Develop AI pipelines (e.g., RAG, RL) by sequencing multiple model executions
02Build your own ChatGPT-style API with streaming capabilities
03Create an intelligent task router that parses user intent for AI services