Empowers large language models with visual and auditory perception, enabling natural human-computer interactions.
AgentRobot is a multi-agent human-computer interaction system designed to enhance large language models by integrating visual recognition, speech recognition, and speech synthesis. The system's architecture comprises specialized agents, including a Brain Agent for decision-making, an Eye Agent for visual input processing, an Ear Agent for audio input processing, and a Mouth Agent for audio output. These agents collaborate to achieve a natural and intuitive human-computer interaction experience, giving large language models the ability to 'see,' 'hear,' and 'speak'.