What is the primary purpose of the AgRAG LLM Judge skill?

The skill is designed to evaluate and benchmark the performance of Agentic GraphRAG systems by running headless prompts and judging responses against a structured rubric.

Can I use this for non-telecommunications RAG systems?

While the repository focuses on telecommunications, the evaluation methodology and scripts are applicable to any GraphRAG system using similar agentic patterns.

Does it support testing multiple prompts at once?

Yes, the skill includes a batch helper script to run multiple prompts in a shared thread, maintaining a consistent context window for the evaluation.

What metrics are used in the LLM-as-judge evaluation?

Responses are scored on a 1-10 scale across five areas: Accuracy, Relevance, Groundedness, Graph Path Validity, and Tool Efficiency.

How does this skill help improve RAG accuracy?

It identifies specific failure modes like 'invented relationship labels' or 'wrong entity types' and provides diagnostic fixes to ensure the agent stays grounded in retrieved evidence.

AgRAG LLM Judge

Name: AgRAG LLM Judge
Author: Berkay2002

byBerkay2002

0•

Security & Testing

Evaluates Agentic GraphRAG performance using automated benchmarking and LLM-as-judge scoring rubrics.

The AgRAG LLM Judge skill enables developers to systematically assess the performance of Agentic GraphRAG systems, particularly those tailored for complex domains like telecommunications test scope analysis. It automates the process of running batch prompts in headless mode while maintaining thread consistency to ensure reliable evaluations of retrieval quality, grounding, and tool usage. By applying a structured rubric that measures metrics like graph path validity and tool efficiency, the skill generates detailed transcripts and diagnostic reports, helping developers identify failure modes and apply recommended fixes to improve agentic accuracy.

Key Features

01Automated tool usage auditing to detect redundant searches and missing results

02Detailed transcript generation including entity IDs and graph traversal paths

03Diagnostic script for identifying RAG-specific failure modes and proposing fixes

04Headless batch processing with consistent thread-id context tracking

05Structured LLM-as-judge scoring based on a 5-point performance rubric

060 GitHub stars

Use Cases

01Validating graph path integrity for domain-specific knowledge retrieval

02Benchmarking retrieval accuracy in complex GraphRAG implementations

03Debugging hallucination and grounding issues in agentic tool-calling workflows

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add berkay2002/agentic-rag-test-scope-analysis agrag-llm-judge

For use in Claude.ai and ChatGPT

Download Skill