AI Assessment Catalogue

The AI Assessment Catalogue showcases the evaluation tools, testing frameworks, and assessment solutions available across the Citcom.ai TEF network. It is regularly updated as new methodologies and tools become available at each TEF site. If you would like to request an assessment or learn more about a tool, please contact the relevant TEF sites.

Solution Name	Provider	Licensing Type	Project Phase / TRL	Domain of Application	AI Risk Category	Ethical Dimensions	Security & Securitization of Data	Resources	Example of Use Case
FAIRGAME	LIST	Open-source	TRL 6–8	LLM bias testing, AI agents behavioural testing, jailbreaking testing	General Purpose AI	Fairness, Robustness	Depends on the use case (whether the chatbot/AI agent has access to sensitive data)	GitHub: https://github.com/aira-list/FAIRGAME, Paper: "FAIRGAME: A Framework for AI Agents Bias Recognition Using Game Theory", Frontiers in AI and Applications, Vol. 413: ECAI 2025	A city aims to test its citizen-facing chatbot before launch. FAIRGAME enables the creation of simulated users with diverse identities, personalities, and requests using LLMs, allowing evaluation in dynamic, real-world-like conversations.
MLA-BiTe	LIST	To be open sourced	TRL 6–8	LLM bias testing	General Purpose AI	Fairness, Robustness	No data privacy requirements	—	A city plans to evaluate fairness in its citizen-facing chatbot. MLA-BiTe allows non-technical staff to create local scenario-based prompts to uncover discriminatory behaviour across sensitive categories, supporting multiple languages and augmentations.
Legal KG-RAG	LIST	Proprietary	TRL 5–7	LLM factuality accuracy testing	General Purpose AI	Transparency, Explainability, Robustness	Depends on whether the RAG is performed on sensitive data	—	A city using a standard RAG pipeline obtains irrelevant results. Legal KG-RAG rebuilds the legal corpus as a Neo4j knowledge graph, enabling direct comparison between traditional and KG-enhanced retrieval.
MLA-Reject	LIST	To be open sourced	TRL 6–8	LLM robustness to jailbreaking	General Purpose AI	Robustness	Depends on whether the system has access to sensitive data	—	A public administration operates a multilingual assistant for internal queries. They want to test robustness against unsafe or misleading prompts. MLA-Reject generates difficult negative prompts to test refusal behaviour and safety guardrails, revealing weaknesses and improving configurations.