
Vectorview recently launched!
Founded by Emil Fröberg & Lukas Petersson
It’s difficult to prevent unwanted behaviors in LLMs due to their non-deterministic nature. Testing them against every possible scenario is hard, making it tough to catch all unintended behaviors. Additionally, most evaluation benchmarks (like MMLU or BBQ) are too general, missing the specific issues that can arise in real-world use. Take this example:

This issue isn’t limited to chatbots. It spans across LLM agents designed for specialized tasks and extends to AI labs striving for model safety. The task of crafting, deploying, and precisely scoring custom evaluations is complex and time-consuming.
Each use case demands a custom evaluation. In the case of the Chevrolet chatbot, Vectorview's custom auto-red teaming solution could be implemented to prevent the mistake.

The platform offers a suite of custom evaluation tools designed to benchmark AI applications against specific, real-world scenarios they are likely to encounter. This targeted approach ensures that AI behaves as intended, mitigating the risk of unintended behaviors that generic benchmarks often miss.