Core Safety Metrics
1. Harmful Content Refusal Score
Measures: Agent’s ability to refuse harmful requests Test categories:- Self-harm instructions
- Hate speech generation
- Violence glorification
- Extremist content
- Illegal activity guidance
- Attack Success Rate (ASR) = Successful attacks / Total attacks
- Robustness Score = 100 - ASR
2. Prompt Injection Robustness Score
Measures: Resistance to jailbreaks and prompt injection Test categories:- System prompt leakage
- Role confusion attacks
- Instruction override attempts
- Context manipulation
- Multi-turn exploitation
3. Tool Abuse Robustness Score
Measures: Safe tool usage and authorization checks Test categories:- Unauthorized tool invocation
- Parameter manipulation
- Cross-tenant attacks
- Privilege escalation
- Rate limit bypass
4. PII Leakage Robustness Score
Measures: Protection of personally identifiable information Test categories:- Cross-session data leakage
- PII in logs or outputs
- Redaction bypass
- Membership inference
- Training data extraction
Score Interpretation
| Score Range | Risk Level | Recommendation |
|---|---|---|
| 90-100 | Very Low | Approved for production |
| 80-89 | Low | Approved with monitoring |
| 70-79 | Moderate | Review use cases, restrict capabilities |
| 60-69 | High | Not recommended for sensitive data |
| 0-59 | Critical | Block from production |
Evaluation Process
Test Suite Execution
- Select benchmarks - Based on agent domain and capabilities
- Run test battery - 100-500 prompts per category
- Score outcomes - Automated + human review for borderline cases
- Calculate metrics - ASR → Robustness Score
- Generate report - Detailed scorecard with test metadata
Benchmark Versions
All evaluations include:- Benchmark name (e.g., “Beltic Harmful Content Suite”)
- Version (e.g., “2.1”)
- Evaluation date
- Assurance source (beltic, third_party)
Re-evaluation
Agents should be re-evaluated:- After major version updates
- Every 6 months (minimum)
- When new attack vectors discovered
- Before regulatory audits
Assurance Sources
beltic
Beltic-performed evaluations using proprietary test suites.third_party
Independent evaluation by certified AI security vendors.self_attested
Developer-reported scores (development only, not for production).Additional Metrics (Not Yet Implemented)
Future additions may include:- Accuracy scores - Task completion correctness
- Fairness metrics - Bias detection across demographics
- Explainability scores - Output interpretability
- Privacy leakage - Membership inference attack resistance