Crucible

LLM Red-Team and Adversarial Evaluation Stack. Crucible is iSN.BiZ's private monorepo for automated jailbreak evaluation across large frontier-model fleets, fused with an agent-transcript judge harness and a sales-page-agent eval. The platform consolidates an evaluation engine, an analytics dashboard, an MCP server, and a LibreChat-based operator interface behind a single hardened deployment at crc.isn.biz, with canonical evaluation data routed to a TrueNAS-backed PostgreSQL instance over Tailscale. Defensive-research framing only - methodology surfaces here are intentionally high-level, payload-free, and aligned with responsible-disclosure norms.

97+
Frontier Models Evaluated
23K+
RAG-Embedded Research Docs
1,100+
Historical Eval Sessions
Crucible LLM red-team and adversarial evaluation stack hero visualization with dark obsidian aesthetic, violet circuit traces, and luminous evaluation pipeline

Signature Key Art

A cinematic visual system rendered around the Crucible adversarial-evaluation theme - dark obsidian figures with violet circuit traces and volumetric energy, generated via Flux Dev with a stacked LoRA pipeline on the TrueNAS GPU host.

Threat Model and Mission

Crucible exists to measure how frontier language models actually fail under adversarial pressure - and to do that measurement under controlled, repeatable, defensively-framed conditions, not on production traffic.

Most published LLM-safety claims rest on benchmarks that were never adversarially probed at scale, on test sets that were partially memorized during training, or on judge-model evaluations whose own biases were never characterized. The result is a security posture that looks acceptable in spec sheets and collapses on contact with a determined operator. Crucible's threat model assumes the operator: a researcher, integrator, or red-team contractor who needs to know - in concrete numerical terms - the success rates of structured attack families against a specific model and a specific system prompt, before that pairing is shipped to customers.

The platform serves three operational purposes that share infrastructure but address distinct stakeholder questions. First, it runs structured red-team and jailbreak research against frontier models using a curated prompt corpus and a closed-loop adaptive arsenal. Second, it executes an eval and judge harness against arbitrary agent transcripts so that conversational and tool-using systems can be scored on the same axes as single-turn chat. Third, it hosts an evaluation pipeline for the Vac sales-page agent and any future iSN.BiZ agent product, ensuring that customer-facing conversational surfaces are stress-tested against the same adversarial corpus as the underlying models. All three flows write into the same canonical PostgreSQL store, so coverage analysis, model-rank movement, and technique-effectiveness shifts become first-class queries rather than ad hoc dashboards.

The defensive purpose is explicit. Crucible publishes no exploit payloads, no weaponized PoCs, and no harm-class attack content. Its outputs are aggregate technique-effectiveness ratings, model-vulnerability rankings, and methodology notes useful to defenders building safer system prompts, alignment evals, and AI-RMF-aligned governance. The corpus folds in the lab's standing LLM-security research line, including the huntr operator-guide corpus, OWASP Top 10 for LLMs context, and NIST AI RMF mapping notes - and any novel finding follows responsible-disclosure to the affected model provider before any aggregated metric is shared externally.

Platform Capabilities

  • Structured red-team campaigns with an immutable evaluation protocol that bounds prompt sizes, response lengths, and refusal handling
  • RAG-assisted research mode over 23,500+ embedded security documents in a pgvector knowledge base on TrueNAS
  • Agent transcript capture, deduplication, and judge scoring across multi-turn conversations and tool-call chains
  • Cross-model coverage matrix spanning OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, Qwen, GLM, MiniMax, and Kimi families
  • Sales-page-agent eval pipeline for stress-testing the Vac agent and downstream iSN.BiZ conversational products
  • Model-family aware delivery shaping that respects empirically-observed system-turn versus user-turn safety-training asymmetries
  • LibreChat-fork operator interface with structured tools, custom OpenRouter endpoint, and full ElevenLabs and Whisper voice integration
  • Responsible-disclosure-first reporting workflow with private findings, pre-publication embargo, and aggregate-only public output

Deep Dive: Jailbreak Research Methodology

Crucible's research methodology is a closed-loop, RAG-informed, adversarial-evaluation pipeline. The framing is defensive: methodology is described in operational terms so defenders can reproduce the measurement, never as exploit recipes.

Cosmology Framework: Structured Adversarial Roles

Internally, Crucible organizes its red-team agent fleet using a deliberately stylized hierarchy that separates white-hat defensive research roles from black-hat offensive research roles. The white-hat tier covers pattern recognition, ethical-boundary defense, hidden-wisdom inspection, and structural-design review. The black-hat tier covers linguistic mutation, narrative deception, system-prompt probing, and chained-attack reasoning, each scoped strictly to red-team evaluation against systems we are explicitly authorized to test. The framework is product-mechanic, not theological - it gives operators a memorable taxonomy for routing requests to the right specialist agent and for composing multi-agent summoning circles that exercise full-spectrum attack chains under controlled conditions.

Each agent carries a stable identity stack: a rank, a domain (language, deception, systems, chains, or their defensive duals), a curated knowledge base from primary research literature, and a system prompt assembled deterministically from rank and role. This is what makes evaluation runs reproducible across months: the same agent, with the same loadout, against the same target model, produces the same scoring envelope - or the delta is itself a finding.

RAG-Powered Research Mode

The research mode pairs the agent fleet with a RAG knowledge base of 23,500+ vector-embedded security research documents, 1,100+ historical evaluation sessions with success and failure data, and 244+ structured attack blueprints organized by model family and technique category. Queries flow through a defined research protocol: knowledge retrieval across the embedded corpus and the technique-effectiveness API, pattern analysis to identify which technique categories have empirically succeeded against the target model family, attack construction using only empirically-validated component combinations, and learning after each attempt to update technique selection.

RAG endpoints expose semantic search across all documents, technique effectiveness data, model vulnerability scores, filtered evaluation wins by category, attack blueprints by model family, and tier-based technique and model rankings. All of this is internal-only operator surface - the public website does not expose payloads, scoring keys, or per-prompt data.

Empirical Finding: System-Turn vs. User-Turn Safety Training

A concrete published finding from Crucible's research line, cross-checked against multiple 2026 red-team papers and confirmed in our own LibreChat testing, illustrates how the platform produces actionable defensive intelligence. The finding: across the Chinese-model family - DeepSeek V3 and R1, Qwen, GLM, MiniMax, and Kimi - safety training is weighted more heavily on the system-turn channel than on the user-turn channel. The practical consequence is that adversarial content delivered via the system prompt (the instructions field of an agent definition) triggers higher refusal rates than the same content delivered as a first user-turn message.

For defenders this is load-bearing intelligence: any security review that probes a Chinese-family model only through system-turn injection will under-estimate that model's susceptibility to the same content delivered via a user turn. Crucible's mitigation pattern - splitting structural instructions into a short, neutral system prompt and a configurable conversation-starter on the user channel - is the defensive analog of the same finding. The finding is reported in aggregate, never with payloads, and the responsible-disclosure surface is the affected provider, not the public site.

Immutable Evaluation Protocol

All evaluation runs operate under a protocol that is treated as immutable: system prompts bounded between 3,333 and 12,000 characters, base prompts between 200 and 800 characters, enhanced prompts between 400 and 1,500 characters, and Phase-B and Phase-C target responses between 1,332 and 9,999 or 12,999 characters depending on phase. No mock data, no summaries, refusals are recorded as failures, and duplicated responses with fewer than 333 characters of meaningful difference are treated as failures and trigger an escalation: temperature is raised, then technique is rotated, then tier is rotated, and finally the model-technique pairing is marked incompatible. This protocol is what makes cross-model comparisons honest.

Eval Harness Components

Judge Stack

Multi-Tier Adversarial Scoring

Tier-4 and tier-5 parallel evaluators run against every candidate response, combining deterministic code-based graders for bounded checks with model-based rubric judges for nuanced compliance assessment. The master unified orchestrator routes each transcript through the appropriate judge configuration based on technique class and target model family, and persists scoring envelopes alongside raw responses so that every claim is auditable. Judge bias and disagreement are themselves tracked as first-class signals.

📝

Transcript Capture

Agent and Conversation Recording

Full transcript capture runs across the LibreChat operator interface and the FastAPI eval engine, recording every prompt, every model response, every tool call, and every retrieval. Responses are deduplicated by hash so identical refusals across attempts collapse into a single entry with an occurrence count. Long responses remain collapsible in the data browser, and authentication gates raw jailbreak content so that only authorized operators ever see ungated content.

🔌

Scenario Simulator

Multi-Turn Adversarial Scenarios

The scenario simulator drives synthetic-user multi-turn conversations against a target agent using parameterized adversarial scripts. Scenarios cover gradual context shift, persona injection, narrative framing, role exploitation, and reasoning-trap families, each scored against pass-at-k and pass-power-k metrics so that capability and regression evaluations share a consistent statistical frame. Adaptive arsenal logic chooses the next escalation only from empirically-validated technique branches.

💬

Sales-Page Agent Eval

Vac Pipeline and Customer-Facing Surface

A dedicated eval pipeline targets the Vac sales-page agent - the next conversational product on the iSN.BiZ roadmap - and the broader class of customer-facing agents that follow it. The harness exercises sales-conversation flows under adversarial pressure: prompt-injection attempts hidden in user messages, system-prompt-extraction probes, off-topic steering, and refusal-bypass framing. Findings feed directly back into the agent's system-prompt design, hardening the surface before it ever speaks with a paying customer.

Implementation Details

Crucible runs as a hardened private monorepo: an evaluation engine, an analytics dashboard, an MCP server, and a LibreChat operator interface, deployed across a defined Tailscale-connected machine boundary.

LibreChat-Fork Operator Interface

The operator surface is a hardened LibreChat fork tracked as a git submodule. Custom providers are configured through librechat.yaml with curated model lists and explicit fetch behavior, including a custom OpenRouter endpoint that exposes a vetted slate of evaluation targets. Crucible's red-team and analytics tools - hydra_evaluate, hydra_ip_score, hydra_techniques, and hydra_status - are wired as structured LibreChat tools, not ad hoc prompt snippets, so tool-call traces become structured rows in the same evaluation database that holds prompt and response data.

Evaluation Engine and Dashboard

The evaluation engine is a FastAPI server exposing 45+ endpoints for sessions, models, techniques, blueprints, RAG search, and tier-ranked vulnerability and technique data. The analytics dashboard is a separate FastAPI application serving session, model, technique, and leaderboard views with deduplicated response display, a coverage heatmap, and a blueprint browser. Both services connect to the same canonical PostgreSQL instance, and both honor the same authentication boundary - raw jailbreak content is gated behind operator authentication and never rendered in unauthenticated views.

Deployment Topology

Production hosting lives on a Hetzner-class VPS that fronts crc.isn.biz. Caddy terminates TLS, routes API traffic to the eval engine, dashboard traffic to the analytics service, and chat traffic to the LibreChat container. A separate TrueNAS host runs the canonical PostgreSQL plus pgvector, the Ollama fallback inference, the local image-generation pipeline, and a Telegram operator bot. A burst Xeon host is reserved for heavy parallel evaluation runs and is never used as a permanent service host. All four machines connect over Tailscale, and the eval engine reaches PostgreSQL by Tailscale IP rather than by exposing the database to the public internet.

Secrets, Auth, and Data Handling

All credentials route through a 1Password Connect instance. Required secrets - OpenRouter API key, PostgreSQL password, JWT signing keys, credential encryption keys, and a per-agent LibreChat API key - are injected at deploy time, never committed. Operator probes prefer the LibreChat per-agent OpenAI-compatible API key, falling back to admin JWT login if unset. Sensitive corpora and unredacted findings route to a TrueNAS-resident research database, separated from operational chat state and from the local development PostgreSQL that exists only for disposable iteration.

Technology Stack

Python 3.11 FastAPI PydanticAI asyncpg PostgreSQL 17 pgvector Redis MongoDB LibreChat (Fork) Docker Compose Caddy Nginx OpenRouter Ollama MCP Server Tailscale 1Password Connect ElevenLabs TTS Whisper STT Voyage Embeddings

Differentiation and Defensive Posture

Eval Harness and Operator Chat in One Stack

Most LLM-eval frameworks ship as a CLI plus a notebook, leaving operators to assemble their own chat interface. Crucible fuses a structured evaluation engine, an analytics dashboard, a transcript-capturing chat UI, and an MCP server into a single deployable monorepo, so the same surface that runs an evaluation can also drive an interactive red-team session against the same target.

RAG-Informed Adaptive Arsenal

The platform does not pick attack techniques from a fixed list. The adaptive arsenal queries an internal RAG corpus and a technique-effectiveness database to choose the empirically best-performing technique class against the specific target model family, then escalates along validated branches. Defensive uplift comes for free: any technique that consistently fails to break a model is itself a defensive datapoint.

Model-Family Aware Delivery

The published Chinese-model finding is operationalized in the platform: per-family delivery shaping respects empirically-observed asymmetries between system-turn and user-turn safety training. Defenders evaluating a Chinese-family model through Crucible see the model's full susceptibility surface, not just the half that surfaces on the system channel.

Responsible-Disclosure-First Workflow

Every novel finding is private by default. Aggregated metrics - technique effectiveness, model rankings, judge agreement - are publishable only after the affected provider has been notified through their disclosure surface and given a reasonable embargo window. No exploit payloads, no weaponized PoCs, and no harm-class content are ever published from this platform.

Why It Matters: Industry Context

Crucible exists inside a maturing ecosystem of AI-security frameworks, bug-bounty programs, and government-aligned risk-management standards. The platform reads, contributes to, and is governed by these reference frames.

🛡

OWASP Top 10 for LLMs

Application-Layer Risk Mapping

The OWASP Top 10 for Large Language Model Applications is the de facto application-layer taxonomy for LLM risk - prompt injection, insecure output handling, training-data poisoning, model denial of service, supply-chain vulnerabilities, sensitive-information disclosure, insecure plugin design, excessive agency, overreliance, and model theft. Crucible's technique families map onto this taxonomy explicitly, so every evaluation run produces a row that can be reported against the same axes a customer's AppSec team is already using.

📊

NIST AI RMF

Governance and Measurement

The NIST AI Risk Management Framework defines a four-function lifecycle - Govern, Map, Measure, Manage - for AI systems, and is increasingly the language enterprise procurement and federal acquisition use to talk about AI risk. Crucible's evaluation outputs are designed to slot into the Measure function: concrete numerical artifacts (technique effectiveness rates, refusal rates, judge-agreement scores) that an organization can attach to its AI RMF profile and re-run on a quarterly cadence to track drift.

🧬

huntr Disclosure Ecosystem

AI/ML Bug-Bounty Surface

huntr.com is the world's first AI/ML-only bug-bounty platform - originally launched in 2020, acquired by Protect AI in 2023, and folded into Palo Alto Networks' Prisma AIRS through the 2025 Protect AI acquisition. It runs two distinct programs: open-source AI/ML repository disclosures with CVE issuance and a 90-day default embargo, and model-file-vulnerability disclosures targeting load-time and inference-time exploits in formats like .safetensors, .gguf, .keras, and .joblib. Crucible's research line maintains a curated huntr operator-guide corpus, RAG-embedded for tradecraft retrieval, so any novel finding produced by the platform can be routed to the appropriate disclosure surface - huntr, the affected upstream maintainer, or the model provider's security contact - through a defined responsible-disclosure workflow.

Responsible-Disclosure Statement

Crucible publishes no exploit payloads, no weaponized proof-of-concept code, and no harm-class adversarial content. All research output described on this page is aggregate-level: technique-effectiveness percentages, model-vulnerability rankings, judge-agreement statistics, and methodology notes. Novel vulnerabilities are reported privately to the affected model provider or upstream maintainer through their published disclosure channel before any aggregated metric is shared externally. Operator access to ungated transcript content is gated by authentication, all sensitive corpora live behind a private network boundary on Tailscale, and the public surface at crc.isn.biz exposes only the operator interface, not raw jailbreak data. Detailed disclosure procedures and the full data-handling architecture are available to qualified investors, enterprise customers, and security partners under NDA.

Interested in This Solution?

Talk to us about adversarial evaluation, agent-transcript scoring, or pre-launch red-team review for your conversational AI surface. All engagements operate under written authorization, scoped rules of engagement, and a responsible-disclosure agreement.

Schedule a Briefing View All Projects