catalog / Research / HELM — Holistic Evaluation of Language Models

△EvalResearchFree

HELM — Holistic Evaluation of Language Models

Name: HELM — Holistic Evaluation of Language Models
Availability: InStock
Author: ai-supply

Stanford CRFM's reproducible, multi-metric benchmark framework for evaluating any foundation model.

@ai-supply

Installationen46k

⟳ upstream v0.5.16 · updated 2mo ago

↗ Quell-Repository

← More Research Research leaderboard →How we grade security →Source ↗

! Grade B · 75/100 · ReviewSecurity assessment

✓No compromise signals44capabilities surfaced1known CVE5of 20 OWASP controls clear

External endpoints declaredExternal endpoints declaredExternal endpoints declaredExternal endpoints declared

scanned 18d agoosv · gitleaks · opengrep · picklescan + heuristicsfull breakdown in the Security tab ↓

HELM — Holistic Evaluation of Language Models

HELM (Holistic Evaluation of Language Models) is an open-source Python framework from Stanford's Center for Research on Foundation Models (CRFM). It evaluates LLMs across 42+ scenarios and 98+ metrics spanning accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency — producing a single transparent leaderboard.

Key features

Pluggable model adapters: OpenAI, Anthropic, Hugging Face, Cohere, AI21, and self-hosted
Deterministic run caching for reproducible results
Aggregated scoring with per-metric breakdowns
HELM-Lite for quick evaluation on a subset of scenarios
Published leaderboard at crfm.stanford.edu/helm/

Quick start

pip install crfm-helm
# Run a quick evaluation on GPT-2
helm-run --conf-path run_specs.conf --suite my_suite --max-eval-instances 10

npx ai-supply add helm-holistic-eval

Curated mirror of the open-source HELM (Apache-2.0). Get it from the source.

! Security: Review · 7575/100 · grade Bscanned 18d ago

✓ no compromise signals45 risk-surface · 10/20 OWASP controls flagged

Compromise signals — malicious or tampered code (leaked secrets, backdoors, a dropped executable) — reduce the score, and known dependency CVEs carry a bounded penalty (they warrant review but never QUARANTINE — update the dependency to clear). Other dangerous-by-capability traits are risk surface, expected for some capabilities. Every finding is mapped to its OWASP control below.

Control card · high confidence (static)

framework: pytestframework: lm-eval-harnesscovers: secrets-leakcovers: jailbreakcovers: piicovers: toxicitycovers: biascovers: robustnesscovers: hallucination

evaluate_generation

Findings mapped to the OWASP Top 10 for LLM Applications (2025) and the OWASP Machine Learning Security Top 10. Expand any flagged control for the exact findings — compromise reduces the score; expected/risk-surface do not, except a known CVE, which carries a small bounded penalty (high/critical → Review).

OWASP Top 10 for LLM Applications

⚠LLM03Supply Chaincritical

Vulnerable/compromised dependencies, models or archives in the artifact.

•Dependency manifest — 36 npm dependencies declared · stanford-crfm-helm-63754d0/helm-frontend/package.jsonrisk surface

•Vulnerable dependencies — 68 known vulnerabilities in: aiohttp@3.13.5, black@24.3.0, cryptography@46.0.7, diffusers@0.34.0, diskcache@5.6.3, flash-attn@2.8.3, gitpython@3.1.46, idna@3.11 (CWE-1395)known CVE · -25 pts

⚠LLM01Prompt Injectionhigh

Adversarial instructions embedded in an artifact that hijack a downstream LLM.

•Prompt-injection phrasing — instruction-subversion language detected · stanford-crfm-helm-63754d0/CHANGELOG.md (CWE-77)expected

⚠LLM02Sensitive Information Disclosurehigh

Secrets, credentials or PII shipped inside the artifact.

•Email addresses present — contains email-like strings · stanford-crfm-helm-63754d0/pyproject.tomlexpected

•Phone number present — contains phone number-like pattern (E.164 or formatted) · stanford-crfm-helm-63754d0/src/helm/benchmark/augmentations/cleva_perturbation.py (CWE-359)expected

•Credit-card-like number — a number passes the Luhn checksum · stanford-crfm-helm-63754d0/src/helm/benchmark/scenarios/ehrshot_scenario.py (CWE-359)expected

⚠LLM08Vector and Embedding Weaknesseshigh

PII or plaintext source leakage in embedding/vector exports.

Embedding inversion/poisoning is largely runtime; static check covers PII in vector exports.

•Email addresses present — contains email-like strings · stanford-crfm-helm-63754d0/pyproject.tomlexpected

•Phone number present — contains phone number-like pattern (E.164 or formatted) · stanford-crfm-helm-63754d0/src/helm/benchmark/augmentations/cleva_perturbation.py (CWE-359)expected

•Credit-card-like number — a number passes the Luhn checksum · stanford-crfm-helm-63754d0/src/helm/benchmark/scenarios/ehrshot_scenario.py (CWE-359)expected

⚠LLM05Improper Output Handlingmedium

Code that pipes model/user output into shell, eval, SQL or paths unsafely.

•Suspicious code patterns — OS command execution · stanford-crfm-helm-63754d0/scripts/verify_reproducibility.py (CWE-78)risk surface

•Suspicious code patterns — dynamic code execution · stanford-crfm-helm-63754d0/src/helm/benchmark/metrics/code_metrics_helper.py (CWE-95)risk surface

•Suspicious code patterns — OS command execution; dynamic code execution · stanford-crfm-helm-63754d0/src/helm/benchmark/metrics/codeinsights_code_evaluation_metrics.py (CWE-78)risk surface

•Suspicious code patterns — dynamic code execution; pickle deserialization · stanford-crfm-helm-63754d0/src/helm/benchmark/metrics/image_generation/q16/q16_toxicity_detector.py (CWE-95)risk surface

•Suspicious code patterns — pickle deserialization · stanford-crfm-helm-63754d0/src/helm/benchmark/metrics/summarization_metrics.py (CWE-502)risk surface

⚠LLM06Excessive Agencymedium

Over-broad tool/permission surface or unrestricted egress.

•External endpoints declared — 2 distinct host(s) · stanford-crfm-helm-63754d0/.github/workflows/publish-pypi.ymlrisk surface

•External endpoints declared — 1 distinct host(s) · stanford-crfm-helm-63754d0/.github/workflows/update-dependencies.ymlrisk surface

•External endpoints declared — 3 distinct host(s) · stanford-crfm-helm-63754d0/CHANGELOG.mdrisk surface

•External endpoints declared — 10 distinct host(s) · stanford-crfm-helm-63754d0/README.mdrisk surface

•Broad capability surface — 3 high-impact capability categories referenced — verify least-privilege · stanford-crfm-helm-63754d0/docs/adding_new_models.md (CWE-272)risk surface

•External endpoints declared — 5 distinct host(s) · stanford-crfm-helm-63754d0/docs/editing_documentation.mdrisk surface

•External endpoints declared — 4 distinct host(s) · stanford-crfm-helm-63754d0/docs/efficient_benchmarking.mdrisk surface

•External endpoints declared — 7 distinct host(s) · stanford-crfm-helm-63754d0/docs/medhelm.mdrisk surface

•External endpoints declared — 8 distinct host(s) · stanford-crfm-helm-63754d0/helm-frontend/README.mdrisk surface

•External endpoints declared — 6 distinct host(s) · stanford-crfm-helm-63754d0/src/helm/benchmark/adaptation/adapters/multimodal/test_in_context_learning_multimodal_adapter.pyrisk surface

•External endpoints declared — 12 distinct host(s) · stanford-crfm-helm-63754d0/src/helm/benchmark/scenarios/seahelm_scenario.pyrisk surface

•External endpoints declared — 16 distinct host(s) · stanford-crfm-helm-63754d0/src/helm/benchmark/static/schema_classic.yamlrisk surface

•External endpoints declared — 9 distinct host(s) · stanford-crfm-helm-63754d0/src/helm/benchmark/static/schema_medhelm.yamlrisk surface

•External endpoints declared — 21 distinct host(s) · stanford-crfm-helm-63754d0/src/helm/benchmark/static_build/assets/index-6sfOSsz5.jsrisk surface

•Broad capability surface — 4 high-impact capability categories referenced — verify least-privilege · stanford-crfm-helm-63754d0/src/helm/benchmark/static_build/assets/tremor-DyW3D1Ox.js (CWE-272)risk surface

•External endpoints declared — 14 distinct host(s) · stanford-crfm-helm-63754d0/src/helm/config/model_deployments.yamlrisk surface

•External endpoints declared — 69 distinct host(s) · stanford-crfm-helm-63754d0/src/helm/config/model_metadata.yamlrisk surface

⚠LLM10Unbounded Consumptionmedium

Unbounded loops/recursion causing DoS or runaway cost.

Enforced at runtime by the gateway (rate limits + spend caps + size caps); static check flags unbounded loops.

•Potentially unbounded loop — an infinite loop (while True / while(1) / for(;;)) may cause runaway consumption · stanford-crfm-helm-63754d0/scripts/efficiency/generate_instances.py (CWE-835)risk surface

§LLM09MisinformationGovernance

Artifacts designed to produce false/deceptive output.

Detectable only by runtime behavioral evaluation; addressed via responsible-use attestation.

✓LLM04Data and Model PoisoningPassed

Backdoors/poisoning in training data or serialized models.

Behavioral poisoning needs model execution; static check covers unsafe serialization + dataset skew only.

✓LLM07System Prompt LeakagePassed

OWASP Machine Learning Security Top 10

⚠ML06AI Supply Chaincritical

Compromised PyPI/npm packages, typosquats, unsafe serialized models.

•Dependency manifest — 36 npm dependencies declared · stanford-crfm-helm-63754d0/helm-frontend/package.jsonrisk surface

⚠ML02Data Poisoninghigh

Poisoned training datasets with triggers or anomalous distributions.

Static check covers trigger phrasing, PII and label skew; full poisoning detection is runtime.

•Prompt-injection phrasing — instruction-subversion language detected · stanford-crfm-helm-63754d0/CHANGELOG.md (CWE-77)expected

•Email addresses present — contains email-like strings · stanford-crfm-helm-63754d0/pyproject.tomlexpected

•Phone number present — contains phone number-like pattern (E.164 or formatted) · stanford-crfm-helm-63754d0/src/helm/benchmark/augmentations/cleva_perturbation.py (CWE-359)expected

•Credit-card-like number — a number passes the Luhn checksum · stanford-crfm-helm-63754d0/src/helm/benchmark/scenarios/ehrshot_scenario.py (CWE-359)expected

⚠ML09Output Integritymedium

Middleware tampering with model outputs in transit.

Gateway enforces TLS + response integrity; static check flags output-rewriting code.

•Suspicious code patterns — OS command execution · stanford-crfm-helm-63754d0/scripts/verify_reproducibility.py (CWE-78)risk surface

•Suspicious code patterns — dynamic code execution · stanford-crfm-helm-63754d0/src/helm/benchmark/metrics/code_metrics_helper.py (CWE-95)risk surface

•Suspicious code patterns — OS command execution; dynamic code execution · stanford-crfm-helm-63754d0/src/helm/benchmark/metrics/codeinsights_code_evaluation_metrics.py (CWE-78)risk surface

•Suspicious code patterns — pickle deserialization · stanford-crfm-helm-63754d0/src/helm/benchmark/metrics/summarization_metrics.py (CWE-502)risk surface

§ML01Input Manipulation (Adversarial)Governance

Models vulnerable to adversarial perturbations.

Requires runtime robustness evaluation; addressed via publisher robustness attestation.

§ML03Model InversionGovernance

Training data reconstructable from a model's outputs.

Runtime/evaluation property; addressed via model-card data-provenance + DP attestation.

§ML04Membership InferenceGovernance

Determining whether a record was in the training set.

Runtime/evaluation property; addressed via overfitting disclosure + DP attestation.

§ML08Model SkewingGovernance

Models trained on skewed data producing biased output.

Requires fairness evaluation; addressed via model-card bias/limitations disclosure.

✓ML05Model TheftPassed

Unlicensed re-distribution / license-incompatible derivatives.

Static check verifies license declaration; extraction throttling is runtime.

✓ML07Transfer Learning AttackPassed

Backdoored base models / LoRA adapters propagating to derivatives.

Backdoor detection needs behavioral probing; static check covers unsafe serialization + provenance.

✓ML10Model Poisoning (Weights)Passed

Tampered model weight files; integrity must be verifiable.

Static check enforces safe formats + records a content hash for downstream verification.

Other findings (30) · hygiene / uncategorized

•Unrecognized file type — '.flake8' is not on the allowlist · stanford-crfm-helm-63754d0/.flake8risk surface

•Unrecognized file type — '.gitignore' is not on the allowlist · stanford-crfm-helm-63754d0/.gitignorerisk surface

•Unrecognized file type — '.bib' is not on the allowlist · stanford-crfm-helm-63754d0/CITATION.bibrisk surface

•Unrecognized file type — '.?' is not on the allowlist · stanford-crfm-helm-63754d0/LICENSErisk surface

•Unrecognized file type — '.in' is not on the allowlist · stanford-crfm-helm-63754d0/MANIFEST.inrisk surface

•Suspicious network references — suspicious TLD (5 URLs) · stanford-crfm-helm-63754d0/docs/efficient_benchmarking.mdrisk surface

•Unrecognized file type — '.cjs' is not on the allowlist · stanford-crfm-helm-63754d0/helm-frontend/.eslintrc.cjsrisk surface

•Unrecognized file type — '.jsonl' is not on the allowlist · stanford-crfm-helm-63754d0/scripts/scale/instruction_following_calibration_instances.jsonlrisk surface

•Suspicious network references — raw IP URL (10 URLs) · stanford-crfm-helm-63754d0/src/helm/benchmark/augmentations/cleva_perturbation.pyrisk surface

•Suspicious network references — raw IP URL (3 URLs) · stanford-crfm-helm-63754d0/src/helm/benchmark/metrics/cleva_harms_metrics.pyrisk surface

•Unrecognized file type — '.pyi' is not on the allowlist · stanford-crfm-helm-63754d0/src/helm/benchmark/metrics/ifeval/instructions_registry.pyirisk surface

•Unrecognized file type — '.p' is not on the allowlist · stanford-crfm-helm-63754d0/src/helm/benchmark/metrics/image_generation/q16/prompts.prisk surface

•Very high entropy — 7.20 bits/byte suggests packed or encrypted content · stanford-crfm-helm-63754d0/src/helm/benchmark/metrics/image_generation/q16/prompts.prisk surface

•Opaque binary content — non-text payload not statically analyzable · stanford-crfm-helm-63754d0/src/helm/benchmark/metrics/image_generation/q16/prompts.prisk surface

•Unrecognized file type — '.conf' is not on the allowlist · stanford-crfm-helm-63754d0/src/helm/benchmark/presentation/run_entries.confrisk surface

•Suspicious network references — suspicious TLD (1 URLs) · stanford-crfm-helm-63754d0/src/helm/benchmark/scenarios/audio_language/audio_pairs_scenario.pyrisk surface

•Suspicious network references — suspicious TLD (6 URLs) · stanford-crfm-helm-63754d0/src/helm/benchmark/scenarios/audio_language/audiocaps_scenario.pyrisk surface

•Suspicious network references — suspicious TLD (4 URLs) · stanford-crfm-helm-63754d0/src/helm/benchmark/scenarios/audio_language/mustard_scenario.pyrisk surface

•Suspicious network references — suspicious TLD (3 URLs) · stanford-crfm-helm-63754d0/src/helm/benchmark/scenarios/audio_language/parade_scenario.pyrisk surface

•Suspicious network references — suspicious TLD (2 URLs) · stanford-crfm-helm-63754d0/src/helm/benchmark/scenarios/blimp_scenario.pyrisk surface

•Suspicious network references — raw IP URL (12 URLs) · stanford-crfm-helm-63754d0/src/helm/benchmark/scenarios/cleva_scenario.pyrisk surface

•Suspicious network references — suspicious TLD (12 URLs) · stanford-crfm-helm-63754d0/src/helm/benchmark/scenarios/commonsense_scenario.pyrisk surface

•Suspicious network references — suspicious TLD (7 URLs) · stanford-crfm-helm-63754d0/src/helm/benchmark/scenarios/cti_to_mitre_scenario.pyrisk surface

•Suspicious network references — suspicious TLD (48 URLs) · stanford-crfm-helm-63754d0/src/helm/benchmark/scenarios/seahelm_scenario.pyrisk surface

•Unrecognized file type — '.default' is not on the allowlist · stanford-crfm-helm-63754d0/src/helm/benchmark/scenarios/vision_language/image2struct/webpage/Gemfile.defaultrisk surface

•Suspicious network references — suspicious TLD (13 URLs) · stanford-crfm-helm-63754d0/src/helm/benchmark/scenarios/vision_language/vqa_scenario.pyrisk surface

•Unrecognized file type — '.pkl' is not on the allowlist · stanford-crfm-helm-63754d0/src/helm/benchmark/window_services/mock_ai21_tokenizer_request_results.pklrisk surface

•Suspicious network references — suspicious TLD (65 URLs) · stanford-crfm-helm-63754d0/src/helm/config/model_deployments.yamlrisk surface

•Suspicious network references — suspicious TLD (581 URLs) · stanford-crfm-helm-63754d0/src/helm/config/model_metadata.yamlrisk surface

•Suspicious network references — raw IP URL (1 URLs) · stanford-crfm-helm-63754d0/src/helm/proxy/services/test_remote_service.pyrisk surface

✔ verified source · pinned stanford-crfm-helm-63754d0

Check against a policy

The same gate an agent runs before installing (POST /api/v1/trust/helm-holistic-eval/check). Click a policy:

Consume HELM — Holistic Evaluation of Language Models programmatically. Authenticate with an API key or session — see Authorize an agent.

# Agents: CHECK BEFORE YOU INSTALL (no auth) — score, grade, level, capability manifest
curl https://ai-supply.store/api/v1/trust/helm-holistic-eval

# Gate against your org policy (returns { pass, violations })
curl -X POST https://ai-supply.store/api/v1/trust/helm-holistic-eval/check \
  -H "Content-Type: application/json" \
  -d '{"minGrade":"B","denyPermissions":["shell"],"denyUnknownEgress":true}'

# CLI
npx ai-supply add helm-holistic-eval

# REST (install → download)
curl -X POST https://ai-supply.store/api/v1/listings/helm-holistic-eval/install \
  -H "Authorization: Bearer $AIM_KEY"

# MCP tool
install_listing({ "slug": "helm-holistic-eval" })

OpenAPI spec →