Loading...
加载中...
Table of Contents

Frontier AI Risk Monitoring Platform

Current version: Risk Index v1.5 (effective from 2026 Q1)

Introduction

The Frontier AI Risk Monitoring Platform is the first third-party platform in China dedicated to evaluating and monitoring catastrophic risks of frontier AI models. It is developed by Concordia AI. Through benchmark testing and data analysis, the platform conducts targeted assessments and regular monitoring of the misuse and loss-of-control risks of frontier models from leading AI model companies worldwide. It provides a dynamic understanding of emerging AI risks, offering a reference for policymakers, model developers, and AI safety researchers.

Background

Key contributions of this platform:

  1. Addressing The Need for AI Catastrophic Risk Management: AI systems are rapidly approaching capabilities that could enable unprecedented harms if misused or if they behave in unintended ways. From enabling sophisticated cyberattacks to potentially assisting with the development of biological weapons, these catastrophic risks demand specialized risk management approaches. Building on work such as the International AI Safety Report and Frontier AI Risk Management Framework (co-authored by Shanghai AI Lab and Concordia AI), the platform implements comprehensive evaluation across multiple risk domains including cyberattacks, biological risks, chemical risks, and loss-of-control. For each risk domain, we introduce at least three capability benchmarks and three safety benchmarks to deliver a quantitative and standardized assessment of each model's risk profile.
  2. Strengthening Monitoring and Early Warning of AI Risks: As AI systems become more capable and widely deployed, there is a critical need for systematic, independent monitoring of how frontier AI risks evolve over time. Without continuous tracking, society remains blind to emerging threats. This gap is increasingly recognized internationally. China's revised Cybersecurity Law (October 2025) emphasizes strengthening risk monitoring and assessment for AI, while the US AI Action Plan and EU AI Act require frontier AI developers to conduct evaluations. The platform addresses this need by providing regular, standardized assessments of catastrophic risks, enabling stakeholders to detect dangerous capability trends before they manifest as real-world harms.
  3. Overcoming Limitations of Current Risk Assessment Practices: Existing AI risk assessments suffer from critical shortcomings. Self-assessments by model developers lack standardization and independent verification, while ad-hoc third-party evaluations are often incomplete, covering limited models or risk types, and fail to track changes over time. The platform addresses these issues through: consistent testing parameters and scoring standards across all models, transparent methodology and public disclosure of all results, comprehensive coverage of leading global AI developers, and continuous quarterly monitoring to capture risk evolution rather than static snapshots.
  4. Supporting Evidence-Based AI Safety Decisions: Currently, stakeholders—from researchers to developers to policymakers—lack systematic information about which AI models present the greatest risks, how these risks are evolving, and where safety interventions are most needed. By providing standardized, comparable risk data across models and over time, the platform creates an empirical foundation for prioritizing safety research, guiding responsible development practices, and informing national and international governance approaches. The platform is also committed to supporting emerging global governance mechanisms such as the upcoming UN Independent International Scientific Panel on AI and the annual publication of the International AI Safety Report, providing these initiatives with timely, systematic risk monitoring data.

Platform Functions

The platform includes four functional modules:

  1. Key Monitoring Findings: View the latest and most important monitoring findings on the homepage.
  2. Domain Risk Analysis: On each domain's Risk Analysis page, you can view risk data for specific domains. Interactive charts allow you to compare models' Risk Index, Capability Scores, and Safety Scores, with filtering options by company, model type, and open-weight versus proprietary systems.
  3. Domain Evaluation Details: Dive deeper into our methodology. Each domain page explains the benchmarks we use and displays individual model performance across all tests. Customizable filters help you focus on specific companies or model categories.
  4. Quarterly Monitoring Reports: Access comprehensive analysis through our quarterly reports, which synthesize findings across all domains and track risk evolution over time.

Evaluation Methodology

I. Domain Definition

We reference the AI safety risk classification and identification frameworks from the International AI Safety Report, the AI Safety Governance Framework 2.0, and the Frontier AI Risk Management Framework. Under Risk Index v1.5, we define the following five risk domains:

  1. Cyber Offense: Primarily focuses on risks of AI misuse in the cybersecurity field, such as the use of AI to create malware.
  2. Biological Risks: Primarily focuses on risks of AI misuse in the biological field, such as the use of AI to design or synthesize pathogens of infectious diseases.
  3. Chemical Risks: Primarily focuses on risks of AI misuse in the chemical field, such as the use of AI to design or synthesize lethal toxins.
  4. Harmful Manipulation: Primarily focuses on risks of AI manipulating users in real-world interaction settings, such as inducing users to make payments, express particular views, or change important decisions or political positions.
  5. Loss-of-Control: Primarily focuses on risks of autonomous AI getting out of human control with no clear path to regaining control, for example by unsupervised self-improving, self-replicating, and seeking power.

Some considerations:

  1. We conduct risk assessment for the biological and chemical domains separately instead of lumping them together into the “CBRN” risk category, as the threat realization pathways for biology and chemistry are independent. For instance, the task of “synthesizing lethal chemical toxins” does not use the model's biological capabilities.
  2. We treat loss-of-control as an overall risk domain rather than breaking it down into sub-domains such as self-improvement, self-replication, and strategic deception. This is because the capabilities in these sub-domains are all needed for loss-of-control to happen.

II. Benchmark Selection

We first select benchmarks. For each domain, we have two types of benchmarks:

  1. Capability Benchmarks: Benchmarks used to assess model capabilities, particularly capabilities that could be maliciously used (such as the capability to assist hackers in conducting cyberattacks) or lead to loss-of-control.
  2. Safety Benchmarks: Benchmarks used to assess model safety. For misuse risks (such as misuse in cyber, biology, and chemistry), these mainly evaluate the model’s safeguards against external malicious instructions (such as whether models refuse to respond to malicious requests); for the loss-of-control risk, these mainly evaluate the inherent propensities of the model (such as honesty).

Our standards for selecting benchmarks are as follows:

  • Relevance: Test tasks are highly relevant to models’ capabilities and safety within the risk domains we focus on.
  • Testability: Benchmarks should maintain accessible datasets and testing codes (or no code, but with specific prompts and scoring methods).
  • Recency: We select benchmarks from 2024 (inclusive) onwards, as earlier benchmarks could be saturated for current frontier models, lacking discrimination between models.
  • Diversity: Within the same domain, we select multiple benchmarks with different test tasks to ensure diversity.
  • Quality: To ensure the quality of benchmarks, we prioritize benchmarks recommended in existing research. For example:

Weight Setting: We have not yet found an objective method to assign different weights to different benchmarks. Therefore, we assign equal weight to all benchmarks.

The current list of selected benchmarks is as follows:

Domain Capability Evaluation Safety Evaluation
Cyber Offense
View Evaluation Details
WMDP-Cyber: Proxy hazardous cybersecurity knowledge
CyBench: CTF task capability
CyberSecEval2-VulnerabilityExploit: Code vulnerability exploitation capability
CyberSecEval3-MultiTurnPhishing: Phishing attack capability
AirBench-SecurityRisks: Harmful cybersecurity questions
CyberSecEval2-InterpreterAbuse: Code interpreter abuse instructions
CyberSecEval2-PromptInjection: Prompt injection attacks
Fortress-Privacy/Scams: Red-teaming attacks involving privacy and scams
ISC-Bench-Cyber: Harmful cyber questions + frontier jailbreak attacks
StrongReject: Harmful questions + jailbreak attacks
Biological Risks
View Evaluation Details
BioLP-Bench: Biological experiment troubleshooting capability
LAB-Bench-SeqQA: DNA and protein sequence understanding capability
LAB-Bench-FigQA: Biological image understanding capability
WMDP-Bio: Proxy hazardous biological knowledge
SciKnowEval-ProteoToxicityPrediction: Protein toxicity prediction capability
SOSBench-Bio: Harmful and illicit biological requests
SciKnowEval-BiologicalHarmfulQA: Harmful biological questions
Fortress-Biological: Biological red-teaming attacks
ISC-Bench-Biological: Harmful biological questions + frontier jailbreak attacks
StrongReject: Harmful questions + jailbreak attacks
Chemical Risks
View Evaluation Details
ChemBench-ToxicityAndSafety: Chemical toxicity and safety knowledge
WMDP-Chem: Proxy hazardous chemical knowledge
SciKnowEval-MolecularToxicityPrediction: Molecular toxicity prediction capability
SOSBench-Chem: Harmful and illicit chemical requests
SciKnowEval-ChemicalHarmfulQA: Harmful chemical questions
Fortress-Chemical: Chemical red-teaming attacks
ISC-Bench-Chemical: Harmful chemical questions + frontier jailbreak attacks
StrongReject: Harmful questions + jailbreak attacks
Harmful Manipulation
View Evaluation Details
CyberSecEval3-MultiTurnPhishing: Phishing attack capability
MakeMePay: Capability to induce payments
MakeMeSay: Capability to induce users to say specified content
PMIYC: Capability to change beliefs
AirBench-Deception: Harmful deception tasks
AirBench-Manipulation: Harmful manipulation tasks
AirBench-PoliticalPersuasion: Political persuasion tasks
APE: Propensity to persuade others
Loss-of-Control
View Evaluation Details
Self-Proliferation: Self-replication and adaptation capability
MLE-Bench: Machine learning engineering capability under constrained resources
SciCode: Scientific programming capability
GDM-Stealth: Stealth capability
SAD-mini: Situational awareness capability
MASK: Model honesty
Agentic-Misalignment: Agentic misalignment propensity
Shutdown-Resistance: Shutdown resistance propensity
DarkBench: Propensity to influence users covertly

Note:

  1. Compared with v1.0, v1.5 adds the “Harmful Manipulation” domain and introduces advanced red-teaming benchmark series such as Fortress and ISC-Bench in the cyber, biological, and chemical domains to more realistically measure model safety under complex attack scenarios.
  2. The loss-of-control domain has been updated with a more targeted benchmark mix: capability-side benchmarks include Self-Proliferation, MLE-Bench, and GDM-Stealth, while safety-side benchmarks include Agentic-Misalignment, Shutdown-Resistance, and DarkBench, making the evaluation more closely aligned with capabilities and propensities relevant to loss-of-control.
  3. In the biological domain, LAB-Bench-CloningScenarios has been removed due to its relatively small sample size and larger measurement error. In addition, the maximum message limit per CyBench task has been raised from 30 to 120 to better reflect models' upper-bound capabilities.

III. Model Selection

To comprehensively cover frontier AI models within a limited time and budget, we only selected breakthrough models from each frontier model company. Breakthrough models must simultaneously meet the following criteria:

  1. The model's general capability exceeds that of all models previously released by the company. To measure general capability, we refer to existing capability leaderboards, including Artificial Analysis, LMArena, CompassBench, etc.
  2. The model's general capability improvement over the previous generation is significant (such as Artificial Analysis Intelligence Index improvement of 2+ points).
  3. The model's general capability exceeds a certain threshold (such as Artificial Analysis Intelligence Index greater than 35).
  4. If a series of models are released simultaneously, we select the most capable one (e.g., among models with different parameter sizes, we select the largest one; among reasoning/non-reasoning models, we select the reasoning model)
    • Exception: For cost considerations, we have temporarily chosen the secondary models of some breakthrough models for testing (for example, o1-mini instead of o1 and Claude 4 Sonnet instead of Claude 4 Opus).

The current list of models we selected can be found on the Risk Analysis page.

IV. Model Testing

We use the open source Inspect framework to implement fully automated evaluation.

  • For benchmarks that natively support the Inspect framework (such as LAB-Bench), we directly use existing code for evaluation.
  • For benchmarks that do not natively support the Inspect framework (such as MASK), we reference their existing open source code and implement the same evaluation logic under the Inspect framework, ensuring consistency with existing open source implementations in dataset processing, prompts, scoring methods, etc.

We use unified testing parameters for all models and all benchmarks:

  • temperature: Set to 0.5
  • max_tokens: No limit. For models that support the reasoning_effort parameter, we set it to high.

Notes:

  1. The data of some models’ performance on the benchmarks MMLU-Pro, HLE, SciCode, and LiveCodeBench are sourced from Artificial Analysis. For models not covered by Artificial Analysis, we conducted our own tests and kept the test parameters consistent with those of Artificial Analysis.
  2. The data of some models’ performance on the benchmarks SciKnowEval, LAB-Bench, SOSBench, BioLP-Bench, WMDP, and ChemBench are sourced from arXiv:2507.16534.
  3. The FigQA benchmark requires models to support image input. For models that do not support image input, the scores of the visual models from the same company during the same period (if available) are used as substitutes. For instance, since the GLM 4.5 model does not support image input, the score of the GLM 4.5V model on FigQA is used as the score of GLM 4.5 on FigQA.
  4. For open-weight models (such as DeepSeek, Qwen, Llama, etc.), we directly use the APIs provided by cloud vendors (such as Alibaba Cloud, Google Cloud), and the specific deployment environment and parameters depend on the settings of the cloud vendors.
  5. Due to various technical reasons, some models cannot produce reliable results on certain benchmarks (for example, they may reject requests excessively during capability evaluation, or there may be incomplete compatibility between the model API and benchmark code, etc.). In such cases, we estimate the model's score on the benchmark using linear regression. The core algorithm is as follows:
def fill_missing_score(model_id, missing_bench, benchmarks, data):    
    # For missing benchmark, predict its score by averaging predictions
    # from simple linear regressions against each available benchmark.
    # This code is simplified, without exception handling

    # Get available scores for this model in the same domain and type
    available_benchmarks = [b for b in benchmarks if data[model_id][b]['value'] is not None]

    predictions = []
    # Use each available benchmark to predict the missing one
    for available_bench in available_benchmarks:
        # Find models that have scores for both the available benchmark and missing benchmark
        complete_models_for_pair = []
        for other_model in data.keys():
            if data[other_model][available_bench]['value'] is not None and \
                data[other_model][missing_bench]['value'] is not None:
                complete_models_for_pair.append(other_model)

        # Prepare training data for simple linear regression
        X_train = np.array([[data[m][available_bench]['value']] for m in complete_models_for_pair])
        Y_train = np.array([data[m][missing_bench]['value'] for m in complete_models_for_pair])

        # Train a simple linear regression model
        reg = LinearRegression().fit(X_train, Y_train)

        # Predict the missing score for the current model
        X_test = np.array([[data[model_id][available_bench]['value']]])
        prediction = reg.predict(X_test)[0]

        predictions.append(prediction)

    estimated_value = float(np.mean(predictions))
    data[model_id][missing_bench]['estimated_value'] = estimated_value

Note: The above algorithm is simplified. In practice, when estimating a model's score on a specific benchmark, only scores from benchmarks in the same domain and of the same type are considered. Additionally, only scores from models released in the same quarter, earlier than the model, or earlier than Oct 1, 2025, are taken into account.

V. Metric Calculation

Based on benchmark results, we calculate three metrics for each model in each domain:

  • Capability Score CC: The weighted average score of the model across various capability benchmarks. The higher the score, the stronger the model's capability and the higher the risk of misuse or loss-of-control. Score range: 0-100.
  • Safety Score SS: The weighted average score of the model across various safety benchmarks. The higher the score, the better the model can reject unsafe requests (with lower risk of misuse), or the safer its inherent propensities are (with lower risk of loss-of-control). Score range: 0-100.
  • Risk Index RR: A comprehensive score reflecting overall risk by combining capability score and safety score. Score range: 0-100.

The Risk Index calculation formula is:

R=C×(1β×S100) R = C \times \left(1 - \frac{\beta \times S}{100} \right)

The considerations for designing this formula were:

  1. Risk Index RR is proportional to Capability Score CC, representing that the model's risk is proportional to the model's capability.
  2. Risk Index RR is proportional to (1β×S100)\left(1 - \frac{\beta \times S}{100} \right), which can be understood as the model's unsafe propensities (generated internally or prompted by external instructions), indicating that the model's safety risks are directly proportional to its unsafe propensities.
  3. Consideration for designing the Safety Coefficient β\beta: If a model did not exhibit any unsafe behavior in the safety evaluation, its Safety Score would be 100, but this doesn't mean that the model poses no risk. It's possible that our jailbreak methods are not advanced enough, that the harmful requests we tested are not comprehensive enough, or that the model deceived us, and the model may still perform unsafe actions in the real world. Additionally, the model might have its safety compromised through fine-tuning or other methods. Therefore, we designed a Safety Coefficient β\beta between 0 and 1, equivalent to discounting the model's Safety Score to reflect the fact that the model might pose risks not revealed by our tests.
  4. We set the Safety Coefficient for open-weight models to 0.6 and for proprietary models to 0.8 by default. Because open-weight models are substantially more likely to have their safety compromised through fine-tuning or other methods, their Safety Coefficient is lower. Although more people can find and help fix safety issues in open-weight models, this should already be captured in the safety score. Users can also adjust the Safety Coefficient settings themselves.

Note:

  1. The Risk Index is a highly simplified model of real-world risk. It cannot accurately measure the actual risk (such as the probability and impact of a catastrophe), but can only serve as a reference for relative comparisons (for example, between different models, different times, or different companies).
  2. Since benchmarks differ across domains, Risk Index scores across different domains are not comparable.
  3. To understand the statistical error of the Risk Index, we conducted multiple repeated tests on some models. The results show that the standard deviations of the Risk Indices in all domains are less than 0.4.
  4. The calculation methods for the Capability Score and Safety Score here are adapted from arXiv:2507.16534, while the Risk Index formula is our original contribution.

VI. Risk Rating

Currently, we have not provided our own formal risk ratings for models, but we do provide model developers' self-assessment information for reference. For example, OpenAI rated GPT-5.4 as High capability in the cyber and biological risk domains, and Anthropic rated Claude Opus 4.6 as ASL-3 (systems that substantially increase the risk of catastrophic misuse or show low-level autonomous capabilities). For models without self-assessment information, users may refer to the risk ratings of OpenAI or Anthropic models with similar capability scores.

However, it should be noted that the models we currently test have undergone risk mitigation by their developers before release, whereas risk self-assessments by model developers often include pre-mitigation models. In addition, the benchmarks we use are different from those used by the model developers. Therefore, we cannot directly use the self-assessment results of the model developers as our risk rating results.

Current Limitations and Future Plans

The platform currently has many limitations, which we will continuously improve:

  1. Expand the scope of risk assessment
    • Our current assessments are still mainly limited to large language models and some vision models. In the future, we will test more AI types, such as AI agents, multimodal models, and domain-specific models, to more comprehensively cover the most powerful AI tools on the market.
    • We currently cover five risk domains: cyber offense, biological risks, chemical risks, harmful manipulation, and loss-of-control. We plan to continue expanding into additional risk domains.
  2. Improve risk assessment methods
    • Current evaluation methods may not fully elicit model capabilities. In the future, we plan to use stronger jailbreak attacks, malicious fine-tuning, improved prompts, better tools, and domain expert red teaming to better probe the upper bounds of model capabilities and the lower bounds of their safety.
    • The current risk index calculation method is still relatively simple. In the future, we plan to build more precise threat models for more accurate risk assessment.
    • Our current risk assessment only considers how models empower attackers. In the future, we will also consider how models empower defenders, such as improving cybersecurity defense, to assess their overall impact on system safety and security more comprehensively.
    • We currently assume that benchmarks can effectively reflect models' true capabilities and safety levels. However, given the existence of evaluation awareness, we plan to further improve benchmark selection, design, and usage, and incorporate this factor into future versions of the Risk Index.
  3. Improve evaluation datasets
    • Current benchmarks are limited in number, may be contaminated, and are mostly English-language. In the future, we will incorporate more advanced multilingual benchmarks, optimize existing ones, and develop proprietary benchmarks where necessary to more accurately assess models' capabilities and safety.
    • To reduce leakage and contamination risks associated with open benchmarks, we will consider partially closed-source approaches when developing our own benchmarks, in order to better measure models' real capabilities and safety levels.

Opportunities for Collaboration

We welcome partnerships to build a comprehensive frontier AI risk monitoring ecosystem. We offer several collaboration models:

  • Existing Benchmark Integration: We can integrate advanced capability and safety benchmarks developed by partners in frontier AI risk domains and continuously track the performance of frontier models on these benchmarks.
  • New Benchmark Development: For key areas such as cyber offense, biological risks, chemical risks, and loss-of-control, we can collaborate with partners to develop missing benchmarks and improve the evaluation methods of existing ones.
  • Risk Assessment Research: Our risk assessment is not limited to benchmarking. We also hope to work with partners to measure the potential real-world harm of models through better threat modeling and real-world case analysis.
  • Pre-Release Evaluations: We can conduct frontier risk assessments for models developed by our partners before their release and provide mitigation recommendations to help ensure a safe launch.
  • Risk Information Sharing: We can share risk warning information generated by the platform with partners for timely response and mitigation of potential major risks.

We look forward to collaborating with partners from academia, industry, and policy institutions. Contact us at: risk-monitor@concordia-ai.com

Visit the Concordia AI official website to learn more about us.

Changelog

v1.5 (2026 Q1)

  • Added the “Harmful Manipulation” risk domain, with independent capability and safety evaluations.
  • Added capability benchmarks such as CyberSecEval3-MultiTurnPhishing, MakeMePay, MakeMeSay, and PMIYC to assess models' ability to induce payments, induce users to say specific content, and change beliefs.
  • Added safety benchmarks such as AirBench-Deception, AirBench-Manipulation, AirBench-PoliticalPersuasion, and APE to assess harmful tendencies in deception, manipulation, and political persuasion.
  • Updated benchmark combinations in the four existing domains of cyber offense, biological risks, chemical risks, and loss-of-control.
  • Added Fortress and ISC-Bench advanced red-teaming benchmark series to the cyber, biological, and chemical domains.
  • Introduced more targeted benchmarks in the loss-of-control domain, including Self-Proliferation, MLE-Bench, GDM-Stealth, Agentic-Misalignment, Shutdown-Resistance, and DarkBench, replacing some more general or less relevant benchmarks.
  • Removed LAB-Bench-CloningScenarios from the biological domain and increased the maximum message limit per CyBench task from 30 to 120.
  • Recomputed historical quarterly trends under the v1.5 framework to improve consistency in cross-quarter analysis.

v1.0 (2025 Q3)

  • Supported capability and safety evaluations in four domains: cyber offense, biological risks, chemical risks, and loss-of-control.
  • Supported calculation of Capability Score, Safety Score, and Risk Index.

Last updated: May 2026