Loading...
加载中...
Table of Contents

Frontier AI Risk Monitoring Platform

Introduction

The Frontier AI Risk Monitoring Platform is the first third-party platform in China dedicated to evaluating and monitoring catastrophic risks of frontier AI models. It is developed by Concordia AI. Through benchmark testing and data analysis, the platform conducts targeted assessments and regular monitoring of the misuse and loss-of-control risks of frontier models from leading AI model companies worldwide. It provides a dynamic understanding of emerging AI risks, offering a reference for policymakers, model developers, and AI safety researchers.

Background

Key contributions of this platform:

  1. Addressing The Need for AI Catastrophic Risk Management: AI systems are rapidly approaching capabilities that could enable unprecedented harms if misused or if they behave in unintended ways. From enabling sophisticated cyberattacks to potentially assisting with the development of biological weapons, these catastrophic risks demand specialized risk management approaches. Building on work such as the International AI Safety Report and Frontier AI Risk Management Framework (co-authored by Shanghai AI Lab and Concordia AI), the platform implements comprehensive evaluation across multiple risk domains including cyberattacks, biological risks, chemical risks, and loss-of-control. For each risk domain, we introduce at least three capability benchmarks and three safety benchmarks to deliver a quantitative and standardized assessment of each model's risk profile.
  2. Strengthening Monitoring and Early Warning of AI Risks: As AI systems become more capable and widely deployed, there is a critical need for systematic, independent monitoring of how frontier AI risks evolve over time. Without continuous tracking, society remains blind to emerging threats. This gap is increasingly recognized internationally. China's revised Cybersecurity Law (October 2025) emphasizes strengthening risk monitoring and assessment for AI, while the US AI Action Plan and EU AI Act require frontier AI developers to conduct evaluations. The platform addresses this need by providing regular, standardized assessments of catastrophic risks, enabling stakeholders to detect dangerous capability trends before they manifest as real-world harms.
  3. Overcoming Limitations of Current Risk Assessment Practices: Existing AI risk assessments suffer from critical shortcomings. Self-assessments by model developers lack standardization and independent verification, while ad-hoc third-party evaluations are often incomplete, covering limited models or risk types, and fail to track changes over time. The platform addresses these issues through: consistent testing parameters and scoring standards across all models, transparent methodology and public disclosure of all results, comprehensive coverage of leading global AI developers, and continuous quarterly monitoring to capture risk evolution rather than static snapshots.
  4. Supporting Evidence-Based AI Safety Decisions: Currently, stakeholders—from researchers to developers to policymakers—lack systematic information about which AI models present the greatest risks, how these risks are evolving, and where safety interventions are most needed. By providing standardized, comparable risk data across models and over time, the platform creates an empirical foundation for prioritizing safety research, guiding responsible development practices, and informing national and international governance approaches. The platform is also committed to supporting emerging global governance mechanisms such as the upcoming UN Independent International Scientific Panel on AI and the annual publication of the International AI Safety Report, providing these initiatives with timely, systematic risk monitoring data.

Platform Functions

The platform includes four functional modules:

  1. Key Monitoring Findings: View the latest and most important monitoring findings on the homepage.
  2. Domain Risk Analysis: On each domain's Risk Analysis page, you can view risk data for specific domains. Interactive charts allow you to compare models' Risk Index, Capability Scores, and Safety Scores, with filtering options by company, model type, and open-weight versus proprietary systems.
  3. Domain Evaluation Details: Dive deeper into our methodology. Each domain page explains the benchmarks we use and displays individual model performance across all tests. Customizable filters help you focus on specific companies or model categories.
  4. Quarterly Monitoring Reports: Access comprehensive analysis through our quarterly reports, which synthesize findings across all domains and track risk evolution over time. Subscribe to receive future reports as they're released.

Evaluation Methodology

I. Domain Definition

We reference the AI safety risk classification and identification frameworks from the International AI Safety Report, the AI Safety Governance Framework 2.0, and the Frontier Al Risk Management Framework, and define the following four risk domains:

  1. Cyber Offense: Primarily focuses on risks of AI misuse in the cybersecurity field, such as the use of AI to create malware.
  2. Biological Risks: Primarily focuses on risks of AI misuse in the biological field, such as the use of AI to design or synthesize pathogens of infectious diseases.
  3. Chemical Risks: Primarily focuses on risks of AI misuse in the chemical field, such as the use of AI to design or synthesize lethal toxins.
  4. Loss-of-Control: Primarily focuses on risks of autonomous AI getting out of human control with no clear path to regaining control, for example by unsupervised self-improving, self-replicating, and seeking power.

Some considerations:

  1. We conduct risk assessment for the biological and chemical domains separately instead of lumping them together into the “CBRN” risk category, as the threat realization pathways for biology and chemistry are independent. For instance, the task of “synthesizing lethal chemical toxins” does not use the model's biological capabilities.
  2. We treat loss-of-control as an overall risk domain rather than breaking it down into sub-domains such as self-improvement, self-replication, and strategic deception. This is because the capabilities in these sub-domains are all needed for loss-of-control to happen.

II. Benchmark Selection

We first select benchmarks. For each domain, we have two types of benchmarks:

  1. Capability Benchmarks: Benchmarks used to assess model capabilities, particularly capabilities that could be maliciously used (such as the capability to assist hackers in conducting cyberattacks) or lead to loss-of-control.
  2. Safety Benchmarks: Benchmarks used to assess model safety. For misuse risks (such as misuse in cyber, biology, and chemistry), these mainly evaluate the model’s safeguards against external malicious instructions (such as whether models refuse to respond to malicious requests); for the loss-of-control risk, these mainly evaluate the inherent propensities of the model (such as honesty).

Our standards for selecting benchmarks are as follows:

  • Relevance: Test tasks are highly relevant to models’ capabilities and safety within the risk domains we focus on.
  • Testability: Benchmarks should maintain accessible datasets and testing codes (or no code, but with specific prompts and scoring methods).
  • Recency: We select benchmarks from 2024 (inclusive) onwards, as earlier benchmarks could be saturated for current frontier models, lacking discrimination between models.
  • Diversity: Within the same domain, we select multiple benchmarks with different test tasks to ensure diversity.
  • Quality: To ensure the quality of benchmarks, we prioritize benchmarks recommended in existing research. For example:

Weight Setting: We have not yet found an objective method to assign different weights to different benchmarks. Therefore, we assign equal weight to all benchmarks.

The current list of selected benchmarks is as follows:

Domain Capability Evaluation Safety Evaluation
Cyber Offense
View Evaluation Details
WMDP-Cyber: Proxy hazardous cybersecurity knowledge
CyBench: CTF (Capture the Flag) capability
CyberSecEval2-VulnerabilityExploit: Code vulnerability exploitation capability
CyberSecEval3-MultiTurnPhishing: Cyber phishing attack capability
AirBench-SecurityRisks: Harmful cybersecurity questions
CyberSecEval2-InterpreterAbuse: Code interpreter abuse instructions
CyberSecEval2-PromptInjection: Prompt injection attacks
StrongReject: Harmful questions + jailbreak attacks
Biological Risks
View Evaluation Details
BioLP-Bench: Biological wet lab protocol troubleshooting capability
LAB-Bench-SeqQA: DNA and protein sequence comprehension capability
LAB-Bench-FigQA: Biological image understanding capability
LAB-Bench-CloningScenarios: Cloning experiments knowledge
WMDP-Bio: Proxy hazardous biological knowledge
SciKnowEval-ProteoToxicityPrediction: Protein toxicity prediction capability
SOSBench-Bio: Hazardous and illegal biological requests
SciKnowEval-BiologicalHarmfulQA: Harmful biological questions
StrongReject: Harmful questions + jailbreak attacks
Chemical Risks
View Evaluation Details
ChemBench-ToxicityAndSafety: Chemical toxicity and safety knowledge
WMDP-Chem: Proxy hazardous chemical knowledge
SciKnowEval-MolecularToxicityPrediction: Molecular toxicity prediction capability
SOSBench-Chem: Hazardous and illegal chemical requests
SciKnowEval-ChemicalHarmfulQA: Harmful chemical questions
StrongReject: Harmful questions + jailbreak attacks
Loss-of-Control
View Evaluation Details
MMLU-Pro: Multidisciplinary professional knowledge and reasoning capability
HLE: Humanity's Last Exam, cutting-edge knowledge and reasoning capabilities
SciCode: Scientific programming capability
LiveCodeBench: Competition-level programming capability
SAD-mini: Situational awareness capability
MASK: Model honesty
AirBench-Deception: Model propensity to deceive
StrongReject: Model propensity to answer harmful questions without jailbreaking

Note:

  1. Since we currently evaluate base models rather than AI agents, the capability benchmarks we have chosen mainly focus on knowledge and reasoning tasks, and do not involve agentic tasks that require using tools (such as SWE-Bench). The performance of models on agentic tasks depends not only on the model itself but also on the implementation of the agent framework (such as workflow, prompts, tools, etc.). In the future, we plan to conduct comprehensive evaluations of the combination of models and agent frameworks.
  2. For the loss-of-control domain, there is currently a lack of particularly targeted benchmarks. Therefore, we have selected some general knowledge and reasoning benchmarks (such as MMLU-Pro, HLE, etc.), which reflect the general capability of AI models. We believe that the higher the general capability of AI, the higher the risk of loss-of-control will be. Moreover, coding capability is also highly related to the loss-of-control risk (as it could help with self-improvement and self-replication). Therefore, we have chosen SciCode and LiveCodeBench as the two coding benchmarks (Artificial Analysis uses these two benchmarks to calculate their Code Index).

III. Model Selection

To comprehensively cover frontier AI models within a limited time and budget, we only selected breakthrough models from each frontier model company. Breakthrough models must simultaneously meet the following criteria:

  1. The model's general capability exceeds that of all models previously released by the company. To measure general capability, we refer to existing capability leaderboards, including Artificial Analysis, LMArena, CompassBench, etc.
  2. The model's general capability improvement over the previous generation is significant (such as Artificial Analysis Intelligence Index improvement of 2+ points).
  3. The model's general capability exceeds a certain threshold (such as Artificial Analysis Intelligence Index greater than 35).
  4. If a series of models are released simultaneously, we select the most capable one (e.g., among models with different parameter sizes, we select the largest one; among reasoning/non-reasoning models, we select the reasoning model)
    • Exception: For cost considerations, we have temporarily chosen the secondary models of some breakthrough models for testing (for example, o1-mini instead of o1 and Claude 4 Sonnet instead of Claude 4 Opus).

The current list of models we selected can be found on the Risk Analysis page.

IV. Model Testing

We use the open source Inspect framework to implement fully automated evaluation.

  • For benchmarks that natively support the Inspect framework (such as LAB-Bench), we directly use existing code for evaluation.
  • For benchmarks that do not natively support the Inspect framework (such as MASK), we reference their existing open source code and implement the same evaluation logic under the Inspect framework, ensuring consistency with existing open source implementations in dataset processing, prompts, scoring methods, etc.

We use unified testing parameters for all models and all benchmarks:

  • temperature: Set to 0.5
  • max_tokens: No limit. For models that support the reasoning_effort parameter, we set it to high.
  • epochs: Set to 1
  • judge_model/judge_llm: Some benchmarks require a judge model to score the tested model's responses. To ensure consistency, we used the same judge model (specifically DeepSeek-V3-250324) to evaluate all models.

Notes:

  1. The data of some models’ performance on the benchmarks MMLU-Pro, HLE, SciCode, and LiveCodeBench are sourced from Artificial Analysis. For models not covered by Artificial Analysis, we conducted our own tests and kept the test parameters consistent with those of Artificial Analysis.
  2. The data of some models’ performance on the benchmarks SciKnowEval, LAB-Bench, SOSBench, BioLP-Bench, WMDP, and ChemBench are sourced from arXiv:2507.16534.
  3. The FigQA benchmark requires models to support image input. For models that do not support image input, the scores of the visual models from the same company during the same period (if available) are used as substitutes. For instance, since the GLM 4.5 model does not support image input, the score of the GLM 4.5V model on FigQA is used as the score of GLM 4.5 on FigQA.
  4. For open-weight models (such as DeepSeek, Qwen, Llama, etc.), we directly use the APIs provided by cloud vendors (such as Alibaba Cloud, Google Cloud), and the specific deployment environment and parameters depend on the settings of the cloud vendors.
  5. Due to various technical reasons, some models cannot produce reliable results on certain benchmarks (for example, they may reject requests excessively during capability evaluation, or there may be incomplete compatibility between the model API and benchmark code, etc.). In such cases, we estimate the model's score on the benchmark using linear regression. The core algorithm is as follows:
def fill_missing_score(model_id, missing_bench, benchmarks, data):    
    # For missing benchmark, predict its score by averaging predictions
    # from simple linear regressions against each available benchmark.
    # This code is simplified, without exception handling

    # Get available scores for this model in the same domain and type
    available_benchmarks = [b for b in benchmarks if data[model_id][b]['value'] is not None]

    predictions = []
    # Use each available benchmark to predict the missing one
    for available_bench in available_benchmarks:
        # Find models that have scores for both the available benchmark and missing benchmark
        complete_models_for_pair = []
        for other_model in data.keys():
            if data[other_model][available_bench]['value'] is not None and \
                data[other_model][missing_bench]['value'] is not None:
                complete_models_for_pair.append(other_model)

        # Prepare training data for simple linear regression
        X_train = np.array([[data[m][available_bench]['value']] for m in complete_models_for_pair])
        Y_train = np.array([data[m][missing_bench]['value'] for m in complete_models_for_pair])

        # Train a simple linear regression model
        reg = LinearRegression().fit(X_train, Y_train)

        # Predict the missing score for the current model
        X_test = np.array([[data[model_id][available_bench]['value']]])
        prediction = reg.predict(X_test)[0]

        predictions.append(prediction)

    estimated_value = float(np.mean(predictions))
    data[model_id][missing_bench]['estimated_value'] = estimated_value

V. Metric Calculation

Based on benchmark results, we calculate three metrics for each model in each domain:

  • Capability Score CC: The weighted average score of the model across various capability benchmarks. The higher the score, the stronger the model's capability and the higher the risk of misuse or loss-of-control. Score range: 0-100.
  • Safety Score SS: The weighted average score of the model across various safety benchmarks. The higher the score, the better the model can reject unsafe requests (with lower risk of misuse), or the safer its inherent propensities are (with lower risk of loss-of-control). Score range: 0-100.
  • Risk Index RR: A comprehensive score reflecting overall risk by combining capability score and safety score. Score range: 0-100.

The Risk Index calculation formula is:

R=C×(1β×S100) R = C \times \left(1 - \frac{\beta \times S}{100} \right)

The considerations for designing this formula were:

  1. Risk Index RR is proportional to Capability Score CC, representing that the model's risk is proportional to the model's capability.
  2. Risk Index RR is proportional to (1β×S100)\left(1 - \frac{\beta \times S}{100} \right), which can be understood as the model's unsafe propensities (generated internally or prompted by external instructions), indicating that the model's safety risks are directly proportional to its unsafe propensities.
  3. Consideration for designing the Safety Coefficient β\beta: If a model did not exhibit any unsafe behavior in the safety evaluation, its Safety Score would be 100, but this doesn't mean that the model poses no risk. It's possible that our jailbreak methods are not advanced enough, that the harmful requests we tested are not comprehensive enough, or that the model deceived us, and the model may still perform unsafe actions in the real world. Additionally, the model might have its safety compromised through fine-tuning or other methods. Therefore, we designed a Safety Coefficient β\beta between 0 and 1, equivalent to discounting the model's Safety Score to reflect the fact that the model might pose risks not revealed by our tests.
  4. We set the Safety Coefficient for open-weight models to 0.6 and for proprietary models to 0.8 by default. Because open-weight models are substantially more likely to have their safety compromised through fine-tuning or other methods, their Safety Coefficient is lower. Although more people can find and help fix safety issues in open-weight models, this should already be captured in the safety score. Users can also adjust the Safety Coefficient settings themselves.

Note:

  1. The Risk Index is a highly simplified model of real-world risk. It cannot accurately measure the actual risk (such as the probability and impact of a catastrophe), but can only serve as a reference for relative comparisons (for example, between different models, different times, or different companies).
  2. Since benchmarks differ across domains, Risk Index scores across different domains are not comparable.
  3. To understand the statistical error of the Risk Index, we conducted multiple repeated tests on some models. The results show that the standard deviations of the Risk Indices in all domains are less than 0.4.
  4. The calculation methods for the Capability Score and Safety Score here are adapted from arXiv:2507.16534, while the Risk Index formula is our original contribution.

VI. Risk Rating

Currently, we have not provided risk rating results for models, but we provide model developers’ risk self-assessments for reference. For example, OpenAI rated o4-mini as Medium risk in biology/chemistry domains, and Anthropic rated Claude Sonnet 4 as ASL-2 (systems showing early signs of dangerous capabilities) and applied ASL-3 (systems that substantially increase the risk of catastrophic misuse or show low-level autonomous capabilities) safeguards for Claude Opus 4. For models without self-assessment information, users can refer to the risk ratings of OpenAI or Anthropic models with similar capability scores.

However, it should be noted that the models we currently test have undergone risk mitigation by their developers before release, whereas risk self-assessments by model developers often include pre-mitigation models. In addition, the benchmarks we use are different from those used by the model developers. Therefore, we cannot directly use the self-assessment results of the model developers as our risk rating results.

Current Limitations and Future Plans

The platform currently has many limitations, which we will continuously improve:

  1. Expand the scope of risk assessment
    • Currently, our assessments are limited to mostly large language models and some vision models. In future, we plan to test more types of AI, such as AI agents, multimodal models, and domain-specific models to cover the most powerful AI tools.
    • Currently, we are limited to 4 risk domains. In the future, we plan to expand to more risk domains, such as large-scale persuasion and harmful manipulation risks.
  2. Improve risk assessment methods
    • Current evaluation methods may not fully elicit model capabilities. In the future, we plan to use methods such as stronger jailbreak attacks, malicious fine-tuning, improved prompts, better tools, and expert red teaming to explore the upper limits of model capabilities and the lower limits of their safety as much as possible.
    • The current risk index calculation method is relatively simplistic. In future, we plan to establish more precise threat models to conduct more accurate risk assessments.
    • The current risk assessment only considers the model's empowerment of attackers. In future, we will also consider the model's empowerment of defenders (e.g., using models to enhance cybersecurity defense capabilities) to more comprehensively assess the impact on overall system safety/security.
  3. Improve evaluation datasets
    • Current benchmarks face limitations: narrow coverage, potential contamination, and English-language bias. We will integrate more advanced multilingual benchmarks, refine existing assessments, and develop proprietary benchmarks where necessary to achieve more accurate capability and safety measurements.

Opportunities for Collaboration

We welcome partnerships to build a comprehensive frontier AI risk monitoring ecosystem. We offer several collaboration models:

  • Existing Benchmark Integration: We can integrate advanced capability and safety benchmarks developed by partners in frontier AI risk domains and continuously track the performance of frontier models on these benchmarks.
  • New Benchmark Development: For key areas such as cyber offense, biological risks, chemical risks, and loss-of-control, we can collaborate with partners to develop missing benchmarks and improve the evaluation methods of existing ones.
  • Risk Assessment Research: Our risk assessment is not limited to benchmarking. We also hope to work with partners to measure the potential real-world harm of models through better threat modeling and real-world case analysis.
  • Pre-Release Evaluations: We can conduct frontier risk assessments for models developed by our partners before their release and provide mitigation recommendations to help ensure a safe launch.
  • Risk Information Sharing: We can share risk warning information generated by the platform with partners for timely response and mitigation of potential major risks.

We look forward to collaborating with partners from academia, industry, and policy institutions. Contact us at: risk-monitor@concordia-ai.com

Visit the Concordia AI official website to learn more about us.


Last updated: Nov 2025