Loading...
加载中...
Table of Contents

Frontier AI Risk Monitoring Platform

Introduction

As AI models’ capabilities rapidly advance and their applications become increasingly widespread, their risks are also drawing more attention. For instance, AI may be misused to generate fake information or carry out cyberattacks, or it may cause harm due to inherent biases and unreliability within AI systems. As AI's capabilities and autonomy continue to improve, this will further amplify its risks, potentially causing large-scale social harm, threatening national security, and even posing a threat to the survival of humanity as a whole. Therefore, regular assessment, monitoring, and timely warning of AI model risks is crucial. Although there are many existing AI risk assessment practices, such as self-assessments by model developers (reported in their system cards) and third-party evaluations, the former often lack unified standards and neutrality, while the latter may be limited by incomplete model coverage, a narrow risk scope, or a lack of continuous monitoring.

The Frontier AI Risk Monitoring Platform is a third-party platform launched by Concordia AI specifically designed to assess and monitor the risks of frontier AI models. Through benchmark testing and data analysis, the platform conducts targeted assessments and regular monitoring of the misuse and loss-of-control risks of frontier models. It contributes to our evidence base and shared understanding of the emerging trends of AI risks, providing a reference for policymakers, model developers, and AI safety researchers to better understand and tackle these risks.

Platform Features

  • Focus on Catastrophic Risks: We focus on risks that could potentially cause catastrophic harm. Currently, we support AI risk assessment in four domains: cyber offense, biological risks, chemical risks, and loss-of-control.
  • Coverage of Frontier Models: We evaluated 52 breakthrough models (specific definition below) from 15 leading frontier model companies in China, the US, and the EU. Currently, we only evaluate large language models (including visual language models), but we will expand our coverage to AI agents, domain-specific models, and other model types in the future.
  • Multi-dimensional Evaluation: For each risk domain, we select at least 3 capability benchmarks and 3 safety benchmarks to conduct multi-dimensional quantitative assessment of each model's risks.
  • Continuous Monitoring: We continuously track newly released models from developers and updates to existing models to track risk trend changes. We plan to publish a risk monitoring report every quarter.
  • Open and Transparent: All specific evaluation methods and risk analysis methods that we use are open and transparent.
  • Independent and Objective: As a third-party institution, we apply consistent testing parameters and scoring standards to all models, ensuring fair and impartial results.

Platform Functions

  1. Key Monitoring Findings: View the latest and most important monitoring findings on the homepage.
  2. Domain Risk Analysis: On each domain's Risk Analysis page, you can view risk data for specific domains, including charts and specific values for each model's Risk Index, Capability Score, Safety Score, etc.
  3. Domain Evaluation Details: On each domain's Evaluation Details page, you can read more about how we evaluate the models, including introductions to each benchmark and each model's scores on each benchmark.
  4. Interpretation of Monitoring Data: A monitoring report is released every quarter. The latest report can be viewed here.

Evaluation Methodology

I. Domain Definition

We reference the AI safety risk classification and identification frameworks from the International AI Safety Report, the AI Safety Governance Framework 2.0, and the Frontier Al Risk Management Framework, and define the following four risk domains:

  1. Cyber Offense: Primarily focuses on risks of AI misuse in the cybersecurity field, such as the use of AI to create malware.
  2. Biological Risks: Primarily focuses on risks of AI misuse in the biological field, such as the use of AI to design or synthesize pathogens of infectious diseases.
  3. Chemical Risks: Primarily focuses on risks of AI misuse in the chemical field, such as the use of AI to design or synthesize lethal toxins.
  4. Loss-of-Control: Primarily focuses on risks of autonomous AI getting out of human control with no clear path to regaining control, for example by unsupervised self-improving, self-replicating, and seeking power.

Some considerations:

  1. We conduct risk assessment for the biological and chemical domains separately instead of lumping them together into the “CBRN” risk category, as the threat realization pathways for biology and chemistry are independent. For instance, the task of “synthesizing lethal chemical toxins” does not use the model's biological capabilities.
  2. We treat loss-of-control as an overall risk domain rather than breaking it down into sub-domains such as self-improvement, self-replication, and strategic deception. This is because the capabilities in these sub-domains are all needed for loss-of-control to happen.

II. Benchmark Selection

We first select benchmarks. For each domain, we have two types of benchmarks:

  1. Capability Benchmarks: Benchmarks used to assess model capabilities, particularly capabilities that could be maliciously used (such as the capability to assist hackers in conducting cyberattacks) or lead to loss-of-control.
  2. Safety Benchmarks: Benchmarks used to assess model safety. For misuse risks (such as misuse in cyber, biology, and chemistry), these mainly evaluate the model’s safeguards against external malicious instructions (such as whether models refuse to respond to malicious requests); for the loss-of-control risk, these mainly evaluate the inherent propensities of the model (such as honesty).

Our standards for selecting benchmarks are as follows:

  • Relevance: Test tasks are highly relevant to models’ capabilities and safety within the risk domains we focus on.
  • Testability: Benchmarks should maintain accessible datasets and testing codes (or no code, but with specific prompts and scoring methods).
  • Recency: We select benchmarks from 2024 (inclusive) onwards, as earlier benchmarks could be saturated for current frontier models, lacking discrimination between models.
  • Diversity: Within the same domain, we select multiple benchmarks with different test tasks to ensure diversity.
  • Quality: To ensure the quality of benchmarks, we prioritize benchmarks recommended in existing research. For example:

Weight Setting: We have not yet found an objective method to assign different weights to different benchmarks. Therefore, we assign equal weight to all benchmarks.

The current list of selected benchmarks is as follows:

Domain Capability Evaluation Safety Evaluation
Cyber Offense
View Evaluation Details
WMDP-Cyber: Proxy hazardous cybersecurity knowledge
CyBench: CTF (Capture the Flag) capability
CyberSecEval2-VulnerabilityExploit: Code vulnerability exploitation capability
CyberSecEval3-MultiTurnPhishing: Cyber phishing attack capability
AirBench-SecurityRisks: Harmful cybersecurity questions
CyberSecEval2-InterpreterAbuse: Code interpreter abuse instructions
CyberSecEval2-PromptInjection: Prompt injection attacks
StrongReject: Harmful questions + jailbreak attacks
Biological Risks
View Evaluation Details
BioLP-Bench: Biological wet lab protocol troubleshooting capability
LAB-Bench-SeqQA: DNA and protein sequence comprehension capability
LAB-Bench-FigQA: Biological image understanding capability
LAB-Bench-CloningScenarios: Cloning experiments knowledge
WMDP-Bio: Proxy hazardous biological knowledge
SciKnowEval-ProteoToxicityPrediction: Protein toxicity prediction capability
SOSBench-Bio: Hazardous and illegal biological requests
SciKnowEval-BiologicalHarmfulQA: Harmful biological questions
StrongReject: Harmful questions + jailbreak attacks
Chemical Risks
View Evaluation Details
ChemBench-ToxicityAndSafety: Chemical toxicity and safety knowledge
WMDP-Chem: Proxy hazardous chemical knowledge
SciKnowEval-MolecularToxicityPrediction: Molecular toxicity prediction capability
SOSBench-Chem: Hazardous and illegal chemical requests
SciKnowEval-ChemicalHarmfulQA: Harmful chemical questions
StrongReject: Harmful questions + jailbreak attacks
Loss-of-Control
View Evaluation Details
MMLU-Pro: Multidisciplinary professional knowledge and reasoning capability
HLE: Humanity's Last Exam, cutting-edge knowledge and reasoning capabilities
SciCode: Scientific programming capability
LiveCodeBench: Competition-level programming capability
SAD-mini: Situational awareness capability
MASK: Model honesty
AirBench-Deception: Model propensity to deceive
StrongReject: Model propensity to answer harmful questions without jailbreaking

Note:

  1. Since we currently evaluate base models rather than AI agents, the capability benchmarks we have chosen mainly focus on knowledge and reasoning tasks, and do not involve agentic tasks that require using tools (such as SWE-Bench). The performance of models on agentic tasks depends not only on the model itself but also on the implementation of the agent framework (such as workflow, prompts, tools, etc.). In the future, we plan to conduct comprehensive evaluations of the combination of models and agent frameworks.
  2. For the loss-of-control domain, there is currently a lack of particularly targeted benchmarks. Therefore, we have selected some general knowledge and reasoning benchmarks (such as MMLU-Pro, HLE, etc.), which reflect the general capability of AI models. We believe that the higher the general capability of AI, the higher the risk of loss-of-control will be. Moreover, coding capability is also highly related to the loss-of-control risk (as it could help with self-improvement and self-replication). Therefore, we have chosen SciCode and LiveCodeBench as the two coding benchmarks (Artificial Analysis uses these two benchmarks to calculate their Code Index).

III. Model Selection

To comprehensively cover frontier AI models within a limited time and budget, we only selected breakthrough models from each frontier model company. Breakthrough models must simultaneously meet the following criteria:

  1. The model's general capability exceeds that of all models previously released by the company. To measure general capability, we refer to existing capability leaderboards, including Artificial Analysis, LMArena, CompassBench, etc.
  2. The model's general capability improvement over the previous generation is significant (such as Artificial Analysis Intelligence Index improvement of 2+ points).
  3. The model's general capability exceeds a certain threshold (such as Artificial Analysis Intelligence Index greater than 35).
  4. If a series of models are released simultaneously, we select the most capable one (e.g., among models with different parameter sizes, we select the largest one; among reasoning/non-reasoning models, we select the reasoning model)
    • Exception: For cost considerations, we have temporarily chosen the secondary models of some breakthrough models for testing (for example, o1-mini instead of o1 and Claude 4 Sonnet instead of Claude 4 Opus).

The current list of models we selected can be found on the Risk Analysis page.

IV. Model Testing

We use the open source Inspect framework to implement fully automated evaluation.

  • For benchmarks that natively support the Inspect framework (such as LAB-Bench), we directly use existing code for evaluation.
  • For benchmarks that do not natively support the Inspect framework (such as MASK), we reference their existing open source code and implement the same evaluation logic under the Inspect framework, ensuring consistency with existing open source implementations in dataset processing, prompts, scoring methods, etc.

We use unified testing parameters for all models and all benchmarks:

  • temperature: Set to 0.5
  • max_tokens: No limit. For models that support the reasoning_effort parameter, we set it to high.
  • epochs: Set to 1
  • judge_model/judge_llm: Some benchmarks require a judge model to score the tested model's responses. To ensure consistency, we used the same judge model (specifically DeepSeek-V3-250324) to evaluate all models.

Notes:

  1. The data of some models’ performance on the benchmarks MMLU-Pro, HLE, SciCode, and LiveCodeBench are sourced from Artificial Analysis. For models not covered by Artificial Analysis, we conducted our own tests and kept the test parameters consistent with those of Artificial Analysis.
  2. The data of some models’ performance on the benchmarks SciKnowEval, LAB-Bench, SOSBench, BioLP-Bench, WMDP, and ChemBench are sourced from arXiv:2507.16534.
  3. The FigQA benchmark requires models to support image input. For models that do not support image input, the scores of the visual models from the same company during the same period (if available) are used as substitutes. For instance, since the GLM 4.5 model does not support image input, the score of the GLM 4.5V model on FigQA is used as the score of GLM 4.5 on FigQA.
  4. For open-weight models (such as DeepSeek, Qwen, Llama, etc.), we directly use the APIs provided by cloud vendors (such as Alibaba Cloud, Google Cloud), and the specific deployment environment and parameters depend on the settings of the cloud vendors.
  5. Due to various technical reasons, some models cannot produce reliable results on certain benchmarks (for example, they may reject requests excessively during capability evaluation, or there may be incomplete compatibility between the model API and benchmark code, etc.). In such cases, we estimate the model's score on the benchmark using linear regression. The core algorithm is as follows:
def fill_missing_score(model_id, missing_bench, benchmarks, data):    
    # For missing benchmark, predict its score by averaging predictions
    # from simple linear regressions against each available benchmark.
    # This code is simplified, without exception handling

    # Get available scores for this model in the same domain and type
    available_benchmarks = [b for b in benchmarks if data[model_id][b]['value'] is not None]

    predictions = []
    # Use each available benchmark to predict the missing one
    for available_bench in available_benchmarks:
        # Find models that have scores for both the available benchmark and missing benchmark
        complete_models_for_pair = []
        for other_model in data.keys():
            if data[other_model][available_bench]['value'] is not None and \
                data[other_model][missing_bench]['value'] is not None:
                complete_models_for_pair.append(other_model)

        # Prepare training data for simple linear regression
        X_train = np.array([[data[m][available_bench]['value']] for m in complete_models_for_pair])
        Y_train = np.array([data[m][missing_bench]['value'] for m in complete_models_for_pair])

        # Train a simple linear regression model
        reg = LinearRegression().fit(X_train, Y_train)

        # Predict the missing score for the current model
        X_test = np.array([[data[model_id][available_bench]['value']]])
        prediction = reg.predict(X_test)[0]

        predictions.append(prediction)

    estimated_value = float(np.mean(predictions))
    data[model_id][missing_bench]['estimated_value'] = estimated_value

V. Metric Calculation

Based on benchmark results, we calculate three metrics for each model in each domain:

  • Capability Score CC: The weighted average score of the model across various capability benchmarks. The higher the score, the stronger the model's capability and the higher the risk of misuse or loss-of-control. Score range: 0-100.
  • Safety Score SS: The weighted average score of the model across various safety benchmarks. The higher the score, the better the model can reject unsafe requests (with lower risk of misuse), or the safer its inherent propensities are (with lower risk of loss-of-control). Score range: 0-100.
  • Risk Index RR: A comprehensive score reflecting overall risk by combining capability score and safety score. Score range: 0-100.

The Risk Index calculation formula is:

R=C×(1β×S100) R = C \times \left(1 - \frac{\beta \times S}{100} \right)

The considerations for designing this formula were:

  1. Risk Index RR is proportional to Capability Score CC, representing that the model's risk is proportional to the model's capability.
  2. Risk Index RR is proportional to (1β×S100)\left(1 - \frac{\beta \times S}{100} \right), which can be understood as the model's unsafe propensities (generated internally or prompted by external instructions), indicating that the model's safety risks are directly proportional to its unsafe propensities.
  3. Consideration for designing the Safety Coefficient β\beta: If a model did not exhibit any unsafe behavior in the safety evaluation, its Safety Score would be 100, but this doesn't mean that the model poses no risk. It's possible that our jailbreak methods are not advanced enough, that the harmful requests we tested are not comprehensive enough, or that the model deceived us, and the model may still perform unsafe actions in the real world. Additionally, the model might have its safety compromised through fine-tuning or other methods. Therefore, we designed a Safety Coefficient β\beta between 0 and 1, equivalent to discounting the model's Safety Score to reflect the fact that the model might pose risks not revealed by our tests.
  4. We set the Safety Coefficient for open-weight models to 0.6 and for proprietary models to 0.8 by default. Because open-weight models are substantially more likely to have their safety compromised through fine-tuning or other methods, their Safety Coefficient is lower. Although more people can find and help fix safety issues in open-weight models, this should already be captured in the safety score. Users can also adjust the Safety Coefficient settings themselves.

Note:

  1. The Risk Index is a highly simplified model of real-world risk. It cannot accurately measure the actual risk (such as the probability and impact of a catastrophe), but can only serve as a reference for relative comparisons (for example, between different models, different times, or different companies).
  2. Since benchmarks differ across domains, Risk Index scores across different domains are not comparable.
  3. To understand the statistical error of the Risk Index, we conducted multiple repeated tests on some models. The results show that the standard deviations of the Risk Indices in all domains are less than 0.4.
  4. The calculation methods for the Capability Score and Safety Score here are adapted from arXiv:2507.16534, while the Risk Index formula is our original contribution.

VI. Risk Rating

Currently, we have not provided risk rating results for models, but we provide model developers’ risk self-assessments for reference. For example, OpenAI rated o4-mini as Medium risk in biology/chemistry domains, and Anthropic rated Claude Sonnet 4 as ASL-2 (systems showing early signs of dangerous capabilities) and applied ASL-3 (systems that substantially increase the risk of catastrophic misuse or show low-level autonomous capabilities) safeguards for Claude Opus 4. For models without self-assessment information, users can refer to the risk ratings of OpenAI or Anthropic models with similar capability scores.

However, it should be noted that the models we currently test have undergone risk mitigation by their developers before release, whereas risk self-assessments by model developers often include pre-mitigation models. In addition, the benchmarks we use are different from those used by the model developers. Therefore, we cannot directly use the self-assessment results of the model developers as our risk rating results.

Current Limitations and Future Plans

The platform currently has many limitations, which we will continuously improve:

  1. Expand the scope of risk assessment
    • Currently, we are limited to large language models (and some vision models). In the future, we will test more types of AI, such as AI agents, multimodal models, and domain-specific models, to comprehensively cover the most powerful AI tools on the market.
    • Currently, we are limited to 4 risk domains. In the future, we will expand to more risk domains, such as large-scale persuasion and harmful manipulation risks.
  2. Improve risk assessment methods
    • Current evaluation methods may not fully elicit model capabilities. In the future, we will use methods such as stronger jailbreak attacks, malicious fine-tuning, improved prompts, better tools, and expert red teaming to explore the upper limits of model capabilities and the lower limits of their safety as much as possible, in order to more accurately measure the risks.
    • The current risk index calculation method is relatively simplistic. In the future, we will establish more precise threat models to conduct more accurate risk assessments.
    • The current risk assessment only considers the model's empowerment of attackers. In the future, we will also consider the model's empowerment of defenders (e.g., using models to enhance cybersecurity defense capabilities) to more comprehensively assess the impact on overall system safety/security.
  3. Improve evaluation datasets
    • Addressing the limitations of the current benchmarks, which are few in number, potentially contaminated, and primarily in English, we will incorporate more advanced, multilingual benchmarks, optimize existing ones, and, if necessary, develop our own benchmarks to more accurately assess model capabilities and safety.

Opportunities for Collaboration

We are open to various forms of collaboration and hope to build a frontier AI risk assessment and monitoring ecosystem with industry peers:

  • Integration of Existing Benchmarks: We can integrate advanced capability and safety benchmarks developed by partners in frontier AI risk domains and continuously track the performance of frontier models on these benchmarks.
  • Collaborative Development of Benchmarks: For key areas such as cyber offense, biological risks, and loss-of-control, we can collaborate with partners to develop missing benchmarks and improve the evaluation methods of existing ones.
  • Collaborative Research on Risk Assessment: Our risk assessment is not limited to benchmarking. We also hope to work with partners to measure the potential real-world harm of models through better threat modeling and real-world case analysis.
  • Pre-release Risk Assessment for Models: We can conduct frontier risk assessments for models developed by our partners before their release and provide mitigation recommendations to help ensure a safe launch.
  • Risk Prevention and Mitigation Practices: We can share risk warning information generated by the platform with partners for timely response and mitigation of potential major risks.

We look forward to collaborating with partners from academia, industry, and policy institutions in these areas. Contact us at: risk-monitor@concordia-ai.com

Visit the Concordia AI official website to learn more about us.


Last updated: Nov 2025