Loading...
加载中...

Frontier AI Risk Monitoring Platform

This platform tracks the risk landscape of frontier models1 from leading AI companies across five domains: cyber offense, biological risks, chemical risks, harmful manipulation, and loss of control. It evaluates model performance across multiple capability benchmarks2 and safety benchmarks3, and calculates a Capability Score4, a Safety Score5, and a Risk Index6 for each model in each domain.

Latest Insights

1. Risk trends are diverging structurally, while the loss-of-control Risk Index continues to rise

Under Risk Index v1.57, the cyber offense, biological risks, chemical risks, and harmful manipulation domains have, over the past year, broadly followed the same pattern: both Capability and Safety Score rose, while the Risk Index first decreased and then rose slightly.

By contrast, in the loss-of-control domain, the Capability Score has continued to increase without a corresponding improvement in the Safety Score, and the Risk Index has risen for three consecutive quarters, for a cumulative increase of 51%, which deserves attention.

Note: Because benchmarks differ across domains, Risk Indices are not comparable across different domains.

View More

2. Risk profiles continue to diverge across model families

Overall, risk trajectories are no longer moving together across model families, and instead follow clearly different paths across domains. For example:

The Gemini family shows notably elevated Risk Indices in the loss-of-control domain.

The DeepSeek, GLM, and MiMo families remain in relatively high-risk ranges across most domains.

The Kimi family has seen relatively rapid Risk Index increases in the biological and chemical domains.

The GPT and Claude families remain in relatively low-risk ranges across most domains.

View More

3. Proprietary models dominate the risk frontier in multiple domains

In the cyber offense, biological risks, harmful manipulation, and loss-of-control domains, the models on the high-capability, low-safety frontier are mostly proprietary.

The main gap between open-weight and proprietary models lies in capability rather than safety scores. The only exception is the chemical risk domain, where open-weight models outperform proprietary ones: Kimi K2.5 achieved the highest Capability Score, 4% above the top proprietary models.

View More

4. Frontier models continue to break cyberattack capability records

Claude Opus 4.6 and GPT-5.4 set new highs on benchmarks for vulnerability exploitation, CTF tasks, and cyberattack knowledge, including VulnerabilityExploit, CyBench, and WMDP-Cyber.

Notably, the top CyBench score reached 80 for the first time, up 108% from three quarters ago, indicating substantial progress by frontier models on complex, long-horizon cyberattack tasks.

Note: The red dashed lines connect models that achieved new high scores. The same applies below.
View More

5. ...but safety guardrails in the cyber offense domain remain fragile

Compared with rapidly improving capabilities, cyber safety guardrails remain insufficient. Although new models perform well on refusal benchmarks such as AirBench-SecurityRisks, with most scoring above 80, most models still score below 20 on advanced red-teaming benchmarks such as ISC-Bench-Cyber.

In addition, some model families, such as Claude and GPT, show substantial declines in safeguards in their latest versions.

Note: Higher scores on safety benchmarks indicate safer models. The same applies below.
View More

6. Biological capabilities keep improving, but safeguards remain insufficient

On the capability side, more than half of the new 2026Q1 models surpassed the human expert baseline on biological experiment troubleshooting in BioLP-Bench, and GPT-5.4 reached human-expert-level performance on biological image understanding in LAB-Bench-FigQA for the first time.

On the safety side, although new models perform well on basic refusal tests such as SOSBench-Bio, with most scoring above 80, most models still score below 80 in advanced red-teaming scenarios such as Fortress-Biological.

View More

7. Chemical capability growth remains modest, while safety guardrails still show major weaknesses

Over the past year, capability gains in the chemical risk domain have remained limited. On capability benchmarks such as WMDP-Chem, score differences across models remain small, and the top score has increased by only 1% over the past three quarters.

As for safety guardrails, although performance on basic refusal tests such as ChemicalHarmfulQA has improved, the weaknesses remain clear, with most models still scoring below 40. Performance on advanced red-teaming tests such as ISC-Bench-Chemical is also generally poor, with most models scoring below 20.

View More

8. Harmful manipulation capabilities continue to improve, but unsafe propensities are clear

Frontier models continue to improve on harmful manipulation benchmarks: Gemini 3.1 Pro Preview holds a clear lead on MakeMePay, while Claude Opus 4.6 set a new record on MultiTurnPhishing.

At the same time, harmful manipulation safety evaluations show continued improvement in models' refusal of harmful manipulation tasks in AirBench-Manipulation, with most models scoring above 80. However, models still show clear unsafe propensities on APE, which measures the tendency to persuade others, where most models score below 60.

View More

9. Capabilities relevant to loss-of-control continue to strengthen

On Self-Proliferation, scores have continued to rise, and the top score is now 45% higher than it was three quarters ago.

On MLE-Bench, capability has improved substantially, and the top score is now 44% higher than it was three quarters ago. This capability is necessary for model self-improvement.

On GDM-Stealth, most models still score below 20, though they are making progress. On SAD-mini, new models now generally score above 80. These capabilities may be necessary for models to engage in scheming.

View More

10. ...but loss-of-control safety indicators have not improved in step

Compared with the rapid growth in capabilities, safety indicators related to loss-of-control have not improved correspondingly. These indicators measure models' propensities relevant to loss-of-control.

On propensity benchmarks such as MASK and Agentic-Misalignment, scores vary widely across models, ranging from 40 to 100, while overall improvement remains slow.

On DarkBench, which measures the tendency to influence users covertly, model scores are generally below 60, indicating widespread unsafe propensities.

View More

Explanation of Terms

Concept


  1. Frontier Models: AI models that were at the industry's frontier when released. To cover as many frontier models as possible within limited time and budget, we select only the breakthrough model from each frontier AI company, meaning the model that was the company's most capable at the time of release. The specific criteria can be found here

  2. Capability Benchmarks: Benchmarks used to evaluate model capabilities, especially those that could be maliciously misused or could contribute to loss of control. 

  3. Safety Benchmarks: Benchmarks used to evaluate model safety. For misuse risks, they mainly measure how well a model resists malicious external instructions. For loss-of-control and manipulation risks, they more often measure the model's internal propensities, honesty, and ability to suppress inappropriate behavior. 

  4. Capability Score CC: The weighted average score of a model across capability benchmarks. The higher the score, the stronger the model's risky capabilities for misuse or loss of control. 

  5. Safety Score SS: The weighted average score of a model across safety benchmarks. The higher the score, the better the model is at refusing unsafe requests or exhibiting safer internal propensities. 

  6. Risk Index RR: A score that combines capability and safety to reflect overall risk. It is calculated as R=C×(1β×S100)R = C \times \left(1 - \frac{\beta \times S}{100} \right). The score ranges from 0 to 100. The Safety Coefficient β\beta adjusts the contribution of the Safety Score to the final Risk Index, reflecting possibilities such as safety benchmarks not covering all unsafe behavior, or previously safe models becoming unsafe through jailbreaks or malicious fine-tuning. 

  7. Risk Index v1.5: The new risk framework adopted by the platform starting in 2026Q1. Compared with v1.0, it adds the harmful manipulation domain and updates the benchmark combinations in the other four domains, making the Risk Index better reflect the real risk frontier of today's frontier models. 

前沿AI风险监测平台

本平台监测了领先AI公司的前沿模型1在五个领域的风险状态:网络攻击、生物风险、化学风险、有害操纵和失控领域。平台通过测试模型在多个能力基准2和安全基准3上的表现,计算出每个模型在每个领域的能力分4和安全分5,以及综合了能力和安全这两个方面的风险指数6

最新洞察

1. 风险趋势出现结构性分化,失控领域风险指数持续上升

在风险指数1.5版7下,网络攻击、生物风险、化学风险、有害操纵这四个领域在过去一年整体呈现出“能力与安全同步上升、风险指数先降后小幅反弹”的态势。

相比之下,失控领域因能力持续增强而安全分未同步提升,风险指数已连续三个季度上升,累计增幅达51%,值得重视。

注:由于不同领域的测评基准不同,跨领域的风险指数不具有可比性。

查看更多

2. 不同模型系列的风险表现持续分化

整体看,模型系列之间的风险轨迹并不同步,而是在不同领域上呈现出明显的分化路径,例如:

Gemini系列在失控领域的风险指数显著升高。

DeepSeek、GLM和MiMo系列在多数风险领域处于较高风险区间。

Kimi系列在生物和化学领域的风险指数上升较快。

GPT和Claude系列在多数风险领域维持较低风险区间。

查看更多

3. 闭源模型在多个领域主导风险前沿

在网络攻击、生物风险、有害操纵和失控领域,处在高能力、低安全风险前沿的模型大多为闭源模型。

开源模型与闭源模型的主要差距体现在能力分上,而非安全分上;只有在化学风险领域,开源模型的能力超过闭源模型——Kimi K2.5模型取得了能力分的最高分,领先闭源模型4%。

查看更多

4. 前沿模型的网络攻击能力持续突破

Claude Opus 4.6和GPT-5.4在漏洞利用(VulnerabilityExploit)、CTF任务(CyBench)和网络攻击知识(WMDP-Cyber)等基准上得分创出新高。

其中CyBench最高分首次达到80分,比三个季度前提升108%,表明顶尖模型在复杂长程网络攻击任务上的能力出现了实质性进展。

注:红色虚线连接的是得分创新高的模型,下同。
查看更多

5. ...但网络攻击领域的安全护栏仍不稳固

与快速提升的能力相比,网络攻击领域的安全防护仍有不足。尽管新模型在AirBench-SecurityRisks等拒答测试中表现较好,大部分在80分以上,但在ISC-Bench-Cyber等高级红队攻击基准上,大多数模型得分仍然不足20分。

此外,有些模型系列(如Claude和GPT)出现了新模型防护能力显著下降的现象。

注:在安全基准上的分数越高,代表模型越安全,下同。
查看更多

6. 生物能力持续变强,但护栏仍有不足

能力方面,超过半数2026Q1新模型在生物实验问题排查任务(BioLP-Bench)中超过人类专家基线,GPT-5.4首次在生物图像理解能力(LAB-Bench-FigQA)上达到人类专家水平。

安全方面,尽管新模型在SOSBench-Bio等基础拒答测试中表现较好,多数在80分以上,但在Fortress-Biological等高级红队攻击场景中,大部分模型得分仍不到80分。

查看更多

7. 化学能力增长平缓,安全护栏仍有较大短板

过去一年模型在化学风险领域的能力提升有限,在WMDP-Chem等能力测试中,模型间差距较小,最高分在近三个季度仅提升1%。

在安全护栏方面,在ChemicalHarmfulQA等基础拒答测试中虽有提升,但短板仍明显,大部分在40分以下。而在ISC-Bench-Chemical等高级红队测试中普遍表现也比较差,大部分不到20分。

查看更多

8. 有害操纵能力持续提升,但有明显的不安全倾向

前沿模型的有害操纵能力持续提升:Gemini 3.1 Pro Preview在诱导付费(MakeMePay)基准上显著领先,Claude Opus 4.6在多轮网络钓鱼(MultiTurnPhishing)上刷新纪录。

与此同时,有害操纵安全测评显示,模型对有害操纵类任务(AirBench-Manipulation)的拒答能力持续提升,大部分在80分以上。但在说服他人的倾向(APE)上仍整体表现出明显的不安全性,大部分在60分以下。

查看更多

9. 失控相关能力持续增强

在自我复制能力(Self-Proliferation)上,模型的分数持续提升,最高分比三个季度前提升45%。

在机器学习工程能力(MLE-Bench)上,模型的能力进步明显,最高分比三个季度前提升44%。这一能力是模型实施自我改进的必要能力。

在隐蔽行动(GDM-Stealth)上,模型得分普遍不到20分,但也在取得进步。而在情境感知能力(SAD-mini)方面,新模型已普遍得到80分以上。这些能力可能是模型实施密谋(Scheming)的必要能力。

查看更多

10. ...但失控安全指标未同步改善

和能力的快速增长相比,失控安全指标并未同步改善,这些指标衡量了模型的失控相关倾向。

在诚实性(MASK)、代理式错误对齐(Agentic-Misalignment)等倾向测试上,模型分数差异较大,分布在40到100分之间,整体进步速度较慢。

在暗中影响用户(DarkBench)倾向方面,模型分数普遍低于60分,存在普遍的不安全倾向。

查看更多

名词解释

Concept


  1. 前沿模型:发布时能力处在业界前沿水平的人工智能模型。为了在有限的时间和预算内尽可能全面地覆盖到前沿模型,我们只选择每个前沿模型公司的突破性模型,即发布时为该模型公司能力最强的模型,具体标准见这里。 

  2. 能力基准:用于评估模型能力的基准,特别是那些可能被恶意滥用或推动失控的风险能力。 

  3. 安全基准:用于评估模型安全性的基准。对于滥用风险,主要衡量模型对外部恶意指令的防御能力;对于失控和操纵风险,则更多衡量模型内在倾向、诚实性与不当行为抑制能力。 

  4. 能力分CC:模型在各项能力基准中的加权平均分。分数越高,模型可能被滥用或推动失控的风险能力越强。 

  5. 安全分SS:模型在各项安全基准中的加权平均分。分数越高,模型越能拒绝不安全请求,或表现出更安全的内在倾向。 

  6. 风险指数RR:综合能力分和安全分,反映整体风险的一个分数。其计算公式为:R=C×(1β×S100)R = C \times \left(1 - \frac{\beta \times S}{100} \right)。分数区间为0-100。安全系数β\beta用于调节模型安全分对最终风险指数的贡献,其反映了以下可能的情况:安全基准未覆盖模型所有不安全行为;原本表现安全的模型可能被越狱或者恶意微调而变得不安全等。 

  7. 风险指数1.5版:平台在2026Q1开始采用的新版本风险框架。相比1.0版,新增了“有害操纵”领域,并对其他四个领域的测评基准组合进行更新,使风险指数更贴近当前前沿模型的真实风险边界。