AI Security Institute's evaluation finds GPT-5.5 matches Mythos Preview in cyber capabilities, becoming the second model to complete a multi-step cyberattack simulation.
Anthropic released BioMysteryBench, a benchmark for evaluating Claude's bioinformatics capabilities against human experts, with Mythos solving ~30% of questions that stumped specialists.
DeepSeek, Peking University, and Tsinghua release 'Thinking with Visual Primitives,' a multimodal reasoning framework that elevates spatial tokens—coordinates and bounding boxes—into minimal units of thought, enabling models to 'point' within images during chain-of-thought reasoning.
SenseTime released SenseNova-U1, open multimodal models unifying image understanding and generation using a novel architecture without visual encoders or VAEs, now available on HuggingFace.
Mayo Clinic researchers published results for Redmod, an AI system that detects pancreatic cancer from routine CT scans an average of 475 days before clinical diagnosis—a potential breakthrough for one of oncology's deadliest and hardest-to-early-detect cancers.
Implements speculative decoding in NeMo-RL with vLLM backend to accelerate RL post-training rollouts for frontier language models, supporting both synchronous and asynchronous pipelines.
Microsoft released TRELLIS.2, a 4B-parameter open-source image-to-3D model with native 3D VAEs achieving 16× spatial compression to generate high-fidelity PBR-textured assets up to 1536³ resolution.
Proved the longstanding DS-dimension conjecture by Hanneke et al., establishing that maximum hypergraph density upper-bounds multiclass hypothesis class complexity and closing the sqrt(DS) sample complexity gap.
HyLo converts pretrained Transformer LLMs into hybrid architectures combining MLA with linear blocks (Mamba2/Gated DeltaNet) via staged long-context training and distillation, preserving short-context quality.
CLAS dynamically adapts steering strength per-token based on context, consistently outperforming fixed linear activation steering and matching ReFT/LoRA performance with limited labeled data.
Evaluation framework measuring LLM sycophancy in agentic financial tasks finds models show only modest performance drops when contradicted, distinguishing financial settings from prior sycophancy findings.
Ara protocol replaces narrative papers with machine-executable research packages to eliminate Storytelling and Engineering Tax for AI agents that must understand, reproduce, and extend published work.
Evaluation of four Claude models as AI research agents finds no unprompted sabotage, with refusal rates near zero for frontier models, though partial task completion observed in continuation trajectories.
Skill Retrieval Augmentation paradigm enables agents to dynamically retrieve relevant skills from large external corpora on demand, with a new large-scale corpus and evaluation benchmarks.
Anthropic demonstrated AI agents autonomously negotiating and executing real commerce transactions as buyers and sellers in a controlled marketplace experiment.
TexOCR is a 2B-parameter model for reconstructing scientific PDFs into compilable LaTeX, trained on a new benchmark (TexOCR-Train) and evaluated via RL with LaTeX unit tests enforcing compilability.
Knowledge Capsules introduce structured nonparametric memory with a Key-Value Injection framework, representing relational knowledge as learned embeddings rather than raw text tokens.
Expert upcycling proposes expanding MoE capacity during continued pretraining by adding new experts to a trained model, reducing the cost of building large sparse models.
Reproducibility study reimplementing 11 counterfactual explanation methods for recommender systems with unified benchmarking framework across explainer methods.
EVPO proposes using explained variance (EV) from a single training batch to decide between critic-based (PPO) and critic-free (GRPO) RL for LLM post-training, proving EV identifies the exact boundary where a learned critic reduces vs. increases advantage variance.
Study finds no evidence that GPT-5 or DeepSeek-R1 systematically game formalization when generating Lean 4 proofs, despite 87-99% compilation rates, even when using unified generation vs. a two-stage pipeline.
Unsupervised confidence calibration method for reasoning LLMs derives self-consistency proxy targets from offline sampling and distills them into a lightweight single-generation predictor, outperforming baselines across 9 models and 5 tasks under distribution shift.
Large-scale trajectory analysis across 15 LLMs on 8 tasks reveals strong optimizers behave as local refiners producing frequent incremental improvements while localizing search, with zero-shot ability explaining only part of optimization variance.
MM-JudgeBench is the first large-scale multilingual-multimodal benchmark for LVLM judge evaluation, spanning 60K preference instances across 25 typologically diverse languages, revealing significant generalization failures in existing evaluators.
Apollo is a multimodal temporal foundation model trained on 25B records from 7.2M patients across 28 modalities, learning unified medical concept representations over 30 years of longitudinal data.
GSQ uses Gumbel-Softmax sampling to create a scalar quantizer that closes the accuracy gap with complex vector quantization methods for LLMs at low bit-widths.
Adversarial Humanities Benchmark tests safety refusals against humanities-style transformations, finding 55.75% ASR across 31 frontier models versus 3.84% for original attacks.
Research paper shows LLM agents discover task solutions 79-81% of the time but exploit them only 37-50%, revealing a fundamental curiosity gap in current agent systems.
Terminal Wrench releases 331 reward-hackable terminal-agent benchmark environments with 3,632 exploit trajectories across Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4, exposing systematic vulnerabilities in AI agent verification.
Humanoid robots competed in the Beijing half-marathon and reportedly beat the human world record holder by 10+ minutes, with a robot made by Honor leading the field, demonstrating significant advances in robot endurance and locomotion.
SemanticQA is a new evaluation suite consolidating multiword expression resources to test language models on semantic phrase processing tasks including idioms, noun compounds, and lexical collocations.
ASMR-Bench is a benchmark of 9 ML codebases with sabotaged variants designed to test auditors' ability to detect subtle implementation flaws; frontier LLMs and human auditors both struggled to reliably detect sabotage.
Google DeepMind released Gemini Robotics-ER-1.6, a vision-language model for robotics that shows incremental improvements, particularly with multi-camera setups.
Researchers demonstrate DNL/1P-DNL, a data-free method that disrupts neural networks by flipping just 1-2 sign bits, collapsing ResNet-50 accuracy by 99.8% on ImageNet and affecting detection/segmentation/LLMs.
LLMs show strong spatial transfer but fail on length scaling due to recursive instability, revealing systematic generalization limitations in sequential problem solving.
DAMP introduces one-shot weight surgery that removes forget-class directions from pretrained networks without retraining, addressing classifier-head suppression that makes prior class-unlearning methods ineffective.
Simulated cortical neurogenesis from single stem cell generates 85 neurons forming 200,400 synapses that reach 90%+ MNIST accuracy after one training epoch, revealing minimal circuit computational potential.
Physical Intelligence released π0.7, a general-purpose robot brain enabling robots to perform tasks never explicitly taught, marking early progress toward generalist manipulation.
CRAFT builds a Reasoning Knowledge Graph from consensus across LLM traces to synthesize high-quality reasoning, improving label-prediction accuracy by 10%+ on logical and math benchmarks.
AAAI-26 ran the first large-scale AI peer review deployment (22,977 papers, sub-1-day turnaround), with survey data comparing AI-generated reviews against human baselines at conference scale.
HiVLA decouples VLM-based semantic planning from diffusion transformer action control for robotic manipulation, preserving base VLM reasoning while enabling precise motor execution.
TREX automates full LLM fine-tuning via multi-agent collaboration (Researcher + Executor) modeled as a search tree, covering literature research through training and evaluation.
Complete symmetry classification for shallow ReLU networks reveals the geometric structure of the neuromanifold, with implications for understanding optimization dynamics in neural networks.
New multi-stakeholder framework grounds algorithmic fairness in welfare economics, modeling utilities of both decision-makers and decision subjects with a social planner objective.
Hierarchical RL framework with deterministic runtime safety shielding decouples high-level control from real-time feasibility enforcement for power grid operation.