WEAVER introduces a robotic manipulation world model designed for high-fidelity, long-horizon, efficient simulation from limited real-world interaction.
Anthropic’s Mythos Preview can reportedly convert disclosed vulnerabilities into working exploits within hours, advancing AI-assisted cyber offense capabilities.
Unified Neural Scaling Laws proposes a single functional form for extrapolating model performance across compute, data, parameters, and inference steps.
LoopMDM improves masked diffusion language models by looping transformer layers, cutting training compute while enabling inference-time compute scaling.
LLM agent autonomously solved 9 of 353 open Erdős problems and proved 44/492 OEIS conjectures using Lean formal verification, deployed in real math research.
OpenAI's unreleased reasoning model generated a proof disproving the 1946 Erdős unit distance conjecture in discrete geometry, marking a notable capability demonstration in mathematical research.
OpenAI's AI model disproved thewishful thinking conjecture in discrete geometry, solving an 80-year-old unit distance problem and marking a milestone for AI-driven mathematical research.
OpenAI published a proof that one of its general-purpose reasoning models found a construction disproving the conjectured n^{1+O(1/log log n)} upper bound in Erdős's planar unit-distance problem, accompanied by a full proof PDF.
Paper proves DPO's theoretical equivalence to RLHF is conditional on the RLHF-optimal policy preferring human-preferred responses—a frequently violated assumption causing pathological convergence.
HRM-Text replaces standard Transformers with a Hierarchical Recurrent Model using slow/fast layers, MagicNorm, and deep credit assignment for efficient pretraining.
SpecBench identifies reward hacking in long-horizon coding agents by decomposing tasks into specs, visible tests, and held-out composition tests that reveal true capability.
Agent JIT compilation compiles task descriptions into executable code with embedded LLM and tool calls, reducing latency and errors in computer-use agents via validated multi-plan generation.
SpectralEarth-FM is a new hierarchical transformer foundation model for multisensor earth observation that jointly processes hyperspectral imagery with multispectral and SAR data, enabling unified EO pretraining across heterogeneous spectral dimensionality.
GoLongRL releases 23K RLVR samples and a complete long-context RL training pipeline across 9 task types, with a taxonomy of long-context capabilities guiding data construction.
OpenComputer provides a verifier-grounded framework for computer-use agents with 33 desktop apps and 1,000 machine-checkable tasks, including self-evolving verification and auditable partial-credit rewards.
A unified LLM-based optimization system achieves SOTA across six diverse tasks, discovering agent architectures that triple ARC-AGI accuracy, cutting cloud costs 40%, and generating competitive CUDA kernels.
Toto 2.0 releases five open-weights time series forecasting models (4M–2.5B params) demonstrating scaling laws and setting SOTA on BOOM, GIFT-Eval, and TIME benchmarks under Apache 2.0.
Lance is a native unified multimodal model with dual-stream MoE trained from scratch, supporting joint understanding and generation of images and video.
EnvFactory automates executable environment synthesis and robust RL training for tool-use agents, generating realistic multi-turn interaction data without costly real-world APIs.
GIM benchmark tests grounded integration of cognitive operations across 820 problems, separating reasoning capability from knowledge demands or abstract puzzles.
Mechanistic circuit analysis reveals a three-phase backdoor: trigger composition, orthogonal subspace propagation, and MLP-based language conversion in an 8B model.
CrossView Suite provides 450K cross-view instruction data, a comprehensive benchmark, and explicit alignment mechanism for MLLM spatial reasoning across viewpoints.