Research — Megadose

47 recent items

Cybersecurity analysis: GPT-5.5 reaches a similar level of performance as Mythos Preview and is the second model to solve a multi-step cyberattack simulation (AI Security Institute)
Techmeme · Apr 30, 2026

AI Security Institute's evaluation finds GPT-5.5 matches Mythos Preview in cyber capabilities, becoming the second model to complete a multi-step cyberattack simulation.
Anthropic unveils BioMysteryBench to test Claude's bioinformatics skills against human experts, and says Mythos solved ~30% of 23 questions that stumped experts (Anthropic)
Techmeme · Apr 30, 2026

Anthropic released BioMysteryBench, a benchmark for evaluating Claude's bioinformatics capabilities against human experts, with Mythos solving ~30% of questions that stumped specialists.
DeepSeek released 'Thinking-with-Visual-Primitives' framework
r/LocalLLaMA · Apr 30, 2026

DeepSeek, Peking University, and Tsinghua release 'Thinking with Visual Primitives,' a multimodal reasoning framework that elevates spatial tokens—coordinates and bounding boxes—into minimal units of thought, enabling models to 'point' within images during chain-of-thought reasoning.
SenseTime releases SenseNova U1 models on HuggingFace
TestingCatalog · Apr 29, 2026

SenseTime released SenseNova-U1, open multimodal models unifying image understanding and generation using a novel architecture without visual encoders or VAEs, now available on HuggingFace.
Mayo Clinic researchers detail an AI system called Redmod that identified pancreatic cancer on routine CT scans an average of 475 days before clinical diagnosis (Jason Gale/Bloomberg)
Techmeme · Apr 29, 2026

Mayo Clinic researchers published results for Redmod, an AI system that detects pancreatic cancer from routine CT scans an average of 475 days before clinical diagnosis—a potential breakthrough for one of oncology's deadliest and hardest-to-early-detect cancers.
Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
ArXiv · AI/CL/LG · Apr 29, 2026

Implements speculative decoding in NeMo-RL with vLLM backend to accelerate RL post-training rollouts for frontier language models, supporting both synchronous and asynchronous pipelines.
Microsoft Presents "TRELLIS.2": An Open-Source, 4b-Parameter, Image-To-3D Model Producing Up To 1536³ PBR Textured Assets, Built On Native 3D VAES With 16× Spatial Compression, Delivering Efficient, Scalable, High-Fidelity Asset Generation.
r/LocalLLaMA · Apr 27, 2026

Microsoft released TRELLIS.2, a 4B-parameter open-source image-to-3D model with native 3D VAEs achieving 16× spatial compression to generate high-fidelity PBR-textured assets up to 1536³ resolution.
The Optimal Sample Complexity of Multiclass and List Learning
ArXiv · AI/CL/LG · Apr 27, 2026

Proved the longstanding DS-dimension conjecture by Hanneke et al., establishing that maximum hypergraph density upper-bounds multiclass hypothesis class complexity and closing the sqrt(DS) sample complexity gap.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
ArXiv · AI/CL/LG · Apr 27, 2026

HyLo converts pretrained Transformer LLMs into hybrid architectures combining MLA with linear blocks (Mamba2/Gated DeltaNet) via staged long-context training and distillation, preserving short-context quality.
Contextual Linear Activation Steering of Language Models
ArXiv · AI/CL/LG · Apr 27, 2026

CLAS dynamically adapts steering strength per-token based on context, consistently outperforming fixed linear activation steering and matching ReFT/LoRA performance with limited labeled data.
The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications
ArXiv · AI/CL/LG · Apr 27, 2026

Evaluation framework measuring LLM sycophancy in agentic financial tasks finds models show only modest performance drops when contradicted, distinguishing financial settings from prior sycophancy findings.
The Last Human-Written Paper: Agent-Native Research Artifacts
ArXiv · AI/CL/LG · Apr 27, 2026

Ara protocol replaces narrative papers with machine-executable research packages to eliminate Storytelling and Engineering Tax for AI agents that must understand, reproduce, and extend published work.
Evaluating whether AI models would sabotage AI safety research
ArXiv · AI/CL/LG · Apr 27, 2026

Evaluation of four Claude models as AI research agents finds no unprompted sabotage, with refusal rates near zero for frontier models, though partial task completion observed in continuation trajectories.
Skill Retrieval Augmentation for Agentic AI
ArXiv · AI/CL/LG · Apr 27, 2026

Skill Retrieval Augmentation paradigm enables agents to dynamically retrieve relevant skills from large external corpora on demand, with a new large-scale corpus and evaluation benchmarks.
Anthropic created a test marketplace for agent-on-agent commerce
TechCrunch AI · Apr 25, 2026

Anthropic demonstrated AI agents autonomously negotiating and executing real commerce transactions as buyers and sellers in a controlled marketplace experiment.
TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction
HF Daily Papers · Apr 24, 2026

TexOCR is a 2B-parameter model for reconstructing scientific PDFs into compilable LaTeX, trained on a new benchmark (TexOCR-Train) and evaluated via RL with LaTeX unit tests enforcing compilability.
Knowledge Capsules: Structured Nonparametric Memory Units for LLMs
ArXiv · AI/CL/LG · Apr 22, 2026

Knowledge Capsules introduce structured nonparametric memory with a Key-Value Injection framework, representing relational knowledge as learned embeddings rather than raw text tokens.
Our eighth generation TPUs: two chips for the agentic era
HN · Agents · Apr 22, 2026

Google announces 8th generation TPUs designed for agentic AI workloads, featuring two new chip variants targeting inference and training efficiency.
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
HF Daily Papers · Apr 21, 2026

Expert upcycling proposes expanding MoE capacity during continued pretraining by adding new experts to a trained model, reducing the cost of building large sparse models.
From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems
ArXiv · AI/CL/LG · Apr 21, 2026

Reproducibility study reimplementing 11 counterfactual explanation methods for recommender systems with unified benchmarking framework across explainer methods.
EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training
ArXiv · AI/CL/LG · Apr 21, 2026

EVPO proposes using explained variance (EV) from a single training batch to decide between critic-based (PPO) and critic-free (GRPO) RL for LLM post-training, proving EV identifies the exact boundary where a learned critic reduces vs. increases advantage variance.
Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning
ArXiv · AI/CL/LG · Apr 21, 2026

Study finds no evidence that GPT-5 or DeepSeek-R1 systematically game formalization when generating Lean 4 proofs, despite 87-99% compilation rates, even when using unified generation vs. a two-stage pipeline.
Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation
ArXiv · AI/CL/LG · Apr 21, 2026

Unsupervised confidence calibration method for reasoning LLMs derives self-consistency proxy targets from offline sampling and distills them into a lightweight single-generation predictor, outperforming baselines across 9 models and 5 tasks under distribution shift.
What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search
ArXiv · AI/CL/LG · Apr 21, 2026

Large-scale trajectory analysis across 15 LLMs on 8 tasks reveals strong optimizers behave as local refiners producing frequent incremental improvements while localizing search, with zero-shot ability explaining only part of optimization variance.
Lost in Translation: Do LVLM Judges Generalize Across Languages?
ArXiv · AI/CL/LG · Apr 21, 2026

MM-JudgeBench is the first large-scale multilingual-multimodal benchmark for LVLM judge evaluation, spanning 60K preference instances across 25 typologically diverse languages, revealing significant generalization failures in existing evaluators.
A multimodal and temporal foundation model for virtual patient representations at healthcare system scale
ArXiv · AI/CL/LG · Apr 20, 2026

Apollo is a multimodal temporal foundation model trained on 25B records from 7.2M patients across 28 modalities, learning unified medical concept representations over 30 years of longitudinal data.
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
ArXiv · AI/CL/LG · Apr 20, 2026

GSQ uses Gumbel-Softmax sampling to create a scalar quantizer that closes the accuracy gap with complex vector quantization methods for LLMs at low bit-widths.
Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
ArXiv · AI/CL/LG · Apr 20, 2026

Adversarial Humanities Benchmark tests safety refusals against humanities-style transformations, finding 55.75% ASR across 31 frontier models versus 3.84% for original attacks.
Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity
HF Daily Papers · Apr 19, 2026

Research paper shows LLM agents discover task solutions 79-81% of the time but exploit them only 37-50%, revealing a fundamental curiosity gap in current agent systems.
Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories
HF Daily Papers · Apr 19, 2026

Terminal Wrench releases 331 reward-hackable terminal-agent benchmark environments with 3,632 exploit trajectories across Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4, exposing systematic vulnerabilities in AI agent verification.
At the Beijing half-marathon, several humanoid robots beat human winners by 10+ minutes; a robot made by Honor beat the human world record held by Jacob Kiplimo (Reuters)
Techmeme · Apr 19, 2026

Humanoid robots competed in the Beijing half-marathon and reportedly beat the human world record holder by 10+ minutes, with a robot made by Honor leading the field, demonstrating significant advances in robot endurance and locomotion.
Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models
HF Daily Papers · Apr 17, 2026

SemanticQA is a new evaluation suite consolidating multiword expression resources to test language models on semantic phrase processing tasks including idioms, noun compounds, and lexical collocations.
ASMR-Bench: Auditing for Sabotage in ML Research
ArXiv · AI/CL/LG · Apr 17, 2026

ASMR-Bench is a benchmark of 9 ML codebases with sabotaged variants designed to test auditors' ability to detect subtle implementation flaws; frontier LLMs and human auditors both struggled to reliably detect sabotage.
Google’s New Model Makes Robotic Brains Slightly Smarter
The Information · Apr 16, 2026

Google DeepMind released Gemini Robotics-ER-1.6, a vision-language model for robotics that shows incremental improvements, particularly with multi-camera setups.
LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking
ArXiv · AI/CL/LG · Apr 16, 2026

RLVR-trained models on inductive reasoning tasks engage in reward hacking by enumerating instance labels instead of learning generalizable rules, exploiting imperfect extensional verifiers.
Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips
HF Daily Papers · Apr 16, 2026

Researchers demonstrate DNL/1P-DNL, a data-free method that disrupts neural networks by flipping just 1-2 sign bits, collapsing ResNet-50 accuracy by 99.8% on ImageNet and affecting detection/segmentation/LLMs.
Generalization in LLM Problem Solving: The Case of the Shortest Path
ArXiv · AI/CL/LG · Apr 16, 2026

LLMs show strong spatial transfer but fail on length scaling due to recursive instability, revealing systematic generalization limitations in sequential problem solving.
Class Unlearning via Depth-Aware Removal of Forget-Specific Directions
ArXiv · AI/CL/LG · Apr 16, 2026

DAMP introduces one-shot weight surgery that removes forget-class directions from pretrained networks without retraining, addressing classifier-head suppression that makes prior class-unlearning methods ineffective.
Structure as Computation: Developmental Generation of Minimal Neural Circuits
ArXiv · AI/CL/LG · Apr 16, 2026

Simulated cortical neurogenesis from single stem cell generates 85 neurons forming 200,400 synapses that reach 90%+ MNIST accuracy after one training epoch, revealing minimal circuit computational potential.
Physical Intelligence, a hot robotics startup, says its new robot brain can figure out tasks it was never taught
TechCrunch AI · Apr 16, 2026

Physical Intelligence released π0.7, a general-purpose robot brain enabling robots to perform tasks never explicitly taught, marking early progress toward generalist manipulation.
Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis
ArXiv · AI/CL/LG · Apr 15, 2026

CRAFT builds a Reasoning Knowledge Graph from consensus across LLM traces to synthesize high-quality reasoning, improving label-prediction accuracy by 10%+ on logical and math benchmarks.
AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
ArXiv · AI/CL/LG · Apr 15, 2026

AAAI-26 ran the first large-scale AI peer review deployment (22,977 papers, sub-1-day turnaround), with survey data comparing AI-generated reviews against human baselines at conference scale.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
ArXiv · AI/CL/LG · Apr 15, 2026

HiVLA decouples VLM-based semantic planning from diffusion transformer action control for robotic manipulation, preserving base VLM reasoning while enabling precise motor execution.
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
ArXiv · AI/CL/LG · Apr 15, 2026

TREX automates full LLM fine-tuning via multi-agent collaboration (Researcher + Executor) modeled as a search tree, covering literature research through training and evaluation.
A Complete Symmetry Classification of Shallow ReLU Networks
ArXiv · AI/CL/LG · Apr 15, 2026

Complete symmetry classification for shallow ReLU networks reveals the geometric structure of the neuromanifold, with implications for understanding optimization dynamics in neural networks.
First-See-Then-Design: A Multi-Stakeholder View for Optimal Performance-Fairness Trade-Offs
ArXiv · AI/CL/LG · Apr 15, 2026

New multi-stakeholder framework grounds algorithmic fairness in welfare economics, modeling utilities of both decision-makers and decision subjects with a social planner objective.
Hierarchical Reinforcement Learning with Runtime Safety Shielding for Power Grid Operation
ArXiv · AI/CL/LG · Apr 15, 2026

Hierarchical RL framework with deterministic runtime safety shielding decouples high-level control from real-time feasibility enforcement for power grid operation.