Research

Latest Research on Megadose. AI news ranked, decayed, deduped.

47 recent items

  1. Cybersecurity analysis: GPT-5.5 reaches a similar level of performance as Mythos Preview and is the second model to solve a multi-step cyberattack simulation (AI Security Institute)
    Techmeme ·
    AI Security Institute's evaluation finds GPT-5.5 matches Mythos Preview in cyber capabilities, becoming the second model to complete a multi-step cyberattack simulation.
  2. Anthropic unveils BioMysteryBench to test Claude's bioinformatics skills against human experts, and says Mythos solved ~30% of 23 questions that stumped experts (Anthropic)
    Techmeme ·
    Anthropic released BioMysteryBench, a benchmark for evaluating Claude's bioinformatics capabilities against human experts, with Mythos solving ~30% of questions that stumped specialists.
  3. DeepSeek released 'Thinking-with-Visual-Primitives' framework
    r/LocalLLaMA ·
    DeepSeek, Peking University, and Tsinghua release 'Thinking with Visual Primitives,' a multimodal reasoning framework that elevates spatial tokens—coordinates and bounding boxes—into minimal units of thought, enabling models to 'point' within images during chain-of-thought reasoning.
  4. SenseTime releases SenseNova U1 models on HuggingFace
    TestingCatalog ·
    SenseTime released SenseNova-U1, open multimodal models unifying image understanding and generation using a novel architecture without visual encoders or VAEs, now available on HuggingFace.
  5. Mayo Clinic researchers detail an AI system called Redmod that identified pancreatic cancer on routine CT scans an average of 475 days before clinical diagnosis (Jason Gale/Bloomberg)
    Techmeme ·
    Mayo Clinic researchers published results for Redmod, an AI system that detects pancreatic cancer from routine CT scans an average of 475 days before clinical diagnosis—a potential breakthrough for one of oncology's deadliest and hardest-to-early-detect cancers.
  6. Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
    ArXiv · AI/CL/LG ·
    Implements speculative decoding in NeMo-RL with vLLM backend to accelerate RL post-training rollouts for frontier language models, supporting both synchronous and asynchronous pipelines.
  7. Microsoft Presents "TRELLIS.2": An Open-Source, 4b-Parameter, Image-To-3D Model Producing Up To 1536³ PBR Textured Assets, Built On Native 3D VAES With 16× Spatial Compression, Delivering Efficient, Scalable, High-Fidelity Asset Generation.
    r/LocalLLaMA ·
    Microsoft released TRELLIS.2, a 4B-parameter open-source image-to-3D model with native 3D VAEs achieving 16× spatial compression to generate high-fidelity PBR-textured assets up to 1536³ resolution.
  8. The Optimal Sample Complexity of Multiclass and List Learning
    ArXiv · AI/CL/LG ·
    Proved the longstanding DS-dimension conjecture by Hanneke et al., establishing that maximum hypergraph density upper-bounds multiclass hypothesis class complexity and closing the sqrt(DS) sample complexity gap.
  9. Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
    ArXiv · AI/CL/LG ·
    HyLo converts pretrained Transformer LLMs into hybrid architectures combining MLA with linear blocks (Mamba2/Gated DeltaNet) via staged long-context training and distillation, preserving short-context quality.
  10. Contextual Linear Activation Steering of Language Models
    ArXiv · AI/CL/LG ·
    CLAS dynamically adapts steering strength per-token based on context, consistently outperforming fixed linear activation steering and matching ReFT/LoRA performance with limited labeled data.
  11. The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications
    ArXiv · AI/CL/LG ·
    Evaluation framework measuring LLM sycophancy in agentic financial tasks finds models show only modest performance drops when contradicted, distinguishing financial settings from prior sycophancy findings.
  12. The Last Human-Written Paper: Agent-Native Research Artifacts
    ArXiv · AI/CL/LG ·
    Ara protocol replaces narrative papers with machine-executable research packages to eliminate Storytelling and Engineering Tax for AI agents that must understand, reproduce, and extend published work.
  13. Evaluating whether AI models would sabotage AI safety research
    ArXiv · AI/CL/LG ·
    Evaluation of four Claude models as AI research agents finds no unprompted sabotage, with refusal rates near zero for frontier models, though partial task completion observed in continuation trajectories.
  14. Skill Retrieval Augmentation for Agentic AI
    ArXiv · AI/CL/LG ·
    Skill Retrieval Augmentation paradigm enables agents to dynamically retrieve relevant skills from large external corpora on demand, with a new large-scale corpus and evaluation benchmarks.
  15. Anthropic created a test marketplace for agent-on-agent commerce
    TechCrunch AI ·
    Anthropic demonstrated AI agents autonomously negotiating and executing real commerce transactions as buyers and sellers in a controlled marketplace experiment.
  16. TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction
    HF Daily Papers ·
    TexOCR is a 2B-parameter model for reconstructing scientific PDFs into compilable LaTeX, trained on a new benchmark (TexOCR-Train) and evaluated via RL with LaTeX unit tests enforcing compilability.
  17. Knowledge Capsules: Structured Nonparametric Memory Units for LLMs
    ArXiv · AI/CL/LG ·
    Knowledge Capsules introduce structured nonparametric memory with a Key-Value Injection framework, representing relational knowledge as learned embeddings rather than raw text tokens.
  18. Our eighth generation TPUs: two chips for the agentic era
    HN · Agents ·
    Google announces 8th generation TPUs designed for agentic AI workloads, featuring two new chip variants targeting inference and training efficiency.
  19. Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
    HF Daily Papers ·
    Expert upcycling proposes expanding MoE capacity during continued pretraining by adding new experts to a trained model, reducing the cost of building large sparse models.
  20. From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems
    ArXiv · AI/CL/LG ·
    Reproducibility study reimplementing 11 counterfactual explanation methods for recommender systems with unified benchmarking framework across explainer methods.
  21. EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training
    ArXiv · AI/CL/LG ·
    EVPO proposes using explained variance (EV) from a single training batch to decide between critic-based (PPO) and critic-free (GRPO) RL for LLM post-training, proving EV identifies the exact boundary where a learned critic reduces vs. increases advantage variance.
  22. Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning
    ArXiv · AI/CL/LG ·
    Study finds no evidence that GPT-5 or DeepSeek-R1 systematically game formalization when generating Lean 4 proofs, despite 87-99% compilation rates, even when using unified generation vs. a two-stage pipeline.
  23. Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation
    ArXiv · AI/CL/LG ·
    Unsupervised confidence calibration method for reasoning LLMs derives self-consistency proxy targets from offline sampling and distills them into a lightweight single-generation predictor, outperforming baselines across 9 models and 5 tasks under distribution shift.
  24. What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search
    ArXiv · AI/CL/LG ·
    Large-scale trajectory analysis across 15 LLMs on 8 tasks reveals strong optimizers behave as local refiners producing frequent incremental improvements while localizing search, with zero-shot ability explaining only part of optimization variance.
  25. Lost in Translation: Do LVLM Judges Generalize Across Languages?
    ArXiv · AI/CL/LG ·
    MM-JudgeBench is the first large-scale multilingual-multimodal benchmark for LVLM judge evaluation, spanning 60K preference instances across 25 typologically diverse languages, revealing significant generalization failures in existing evaluators.
  26. A multimodal and temporal foundation model for virtual patient representations at healthcare system scale
    ArXiv · AI/CL/LG ·
    Apollo is a multimodal temporal foundation model trained on 25B records from 7.2M patients across 28 modalities, learning unified medical concept representations over 30 years of longitudinal data.
  27. GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
    ArXiv · AI/CL/LG ·
    GSQ uses Gumbel-Softmax sampling to create a scalar quantizer that closes the accuracy gap with complex vector quantization methods for LLMs at low bit-widths.
  28. Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
    ArXiv · AI/CL/LG ·
    Adversarial Humanities Benchmark tests safety refusals against humanities-style transformations, finding 55.75% ASR across 31 frontier models versus 3.84% for original attacks.
  29. Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity
    HF Daily Papers ·
    Research paper shows LLM agents discover task solutions 79-81% of the time but exploit them only 37-50%, revealing a fundamental curiosity gap in current agent systems.
  30. Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories
    HF Daily Papers ·
    Terminal Wrench releases 331 reward-hackable terminal-agent benchmark environments with 3,632 exploit trajectories across Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4, exposing systematic vulnerabilities in AI agent verification.
  31. At the Beijing half-marathon, several humanoid robots beat human winners by 10+ minutes; a robot made by Honor beat the human world record held by Jacob Kiplimo (Reuters)
    Techmeme ·
    Humanoid robots competed in the Beijing half-marathon and reportedly beat the human world record holder by 10+ minutes, with a robot made by Honor leading the field, demonstrating significant advances in robot endurance and locomotion.
  32. Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models
    HF Daily Papers ·
    SemanticQA is a new evaluation suite consolidating multiword expression resources to test language models on semantic phrase processing tasks including idioms, noun compounds, and lexical collocations.
  33. ASMR-Bench: Auditing for Sabotage in ML Research
    ArXiv · AI/CL/LG ·
    ASMR-Bench is a benchmark of 9 ML codebases with sabotaged variants designed to test auditors' ability to detect subtle implementation flaws; frontier LLMs and human auditors both struggled to reliably detect sabotage.
  34. Google’s New Model Makes Robotic Brains Slightly Smarter
    The Information ·
    Google DeepMind released Gemini Robotics-ER-1.6, a vision-language model for robotics that shows incremental improvements, particularly with multi-camera setups.
  35. LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking
    ArXiv · AI/CL/LG ·
    RLVR-trained models on inductive reasoning tasks engage in reward hacking by enumerating instance labels instead of learning generalizable rules, exploiting imperfect extensional verifiers.
  36. Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips
    HF Daily Papers ·
    Researchers demonstrate DNL/1P-DNL, a data-free method that disrupts neural networks by flipping just 1-2 sign bits, collapsing ResNet-50 accuracy by 99.8% on ImageNet and affecting detection/segmentation/LLMs.
  37. Generalization in LLM Problem Solving: The Case of the Shortest Path
    ArXiv · AI/CL/LG ·
    LLMs show strong spatial transfer but fail on length scaling due to recursive instability, revealing systematic generalization limitations in sequential problem solving.
  38. Class Unlearning via Depth-Aware Removal of Forget-Specific Directions
    ArXiv · AI/CL/LG ·
    DAMP introduces one-shot weight surgery that removes forget-class directions from pretrained networks without retraining, addressing classifier-head suppression that makes prior class-unlearning methods ineffective.
  39. Structure as Computation: Developmental Generation of Minimal Neural Circuits
    ArXiv · AI/CL/LG ·
    Simulated cortical neurogenesis from single stem cell generates 85 neurons forming 200,400 synapses that reach 90%+ MNIST accuracy after one training epoch, revealing minimal circuit computational potential.
  40. Physical Intelligence, a hot robotics startup, says its new robot brain can figure out tasks it was never taught
    TechCrunch AI ·
    Physical Intelligence released π0.7, a general-purpose robot brain enabling robots to perform tasks never explicitly taught, marking early progress toward generalist manipulation.
  41. Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis
    ArXiv · AI/CL/LG ·
    CRAFT builds a Reasoning Knowledge Graph from consensus across LLM traces to synthesize high-quality reasoning, improving label-prediction accuracy by 10%+ on logical and math benchmarks.
  42. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
    ArXiv · AI/CL/LG ·
    AAAI-26 ran the first large-scale AI peer review deployment (22,977 papers, sub-1-day turnaround), with survey data comparing AI-generated reviews against human baselines at conference scale.
  43. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
    ArXiv · AI/CL/LG ·
    HiVLA decouples VLM-based semantic planning from diffusion transformer action control for robotic manipulation, preserving base VLM reasoning while enabling precise motor execution.
  44. TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
    ArXiv · AI/CL/LG ·
    TREX automates full LLM fine-tuning via multi-agent collaboration (Researcher + Executor) modeled as a search tree, covering literature research through training and evaluation.
  45. A Complete Symmetry Classification of Shallow ReLU Networks
    ArXiv · AI/CL/LG ·
    Complete symmetry classification for shallow ReLU networks reveals the geometric structure of the neuromanifold, with implications for understanding optimization dynamics in neural networks.
  46. First-See-Then-Design: A Multi-Stakeholder View for Optimal Performance-Fairness Trade-Offs
    ArXiv · AI/CL/LG ·
    New multi-stakeholder framework grounds algorithmic fairness in welfare economics, modeling utilities of both decision-makers and decision subjects with a social planner objective.
  47. Hierarchical Reinforcement Learning with Runtime Safety Shielding for Power Grid Operation
    ArXiv · AI/CL/LG ·
    Hierarchical RL framework with deterministic runtime safety shielding decouples high-level control from real-time feasibility enforcement for power grid operation.