Weekly Intelligence

AI Quick Bites

March 09, 2026 · 361 items from 12 sources

Last refreshed: March 09, 2026 at 10:14 UTC

Highlights

The five most consequential developments in AI this week — selected from 361 items across 12 sources. These are the things an AI engineer, researcher, or founder needs to know.

02
NOBLE's nonlinear low-rank branches deliver up to 1.47x pretraining step speedup with only 4% extra parameters and 7% step overhead — a drop-in architectural improvement validated across LLMs, ViT, and VQGAN.
arxiv 2026-03-09 18 min
03
COLD-Steer enables inference-time LLM behavior steering with 50x fewer labeled examples by approximating fine-tuning dynamics without any parameter updates — directly useful for alignment and personalization applications.
arxiv 2026-03-09 18 min
04
Paired 'data analogies' across embodiments beat large-scale unpaired datasets by 22.5% for cross-robot transfer — a concrete data curation insight for anyone building generalist robot policies.
arxiv 2026-03-09 20 min
05
Backdoor Modality Collapse reveals that multi-modal backdoor attacks in diffusion models are fundamentally weaker than assumed, with critical implications for how we evaluate and defend against multimodal model attacks.
arxiv 2026-03-09 18 min

What Changed This Week

Week-over-week diff showing new arrivals, items gaining momentum, and topics that dropped off the radar.

AI Security

Novel attack vectors, jailbreak research, red-teaming findings, and defensive tools across the AI security landscape. Only items with genuine technical substance make it here.

KeygraphHQ/shannon
8/10
Shannon Lite is a fully autonomous AI pentester for web apps and APIs, achieving 96.15% (100/104 exploits) on a hint-free variant of the XBOW benchmark — the strongest published result on this benchmark. Represents a significant capability milestone for autonomous offensive AI security agents.
github 2026-03-09 5 min
Hardening Firefox with Anthropic's Red Team
8/10
Anthropic's red team collaborated with Mozilla to discover and patch real Firefox security vulnerabilities, with bugs attributed to Claude in official Mozilla security advisories. First high-profile example of an AI red team finding confirmed CVEs in major production browser software.
hackernews 2026-03-09 8 min
Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives
7.5/10
ICLR 2026 paper introduces honesty fine-tuning techniques to make LLMs self-report hidden objectives, addressing a core alignment auditing challenge. Directly relevant to detecting deceptive or misaligned agentic systems.
conferences 2026-03-09 20 min
A tool that removes censorship from open-weight LLMs
7/10
Open-source tool for removing safety fine-tuning from open-weight LLMs, sparking substantial HN discussion (83 comments) about model alignment and the fragility of RLHF-based safety measures. Directly relevant to AI safety research and the open-weight model risk surface.
hackernews 2026-03-09 5 min
Show HN: Golf Scanner – OSS tool to find and audit every MCP server
7/10
Open-source Go binary that discovers all MCP servers configured across IDEs and runs security audits — addressing a real emerging risk as engineers routinely connect AI agents to production systems without vetting MCP server permissions.
hackernews 2026-03-09 5 min
OBLITERATUS
7/10
OBLITERATUS by 'pliny-the-prompter' is a one-click model jailbreak/liberation tool and chat playground — high trending score (153) and AGPL license. Directly relevant to LLM safety red-teaming research.
huggingface_spaces 2026-03-09 3 min
THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics
7/10
THEMIS is a new multimodal benchmark for evaluating MLLMs on detecting scientific paper fraud (image manipulation, data fabrication) in real-world academic scenarios. Fills a genuine gap in evaluation tooling for integrity detection.
conferences 2026-03-09 20 min
Heads up: prompt injection payload targeting OpenClaw agents circulating in the wild
7/10
Real-world prompt injection payload targeting OpenClaw agents circulating in the wild, disguised as a post-context-compaction audit message to trick agents into reading attacker-controlled files. Concrete in-the-wild example of indirect prompt injection exploiting agent memory/tool-use patterns.
reddit 2026-03-09 3 min
The L in "LLM" Stands for Lying
6/10
High-engagement blog post (472 HN comments) arguing that LLM hallucination is structural rather than a fixable bug — the model generates plausible text rather than grounded truth. Useful framing for practitioners setting user expectations, though not novel research.
hackernews 2026-03-09 8 min
Anthropic Cowork feature creates 10GB VM bundle on macOS without warning
6/10
Claude Code's new Cowork feature silently downloads and installs a ~10GB VM bundle on macOS without user consent, raising significant concerns about transparency and permission models in agentic dev tools. 186 HN comments signal strong community concern.
hackernews 2026-03-09 5 min
steerling-8b
6/10
Steerling-8B is a novel causal diffusion LM with interpretability-first design, featuring concept-steering and masked diffusion architecture — interesting for alignment and interpretability research, though low downloads suggest early-stage.
huggingface_models 2026-03-09 4 min
Meta’s AI smart glasses and data privacy concerns
6/10
Investigation reveals Meta's Ray-Ban smart glasses workers have broad access to user video/audio data used for AI training, raising serious surveillance and data privacy concerns with real-world AI deployment implications.
hackernews 2026-03-09 6 min
This Github repo can permanently removes LLM censorship in 45 minutes. It's called Heretic. No
6/10
Tweet about 'Heretic,' an open-source tool claiming to permanently remove refusal behaviors from local LLMs (Llama, Qwen, Gemma) via fine-tuning in 45 minutes — noteworthy as an alignment/safety concern and jailbreak-adjacent technique that bypasses prompt-level defenses entirely.
twitter 2026-03-09 2 min
BREAKING: researchers planted a single bad actor inside a group of LLM agents.
6/10
Research finding that a single malicious LLM agent embedded in a multi-agent network can prevent consensus — important adversarial robustness result for multi-agent system designers, though the tweet is thin on methodology.
twitter 2026-03-09 1 min
When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models
5/10
Discovers 'Backdoor Modality Collapse' in multimodal diffusion models — multi-modal attacks degenerate to single-modality dominance with negligible cross-modal interaction, revealing that high attack success rates mask fundamental reliance on a subset of modalities. Introduces TMA and CTI metrics to quantify this behavior.
arxiv 2026-03-09 18 min

Top Contributors

Authors and organizations making the biggest impact this week, ranked by cumulative relevance score across all sources.

Top Authors
#1
prithivMLmods
2 items · avg 112.5/10
225.0
#2
FrameAI4687
1 item · avg 182.0/10
182.0
#3
pliny-the-prompter
1 item · avg 153.0/10
153.0
#4
r3gm
1 item · avg 138.0/10
138.0
#5
multimodalart
2 items · avg 66.5/10
133.0
#6
99.0
#7
mrfakename
1 item · avg 88.0/10
88.0
#8
HuggingFaceM4
1 item · avg 86.0/10
86.0
#9
microsoft
1 item · avg 59.0/10
59.0
#10
selfit-camera
1 item · avg 55.0/10
55.0
Top Organizations
#1
openclaw
1 item · avg 358025.0/10
358025.0
#2
anthropics
3 items · avg 70655.4/10
211966.3
#3
f
1 item · avg 195658.0/10
195658.0
#4
shadcn-ui
1 item · avg 141185.0/10
141185.0
#5
microsoft
1 item · avg 117539.3/10
117539.3
#6
openai
2 items · avg 47749.2/10
95498.4
#7
toeverything
1 item · avg 85119.9/10
85119.9
#8
affaan-m
1 item · avg 83857.0/10
83857.0
#9
ruvnet
4 items · avg 17993.1/10
71972.2
#10
karpathy
2 items · avg 34715.0/10
69430.0

Build Ideas

Actionable product ideas distilled from this week's highest-scoring research and discussions. Each includes specific use cases and the source material that inspired it.

LLM Acceptance Criteria Coach
A developer tool that prompts users to define explicit acceptance criteria, test cases, and edge cases before submitting any code-generation request to an LLM. The tool enforces a TDD-style discipline by blocking prompt submission until minimal criteria are set, then automatically validates LLM output against those criteria and flags hallucinated or untestable claims. This directly addresses the structural reliability gap that makes LLM-generated code unreliable in production.
IDE plugins for VS Code and JetBrains CI/CD pipeline LLM code review gates Enterprise AI coding assistants with audit trails LLM-assisted test suite generation
https://blog.katanaquant.com/p/your-llm-... https://acko.net/blog/the-l-in-llm-stand...
Real-Time Streaming Voice
A production-ready streaming TTS middleware layer that wraps any LLM-based text-to-speech pipeline with prosodic boundary detection, enabling low-latency, natural-sounding synthesis from streaming text input. Using the boundary-aware early stopping technique from recent research, it prevents mid-word cuts and unnatural pauses that plague current real-time voice AI systems. This is a drop-in SDK for developers building voice agents, meeting bots, or read-aloud features.
Voice AI agents and call center bots Real-time document and article read-aloud Live captioning and audio narration tools Multilingual voice interfaces for LLM apps
https://arxiv.org/abs/2603.06444v1
Robot Demo Pairing Studio
A data management and annotation platform for robotics teams that structures paired cross-embodiment demonstrations, making it easy to record, align, and export 'data analogies' between different robot morphologies. The 22.5% transfer improvement from paired data over large unpaired datasets means most teams are leaving performance on the table due to poor data organization. This tool closes the gap between raw demonstration recordings and policy-ready training sets.
Cross-embodiment policy transfer for warehouse robots Generalist robot policy training pipelines Academic robotics lab data management Robot simulation-to-real transfer workflows
https://arxiv.org/abs/2603.06450v1
Inference Behavior Steering API
A lightweight inference-time activation steering API built on the COLD-Steer technique that lets developers steer LLM behavior — tone, persona, factual focus, safety constraints — using only a handful of in-context examples rather than fine-tuning. With 95% steering effectiveness at 50x lower sample cost than baselines, this makes behavior customization accessible without model ownership or GPU budgets. Offered as a middleware layer that wraps any open or proprietary LLM endpoint.
Enterprise LLM persona and brand voice control Safety constraint enforcement at inference time Domain-specific assistant behavior tuning A/B testing LLM output styles in production
https://arxiv.org/abs/2603.06495v1
Multimodal Backdoor Auditor
A security auditing tool for teams deploying multimodal diffusion models that automatically detects 'Backdoor Modality Collapse' — where adversarial attacks cause the model to silently rely on a single modality while appearing to use all inputs. Using the TMA and CTI diagnostic metrics introduced in recent research, the tool surfaces hidden modality dependencies and attack vulnerabilities before production deployment. This fills a critical gap in MLSecOps tooling for image-text and video-text pipelines.
Pre-deployment red-teaming for multimodal AI products Compliance audits for AI systems in regulated industries Research reproducibility checks on multimodal model claims Continuous monitoring of fine-tuned diffusion models
https://arxiv.org/abs/2603.06508v1

Trending Repos

Repositories gaining serious momentum this week — sourced from GitHub Trending and TrendShift, enriched with commit velocity and contributor activity.

1
GH Trending
KeygraphHQ/shannon
typescript 32,802 3,267 6,900 stars this week
Shannon Lite is a fully autonomous AI pentester for web apps and APIs, achieving 96.15% (100/104 exploits) on a hint-free variant of the XBOW benchmark — the strongest published result on this benchmark. Represents a significant capability milestone for autonomous offensive AI security agents.
Build idea
A continuous web application security service that automatically runs autonomous penetration tests on staging environments before each deployment, delivering detailed exploit reports and remediation guidance without requiring human security researchers.
2
TrendShift
karpathy/autoresearch
Python 8,700 1,200
Karpathy's AI agent system that autonomously runs ML research experiments on single-GPU nanochat training setups. Represents a concrete step toward self-directed AI research loops — high signal given the author's track record.
Build idea
A cloud platform where ML teams submit research hypotheses and receive fully automated experiment results, ablation studies, and model comparisons — turning a single GPU overnight into a junior researcher's week of work.
3
GH Trending
openai/codex
rust 64,048 8,538 1,437 stars this week
OpenAI's official lightweight terminal-based coding agent written in Rust, accumulating 64K stars rapidly. Signals OpenAI's move to productize agentic coding workflows natively in the terminal, competing directly with Claude Code.
Build idea
A managed enterprise terminal coding agent service that integrates with corporate codebases via SSO and VPN, providing audit logs, usage controls, and compliance guardrails on top of agentic coding workflows for regulated industries.
4
GH Trending
LMCache/LMCache
python 7,586 985 632 stars this week
High-performance KV cache layer for LLM inference that decouples cache from compute, enabling cross-instance cache sharing and significant latency/cost reduction. Gaining solid momentum (632 stars/week) as a production-grade inference optimization tool.
Build idea
A drop-in LLM inference optimization layer sold to AI-native companies that dramatically cuts their GPU costs by intelligently sharing and persisting KV caches across inference instances, offered as a managed service with a cost-savings guarantee.
5
GH Trending
QwenLM/Qwen-Agent
python 15,240 1,462 1,735 stars this week
Official agent framework from Alibaba's Qwen team supporting Qwen 3.0+, featuring MCP protocol integration, function calling, code interpreter, and RAG. Strong weekly growth (1,735 stars) signals growing adoption of Qwen as an agent backbone.
Build idea
A no-code enterprise AI agent builder that lets operations teams deploy Qwen-backed agents with pre-built connectors to internal tools like Salesforce, Jira, and Confluence, without writing a single line of code.
6
GH Trending
alibaba/OpenSandbox
python 7,058 520 3,959 stars this week
Alibaba's general-purpose sandboxed execution platform for AI agents, supporting multi-language SDKs, Docker/Kubernetes runtimes, and scenarios including coding agents, GUI agents, and RL training. Notable breakout week (3,959 stars) for a production-grade agent execution infrastructure.
Build idea
A sandboxed AI agent execution cloud that lets developers deploy untrusted or experimental agents in fully isolated, billable runtime environments with usage metering, making it safe and economical to run third-party AI agents in production.
7
TrendShift
anthropics/claude-code
Shell 74,900 6,000
Anthropic's official agentic coding CLI dominating GitHub trends with 74,900 stars and a surrounding ecosystem explosion this week. The anchor project driving most of the claude-skills/OpenClaw activity in this digest.
Build idea
A SaaS layer on top of Claude Code that provides team-wide session management, shared context, cost allocation by developer or project, and compliance logging for enterprises adopting agentic coding at scale.
8
GH Trending
anthropics/skills
python 87,793 9,307 7,152 stars this week
Anthropic's official public repository for Agent Skills — the canonical source for the claude-skills/OpenClaw ecosystem that has spawned dozens of derivative repos this week. 87,793 stars and 7,152 new this week make it the most-watched AI dev tools repo in this cycle.
Build idea
A marketplace where developers publish, monetize, and discover verified Claude Agent Skills — earning revenue-share each time their skill is invoked by other users' Claude Code workflows.
9
GH Trending
bytedance/deer-flow
python 26,391 3,117 3,150 stars this week
ByteDance's open-source SuperAgent framework with 26k+ stars that orchestrates research, coding, and content creation via sandboxed subagents with memory and tool use. Gaining serious traction as a multi-agent harness for complex long-horizon tasks.
Build idea
A research and competitive intelligence SaaS that deploys DeerFlow-based multi-agent pipelines to continuously monitor industries, synthesize findings from the web, and deliver structured briefings to executive teams on a scheduled basis.
10
GH Trending
inclusionAI/AReaL
python 4,570 381 969 stars this week
Fast reinforcement learning framework for LLM reasoning and agent training, emphasizing simplicity and flexibility. With nearly 1k stars gained this week, it's gaining traction as a practical alternative to complex RL pipelines for post-training LLMs.
Build idea
A fine-tuning platform for AI teams that uses AReaL to apply reinforcement learning post-training to custom LLMs, improving domain-specific reasoning (legal, medical, finance) without requiring deep RL expertise in-house.

Trending Developers

Developers gaining traction on GitHub this week — shipping open-source AI tools, models, and frameworks worth following.

1
Robert Allen
@zircote
zircote/rlm-rs
Rust CLI implementing the Recursive Language Model (RLM) pattern for Claude Code, enabling processing of documents 100x larger than context windows via recursive summarization — practically useful for large-codebase agentic workflows.
2
Benson Wong
@mostlygeek
mostlygeek/llama-swap
llama-swap enables reliable model swapping across local OpenAI/Anthropic-compatible servers including llama.cpp and vLLM. Practical tooling for local inference setups managing multiple models.
3
zhayujie
@zhayujie
zhayujie/chatgpt-on-wechat
CowAgent is a multi-platform LLM-powered super assistant supporting autonomous task planning, OS/web access, skill creation, and long-term memory, with integrations for WeChat, DingTalk, and Lark across OpenAI/Claude/Gemini/DeepSeek/Qwen backends.
4
Brady Gaster
@bradygaster
bradygaster/squad
Squad is a GitHub project for building AI agent teams. Minimal context provided to evaluate technical novelty.
5
David East
@davideast
davideast/stitch-mcp
CLI tool for bridging Google's Stitch AI UI design platform into developer workflows via MCP. Narrow use case with limited broader impact.
6
Nathan Brake
@njbrake
njbrake/agent-of-empires
Developer profile page listing AI coding agent tools (Claude Code, Codex CLI, etc.) — no substantive technical content.
7
Gunnar Morling
@gunnarmorling
gunnarmorling/1brc
Java performance challenge for aggregating 1B rows — not AI-related.
8
Karl Seguin
@karlseguin
karlseguin/http.zig
HTTP server for Zig — not AI-related.
9
Kim Morrison
@kim-em
kim-em/lean-zip
Lean theorem proving tooling — not directly AI/ML research.
10
Krille-chan
@krille-chan
krille-chan/fluffychat
Matrix messaging client — not AI-related.
11
mxsm
@mxsm
mxsm/rocketmq-rust
Apache RocketMQ reimplemented in Rust — not AI-related.
12
qixing-jk
@qixing-jk
qixing-jk/all-api-hub
API relay manager for LLM API keys — utility tool with minimal technical novelty.
13
Saúl Ibarra Corretgé
@saghul
saghul/txiki.js
Developer profile for txiki.js (tiny JS runtime) — not AI-related.
14
Stephen Berry
@stephenberry
stephenberry/glaze
C++ JSON/reflection library developer profile — not AI-related.
15
YuTengjing
@tjx666
tjx666/awesome-chrome-extension-boilerplate
Chrome extension boilerplate developer profile — not AI-related.
16
Yair Morgenstern
@yairm210
yairm210/Unciv
Open-source Civ V remake for Android/Desktop — not AI-related.
17
Austin Griffith
@austintgriffith
austintgriffith/ethskills
The missing knowledge between AI agents and production Ethereum.

Models & Benchmarks

New model releases, arena rankings, and benchmark results across frontier and open-source AI models this week.

Arena Leaderboard — Top 15
#ModelTypeEloVotes
1 claude-opus-4-6 Anthropic Closed 1504 9,170
2 claude-opus-4-6-thinking Anthropic Closed 1502 8,313
3 gemini-3.1-pro-preview Google Closed 1500 4,041
4 grok-4.20-beta1 xAI Closed 1491 5,280
5 gemini-3-pro Google Closed 1485 39,923
6 gpt-5.4-high OpenAI Closed 1479 3,503
7 gpt-5.2-chat-latest-20260210 OpenAI Closed 1479 5,786
8 gemini-3-flash Google Closed 1473 30,600
9 grok-4.1-thinking xAI Closed 1473 39,309
10 claude-opus-4-5-20251101-thinking-32k Anthropic Closed 1470 32,516
11 claude-opus-4-5-20251101 Anthropic Closed 1467 37,462
12 dola-seed-2.0-preview Bytedance Closed 1465 6,712
13 grok-4.1 xAI Closed 1462 43,536
14 gemini-3-flash (thinking-minimal) Google Closed 1462 22,846
15 gpt-5.4 OpenAI Closed 1457 3,417
New & Trending Models
sarvamai/sarvam-105b
1,389 downloads 178 likes 178 trending
Open Source 2026-03-03
Sarvam-105B is a new large multilingual model covering 22+ Indian languages with a custom MLA architecture, under Apache 2.0 — a landmark open model for Indic language AI with the highest trending score this week.
zai-org/GLM-5
228,106 downloads 1,750 likes 81 trending
Open Source 2026-02-11
GLM-5 from Zhipu AI (ZAI) is a MoE+DSA architecture model with 1750 likes, 228k downloads, MIT license, and strong benchmark results — major open-weight model release competing at the frontier.
LiquidAI/LFM2-24B-A2B
17,414 downloads 275 likes 50 trending
Custom License 2026-02-24
LiquidAI's LFM2-24B-A2B is a novel MoE architecture model with only 2B active parameters from 24B total, supporting 10 languages and targeting edge deployment; strong downloads (17k) and likes (275) suggest real community interest in the architecture.
MiniMaxAI/MiniMax-M2.5
435,012 downloads 1,134 likes 79 trending
Custom License 2026-02-12
MiniMax-M2.5 is a major open model release with 435k downloads and 1134 likes, using a custom MoE-style architecture with FP8 support and Azure deployment — one of the highest-traction new models this week.
Qwen/Qwen3-Coder-Next
1,176,160 downloads 1,092 likes 53 trending
Open Source 2026-01-30
Qwen3-Coder-Next from Alibaba's Qwen team is a next-generation code model with 1.17M downloads, signaling broad adoption and likely represents the upcoming iteration of the Qwen coding model line.
sarvamai/sarvam-30b
2,657 downloads 126 likes 126 trending
Open Source 2026-03-03
Sarvam-30B is the MoE sibling to Sarvam-105B, also covering 22+ Indian languages under Apache 2.0 — together these models represent a major milestone for open Indic-language foundation models.
Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled
15,720 downloads 280 likes 252 trending
Open Source 2026-02-27
Knowledge distillation of Claude 4.6 Opus reasoning traces into Qwen3.5-27B, producing a strong open-weight reasoning model with chain-of-thought; trending heavily with 280 likes and 15k+ downloads across the model family.
Nanbeige/Nanbeige4.1-3B
542,492 downloads 981 likes 66 trending
Open Source 2026-02-10
Nanbeige4.1-3B is a compact bilingual (EN/ZH) LLaMA-based model with an accompanying arXiv paper, 542k downloads and 981 likes — notable traction for a 3B model with academic backing.
allenai/Olmo-Hybrid-7B
16,395 downloads 40 likes 40 trending
Open Source 2026-01-28
AllenAI's OLMo-Hybrid-7B is a fully open (Apache-2.0) 7B model using a hybrid architecture, continuing the OLMo lineage of transparent, reproducible language model research.
guidelabs/steerling-8b
1,321 downloads 106 likes 22 trending
Open Source 2026-02-22
Steerling-8B is a novel causal diffusion LM with interpretability-first design, featuring concept-steering and masked diffusion architecture — interesting for alignment and interpretability research, though low downloads suggest early-stage.
openai/gpt-oss-20b
7,324,821 downloads 4,439 likes 30 trending
Open Source 2025-08-04
OpenAI's open-source 20B model (gpt-oss-20b) with 7.3M downloads and an associated arXiv paper represents a notable open weight release from OpenAI supporting vLLM, FP8, and MXFP4 quantization.
stepfun-ai/Step-3.5-Flash
254,792 downloads 697 likes 26 trending
Open Source 2026-02-01
StepFun's Step-3.5-Flash — a fast Apache-2.0 licensed LLM with 254k downloads and arxiv papers. Competitive open-weight model from a Chinese lab worth benchmarking.
stepfun-ai/Step-3.5-Flash-Base
513 downloads 73 likes 73 trending
Open Source 2026-03-02
Base (pre-RLHF) version of Step-3.5-Flash, freshly released March 2026 with strong trending score. Useful for fine-tuning researchers wanting access to the raw pretrained weights.
tencent/Penguin-VL-8B
189 downloads 30 likes 30 trending
Open Source 2026-03-05
Tencent's Penguin-VL-8B is a vision-language model built on Qwen3-8B with a custom vision encoder, Apache-2.0 licensed. Competitive multimodal model from a major lab with arxiv paper.
Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF
58,196 downloads 132 likes 123 trending
Open Source 2026-02-27
GGUF quantization of the Claude 4.6 Opus reasoning-distilled Qwen3.5-27B, enabling local inference of the distilled reasoning model with 58k downloads.
Model Buzz

Trending Spaces

The hottest interactive demos and apps on HuggingFace Spaces this week — try them live.

UGI Leaderboard
DontPlanToEnd
docker 1,561 34
apache-2.0
The Uncensored General Intelligence Leaderboard tracks model behavior when safety guardrails are removed; useful reference for alignment researchers studying refusal training robustness, though the leaderboard itself is not new.
All Bench Leaderboard
FINAL-Bench
static 37 37
apache-2.0
A new aggregated benchmark leaderboard covering 90+ generative AI models across multiple evaluation metrics — useful for quick model comparisons but not technically novel.
Omni Video Factory
FrameAI4687
gradio 444 182
mit
Gradio space combining text-to-video, image-to-video, and video extension capabilities in one interface. High trending score but limited technical novelty — aggregates existing capabilities.
The Synthetic Data Playbook: Generating Trillions of the Finest Tokens
HuggingFaceFW
docker 96 96
HuggingFace FineWeb team's interactive playbook on generating synthetic training data at scale (trillions of tokens). Practical resource for practitioners building large-scale pretraining pipelines.
faster-qwen3-tts
HuggingFaceM4
docker 134 86
Optimized demo of Qwen3-TTS with faster inference, showing HuggingFace M4 team's work on accelerating the model. Useful for those evaluating open TTS options.
LFM2.5 1.2B Thinking WebGPU
LiquidAI
static 87 40
Liquid AI's LFM2.5-1.2B reasoning model running entirely in-browser via WebGPU — demonstrates non-transformer architecture viability for edge/client-side inference.
Qwen3-TTS Demo
Qwen
gradio 1,645 51
apache-2.0
Official demo for Qwen3-TTS, Alibaba's text-to-speech model with 1645 likes indicating strong community interest. Apache-2.0 licensed and competitive with commercial TTS offerings.
Wan2.2 Animate
Wan-AI
gradio 4,896 43
apache-2.0
Wan2.2 animation demo from Wan-AI with nearly 5k likes, one of the most popular open video generation spaces. Apache-2.0 licensed with strong community adoption.
FLUX.2 [Klein] 9B
black-forest-labs
gradio 632 38
Black Forest Labs' FLUX.2 Klein 9B image generation model demo — their latest iteration with 632 likes. Notable as an official release from the leading open image generation lab.
Free Unlimited Google Veo 3
deddytoyota
static 48 31
Unofficial space claiming free unlimited access to Google Veo 3 with NSFW content — likely a wrapper or spam. Not technically substantive.
Flux2 Klein Face Swap
linoyts
gradio 89 33
Face swap application built on FLUX.2 Klein 9B using LoRA fine-tuning. Demonstrates downstream application of FLUX.2 but limited technical novelty.
TRELLIS.2
microsoft
gradio 1,206 59
mit
Microsoft's TRELLIS.2 generates high-fidelity 3D assets from single images, with 1206 likes and MIT license. Second generation of a strong open-source 3D generation system.
Z Image Turbo
mrfakename
gradio 2,497 85
High-traction image generation space (2497 likes, trending breakout) suggesting a fast turbo-mode image model. Community adoption signals competitive quality/speed tradeoff.
Nano Banana PRO
multimodalart
gradio 576 31
mit
Nano Banana image generation demo exclusive to HuggingFace PRO users — 576 likes but limited public accessibility reduces broad relevance.
Qwen Image Multiple Angles 3D Camera
multimodalart
gradio 1,862 96
Uses Qwen vision model to generate consistent multi-angle views of objects with simulated 3D camera control — 1862 likes signals strong practitioner interest in novel controllable generation.

Conference Papers

Accepted papers from top AI conferences via OpenReview.

Showing accepted papers from active venues. Next deadlines: ICML 2026 (submissions open), NeurIPS 2026 (coming soon).

ICLR 2026 Pierre-Carl Langlais, Pavel Chizhov, Catherine Arnett et al. 2026-03-09
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
ICLR 2026 paper presents Common Corpus, claimed to be the largest openly licensed dataset for LLM pre-training, directly addressing copyright concerns in foundation model training. Critical resource for researchers needing legally clean training data at scale.
dataset pre-training large language models open data open science
ICLR 2026 Mouath Abu Daoud, Leen Kharouf, Omar El Hajj et al. 2026-03-09
MedAraBench: Large-scale Arabic Medical Question Answering Dataset and Benchmark
MedAraBench provides a large-scale Arabic medical QA benchmark to address a significant NLP resource gap. Useful for multilingual medical AI researchers but narrow scope.
Dataset Benchmark Large Language Models Arabic Natural Language Processing Medical Question Answering
ICLR 2026 Zhiheng Chen, Ruofan Wu, Guanhua Fang et al. 2026-03-09
Transformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures
Theoretical ICLR 2026 paper formalizing transformers as unsupervised learning algorithms through the lens of Gaussian Mixture Models, offering new in-context learning theory. Niche theoretical interest.
In-context learning Gaussian Mixture Models Theory
ICLR 2026 Ron Vainshtein, Zohar Rimon, Shie Mannor et al. 2026-03-09
Task Tokens: A Flexible Approach to Adapting Behavior Foundation Models
Task Tokens proposes a lightweight adapter method for behavior foundation models to support new tasks via token conditioning, enabling flexible humanoid control without full retraining. Incremental but practical contribution for robotics foundation models.
Reinforcement Learning Hierarchial Reinforcement Learning Behavior Foundation Models Humanoid Control
ICLR 2026 Kaien Sho, Shinji Ito 2026-03-09
Submodular Function Minimization with Dueling Oracle
Theoretical work on submodular function minimization using pairwise comparison oracles, applicable to preference-based optimization. Highly specialized theoretical contribution with tangential ML relevance.
submodular minimization deling oracle preference-based optimization
ICLR 2026 Rongjin Li, Zichen Tang, Xianghe Wang et al. 2026-03-09
Not Search, But Scan: Benchmarking MLLMs on Scan-Oriented Academic Paper Reasoning
Introduces a benchmark testing MLLMs on scan-oriented reasoning over academic papers — a harder task than search — exposing significant capability gaps in autonomous research assistance. Useful for evaluating document-understanding models.
Multimodal Large Language Models; Academic Paper Reasoning; Scan-Oriented Reasoning
ICLR 2026 Peng Sun, Tao Lin 2026-03-09
Any-step Generation via N-th Order Recursive Consistent Velocity Field Estimation
Proposes N-th order recursive consistent velocity field estimation to enable flexible step-count generation in diffusion/flow models without multi-component losses. Incremental improvement over consistency models.
Generative Models
ICLR 2026 Zeyu Feng, Haiyan Yin, Yew-Soon Ong et al. 2026-03-09
Masked Skill Token Training for Hierarchical Off-Dynamics Transfer
MSTT introduces masked skill token training for hierarchical RL to enable policy transfer across environments with different dynamics — fully offline, without fine-tuning. Relevant for real-world robotics deployment where sim-to-real gaps are common.
Tranfser Learning Skills Hierarchical RL Embodied AI
ICLR 2026 Shaojie Li, Pengwei Tang, Bowei Zhu et al. 2026-03-09
High Probability Bounds for Non-Convex Stochastic Optimization with Momentum
Provides the first high-probability convergence and generalization bounds for SGDM in non-convex settings, filling a theoretical gap. Relevant for researchers working on optimizer theory but limited immediate practical impact.
Momentum nonconvex learning generalization
ICLR 2026 Artyom Sorokin, Nazar Buzun, Aleksandr Anokhin et al. 2026-03-09
Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training
Q-RAG introduces value-based (RL-trained) embedders for multi-step retrieval in long-context settings, addressing the single-step retrieval limitation in standard RAG. Novel training signal for embedders with practical multi-hop QA gains.
Reinforcement Learning RL QA Long-context RAG
ICLR 2026 Seongtae Hong, Youngjoon Jang, Jungseob Lee et al. 2026-03-09
Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment
Cross-lingual alignment technique for improving semantic proximity in multilingual information retrieval embeddings. Solid but incremental work in a well-studied area.
Cross-Lingual Alignment Information Retrieval Multilingual Embedding Cross-Lingual Information Retrieval
ICLR 2026 Rahul Ramachandran, Ali Garjani, Roman Bachmann et al. 2026-03-09
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Systematic ICLR 2026 benchmark of GPT-4o, o4-mini, Gemini 1.5 Pro/Flash on standard computer vision tasks reveals where multimodal foundation models still fall short versus task-specific models. Useful calibration for practitioners deciding when to use VLMs versus specialized CV models.
vision benchmark multimodal foundation models vision language models standard computer vision tasks
ICLR 2026 Tin Hadži Veljković, Erik J Bekkers, Michael Tiemann et al. 2026-03-09
CORDS - Continuous Representations of Discrete Structures
CORDS proposes continuous representations for variable-cardinality discrete structure prediction (object detection, molecular modeling) via neural fields. Interesting theoretical angle but limited demonstrated impact.
Continuous set representations Neural fields Variable-cardinality prediction Invertible encoding/decoding Diffusion and flow matching
ICLR 2026 Christopher Mitcheltree, Vincent Lostanlen, Emmanouil Benetos et al. 2026-03-09
SCRAPL: Scattering Transform with Random Paths for Machine Learning
SCRAPL makes scattering transform-based perceptual losses computationally tractable via random path sampling, enabling use in audio and vision deep inverse problems. Niche signal processing / generative audio contribution.
scattering transform wavelets stochastic optimization ddsp perceptual quality assessment
ICLR 2026 Antanas Žilinskas, Robert Noel Shorten, Jakub Marecek et al. 2026-03-09
EVEREST: A Transformer for Probabilistic Rare-Event Anomaly Detection with Evidential and Tail-Aware Uncertainty
EVEREST is a transformer architecture for rare-event forecasting in multivariate time series, combining evidential deep learning and extreme value theory to handle severe class imbalance. Solid niche contribution for anomaly detection use cases.
Transformer models Uncertainty quantification Evidential deep learning Extreme value theory Imbalanced classification
ICLR 2026 Harris Abdul Majid, Pietro Sittoni, Francesco Tudisco et al. 2026-03-09
Test-Time Accuracy-Cost Control in Neural Simulators via Recurrent-Depth
Introduces recurrent-depth neural simulators that allow test-time accuracy-cost trade-offs analogous to classical numerical methods, enabling adaptive compute for scientific simulation. Relevant for AI4Science practitioners.
Neural Simulator Recurrent Depth AI4Simulation
ICLR 2026 Kun XIE, Peng Zhou, Xingyi Zhang et al. 2026-03-09
PoinnCARE: Hyperbolic Multi-Modal Learning for Enzyme Classification
PoinnCARE applies hyperbolic space multi-modal learning to enzyme function classification, capturing hierarchical EC number relationships better than Euclidean methods. Domain-specific bioinformatics contribution.
EC number prediction enzyme function hyperbolic space learning multi-modal learning enzyme structure
ICLR 2026 Tianqiao Liu, Xueyi Li, Hao Wang et al. 2026-03-09
From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training
ICLR 2026 paper argues that audio-language models for speech-to-speech conversation require non-autoregressive joint training to overcome latency and quality limitations of current autoregressive approaches. Relevant to the emerging voice AI space.
Large Multimodal Models Multi-token Prediction Non-Autoregressive Learning
ICLR 2026 Qinglong Yang, Haoming Li, Haotian Zhao et al. 2026-03-09
FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents
FingerTip 20K is an ICLR 2026 benchmark for proactive, personalized mobile GUI agents that act without explicit user instructions by inferring context. Advances the frontier of autonomous mobile agents beyond reactive instruction-following.
Mobile Agent LLM Agent GUI Proactive Agent Personalization
ICLR 2026 Tianxiang Dai, Jonathan Fan 2026-03-09
Characterizing and Optimizing the Spatial Kernel of Multi Resolution Hash Encodings
Analyzes the spatial kernel of multi-resolution hash encodings (as used in NeRF/Instant-NGP) from a physical systems perspective to enable principled hyperparameter selection. Niche but useful for neural fields researchers.
multi-resolution hash encoding implicit neural representations neural fields point spread function spatial kernel analysis

Deep Dive

All 361 items scored and categorized. Relevance scores reflect novelty, technical depth, and practical impact — 7+ items are the ones worth your time.

361+ research items ready to explore