candidates
GitHub
social/dev web
citable signals
fresh window
1Executive Technical Signal
- Benchmark shift: Terminal-Bench/SWE-style eval chuyển từ leaderboard sang harness nội bộ; evidence S01,S02,S03 = 3 nguồn độc lập → Action: dựng NEXA eval set 50 task.
- OSS coding-agent runtime đang phân mảnh: opencode đạt 165,780 stars nhưng 6,129 issues S04 → adoption mạnh, operational risk cao → Action: trial có sandbox.
- Sandbox/security trở thành gate bắt buộc: microsandbox 6,317 stars S05 + FlowLink MCP destructive-command control S09 → Action: SYNCA policy gate cho agent commands.
- Cost governance nổi lên: token-budget discussion có 27 pts/32 comments S10 → AI coding ROI cần telemetry, không chỉ seat license → Action: đo cost/PR.
- Context layer còn mở: Repowise/ccpocket/gptme tổng 5,887 stars S06,S08,S13 → FARE có cơ hội codebase intelligence.
- Multi-agent workflow còn non-standard: Claude workflow composer + HN signal S11 → Action: chỉ trial theo runbook, chưa platform hóa.
2Trend Clusters
Hot Harness/eval: 3 benchmark signals, confidence 76%.
Hot Sandbox/governance: 2 direct signals, confidence 72%.
Emerging Codebase context: 4 repo/HN signals, confidence 68%.
Watch Multi-agent UI: 1 fresh repo signal, confidence 55%.
Noise Vibe-code anecdotes: low reproducibility.
3Must-read Sources
| Type | Link | Priority | Why read / Key takeaway / Follow-up |
|---|---|---|---|
| Benchmark | DeepSWE | P0 | Contamination-free long-horizon eval → dùng làm mẫu NEXA task hygiene. |
| Repo | opencode | P0 | 165,780 stars; validate CLI/runtime UX, issue-risk. |
| Repo | microsandbox | P0 | Sandbox primitive for untrusted agent execution. |
| HN/GitHub | Dirac | P1 | 393 pts/148 comments; inspect Terminal-Bench method. |
| Governance | FlowLink MCP proxy | P1 | Policy-control pattern for destructive MCP commands. |
| Cost | Uber token cost | P1 | Move from adoption to unit economics. |
4Fabbi Impact Map
| Trend | Evidence | Impact | Move | Owner | Urgency |
|---|---|---|---|---|---|
| Harness/eval | S01/S02/S03 | NEXA quality moat | Build 50-task eval | AI Eng Lead | 0-2w |
| Context engineering | S06/S08/S13 | FARE codebase map | Index 3 pilot repos | Solution Architect | 0-2w |
| Governance | S05/S09 | SYNCA risk control | Policy-as-code gate | Security Lead | 0-2w |
| Enterprise ops | S10/S04 | AIOS telemetry | Cost/PR dashboard | Platform PO | 1-2m |
| Japan/Vietnam/Global | 139 candidates | Presales narrative | Offer eval-first SDLC package | Presales Lead | 1-2m |
5Action Plan
DO THIS WEEK
- NEXA eval harness 50 tasks; ROI/time-saving 18-25%; risk 2/5; owner AI Eng Lead; TTV 7 ngày; validate pass@1 + rollback rate.
- SYNCA command policy for rm/write/network; ROI 10-15%; risk 2/5; owner Security Lead; TTV 5 ngày; validate blocked destructive command rate.
- FARE codebase context pilot on 3 repos; ROI 12-20%; risk 3/5; owner Solution Architect; TTV 10 ngày; validate retrieval precision@10.
- AIOS cost telemetry cost/PR + token/task; ROI 8-12%; risk 1/5; owner Platform PO; TTV 5 ngày; validate weekly spend variance.
WATCH NEXT 2-4 WEEKS
Dirac/ForgeCode leaderboard stability; opencode issue burn-down; MCP proxy patterns.
IGNORE / LOW SIGNAL
Fundraising-only, vibe-code anecdotes thiếu metric, consumer chatbot news.
6CTO Evaluation Matrix
| Signal | Thesis | Counter | Decision | Next validation |
|---|---|---|---|---|
| Benchmarks | Eval-first beats demo-first | Public benchmark contamination | trial 76% | 50 internal tasks |
| opencode | OSS runtime demand proven | 6,129 issues | watch/trial 65% | 2-week POC |
| Sandbox | Enterprise blocker solved by policy | Integration overhead | adopt 72% | MCP denylist test |
| Cost | Token spend becomes CFO topic | External story, limited details | adopt 70% | cost/PR baseline |
7Detailed Source Appendix
| ID | Platform | Source | Metric | Notes |
|---|---|---|---|---|
| S01 | HN | DeepSWE: contamination-free benchmark | 33 pts/9 comments | Agentic SDLC / harness |
| S02 | HN | Dirac topped Terminal-Bench on Gemini-3 flash preview | 393 pts/148 comments | Agentic SDLC / harness |
| S03 | HN | ForgeCode: open-source coding agent in Terminal-Bench 2.0 | 4 pts/0 comments | Agentic SDLC / harness |
| S04 | GitHub | anomalyco/opencode | 165,780 stars/19,696 forks/6,129 issues | Agentic SDLC / harness |
| S05 | GitHub | superradcompany/microsandbox | 6,317 stars/307 forks/50 issues | Agentic SDLC / harness |
| S06 | GitHub | gptme/gptme | 4,309 stars/390 forks/14 issues | Agentic SDLC / harness |
| S07 | GitHub | stablyai/orca | 3,474 stars/228 forks/201 issues | Agentic SDLC / harness |
| S08 | GitHub | K9i-0/ccpocket | 789 stars/63 forks/9 issues | Agentic SDLC / harness |
| S09 | HN | FlowLink MCP proxy blocking destructive commands | 1 pt/0 comments | Agentic SDLC / harness |
| S10 | HN | Uber AI budget/token-cost discussion | 27 pts/32 comments | Agentic SDLC / harness |
| S11 | HN | Visual composer for Claude Code multi-agent workflows | 2 pts/0 comments | Agentic SDLC / harness |
| S12 | HN | Functional programming accelerates agentic feature development | 59 pts/31 comments | Agentic SDLC / harness |
| S13 | HN | Repowise codebase intelligence for AI coding agents | 1 pt/0 comments | Agentic SDLC / harness |
| S14 | HN | Tracecore deterministic coding-agent benchmark | 1 pt/0 comments | Agentic SDLC / harness |
| S15 | GitHub | vercel-labs/zerolang | 4,566 stars/291 forks/116 issues | Agentic SDLC / harness |
8Data Quality / Scan Health Appendix
QUALITY_GATE_PARTIAL: 139 candidates scanned; dev_web/HN 30, GitHub 64, Reddit 25, YouTube 20, X 0, Facebook public 0, papers_product 0. arXiv hit 429 after bounded retries; X/Facebook public unauthenticated fallback produced no usable links. Confidence reduced, but 30+ cited/summarized technical signals available.