Skip to content
EN

AI Coding Workflow

This guide summarizes a practical workflow for xLLM developers who need to optimize NPU serving, debug regressions, or prepare pull requests with reproducible evidence.

The workflow is maintained as a separate knowledge base at xllm-workflow. Use that repository when you want copy-ready agent skills, prompt templates, artifact schemas, and reusable model optimization history.

xLLM AI Coding Workflow

Use this workflow when a change needs evidence beyond a normal code review:

  • optimizing TTFT, TPOT, TPS, memory usage, or serving concurrency;
  • comparing xLLM with vLLM-Ascend, SGLang NPU, or another serving framework;
  • debugging garbled output, dataset score drops, GPU/NPU mismatches, OOM, graph replay failures, HCCL issues, or runtime crashes;
  • validating an NPU-related PR before it is merged;
  • deciding whether an operator migration or kernel-level experiment is needed.

Do not treat one smoke run as a formal conclusion. Formal performance and accuracy claims should include the exact command, environment, workload, raw artifacts, and normalized summaries.

For performance and correctness-sensitive work, follow this loop:

target -> baseline -> profiling -> patch -> accuracy -> performance -> record

The key rule is to keep benchmark, profiling, and accuracy evidence separate. Profiling explains bottlenecks, but it does not replace warmed-up before/after performance measurements.

Use these phases for a complete optimization task:

PhasePurposeOutput
Target and environmentDefine the goal, model, framework commit, hardware, CANN/runtime versions, workload, and SLA.Run manifest
Historical knowledgeCheck prior model PRs, failed attempts, and known risky paths.History notes
Fair baselineRun warmed-up baseline tests before code or parameter changes.Raw metrics and summary
Evidence collectionCollect profiling, capacity, pipeline, compute, or accuracy evidence based on the symptom.Diagnostic report
PatchMake one meaningful, reviewable change per round whenever practical.Code diff
ValidateRe-run accuracy, performance, build, and UT checks appropriate to the change.Validation table
RecordPreserve commands, metrics, failed attempts, risk notes, and follow-up work.Reusable lesson

Before opening or updating an NPU optimization PR, make sure the PR description can answer these questions:

  • What model, tokenizer, dtype, device type, device count, and framework commit were used?
  • What exact startup command and benchmark command were used?
  • Was the baseline warmed up and run on clean devices?
  • Are profiling results used only as diagnostic evidence?
  • Which accuracy level was run, and where are failed cases stored if any?
  • What changed in the patch, and what risk remains?
  • Which artifacts can another developer use to replay the conclusion?

For formal results, keep these artifacts together under the same run root:

  • manifest.md or manifest.yaml with environment and command details;
  • raw evalscope or benchmark output;
  • normalized metrics.json or summary.md;
  • profiling report and timeline notes when profiling is used;
  • failed_cases.jsonl or equivalent bad-case records for accuracy work;
  • PR notes that summarize what changed, why it is safe, and how it was validated.

The workflow repository provides shared schemas for these artifacts:

  • references/run-manifest-template.md
  • references/perf-artifact-schema.md
  • references/profiling-artifact-schema.md
  • references/accuracy-artifact-schema.md

The workflow repository includes task-oriented skills that can be loaded by Codex, Claude Code, opencode, or another local agent runtime:

TaskSkill
End-to-end optimizationxllm-npu-sota-loop
Service startup and evalscope collectionxllm-npu-eval-runner
Fair framework comparisonxllm-npu-benchmark
msprof / MindStudio analysisxllm-npu-profiler
Decode bubble and rank-skew analysisxllm-npu-pipeline-analysis
HBM, KV cache, and OOM analysisxllm-npu-capacity-planner
FLOPs, MFU, and lower-bound estimatesxllm-npu-compute-simulation
Accuracy regression debuggingxllm-npu-accuracy-debug
Crash or runtime incident triagexllm-npu-incident-triage
NPU code reviewxllm-npu-code-review
Operator migrationxllm-npu-op-migration

These skills are aids for disciplined engineering. The final judgment still depends on reproducible xLLM artifacts and reviewable code changes.