mech.app
AI Agents

MUSE-Autoskill: How Agents Build, Version, and Deprecate Their Own Tool Libraries

A skill lifecycle framework that treats agent capabilities as versioned, testable assets with memory-backed evolution and automated quality gates.

Source: arxiv.org
MUSE-Autoskill: How Agents Build, Version, and Deprecate Their Own Tool Libraries

Most agent frameworks treat skills like static functions. You write a tool, register it, and hope the agent calls it correctly. MUSE-Autoskill flips this: agents create their own skills, version them, run unit tests, and deprecate underperformers. The system tracks success rates, execution history, and cross-task reuse patterns in a skill-level memory layer.

This is infrastructure, not prompt engineering. The paper from Huawei and collaborators addresses the operational problem of maintaining a growing tool library when agents generate hundreds of candidate skills across thousands of tasks.

The Core Problem

Existing agent systems have no lifecycle management for skills. When an agent solves a task, it might generate a useful code snippet, but that snippet lives in a single conversation context. Next time a similar task appears, the agent starts from scratch or relies on a human to extract and formalize the skill.

Three failure modes emerge:

  • No reuse mechanism: Skills die with the task that created them
  • No quality gates: Bad skills pollute the library as fast as good ones
  • No evolution path: Skills can’t improve based on runtime feedback

MUSE-Autoskill introduces a five-stage lifecycle: creation, memory storage, management (organization and retrieval), evaluation (unit tests and runtime metrics), and refinement (versioning and deprecation).

Architecture: Skill as Versioned Asset

The system treats each skill as a database record with metadata, code, test suite, and performance history.

Skill record structure:

{
  "skill_id": "uuid",
  "name": "fetch_stock_price",
  "version": 3,
  "code": "def fetch_stock_price(ticker: str) -> float: ...",
  "unit_tests": ["test_valid_ticker", "test_invalid_ticker"],
  "created_at": "2026-05-20T10:00:00Z",
  "last_used": "2026-05-26T14:32:11Z",
  "success_count": 47,
  "failure_count": 2,
  "avg_execution_time_ms": 120,
  "parent_skill_id": "uuid_v2",
  "deprecation_reason": null,
  "tags": ["finance", "api", "yahoo"]
}

When an agent encounters a task, it queries the skill library by semantic similarity and tag match. If no suitable skill exists, the agent generates one, writes unit tests, and stores it. If a skill exists but fails at runtime, the agent forks it, modifies the code, and increments the version.

Creation: On-Demand Skill Generation

The agent uses a two-phase prompt:

  1. Task decomposition: Break the high-level goal into sub-tasks
  2. Skill synthesis: For each sub-task, check the library. If missing, generate a Python function with type hints, docstring, and test cases.

The system enforces a schema:

  • Function signature must include type annotations
  • Docstring must describe inputs, outputs, and failure modes
  • At least two unit tests (happy path and error case)

This isn’t freeform code generation. The agent outputs structured JSON that the framework validates before execution.

Memory: Skill-Level Experience Tracking

Each skill accumulates a performance history across tasks. The memory layer stores:

  • Execution traces: Input parameters, output, execution time, success/failure
  • Context embeddings: Vector representation of the task context where the skill was used
  • Refinement log: Version history with diffs and reasons for changes

When the agent selects a skill, it retrieves the top-k most similar past executions. This helps with parameter tuning. For example, if fetch_stock_price historically fails with single-letter tickers, the agent can add input validation before calling it.

Management: Retrieval and Organization

The library grows quickly. After 1,000 tasks, you might have 300 skills. The management layer handles:

  • Semantic search: Embed the current task and retrieve skills by cosine similarity
  • Tag-based filtering: Skills are tagged during creation (e.g., “data-processing”, “api-call”, “file-io”)
  • Dependency tracking: Skills can call other skills. The system maintains a DAG to prevent circular dependencies.

The retrieval pipeline ranks candidates by:

  1. Semantic similarity to current task (0.4 weight)
  2. Historical success rate (0.3 weight)
  3. Recency of last use (0.2 weight)
  4. Execution speed (0.1 weight)

This prevents the agent from always picking the newest skill or the most-used skill. It balances exploration and exploitation.

Evaluation: Unit Tests and Runtime Feedback

Before a skill enters the library, it must pass its own unit tests. The framework runs them in a sandboxed environment with:

  • 5-second timeout per test
  • No network access
  • Limited filesystem access (temp directory only)

If tests pass, the skill is marked “candidate”. After three successful runtime uses, it becomes “stable”. After ten uses with >90% success rate, it becomes “trusted”.

Runtime evaluation feeds back into the memory layer. If a skill fails, the agent logs:

  • Input parameters
  • Error message
  • Stack trace
  • Task context

This data drives refinement. The agent can analyze failure patterns and decide whether to:

  • Patch the current version
  • Fork a new version for a specific use case
  • Deprecate the skill entirely

Versioning and Deprecation

Skills evolve. When an agent modifies a skill, it creates a new version and links it to the parent. The system keeps old versions for rollback.

Deprecation triggers:

  • Success rate drops below 50% over the last 20 executions
  • Execution time exceeds 10x the median for similar skills
  • A newer version exists with >20% better success rate
  • Manual deprecation by a human operator

Deprecated skills remain in the library but are excluded from retrieval. This prevents catastrophic forgetting while keeping the active set clean.

Trade-offs and Failure Modes

DimensionBenefitRisk
Automated skill creationScales without human interventionGenerates low-quality skills that pollute the library
Skill-level memoryImproves reuse and parameter tuningMemory grows unbounded; requires pruning strategy
Unit test requirementFilters out broken skills earlyAgents may write tests that always pass (overfitting)
VersioningAllows safe experimentationVersion sprawl; hard to track which version is canonical
DeprecationKeeps library cleanMay prematurely remove skills with intermittent failures

The biggest operational risk is the cold-start problem. With an empty library, the agent generates many redundant skills before consolidation kicks in. The paper suggests seeding the library with 10-20 foundational skills (file I/O, HTTP requests, JSON parsing) to bootstrap the process.

Concurrency and Conflict Resolution

The paper doesn’t specify a locking mechanism, but the architecture implies a few options:

  • Optimistic locking: Each skill has a version counter. Updates fail if the counter has changed since read.
  • Skill ownership: Only the agent that created a skill can modify it. Other agents must fork.
  • Append-only log: All skill modifications are logged as events. A background process reconciles conflicts and merges versions.

For multi-agent systems, the third option is safest. It avoids blocking and allows post-hoc conflict resolution.

Observability Gaps

The framework tracks success rates and execution times, but it doesn’t expose:

  • Skill dependency graphs: Which skills call which other skills?
  • Task-to-skill mappings: Which tasks triggered which skill creations?
  • Failure clustering: Are failures concentrated in specific skill types or task categories?

Adding these would help operators debug library bloat and identify underperforming skill families.

Technical Verdict

Use MUSE-Autoskill when:

  • You’re running agents across hundreds or thousands of tasks
  • Task types are diverse enough that a fixed tool set won’t scale
  • You can tolerate initial library churn while the system learns
  • You have observability infrastructure to monitor skill performance

Avoid it when:

  • Your agent handles fewer than 50 distinct task types (manual tool curation is simpler)
  • Tasks require deterministic, auditable behavior (auto-generated skills are harder to verify)
  • You lack sandboxing infrastructure for safe skill execution
  • Your team can’t commit to monitoring and pruning the skill library

The framework works best in exploratory domains where task diversity is high and the cost of manual skill authoring exceeds the cost of library maintenance. It’s overkill for narrow, production-critical workflows where every tool must be human-reviewed.

Tags

agentic-ai orchestration infrastructure

Primary Source

arxiv.org