Plexe: Agentic ML Pipelines That Turn Prompts into Trained Models

Plexe is an open-source agent that accepts a natural language task description and returns a trained machine learning model. No manual feature engineering, no hyperparameter grid search, no architecture selection. The agent handles data preprocessing, feature selection, model architecture search, hyperparameter tuning, and training in a single autonomous loop.

This is not a wrapper around AutoML libraries like TPOT, AutoGluon, or H2O AutoML. It is an orchestration layer that uses an LLM to reason about data characteristics and make sequential decisions about transformations, model families, and training strategies. The plumbing sits between prompt-based planning and traditional ML tooling (scikit-learn, PyTorch, XGBoost).

How the Orchestration Works

Plexe runs a multi-stage pipeline where each stage produces artifacts that feed the next:

Data profiling: The agent inspects the dataset schema, checks for missing values, identifies categorical vs. numerical columns, and computes basic statistics.
Feature engineering: Based on profiling results, the LLM generates a transformation plan (one-hot encoding, scaling, polynomial features, etc.) and executes it using pandas and scikit-learn.
Model architecture search: The agent selects candidate model families (linear models, tree ensembles, neural networks) based on dataset size, feature count, and task type (classification, regression).
Hyperparameter tuning: For each candidate, the agent runs a search (grid, random, or Bayesian) and tracks validation metrics.
Training and evaluation: The best-performing model is trained on the full dataset and evaluated on a holdout set.

Each stage writes intermediate artifacts to disk: preprocessed CSVs, serialized transformers, model checkpoints, and training logs. The agent can resume from any stage if a failure occurs.

Tool Boundaries and State Management

The LLM does not execute Python code directly. Instead, it generates structured JSON plans that a Python runtime interprets. For example, a feature engineering plan might look like this:

{
  "transformations": [
    {"column": "age", "operation": "standard_scale"},
    {"column": "category", "operation": "one_hot_encode"},
    {"columns": ["feature_a", "feature_b"], "operation": "polynomial_features", "degree": 2}
  ]
}

The runtime validates the plan against the current dataset schema, applies transformations using scikit-learn pipelines, and persists the fitted transformers. If a transformation fails (e.g., trying to scale a non-numeric column), the runtime returns an error message to the LLM, which generates a corrected plan.

State is managed through a versioned artifact store. Each pipeline run gets a unique ID, and all intermediate outputs are tagged with that ID. This allows the agent to:

Resume from the last successful stage if training crashes.
Compare multiple runs to see which feature engineering strategy worked best.
Roll back to a previous state if a transformation degrades model performance.

Termination Conditions and Failure Modes

The agent stops when one of three conditions is met:

Convergence: Validation metrics stop improving after N consecutive model candidates.
Budget exhaustion: A maximum number of training runs or wall-clock time is reached.
Manual override: A human reviews intermediate results and decides to stop.

The table below shows how the agent handles different failure scenarios and what recovery strategies it employs:

Failure Type	Agent Behavior	Recovery Strategy
Data transformation error	LLM receives error message, generates new plan	Retry with corrected transformation
Training crash (OOM, NaN loss)	Agent logs failure, tries next model candidate	Skip to next architecture or reduce batch size
LLM generates invalid JSON	Runtime rejects plan, requests re-generation	Retry with schema validation prompt

The agent does not automatically retry indefinitely. After three consecutive failures in the same stage, it surfaces the error to a human and halts.

Observability and Traceability

Every decision the agent makes is logged with a reference to the prompt that triggered it. The log structure uses valid JSON format for machine parsing:

[
  {
    "timestamp": "2025-05-06T08:15:32Z",
    "stage": "feature_engineering",
    "prompt": "Dataset has 50,000 rows, 12 columns (8 numeric, 4 categorical). Suggest transformations.",
    "plan": {"transformations": [{"column": "age", "operation": "standard_scale"}]},
    "result": "Transformed dataset saved to artifacts/run_abc123/transformed.csv"
  }
]

This trace lets you answer questions like:

Which prompt led to the decision to use polynomial features?
Why did the agent choose XGBoost over a neural network?
What hyperparameters were tried before the final model?

The logs are structured JSON, so you can ingest them into a time-series database or observability platform (Langfuse, Weights & Biases, etc.).

Deployment Shape

Plexe runs as a single Python process. You give it:

A dataset (CSV, Parquet, or pandas DataFrame)
A task description (“Predict customer churn based on usage patterns”)
Optional constraints (max training time, preferred model families)

It returns:

A trained model (serialized with joblib or PyTorch)
A preprocessing pipeline (scikit-learn Pipeline object)
A report (Markdown file with metrics, feature importance, and training history)

For production use, you would wrap this in a job queue (Celery, Temporal) and run it on a machine with enough RAM and CPU for training. The agent does not handle distributed training or GPU orchestration out of the box. Model serialization uses joblib by default, which maintains compatibility with scikit-learn pipelines but may require conversion to ONNX or TorchScript for cross-platform deployment.

When to Use Plexe vs. Traditional AutoML

This table compares Plexe’s strengths and weaknesses against traditional AutoML libraries across common deployment scenarios:

Scenario	Plexe	Traditional AutoML
You have a well-defined task and clean data	Overkill	Better fit (TPOT, AutoGluon)
You need to explain why a feature was engineered	Strong (full trace logs)	Weak (black-box search)
You want to inject domain knowledge via prompts	Strong (natural language input)	Weak (requires code changes)
You need production-grade reliability	Weak (early-stage, failure modes not fully hardened)	Strong (battle-tested libraries)
You want to iterate on feature ideas quickly	Strong (re-prompt, re-run)	Weak (requires re-coding pipelines)

Technical Verdict

Use Plexe if your dataset is under 100,000 rows, you have fewer than five iteration cycles per week, and you need full traceability from prompt to model decision. It works best when you can tolerate 30-60 second LLM call overhead per stage and your training budget allows for exploratory runs that may fail. The agent shines in research contexts, financial modeling experiments, and hackathons where understanding the decision path matters more than squeezing out the last 0.5% of accuracy.

The full traceability from prompt to model decision directly addresses the core plumbing question of observability: you can trace which prompt led to which feature transformation or model choice. This makes Plexe valuable for audit trails and explaining model decisions to non-technical stakeholders.

Avoid it if your SLA requires sub-two-second model inference latency, your data contains sensitive PII requiring audit trails beyond structured logs, or you need production-grade failure recovery with automatic rollback. The agent can fail in non-obvious ways (e.g., the LLM suggests a transformation that works on a sample but breaks on the full dataset), and the overhead of LLM calls adds latency. For production pipelines serving real-time predictions, use a traditional AutoML library (AutoGluon, TPOT) and invest in manual feature engineering with version-controlled pipelines.

The real value is not replacing data scientists. It is giving them a tool to prototype faster, document their reasoning in natural language, and generate audit trails that non-experts can review without reading code.