mech.app
Dev Tools

Hiring Agent: Resume Parsing, GitHub Enrichment, and Fair Scoring Plumbing

How a multi-stage agent pipeline extracts structured data from PDFs, enriches with GitHub signals, and enforces deterministic scoring constraints.

Source: github.com
Hiring Agent: Resume Parsing, GitHub Enrichment, and Fair Scoring Plumbing

HackerRank’s Hiring Agent hit GitHub trending at #10 for Python because it solves a real B2B problem: turning unstructured resume PDFs into explainable, fair candidate scores. The repository exposes a five-stage pipeline that decouples PDF extraction, LLM-based parsing, GitHub API enrichment, and deterministic scoring. Each boundary teaches a lesson about when to trust an LLM and when to enforce hard constraints.

Why This Pipeline Matters

Most hiring automation demos stop at “LLM reads resume, outputs score.” Production systems need:

  • Error isolation: If PDF extraction fails, the LLM never sees garbage.
  • Auditability: Scores must cite evidence, not hallucinate qualifications.
  • Fairness constraints: Category weights and deduction rules must be deterministic.
  • External enrichment: GitHub signals add objective data the resume omits.

Hiring Agent implements all four. The architecture shows how to compose stochastic (LLM) and deterministic (API, scoring rules) components without letting one contaminate the other.

Pipeline Architecture

The orchestrator lives in score.py. It calls five modules in sequence:

StageModuleInputOutputFailure Mode
1. PDF to Markdownpymupdf_rag.pyPDF bytesMarkdown text per pageCorrupted PDF, scanned image
2. Section extractionpdf.pyMarkdown + Jinja templatesPydantic-validated JSON per sectionLLM refuses, malformed JSON
3. GitHub enrichmentgithub.pyGitHub usernameProfile + top 7 reposRate limit, private profile
4. Normalizationtransform.pyLoose JSONJSON Resume schemaMissing required fields
5. Scoringevaluator.pyNormalized JSON + job descriptionCategory scores + evidenceLLM ignores constraints

Each stage writes intermediate artifacts when DEVELOPMENT_MODE=true. The CSV output includes raw LLM responses, GitHub API payloads, and final scores for audit trails.

PDF Extraction Boundary

pymupdf_rag.py uses PyMuPDF to convert each page into Markdown-like text. It handles:

  • Multi-column layouts (common in academic CVs)
  • Embedded images (extracts alt text when present)
  • Tables (converts to pipe-delimited Markdown)

The output is plain text. No LLM involvement. This boundary matters because:

  • Error handling is deterministic: If PyMuPDF raises an exception, you know the PDF is corrupt. You do not waste tokens on a retry loop.
  • Observability is simple: Log the Markdown. Diff it against the PDF. No prompt archaeology.
  • Cost control: You pay for LLM tokens only after extraction succeeds.

The alternative (feeding raw PDF bytes to a vision model) burns tokens on layout parsing and introduces a second source of hallucination.

LLM Section Parsing

pdf.py splits the Markdown into sections (contact, experience, education, skills) using Jinja templates under prompts/templates. Each template asks the LLM to extract structured fields:

# Simplified from pdf.py
def extract_section(markdown: str, section: str, provider: LLMProvider) -> dict:
    template = jinja_env.get_template(f"{section}.jinja")
    prompt = template.render(resume_text=markdown)
    response = provider.generate(prompt)
    return json.loads(response)

The templates enforce Pydantic schemas defined in models.py. For example, WorkExperience requires:

  • company (string)
  • position (string)
  • start_date and end_date (ISO 8601)
  • responsibilities (list of strings)

If the LLM returns malformed JSON, pdf.py retries once with the error message appended to the prompt. After two failures, it logs the raw response and moves on. The pipeline does not block.

Why Per-Section Calls?

Sending the entire resume in one prompt risks:

  • Context window overflow: A 10-page CV can exceed 8k tokens.
  • Attention dilution: The LLM might skip minor sections.
  • Retry cost: If one section fails, you re-process everything.

Per-section calls cost more in latency (serial requests) but isolate failures. You can parallelize them with asyncio if the provider supports concurrent requests.

GitHub Enrichment

github.py fetches the candidate’s profile and repositories via the REST API. It classifies each repo by language and topic, then asks the LLM to select the top 7 based on:

  • Stars and forks (popularity signal)
  • Commit frequency (activity signal)
  • README quality (documentation signal)

The LLM prompt includes the full repo list as JSON. The response is a ranked list of repo names. github.py validates that all names exist in the input, then fetches detailed stats for the top 7.

Tool Boundary Lesson

The agent does not call the GitHub API directly. github.py is a Python module that wraps requests and handles:

  • Rate limiting (respects X-RateLimit-Remaining)
  • Authentication (accepts GITHUB_TOKEN env var)
  • Pagination (fetches all repos, not just the first page)

The LLM sees only the final JSON. This separation prevents:

  • Credential leakage: The LLM never touches the API token.
  • Retry loops: If the API returns 429, the module sleeps and retries. The LLM does not see the error.
  • Hallucinated API calls: The LLM cannot invent endpoints or parameters.

If you let the LLM generate API calls (function calling), you must sandbox execution and validate every parameter. Wrapping the API in a module is simpler.

Scoring and Fairness Constraints

evaluator.py runs the final evaluation. It takes:

  • Normalized JSON Resume
  • Job description
  • Scoring rubric (category weights, bonus rules, deduction rules)

The rubric is a YAML file. Example:

categories:
  - name: Technical Skills
    weight: 0.3
    max_score: 10
  - name: Experience
    weight: 0.4
    max_score: 10
  - name: Education
    weight: 0.2
    max_score: 10
  - name: Projects
    weight: 0.1
    max_score: 10

bonuses:
  - condition: "GitHub stars > 100"
    points: 2
  - condition: "Open source contributions > 10"
    points: 1

deductions:
  - condition: "Employment gap > 1 year"
    points: -1

The LLM receives the rubric and the resume. It must return:

  • A score (0-10) for each category
  • Evidence (quoted text from the resume)
  • Applied bonuses and deductions

evaluator.py validates:

  • Scores are within bounds
  • Evidence exists in the input
  • Bonuses and deductions match the rubric

If validation fails, the score is rejected. The pipeline does not trust the LLM to do arithmetic or follow rules. It treats the LLM as a text-to-structured-data translator, then enforces constraints in Python.

Deterministic vs. Stochastic Split

ComponentTypeWhy
Category score (0-10)StochasticRequires judgment (e.g., “Is 5 years of Python experience worth 8/10?”)
Evidence extractionStochasticRequires reading comprehension
Bonus/deduction applicationDeterministicRules are boolean (stars > 100 is true or false)
Final score calculationDeterministicWeighted sum of category scores + bonuses + deductions

The LLM proposes category scores and evidence. Python enforces the rest. This split prevents the LLM from inventing bonus points or ignoring deductions.

Provider Flexibility

llm_utils.py abstracts LLM providers. The repository supports:

  • Ollama (local): Runs llama3.2, qwen2.5, or any model you pull.
  • Google Gemini (hosted): Uses gemini-1.5-flash or gemini-1.5-pro.

You set LLM_PROVIDER=ollama or LLM_PROVIDER=gemini in .env. The provider interface is:

class LLMProvider(Protocol):
    def generate(self, prompt: str, schema: dict | None = None) -> str:
        ...

If schema is provided, the provider uses structured output mode (Gemini’s response_schema or Ollama’s JSON mode). If not, it returns raw text and llm_utils.py cleans it (strips Markdown fences, fixes trailing commas).

Why Two Providers?

  • Ollama: Zero API cost, full control, slower inference.
  • Gemini: Fast, cheap ($0.075 per 1M input tokens for Flash), no local GPU required.

For development, Ollama lets you iterate on prompts without burning credits. For production, Gemini handles scale. The abstraction makes switching trivial.

Observability and Audit Trails

When DEVELOPMENT_MODE=true, score.py writes:

  • resume_markdown.txt: Raw PDF extraction output
  • sections.json: Per-section LLM responses
  • github_data.json: API payloads
  • evaluation.json: Final scores and evidence
  • results.csv: One row per candidate with all fields

The CSV is the audit trail. You can:

  • Spot-check evidence against the resume
  • Compare GitHub signals to claimed experience
  • Identify scoring drift (if category scores creep up over time)

The repository does not include observability hooks (OpenTelemetry, Datadog). You would add spans around each stage and log:

  • Latency per stage
  • Token counts per LLM call
  • GitHub API rate limit headroom
  • Validation failure rates

Deployment Shape

The repository is a CLI tool. You run:

python -m hiring_agent.score --resume resume.pdf --job-description jd.txt

It prints the final score and writes artifacts to output/. For production, you would wrap it in:

  • API server: FastAPI endpoint that accepts PDF uploads and returns JSON.
  • Queue worker: Celery task that processes resumes asynchronously.
  • Batch job: Kubernetes CronJob that scores a directory of resumes nightly.

The pipeline is stateless. Each invocation is independent. You can scale horizontally by running multiple workers. The only shared state is the GitHub API rate limit (5,000 requests/hour for authenticated users).

Likely Failure Modes

FailureCauseMitigation
PDF extraction failsScanned image, password-protectedPre-flight check with PyPDF2.PdfReader.is_encrypted
LLM returns malformed JSONPrompt drift, model updateRetry with error message, log raw response
GitHub API rate limitToo many candidatesCache profile data, use conditional requests (If-None-Match)
Scoring hallucinationLLM ignores rubricValidate all scores in Python, reject invalid responses
Evidence fabricationLLM quotes text not in resumeFuzzy match evidence against input, flag low-confidence matches

The pipeline handles the first four. Evidence fabrication is harder. You need a second LLM call to verify quotes or a vector search over the resume text.

Technical Verdict

Use Hiring Agent when:

  • You need explainable scores with cited evidence.
  • You want to enrich resumes with GitHub or LinkedIn data.
  • You can tolerate 10-30 seconds of latency per resume.
  • You need both local (Ollama) and hosted (Gemini) LLM options.

Avoid it when:

  • You need sub-second scoring (the pipeline is serial, not optimized for speed).
  • You cannot validate LLM outputs (the scoring logic assumes you enforce constraints in code).
  • You need multi-modal inputs (the pipeline does not handle video interviews or coding tests).

The repository is a teaching tool. It shows how to build a multi-stage agent pipeline with clear boundaries between stochastic and deterministic components. The code is readable, the prompts are versioned, and the audit trail is built in. If you are building hiring automation, this is a better starting point than a single LLM call.