Kita's VLM Credit Review: How Vision Models Parse Unstructured Financial Documents When APIs Don't Exist

When open banking APIs don’t exist and credit bureaus return garbage, lenders fall back to the oldest form of underwriting: looking at documents. In emerging markets like the Philippines, Mexico, and Indonesia, loan applicants submit photos of pay stubs, screenshots of bank statements, and scanned utility bills. No standard templates. No machine-readable formats. Just images.

Kita (YC W26) built a VLM-based pipeline to turn those images into structured credit decisions. The problem is not OCR. It’s extracting financial signals, detecting fraud, and cross-referencing data across inconsistent document types when you have no ground-truth API to validate against.

The Infrastructure Gap

In developed markets, credit underwriting runs on structured data pipes:

Plaid or Tink for bank account access
Credit bureaus for payment history
Payroll APIs for income verification

In the Philippines, Mexico, South Africa, and similar markets, those pipes don’t exist or return unreliable data. Borrowers upload whatever they have. Lenders hire credit analysts to manually review each file.

The scale problem is real. Kita cites $13.3 trillion in global lending volume in 2025, with 90% involving document review. Even in the US, non-prime lending and small business credit still rely on manual document checks.

The VLM Pipeline

Kita’s system processes 50+ document types across PDFs, scans, photos, and screenshots. The pipeline has four stages:

1. Document Enhancement

Low-quality phone photos get preprocessed: deskewing, contrast adjustment, noise reduction. This happens before the VLM sees the image. If you feed a blurry, rotated bank statement directly into a vision model, extraction accuracy drops by 30-40%.

2. Entity Extraction

The VLM identifies financial entities: account balances, transaction dates, employer names, income amounts. This is not simple OCR. A pay stub might show gross pay, net pay, deductions, and year-to-date totals in different layouts. The model needs to understand which number matters for underwriting.

3. Cross-Document Verification

A single document is not enough. Kita cross-references data across multiple files:

Does the employer name on the pay stub match the bank statement deposits?
Do the transaction dates align with stated pay periods?
Are the account balances consistent across different statement pages?

This is where fraud detection happens. If an applicant submits a Photoshopped pay stub, the income number won’t reconcile with bank deposits.

4. Risk Signal Aggregation

The final output is not raw text. It’s structured underwriting data: debt-to-income ratio, payment consistency, cash flow volatility. These signals feed into the lender’s credit model.

Handling Confidence Thresholds

VLMs are probabilistic. When the model can’t parse a document with sufficient confidence, the system needs a fallback path.

Kita uses confidence scoring at the field level. If the VLM extracts a bank balance with 95% confidence but an employer name with 60% confidence, the low-confidence field gets flagged for human review.

The human-in-the-loop trigger is not binary. It’s a routing decision based on:

Field importance (income amount vs. document date)
Document type (pay stubs are higher risk than utility bills)
Applicant risk tier (high-value loans get more scrutiny)

This keeps the pipeline fast for clear-cut cases while escalating edge cases to analysts.

Latency and Cost Trade-offs

Credit decisions need to be fast. A borrower applying for a $500 loan won’t wait three days for approval.

Component	Latency	Cost per Document	Failure Mode
Document enhancement	200-500ms	$0.001	Blurry input remains blurry
VLM inference	2-5 seconds	$0.02-0.05	Low confidence on messy layouts
Cross-document checks	500ms-1s	$0.005	Missing reference documents
Human review (fallback)	5-15 minutes	$2-5	Analyst fatigue, inconsistency

The VLM cost is 10-50x cheaper than human review, but only if the confidence threshold is tuned correctly. Set it too high and every document gets escalated. Set it too low and fraud slips through.

Version Drift and Model Retraining

Document formats change. Banks redesign statements. Employers switch payroll providers. A VLM trained on 2024 Philippine bank statements might fail on 2026 formats.

Kita’s approach to version drift:

Continuous monitoring: Track extraction confidence over time. A sudden drop in confidence for a specific document type signals a format change.
Feedback loops: When a human analyst corrects a VLM extraction, that correction becomes training data.
Regional fine-tuning: Different markets need different models. A pay stub in Mexico looks nothing like one in Indonesia.

The challenge is labeled data. In developed markets, you can validate VLM output against API responses. In emerging markets, there is no API. Ground truth comes from human review, which is expensive and slow.

Fraud Detection Without Ground Truth

Traditional fraud detection relies on negative databases: known bad actors, stolen identities, blacklisted accounts. Those databases don’t exist in many emerging markets.

Kita’s fraud checks are document-based:

Internal consistency: Do the numbers add up? Does the transaction history match the ending balance?
Cross-document alignment: Does the pay stub income match bank deposits?
Visual anomalies: Are there signs of image manipulation (mismatched fonts, pixel artifacts, inconsistent shadows)?

The VLM can spot visual fraud that OCR misses. A Photoshopped bank statement might have perfect text extraction but inconsistent typography. The vision model sees the layout as a whole, not just the characters.

When This Breaks

VLM-based underwriting fails in predictable ways:

Handwritten documents: If a borrower submits a handwritten ledger instead of a printed bank statement, extraction accuracy drops. Fine-tuning on handwritten text helps, but it’s a different model.

Multi-page reconciliation: If a bank statement spans 10 pages and the account balance is on page 1 but the transaction details are on pages 2-9, the model needs to maintain state across pages. This is a context window problem.

Adversarial inputs: If borrowers learn that the system flags certain patterns, they’ll adapt. A sophisticated fraudster might create a fake document that passes all the visual checks but has fabricated numbers.

Regulatory compliance: Some jurisdictions require explainability for credit decisions. “The VLM said so” is not an acceptable explanation. You need to log which fields were extracted, which checks were run, and which thresholds were crossed.

Technical Verdict

Use Kita’s approach when:

You operate in markets where open banking and credit bureau APIs are unreliable or nonexistent.
Your borrowers submit financial documents in inconsistent formats (photos, screenshots, scans).
Manual document review is your bottleneck, and you need to scale underwriting without hiring more analysts.
You can tolerate a human-in-the-loop fallback for edge cases and use that feedback to improve the model.

Avoid this approach when:

You have access to structured data APIs and can pull verified financial data directly from banks.
Your regulatory environment requires full explainability and you can’t rely on probabilistic model outputs.
Your document volume is too low to justify the infrastructure investment (if you’re reviewing 100 loans per month, hire an analyst).
You need zero-latency decisions and can’t afford the 2-5 second VLM inference time.

The real value is not replacing human underwriters. It’s triaging the easy cases so analysts spend time on the hard ones. If 70% of loan applications can be auto-approved or auto-rejected based on VLM extraction, your analysts focus on the ambiguous 30%. That’s where the economics work.