A developer built a public tracker to answer one question: is the Strait of Hormuz open for shipping? The project landed 484 points on Hacker News, but the builder’s candid admission in the comments reveals a common agent plumbing problem. Live ship-tracking APIs cost too much for a side project, so they manually copied JSON snapshots from an unnamed source instead.
This is not a story about geopolitics. It is a story about the infrastructure gap between real-time data feeds and what you can actually afford to poll when building an agent that monitors real-world state changes.
The Data Ingestion Problem
Maritime tracking APIs like MarineTraffic and VesselFinder charge hundreds to thousands of dollars per month for live AIS (Automatic Identification System) data. For a production logistics agent or a hedge fund monitoring oil shipments, that cost is justified. For a developer trying to track whether a single chokepoint is blocked, it is not.
The builder’s workaround:
- Find a source that exposes ship position data as JSON
- Manually copy the JSON payload
- Parse it locally
- Display a binary state (open or closed) based on ship density
This works until the upstream source changes its schema, rate-limits your IP, or goes offline. You have no SLA, no versioning, and no way to detect drift except when your parser breaks.
Polling Architecture for Manual Snapshots
When you depend on manual data collection, your agent’s polling loop has to degrade gracefully. Here is the failure surface:
| Failure Mode | Detection Method | Mitigation |
|---|---|---|
| Schema drift | JSON parse error or missing expected keys | Schema validation layer with alerting |
| Stale data | Timestamp comparison (if available) | Display last-updated timestamp; flag staleness |
| Source offline | HTTP timeout or 404 | Fallback to cached state; surface error to user |
| Rate limiting | 429 response or IP block | Exponential backoff; consider proxy rotation |
| Manual update lag | Human forgets to refresh | Automate the copy step with a scraper (legal risk) |
The project tracks a binary state, but the underlying data is continuous ship positions. That means you need a threshold function: how many ships, in what positions, constitute “open”? If you hardcode the logic, you cannot adapt when blockade tactics change. If you make it configurable, you need a UI or config file that non-technical users can adjust.
Cost-Benefit Threshold for Paid APIs
When does it make sense to pay for a real API? Here is a decision matrix:
Stick with manual scraping if:
- Update frequency is low (hourly or daily)
- You are tracking a single binary state
- Downtime is acceptable
- You have no compliance or audit requirements
Switch to a paid API when:
- You need sub-minute latency
- You are making decisions that cost money (trading, logistics routing)
- You need historical data or replay capability
- You are serving external users who expect uptime
For this Hormuz tracker, the builder is in the first category. But if you were building an agent that reroutes tanker shipments or triggers commodity trades, you would need the second.
Implementation Sketch
Here is what a minimal polling agent looks like when you are working with manual JSON snapshots:
import requests
import json
from datetime import datetime, timedelta
class HormuzMonitor:
def __init__(self, json_url, threshold_ships=5):
self.json_url = json_url
self.threshold_ships = threshold_ships
self.last_update = None
self.cached_state = None
def fetch_ship_data(self):
try:
response = requests.get(self.json_url, timeout=10)
response.raise_for_status()
return response.json()
except requests.RequestException as e:
print(f"Fetch failed: {e}")
return None
def is_open(self, ship_data):
if not ship_data or 'ships' not in ship_data:
return None # Unknown state
# Count ships in the strait's bounding box
ships_in_strait = [
s for s in ship_data['ships']
if self._in_strait(s['lat'], s['lon'])
]
return len(ships_in_strait) >= self.threshold_ships
def _in_strait(self, lat, lon):
# Simplified bounding box for Strait of Hormuz
return (25.0 <= lat <= 27.0) and (56.0 <= lon <= 57.0)
def poll(self):
data = self.fetch_ship_data()
if data:
self.last_update = datetime.now()
self.cached_state = self.is_open(data)
# Return cached state if fetch failed
return {
'open': self.cached_state,
'last_update': self.last_update,
'stale': (datetime.now() - self.last_update) > timedelta(hours=1)
}
This pattern applies to any agent that monitors external state: election results, weather events, supply chain disruptions. The key is separating fetch logic from state interpretation and surfacing staleness to the user.
Observability Gaps
When you manually copy data, you lose observability. You cannot answer:
- How often does the upstream source update?
- What is the latency between a real-world event and the data reflecting it?
- How many ships are in the dataset versus how many are actually transmitting AIS?
If you were paying for an API, you would get logs, metrics, and support. With manual scraping, you get none of that. For a side project, this is fine. For an agent making decisions, it is a blind spot.
The Real Lesson
This project is a microcosm of a larger problem in agentic AI: the gap between the data you need and the data you can afford. LLMs are cheap. Structured real-time data is not. If you are building an agent that monitors the world, you will hit this wall.
The workaround is to accept higher latency, lower reliability, and manual intervention. That is a valid trade-off for many use cases. But you need to design your system to degrade gracefully when the data pipeline breaks, because it will.
Technical Verdict
Use manual JSON scraping when:
- You are prototyping or building a side project
- Update frequency is measured in hours, not seconds
- You can tolerate downtime and stale data
- You are tracking a single event or binary state
Avoid it when:
- You are making financial or operational decisions
- You need audit trails or compliance documentation
- You are serving external users with uptime expectations
- You need historical data or the ability to replay events
If you start with manual scraping, build in the hooks to swap in a paid API later. Separate your fetch layer from your business logic, surface staleness to users, and log every failure. That way, when your side project turns into a real product, you can upgrade your data pipeline without rewriting your agent.