Building a serverless analytics engine that scales to zero
Most analytics products solve the dashboard latency problem with an always-on OLAP cluster. We solved the same problem differently — Lambda + DynamoDB + S3 Express + EventBridge in your AWS account. Here's the architecture brief.
Most analytics products solve the dashboard-latency problem with an always-on OLAP cluster. ClickHouse, Druid, Snowflake. The cluster is fast. It’s also always running, always paying. When nobody’s looking at dashboards at 3 AM, you’re still paying for the cluster.
We solved the same problem differently. Kanject Insights ships as a Docker container with bundled CDK provisioning. It deploys Lambda + DynamoDB + S3 Express + EventBridge into your AWS account. When idle, it pays $0.
This is the architecture brief — what we kept, what we rejected, and the principles that fall out when you commit to “scales to zero” as a non-negotiable.
The hard constraint that shaped everything
We started with one rule: no always-on infrastructure. No 24/7 cluster. No long-running database process. No standby capacity that bills while no one is querying.
That rule kills the obvious answer (ClickHouse / Druid / etc.) and forces a different shape:
- Pre-aggregate everything important on a schedule, then serve dashboards from the pre-aggregates.
- Compute pipeline must be triggered, not running. Schedules wake it up; idleness puts it back to sleep.
- Storage must be paid by the byte and request, not by capacity reservation.
The shape that falls out is: ingestion → event store → scheduled snapshot computation → pre-aggregated results → instant dashboard reads. Lambda + DynamoDB + S3 Express + EventBridge maps onto every layer.
Divide and conquer
The computation pipeline runs as two separate Lambda functions wired through SQS:
EventBridge schedule
│
▼
┌─────────────────────────────────────────────┐
│ DISPATCHER (1024 MB / 30s) │
│ │
│ 1. Load insight configuration │
│ 2. Derive computation strategy │
│ 3. Calculate event-time window │
│ 4. Compute config hash │
│ 5. Build ComputeSnapshotRequest │
│ 6. Enqueue to Engine via SQS │
│ │
│ Lightweight — no event data touched │
└──────────────────────┬──────────────────────┘
│ SQS
▼
┌─────────────────────────────────────────────┐
│ ENGINE (2048 MB / 120s) │
│ │
│ 1. Reload insight config │
│ 2. Validate config hash │
│ 3. Stream events from Event Store │
│ 4. Apply filters, aggregate metrics │
│ 5. Write snapshot (DynamoDB + S3 Express) │
│ 6. Evaluate alert rules │
│ │
│ Heavyweight — reads and processes events │
└─────────────────────────────────────────────┘
The Dispatcher decides what to compute. The Engine executes the computation. Two queues, two Lambda concurrency budgets, two failure modes. The Dispatcher fails fast and stays small. The Engine takes its time and processes events page-by-page in a streaming fashion — never loads the full dataset into memory.
The key win: SQS auto-scales the Engine to concurrent Lambda invocations per snapshot. One slow snapshot doesn’t block the others. One failed snapshot doesn’t block the others. The Dispatcher gets to be a single 30-second invocation that emits N work items and walks away.
Strategy is derived, never configured
This is the part most analytics platforms hand to the user. We don’t.
Insights with metrics that are only Count or Sum use an incremental strategy: each scheduled run processes only the delta window since the last snapshot. Running totals accumulate across runs. The daily rollup re-scans raw events to produce an authoritative DailyAggregate, but intra-day reads are served from running totals — sub-100ms regardless of how big the day’s event volume gets.
Insights with any DistinctCount, Average, or Percentage metric use full recompute: each run scans all events from the start of the local day to now. These types aren’t additively composable — a value counted as distinct in window A may reappear in window B; merging two distinct counts double-counts; avg(A ∪ B) ≠ avg(avg(A), avg(B)) unless sample sizes are equal.
The strategy is derived automatically from the metric types. It’s never exposed in the API. The product gets to pick the right shape, and the user never has to learn the words “incremental” or “full recompute.”
Two-tier snapshot storage
Each computed snapshot is split across DynamoDB and S3 Express:
- Tier 1 — DynamoDB holds the snapshot header and a single digest row. The digest carries headline aggregates (metric counts, totals, conversion rates) for sub-100ms dashboard, alert, and sparkline reads. All items in a snapshot share the same partition key, so a single Query returns the header + digest in one shot.
- Tier 2 — S3 Express One Zone holds the full child collection (per-metric entities, funnel steps, cohort cells, correlation summaries, event sequence steps, comparative comparisons) serialised as JSON.
Why split? Because most of the time, a dashboard reads the digest — counts, conversion rates, headlines. Drill-downs (per-group breakdowns, per-step funnel rates, full cohort matrices) are the minority of reads. Putting 4 KB of grouped values in DynamoDB when you only render 200 bytes most of the time is a tax on every read. S3 Express gets you sub-10ms reads for the drill-down case at a fraction of the DynamoDB cost.
The trade-off: a drill-down is two reads (DynamoDB + S3 Express) instead of one. We’re fine with that — drill-downs are rarer than headlines, and S3 Express is fast enough that the second hop disappears in practice.
Two-layer rollup pipeline
Layer 1 is the intra-day computation, fired by EventBridge Scheduler at the rate the insight requests. The available rates are:
DataRefreshFrequency | EventBridge expression |
|---|---|
EveryMinute | rate(1 minute) |
Every5Minutes | rate(5 minutes) |
Every15Minutes | rate(15 minutes) |
Every30Minutes | rate(30 minutes) |
Hourly | rate(1 hour) |
Every6Hours | rate(6 hours) |
Every12Hours | rate(12 hours) |
Standard, Funnel, and EventSequence insights pick whichever fits their use case. Cohort, Correlation, and Comparative are hard-coded to daily — those analytical types don’t benefit from sub-daily recomputation, and pinning them eliminates ~96% of unnecessary Lambda invocations vs allowing per-minute refresh.
Layer 2 is the periodic rollup pipeline, fired on fixed cron schedules:
| Rollup | EventBridge cron | Produces |
|---|---|---|
| Daily | cron(0 1 * * ? *) | DailyAggregate for the completed day |
| Weekly | cron(30 1 ? * SUN *) | WeeklyAggregate for the completed week |
| Monthly | cron(0 2 1 * ? *) | MonthlyAggregate for the completed month |
| Quarterly | cron(0 3 1 1,4,7,10 ? *) | QuarterlyAggregate for the completed quarter |
| Yearly | cron(0 4 1 1 ? *) | YearlyAggregate for the completed year |
Schedules are staggered to prevent thundering-herd contention against DynamoDB. The dailies write at 01:00 UTC; weeklies at 01:30; monthlies at 02:00. By the time the larger rollups fire, the smaller ones they depend on are already in place.
Pre-flight completeness, before it bites
This is one of the corners we got bitten by twice before getting right. Periodic rollups depend on daily aggregates being present. If any are missing — because a daily Lambda timed out, or a transient DynamoDB throttle hit at the wrong moment — naïve aggregation produces silently wrong results.
So before computing any periodic rollup (weekly, monthly, quarterly, yearly), the engine performs a pre-flight check:
periodic rollup triggered (e.g. Weekly at Sunday 01:30 UTC)
│
├─ count daily headers WHERE WindowStart BETWEEN [periodStart, periodEnd]
│
├─ all dailies present → proceed with rollup
│
├─ some missing, retry budget remaining → dispatch repair tasks for the
│ missing days, schedule a delayed retry of this rollup
│
└─ retry budget exhausted → write rollup with IsPartial = true,
log a warning for admin investigation
The repair path uses the same compute pipeline as a normal scheduled snapshot — no special branch in the engine. Self-healing as a property of the architecture, not a feature you have to remember to call.
The IsPartial flag on the snapshot header means downstream queries can choose how to present the data: render with a “partial” badge, gate alerts off it, or flat-out reject reads from partial snapshots if your domain demands it.
What we kept off this brief
There’s a lot more in the engine — the way we handle exact cardinality for distinct-count metrics across multi-day rollups (it’s not as simple as “store an HLL sketch”), the phased waterfall backfill orchestrator that lets you recompute up to 48 months of history without exceeding any Lambda concurrency budget, the request-level config-hash validation that catches stale snapshots after a definition change, the staggered EventBridge schedules that prevent any single DynamoDB partition from getting hammered.
Those are deeper dives for another post.
Cost shape that falls out
When idle: zero. No always-on cluster, no reserved capacity, no minimum spend. Lambda doesn’t run, DynamoDB doesn’t read, S3 doesn’t request.
When active, the cost scales with use:
| Volume | Approx monthly infra |
|---|---|
| 1M events / month | ~$7.50 |
| 10M events / month | ~$90 |
| 100M events / month | ~$1,000 |
For comparison: at 10M events/month, the typical SaaS analytics provider charges $800–$1,500/month. PostHog self-hosted runs ~$150/month — but that’s $150/month even when nobody’s querying, vs ours that drops to single digits at idle.
The numbers aren’t the point. The shape is. Pay-per-query scales differently from pay-per-cluster. For workloads where the cluster sits idle most of the time, the difference compounds.
What this is and isn’t
It is: pre-aggregated metrics, real-time alerts, sub-100ms dashboards, your data in your AWS account, no third party in the data path.
It isn’t: an OLAP engine for ad-hoc exploration. If your primary workflow is marketing teams slicing billions of raw events for questions you haven’t predefined, ClickHouse-backed products are still the right tool. We answer surprise questions via backfill — define the metric, recompute the history, results are permanently fast thereafter. First answer takes minutes; every subsequent answer is instant.
If your team’s analytics workflow is “metrics as assets — defined once, kept forever, accessed in sub-100ms from dashboards and application logic” — this is the shape we built for.
The Docker image is at kanject/insights:latest. The architecture above runs in your account by the time the container finishes its first boot. We don’t see your data. We don’t run anything for you. We just ship the engine that you run.