Workstation AI - AI Workstations, GPU Infrastructure & Agent Solutions

This is the long-form deep-dive. For a quicker, skim-readable version, see the companion blog post.

Polyglot Benchmarks cover — right tool, right job

1. Introduction: habit stacks vs real workloads

Technology choices in mature organisations rarely fail because nobody read a benchmark blog post. They fail because the process for choosing is weak: a loud senior engineer’s preference, a hiring pipeline full of one language, or a single impressive demo at a conference. Meanwhile the product mixes latency-sensitive APIs, batch reconciliation, edge authentication in NGINX, internal CLIs, and the occasional serverless spike. One runtime cannot be optimal for all of those shapes.

This article explains Polyglot Benchmarks — a live comparison dashboard and the open workflow-examples/benchmarks harness behind it — as a decision framework, not a fanboy leaderboard. The goal is evidence you can attach to an architecture decision record (ADR): same endpoints, same load generator, same container layout, multiple languages.

I will not invent throughput numbers in prose. When you run the harness, the dashboard shows measured req/s and latency percentiles for that hardware and Docker setup. Treat any static ranking here as qualitative patterns aligned with the harness’s test families and the dashboard’s built-in verdict copy.

2. What is Polyglot Benchmarks?

Live benchmark results

Charts and summary from the live dashboard (Rust 5, Lua 1, Go 1 tests won on the reference run; wrk 10s / 4 threads / 100 connections).

Polyglot Benchmarks dashboard: RPS, average latency, P99, and TTFB across njs, Lua, Python, Go, Rust, and Bun

Summary table: REQ/S and P99 per test with winner column from polyglot-benchmarks.fictionally.org

Polyglot Benchmarks is hosted at polyglot-benchmarks.fictionally.org (Fictionally-branded infrastructure in the workflow-examples ecosystem). The landing headline compares:

NGINX njs — JavaScript modules inside stock NGINX
OpenResty Lua — Lua at the edge on OpenResty
Python FastAPI — Uvicorn ASGI app
Go net/http — standard library server
Rust Actix-web — async HTTP on Actix
Bun — JavaScript runtime with native HTTP server

The subtitle on the site frames the intent: performance for long polling / chat / queue / API workloads — connection-heavy, JSON-heavy, and routing-heavy paths rather than a single “hello world” chariot race.

2.1 What the dashboard shows

While a run progresses, the UI polls /data/results.json every two seconds and renders:

Requests/sec (throughput) — aggregated from wrk per language per test
Average latency and P99 tail latency — in milliseconds from wrk’s latency model
TTFB (time to first byte) — from curl timing JSON alongside wrk
Summary table — per-test winners with ★ markers on best cells
Detailed test cards — bar charts and metric tables (P50, P90, P99.9, errors)
Latency percentile distribution — averaged across tests
Verdict — best for your use case — narrative once status is complete

That structure is deliberately ARB-friendly: charts for slides, tables for spreadsheets, narrative for the risk section of an ADR.

3. The workflow-examples harness

Everything reproducible lives under benchmarks/ in the workflow-examples repository. Top-level layout:

benchmarks/
  bench.sh              # orchestrates tests, writes results.json
  wrk_json.lua          # wrk done() → JSON summary (RPS, percentiles)
  docker-compose.yml    # six app services + dashboard + bench runner
  njs/                  # NGINX + njs module
  lua/                  # OpenResty nginx.conf
  python/               # FastAPI + Dockerfile
  golang/               # net/http main.go
  rust/                 # Actix-web + Cargo
  bun/                  # Bun server.ts
  dashboard/            # static index.html + nginx.conf

3.1 Docker Compose topology

docker-compose.yml defines isolated services on a shared network:

nginx-njs — port 8081, configs from ./njs
openresty-lua — port 8082
python-fastapi — port 8084, built from ./python
go-server — port 8085
rust-actix — port 8086
bun-server — port 8087
dashboard — port 8083, serves HTML and mounts the results volume at /data
bench — Alpine container that installs curl, wrk, jq, waits for dependencies, runs bench.sh

The bench runner shares the results volume with the dashboard so results appear live without manual copy.

3.2 bench.sh methodology

The shell driver encodes fairness rules explicitly:

Duration: 10s per endpoint per language
Threads: 4
Connections: 100 concurrent
Warmup: 50 rounds hitting /hello on every server before timed tests
Per server: curl timing JSON (dns, connect, ttfb, total) plus wrk with wrk_json.lua

Seven tests are pipe-delimited in the script:

ID	Name	Intent	Endpoint
1	Plain Text	Baseline — minimal response, framework overhead	`/hello`
2	JSON Serialization	Build and serialize 100-item JSON array	`/json`
3	CPU-Bound (fib 30)	Recursive fibonacci(30) — pure CPU	`/cpu?n=30`
4	String Manipulation	Build, split, uppercase, rejoin 1000 segments	`/string`
5	Request Inspection	Headers, method, args → JSON	`/request_info?foo=bar&baz=123`
6	Subrequest + Transform	Internal subrequest, parse JSON, transform	`/subrequest`
7	Routing Logic	Conditional routing on query params	`/route?action=greet&name=bench`

Each language implements the same route surface (see golang/main.go, python/main.py, NGINX configs, etc.) so differences reflect runtime and framework, not mismatched specs.

4. Why “polyglot” matters (without fanaticism)

Polyglot architecture does not mean every engineer learns six languages. It means bounded contexts get a runtime that fits:

Edge plane — auth, rate limits, routing: Lua or njs inside NGINX you already operate
Core API plane — business logic with SLOs: Go or Rust
Data plane — ETL, notebooks, ML glue: Python
Real-time JS plane — WebSockets, shared types with frontend: Bun or Node where velocity wins

Uniform stacks optimise HR and procurement. Polyglot benchmarks optimise fit per workload with measurable trade-offs on the table.

5. Dimensions that matter

Raw req/s is the metric that travels fastest on Slack. It is also the easiest to misread. Use a multi-column rubric:

Criterion	Dominates when…	Harness signal
p50 / p99 latency	User-facing APIs, chat, long poll	wrk latency percentiles; dashboard P99 chart
Throughput (RPS)	High fan-out gateways, batch aggregators	wrk RPS; summary winner column
Memory per connection	Thousands of idle WebSockets	Qualitative + production profiling (not fully in harness)
TTFB	CDN miss, cold paths, mobile networks	curl `time_starttransfer` in results JSON
LOC / complexity	Startups, regulated change control	Compare implementations across folders
Build & CI time	Frequent deploys, many services	Docker build profiles per language
Ops burden	Small platform team	Existing NGINX skills → Lua/njs; else containers
Hosting cost	Always-on vs scale-to-zero	Derived from memory + CPU under load (production)

6. Decision matrix

The diagram below is the qualitative map we use in reviews — illustrative leaders, not a substitute for running the harness on your kit.

Workload vs criteria decision matrix — winner varies by row

Read it row-first: pick your workload shape, then weight columns. If two criteria conflict (e.g. best RPS vs lowest LOC), that tension is the ADR discussion worth having.

7. Case patterns (example labelling)

The following maps harness test families to architectural choices. Leaders on your machine appear on the dashboard after a run — do not treat this section as fixed scores.

7.1 API latency-sensitive (Tests 1–2, 5–7)

JSON APIs and routing-heavy handlers mirror Tests 2, 5, and 7. Compiled async runtimes (Rust Actix, Go) typically lead on throughput and tail latency in this class of synthetic HTTP benchmarks; Bun often sits close on JSON-heavy paths with JS ergonomics. Python FastAPI trades raw RPS for development speed — the dashboard verdict calls this out explicitly.

7.2 CPU-bound micro-work (Test 3)

Fibonacci(30) isolates CPU without I/O. Expect Rust and Go to shine; interpreted stacks pay overhead. If your real service is I/O bound, do not over-weight this row — it is a stress test for compute-heavy middleware, not typical CRUD.

7.3 Edge transform (Tests 6–7, NGINX variants)

Subrequest and routing tests are where OpenResty Lua and NGINX njs belong: zero extra hop, shared worker memory, ops teams already manage NGINX. The harness’s verdict recommends Go/Rust for core chat/queue backends and Lua/njs at the edge — a sensible split many enterprises already run informally.

7.4 Batch ETL and data pipeline

The HTTP harness does not emulate Spark or warehouse loads. For batch, teams usually prioritise Python ecosystem and orchestration (Airflow, Dagster) or Go workers for throughput. Use Polyglot Benchmarks for the API surfaces and control planes those jobs expose, not for the batch engine itself.

7.5 Quick internal tool

Internal admin APIs with low QPS and high change rate: Python or Bun/Node often win on velocity and hiring. Run Test 2 (JSON) and Test 5 (inspection) to ensure overhead is acceptable; if p99 spikes under 100 connections, reconsider before promoting to customer-facing tiers.

8. Workflow diagram

Polyglot benchmarks workflow swim lanes

The reproducible loop: specify the workload → run the Compose harness → publish results to the Fictionally dashboard (or your internal clone) → record an ADR with config metadata → re-run when runtimes or hardware change.

9. How to run your own comparison

Clone: git clone https://github.com/bwalia/workflow-examples.git && cd workflow-examples/benchmarks
Start stack: docker compose up --build — builds all language images, starts dashboard on port 8083, runs bench automatically.
Watch: open http://localhost:8083 (or the public site when a hosted run is active).
Extend: add a folder myruntime/, mirror endpoint paths, register a service in docker-compose.yml, append the server to SERVERS in bench.sh.
Customise load: edit DURATION, THREADS, CONNECTIONS at the top of bench.sh — document changes in your ADR.
Add production-shaped tests: e.g. auth middleware, ORM query, 50KB payload — keep parity across languages.

For CI, run Compose in a nightly workflow, archive results.json as an artifact, and fail only on regression thresholds you define (e.g. p99 +20% week over week on Test 2).

10. Anti-patterns

Picking the winner of Test 1 only — plain text favours minimal stacks; it ignores JSON, CPU, and routing pain.
Ignoring team skills — Rust in production without mentors is expensive in incident time.
Ignoring hosting cost — 2× RPS is useless if memory per pod doubles your bill.
Mandating one language org-wide — destroys legitimate edge vs core splits.
Replacing production profiling — synthetic HTTP ≠ your ORM, cache, or regional latency.
Comparing unlike hardware — laptop Docker ≠ bare metal ≠ K8s limits; note the class in the ADR.

11. Benefits for engineering leadership

ARB packs — export dashboard screenshots + results.json + Compose hash.
Vendor-neutral evidence — no sales deck from a single cloud or framework vendor.
Onboarding clarity — new hires see *why* service A is Go and service B is Python.
Risk reduction — pilot the runner-up in a shadow service before rewrites.
FinOps conversations — link tail latency to autoscaling and memory headroom.
Platform roadmap — justify NGINX skill investment vs another K8s microservice.

12. Limitations

Synthetic workloads — seven HTTP tests cannot represent Kafka consumers, GPU jobs, or complex ORM graphs.
Docker networking — adds overhead; absolute numbers differ from bare metal.
Single machine class — hosted results reflect whatever hardware Fictionally uses; your numbers will differ.
No persistence layer — database and cache effects are absent by design.
Short run duration — 10s per test may not expose GC pauses or warmup cliffs; extend for stress campaigns.

Keep continuous profiling (eBPF, APM, load tests against staging) as the source of truth for production. Polyglot Benchmarks narrows the language and framework debate with reproducible baselines.

13. Conclusion

Default-stack bias is comfortable; it is also how teams ship the wrong runtime for the job. Polyglot Benchmarks and the workflow-examples harness turn “which language wins?” into “which language wins for this workload row?” — with charts your ARB can cite.

Fork the repo, add the endpoints that look like your hot paths, run Compose, and attach the results to the next architecture decision. The companion blog post is the five-minute version for sharing; this article is the reference.

Published by Workstation (workstation.co.uk). Benchmark dashboard hosted at polyglot-benchmarks.fictionally.org in the Fictionally examples ecosystem.

SEO snapshot for this article

SEO title: Polyglot Benchmarks: Right Tool for the Job
Meta description: Compare njs, Lua, Python, Go, Rust, and Bun on the same HTTP workloads. Reproducible harness, live dashboard, decision matrix for architects.
Primary keywords: polyglot benchmarks, language comparison, architecture decisions, workflow examples, performance benchmark, Rust vs Go, FastAPI benchmark
Twitter / X (under 280 chars): Polyglot Benchmarks: same 7 HTTP tests across njs, Lua, Python, Go, Rust, Bun — live dashboard + reproducible workflow-examples harness. Not one winner; a decision framework for architects. #polyglot #performance #architecture
LinkedIn post: We published a deep dive on Polyglot Benchmarks — how to stop picking one org-wide stack from habit and start matching runtimes to workloads. Six languages, seven comparable tests, Docker Compose + wrk, live results at polyglot-benchmarks.fictionally.org, source at github.com/bwalia/workflow-examples/tree/main/benchmarks. Includes decision matrix, anti-patterns, and ADR guidance. From Workstation engineering.