Watch — a short tour of this page, narrated in my own AI-cloned voice.

1. Physical Infrastructure: Cables, Data Centres & Radio

1.1 The Internet’s Physical Layer

The Internet is a physical object. As of the mid-2020s, well over 1 million kilometres of submarine fibre-optic cables carry the overwhelming majority of intercontinental Internet traffic. Each cable uses dense wavelength-division multiplexing (DWDM) to carry dozens of channels, each at 100–400 Gbps, amplified by erbium-doped fibre amplifiers (EDFAs) every ~80 km or so. The 2Africa cable — at ~45,000 km, the longest submarine cable ever built — connects 33 countries across Africa, Europe, and Asia, with a design capacity of ~180 Tbps across 16 fibre pairs; its core system entered service in late 2025.

1.2 Data Centres

A hyperscale data centre is a thermodynamic and logistical challenge as much as a computing one. Google, Microsoft, Meta, and Amazon each operate many dozens of facilities worldwide, consuming tens of gigawatts of power collectively. Power Usage Effectiveness (PUE) — the ratio of total facility power to IT equipment power — has fallen dramatically from industry averages well above 2.0 in the late 2000s to roughly 1.1 at leading hyperscale sites today (Google reported a global fleet-wide PUE of ~1.09 in 2024), driven by liquid cooling, custom hardware, and workload scheduling.

1.3 Wireless: From 4G to 5G to Wi-Fi 7

Mobile broadband (LTE/4G) uses OFDMA modulation on licensed spectrum, providing ~50–150 Mbps. 5G New Radio (3GPP Release 15, 2018) introduces millimetre-wave bands (24–100 GHz) for multi-gigabit throughput with sub-millisecond latency — critical for autonomous vehicles and real-time industrial IoT. Wi-Fi 7 (IEEE 802.11be, 2024) extends this to unlicensed spectrum, achieving 46 Gbps theoretical throughput via 320 MHz channels and 4096-QAM modulation.

2. Cloud Computing & Virtualisation

Cloud computing formalised the utility model of computing: pay for what you use, provision in seconds, release immediately. NIST’s 2011 definition[29] characterises cloud by five essential characteristics: on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service.

2.1 Virtualisation

IBM’s VM/370 (announced 1972, building on the earlier CP/CMS research system) brought commercial virtual machine technology to the mainstream. Modern hypervisors are either Type 1 (bare metal: VMware ESXi, KVM, Hyper-V) or Type 2 (hosted: VirtualBox, Parallels). The Intel VT-x (2005) and AMD-V (2006) hardware virtualisation extensions enabled near-native VM performance by moving privilege transitions into hardware.

Containers (popularised by Docker, 2013) virtualise the OS rather than the hardware, sharing the host kernel while isolating filesystems, network namespaces, and process trees via Linux cgroups and namespaces. Container images (typically layered ext4 filesystems, OCI format) are the dominant packaging unit for cloud-native software.

2.2 Kubernetes & Orchestration

Kubernetes (Google, 2014; open-sourced from the internal Borg system[30]) automates container scheduling, scaling, service discovery, and rolling updates across clusters of thousands of nodes. The control plane (etcd, API server, scheduler, controller manager) and the data plane (kubelet, kube-proxy on each node) together form the backbone of cloud-native application deployment globally.

2.3 Serverless & FaaS

Function-as-a-Service (FaaS) — AWS Lambda (2014), Google Cloud Functions, Cloudflare Workers — abstracts away servers entirely. A function is uploaded as a package, invoked per HTTP request or event, and billed in 100-millisecond increments. Cloudflare Workers uses V8 isolates (the JavaScript engine from Chrome) rather than containers for cold starts in the single-digit millisecond range.

3. Edge Computing & IoT

The Internet of Things connects an estimated 18–19 billion devices globally by the end of 2024 (IoT Analytics, State of IoT 2024) — sensors, cameras, industrial controllers, vehicles — generating data faster than it can be economically sent to centralised clouds. Edge computing moves compute to the data source: a factory floor server, a cell tower, a vehicle’s onboard computer.

The MQTT protocol (ISO/IEC 20922, 2016) — a lightweight publish/subscribe protocol over TCP — is the dominant IoT messaging standard, designed for constrained devices with limited bandwidth. Edge inference (running ML models locally) is driven by frameworks like TensorFlow Lite and ONNX Runtime, which quantise float32 models to int8 for deployment on ARM Cortex-M or RISC-V microcontrollers.

4. Machine Learning: Historical Arc

Machine learning is the study of algorithms that improve with experience (Mitchell, 1997). Its history is one of alternating booms and winters, driven by the interplay of algorithms, data, and compute.

EraKey DevelopmentCitation
1943McCulloch & Pitts artificial neuron — logical model of biological neurons[31]
1958Rosenblatt’s perceptron — first learning algorithm[32]
1969Minsky & Papert prove perceptrons cannot learn XOR — first AI winter
1986Rumelhart, Hinton & Williams — backpropagation popularised[33]
1995Vapnik introduces SVMs — kernel methods dominate 2000s
1998LeCun’s LeNet — convolutional networks for digit recognition[34]
2012AlexNet wins ImageNet — GPU-accelerated deep CNNs revive the field[35]
2017Transformer architecture (Vaswani et al.) — foundation of modern NLP[36]
2020GPT-3 (Brown et al.) — few-shot learning emergent in large models[37]
2022–24ChatGPT, Gemini, Claude, LLaMA — LLMs enter mainstream deployment

5. Deep Learning Revolution

LeCun, Bengio, and Hinton’s 2015 Nature review[38] marked the field’s mainstream arrival. Deep learning is characterised by end-to-end learning of representations directly from raw data, in contrast to earlier ML workflows that required hand-crafted features.

5.1 Convolutional Neural Networks (CNNs)

CNNs exploit the spatial structure of images via learnable filters applied across the input (convolutional layers), followed by subsampling (pooling) and fully connected layers. The AlexNet result (Krizhevsky, Sutskever, Hinton, 2012)[35] — a 15.3% top-5 error rate vs the 26.2% runner-up — was the inflection point that triggered the current era of deep learning.

5.2 Recurrent Networks & LSTM

Recurrent Neural Networks (RNNs) process sequential data by maintaining a hidden state. The Long Short-Term Memory (LSTM) cell (Hochreiter & Schmidhuber, 1997) solves the vanishing gradient problem in vanilla RNNs using a gated memory mechanism, enabling modelling of long-range dependencies in text and speech.

5.3 The Hardware Substrate: GPUs & TPUs

Deep learning training is dominated by matrix multiplication — the core operation of both fully connected layers (linear algebra) and attention mechanisms (dot-product attention). NVIDIA’s CUDA platform (2007) made it practical to express these computations on massively parallel GPU hardware. Google’s Tensor Processing Units (TPUs, 2016)[39] are custom ASICs optimised specifically for the 8-bit integer matrix multiplications used in inference, providing 30–100× the energy efficiency of GPUs for production inference.

6. Transformers & Large Language Models

6.1 The Transformer Architecture

Vaswani et al.’s 2017 paper, “Attention Is All You Need”[36], replaced recurrence with a pure attention mechanism. Scaled dot-product attention computes, for a query Q, key K, and value V:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Multi-head attention runs h parallel attention heads and concatenates their outputs, learning different aspects of the input at each position. The transformer’s self-attention allows each token to attend to all other tokens in O(n²) complexity — the key limitation that positional encoding extensions (RoPE, ALiBi) and efficient attention variants (FlashAttention, linear attention) work to address.

6.2 Scaling Laws

Kaplan et al. (2020)[40] empirically established that transformer language model performance follows power-law scaling with respect to model parameters (N), training tokens (D), and compute (C), with near-perfect predictability across seven orders of magnitude. Chinchilla scaling (Hoffmann et al., 2022)[41] refined the optimal compute allocation, showing that prior large models were significantly undertrained relative to their parameter count.

6.3 RLHF & Instruction Tuning

Pre-trained language models are raw next-token predictors. Ouyang et al. (2022, InstructGPT)[42] introduced Reinforcement Learning from Human Feedback (RLHF): first fine-tune the model on demonstrations of desired behaviour (SFT), then train a reward model on human preference comparisons, then optimise the language model against the reward model using PPO. This process converts a capable but unsteerable language model into a usable AI assistant.

6.4 Current Frontier Models

ModelOrganisationKey Contribution
GPT-4 (2023)OpenAIMultimodal, RLHF at scale, strong reasoning
Gemini 1.5 (2024)Google DeepMind1M+ token context window, MoE architecture
Claude 3.x (2024)AnthropicConstitutional AI, long-context, safety focus
LLaMA 3 (2024)Meta AIOpen weights, competitive at 70B+ parameter scale
DeepSeek-V3 (2024)DeepSeekMoE, mixture-of-experts efficiency at low training cost

7. MLOps & the AI Production Stack

Taking an ML model from research notebook to production system requires a discipline sometimes called MLOps — the application of DevOps principles to machine learning workflows. Sculley et al. (2015)[43] described the “hidden technical debt in machine learning systems”: ML code is typically a small fraction of a production ML system; the majority is data pipelines, feature engineering, configuration management, and monitoring.

7.1 The ML Pipeline

  1. Data ingestion & versioning — DVC, LakeFS, Delta Lake
  2. Feature engineering — Feast (feature store), Tecton
  3. Training orchestration — Kubeflow, MLflow, Weights & Biases
  4. Model registry — MLflow Models, Hugging Face Hub
  5. Serving & inference — TorchServe, Triton Inference Server, vLLM
  6. Monitoring — Evidently AI, Arize, Grafana + custom metrics

7.2 LLM Inference Infrastructure

Serving large transformer models at scale requires innovations beyond standard ML serving. Continuous batching (Orca, 2022) groups requests of varying lengths dynamically, improving GPU utilisation from ~30% to ~70%. PagedAttention (Kwon et al., 2023, vLLM) adapts OS virtual memory techniques to manage the KV cache across requests, reducing memory waste and enabling 24× throughput improvements over naive implementations.

8. Security in the AI Era

AI systems introduce novel attack surfaces alongside classical vulnerabilities:

  • Prompt injection — adversarial inputs causing LLMs to deviate from intended behaviour (Greshake et al., 2023)
  • Training data poisoning — inserting malicious examples to corrupt model behaviour at specific inputs (backdoor attacks)
  • Model extraction — querying a model to reconstruct its weights or decision boundary
  • Adversarial examples — imperceptible input perturbations that cause misclassification (Goodfellow et al., 2014)[44]
  • Membership inference — determining whether a specific data point was in the training set (privacy leak)

The field of AI safety studies both near-term (misuse, misalignment with user intent) and long-term (value alignment, superintelligence) risks. Anthropic’s Constitutional AI approach[45] attempts to encode safety principles directly into model training via a process of self-critique and revision.

9. How It All Connects: The Full Stack

The complete path from a user’s query to an AI-generated response illustrates how every layer described in this reference site participates:

  1. User device → HTTP/3 request over QUIC/UDP, TLS 1.3 encrypted (CS: networking, cryptography)
  2. DNS resolution → Anycast routing to the nearest edge PoP (CS: networking, distributed systems)
  3. Edge proxy → Cloudflare Pingora (written in Rust), terminates TLS, rate-limits, routes (Rust)
  4. Load balancer → distributes to a Kubernetes pod in the nearest region (CS: distributed systems, cloud)
  5. Inference server → Python FastAPI application calls vLLM (AI: LLM serving)
  6. vLLM → schedules the request onto GPU batch, manages KV cache with PagedAttention (AI: inference infrastructure)
  7. GPU kernel → FlashAttention CUDA kernel performs the transformer forward pass (CS: computer architecture, parallel computing)
  8. Token streaming → Server-Sent Events over HTTP/2, buffered in async Python (Python: asyncio)
  9. Client → JavaScript renders the token stream in the browser

Every discipline in this reference site — algorithms, operating systems, networking, Python, Rust, distributed systems, and machine learning — touches this single user interaction. This is why a broad understanding of computer science is not merely academic: the performance, security, cost, and reliability of AI products depend on expertise at every layer of this stack.