The history of AI is a story about math that already existed, people stubborn enough to believe in it, and a few decades of everyone else telling them they were wrong.

Most of the math powering today's large language models comes from the 17th and 18th centuries. Calculus, linear algebra, probability theory. The machines caught up to the math, not the other way around.

I picked up [Why Machines Learn][why-machines-learn] by Anil Ananthaswamy because I wanted to understand what actually happened, not the hype version. The book traces the mathematical lineage from early pattern recognition through deep learning with the kind of rigor and storytelling that made me rethink how I understood the whole field.

Here's that history, condensed.

## The dream: machines that learn from data (1940s-1960s)

It starts in 1943. Warren McCulloch and Walter Pitts published a mathematical model of the neuron: a simple on-or-off switch mimicking how biological neurons fire. A model, not a machine, but it planted the idea: what if you could wire enough of these together to do something useful?

In 1956, a group of researchers gathered at Dartmouth College for a summer workshop. John McCarthy coined the term "artificial intelligence." The ambition was enormous. They believed every aspect of learning could be described precisely enough for a machine to simulate it. They gave themselves one summer. The problem turned out to be harder than that.

The first real learning algorithm came from Frank Rosenblatt at Cornell in 1957. The **perceptron** could look at data and adjust its own weights to classify patterns. Nobody hand-coded the rules. The machine learned them. Rosenblatt demonstrated it could find a line separating two classes of data, and he proved mathematically that if such a line exists, the algorithm will find it in finite time.

A perceptron is surprisingly simple. Each input gets a weight, the weighted sum passes through a threshold, and the output is a classification. When the prediction is wrong, adjust the weights toward the correct answer:

```pseudocode
PERCEPTRON(inputs, weights, bias):
    total = 0
    for each input[i]:
        total = total + input[i] * weight[i]
    total = total + bias

    if total > 0:
        prediction = 1
    else:
        prediction = 0

    error = expected - prediction
    for each weight[i]:
        weight[i] = weight[i] + error * input[i]

    return prediction
```

Around the same time, Bernard Widrow and Ted Hoff at Stanford developed the **ADALINE** (Adaptive Linear Neuron) and the Least Mean Squares algorithm, introducing **gradient descent**: slide down a slope to find the minimum error. Picture a marble rolling to the bottom of a bowl. The marble is the algorithm. The bowl is the error landscape. The bottom is where your model stops being wrong.

```pseudocode
GRADIENT_DESCENT(data, learning_rate, epochs):
    weight = random small value
    bias = random small value

    repeat for each epoch:
        for each (input, expected) in data:
            prediction = weight * input + bias
            error = prediction - expected

            weight = weight - learning_rate * error * input
            bias   = bias   - learning_rate * error

    return weight, bias
```

The learning rate controls how big each step is. Too large and you overshoot the bottom of the bowl. Too small and you crawl there over geological time.

This era had an infectious optimism. Machines learning from data felt like magic, and the math backing it up was real.

## The fall: one book nearly killed the field (1969-1980s)

In 1969, Marvin Minsky and Seymour Papert published *Perceptrons*. The book proved mathematically that single-layer perceptrons could never solve certain problems, notably XOR (exclusive or). A single-layer network can only draw straight lines through data. Some problems need curves.

XOR returns 1 when exactly one input is 1, and 0 otherwise. Try drawing a single straight line that separates the 1s from the 0s:

```pseudocode
XOR truth table:
    input A = 0, input B = 0  →  output = 0
    input A = 0, input B = 1  →  output = 1
    input A = 1, input B = 0  →  output = 1
    input A = 1, input B = 1  →  output = 0

A single perceptron computes:
    output = step(weight_a * A + weight_b * B + bias)

No values of weight_a, weight_b, and bias
can produce the correct output for all four cases.
The 1s sit on opposite corners. No single line divides them.
```

That was a legitimate finding. But Minsky and Papert went further, implying (without proving) that multi-layer networks would likely fail too. The academic community took the hint. Funding dried up. Researchers moved on. This period became known as the **AI winter**.

For roughly fifteen years, neural network research was a backwater. A few stubborn researchers kept working, notably Geoffrey Hinton, but the mainstream had moved on.

One well-placed critique (partially correct, partially speculative) set back an entire field by a generation. The math that would eventually prove Minsky wrong already existed. Nobody had assembled it yet.

_Remember kids, if [someone tells everyone they're right][technology-know-it-alls] it doesn't mean they're right._

## Meanwhile, other math was brewing (18th century to 1990s)

While neural networks froze, other mathematical traditions developed their own approaches to machine learning.

**Bayes's Theorem**, published posthumously in 1763, gave a framework for updating beliefs based on evidence. Thomas Bayes was an English statistician whose work sat largely unnoticed for decades. By the 20th century, Bayesian methods became the backbone of spam filters, medical diagnostics, and authorship attribution. Mosteller and Wallace famously used Bayesian analysis to determine who wrote the disputed Federalist Papers.

```pseudocode
NAIVE_BAYES_SPAM_FILTER(email):
    for each class in [spam, not_spam]:
        score[class] = log(prior_probability(class))
        for each word in email:
            score[class] = score[class]
                         + log(probability(word | class))

    if score[spam] > score[not_spam]:
        return "spam"
    else:
        return "not spam"
```

Every word nudges the score. "Viagra" pushes toward spam. "Meeting" pushes toward not spam. Bayes gives you a principled way to combine all those nudges into a single decision.

**K-Nearest Neighbors (KNN)** took a different approach entirely: classify something by looking at what's nearby. [John Snow's 1855 cholera map][cholera-map], where he traced an outbreak to a contaminated water pump by mapping cases geographically, is an early example of spatial reasoning applied to data.

```pseudocode
KNN_CLASSIFY(new_point, all_points, k):
    distances = []
    for each point in all_points:
        d = distance(new_point, point)
        distances.append((d, point.label))

    sort distances by d, ascending
    neighbors = first k entries of distances

    return most_common_label(neighbors)
```

No training phase. No weights. Just look at what's nearby and go with the majority.

**Principal Component Analysis (PCA)** solved a growing problem: as datasets got more dimensions, distance-based methods stopped working well. Karl Pearson introduced PCA in 1901, and Harold Hotelling expanded it in 1933. It uses eigenvectors and eigenvalues, ideas from the 1700s.

**Support Vector Machines (SVMs)** and the **kernel trick** represented the peak of pre-deep-learning machine learning. Vladimir Vapnik's insight: project data that isn't linearly separable into a higher-dimensional space where it becomes separable, and do it efficiently without computing the full transformation. SVMs dominated machine learning benchmarks through the 1990s and 2000s.

John Hopfield brought physics into the picture in 1982 with **Hopfield Networks**, treating neural networks as energy systems. Stored memories become energy minima. Corrupted input "falls" to the nearest minimum, retrieving the closest match. Statistical mechanics meets pattern recognition.

Each tradition contributed concepts that would later fuse with neural networks to create modern AI.

## The resurrection: backpropagation changes everything (1986)

The breakthrough that ended the AI winter came from David Rumelhart, Geoffrey Hinton, and Ronald Williams in 1986. They published the **backpropagation algorithm**: train multi-layer neural networks using the [chain rule][chain-rule] from calculus.

After the network makes a prediction, measure how wrong it was. Propagate that error backwards through every layer, adjusting each weight proportionally to its contribution to the mistake. Repeat. The network learns.

```pseudocode
BACKPROPAGATION(network, input, expected, learning_rate):
    // Forward pass: compute output layer by layer
    activations[0] = input
    for each layer L from 1 to last:
        z[L] = weights[L] * activations[L-1] + biases[L]
        activations[L] = activation_function(z[L])

    // Compute error at the output
    error = activations[last] - expected

    // Backward pass: propagate error through layers
    for each layer L from last down to 1:
        gradient[L] = error * derivative_of_activation(z[L])
        weights[L]  = weights[L] - learning_rate * gradient[L] * activations[L-1]
        biases[L]   = biases[L]  - learning_rate * gradient[L]
        error = weights[L] transposed * gradient[L]
```

The chain rule from calculus does the heavy lifting. Each layer asks: "How much did I contribute to the total error?" and adjusts accordingly. Millions of small corrections, repeated millions of times.

Backpropagation proved Minsky and Papert's implication wrong. Multi-layer networks could be trained, and they could solve problems that single-layer networks never could.

In 1989, George Cybenko proved the **Universal Approximation Theorem**: a neural network with one hidden layer, given enough neurons, can approximate any continuous function. Cybenko felt Minsky and Papert "had got it wrong." The math agreed.

Yann LeCun put this into practice in the early 1990s with **LeNet**, a deep convolutional neural network (CNN) that recognized handwritten digits for the U.S. Postal Service. LeCun's architecture drew inspiration from David Hubel and Torsten Wiesel's 1960s research on the feline visual cortex, which revealed that biological vision works hierarchically: simple cells detect edges, complex cells combine edges into features, deeper layers recognize increasingly abstract patterns.

A CNN scans an image the way your visual cortex does, small patches first, then combining what it finds:

```pseudocode
CNN_FORWARD(image):
    // Convolutional layer: slide a small filter across the image
    for each position (x, y) in image:
        patch = image[x:x+filter_size, y:y+filter_size]
        feature_map[x, y] = sum(patch * filter_weights)

    // Activation: keep only strong signals
    feature_map = relu(feature_map)  // if value < 0, set to 0

    // Pooling: shrink the map, keep the strongest features
    for each 2x2 block in feature_map:
        pooled[block] = max(values in block)

    // Stack more convolutional layers for higher-level features
    // edges → textures → shapes → objects

    // Final layer: flatten and classify
    output = fully_connected(flatten(pooled))
    return output
```

LeNet worked. But the hardware of the 1990s couldn't scale it. The ideas were sound, the compute wasn't there yet. The field waited.

## The explosion: compute catches up (2010s)

Two things changed in the 2010s: massive datasets became available, and GPUs (Graphics Processing Units) proved ideal for the parallel matrix operations neural networks require.

In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered their deep CNN, **AlexNet**, into the ImageNet competition. It demolished the competition, cutting the error rate nearly in half compared to the next best entry. Deep learning went from academic curiosity to industry obsession overnight.

The techniques that won were decades old. Backpropagation (1986). Convolutional networks (1990s). Gradient descent (1959). The math had been waiting for the hardware to catch up.

From there, progress accelerated. Deep learning began dominating benchmarks in image recognition, speech recognition, natural language processing, and game playing. DeepMind's AlphaGo defeated the world champion Go player in 2016, a milestone many researchers had predicted was decades away.

## The transformer and the current frontier (2017-present)

In 2017, a team at Google published [Attention Is All You Need][attention-paper], introducing the **Transformer architecture**. The key innovation was the **attention mechanism**, which lets a model dynamically weigh different parts of its input. Instead of processing sequences one step at a time (as recurrent networks do), transformers process entire sequences in parallel, attending to relevant context regardless of distance.

```pseudocode
SELF_ATTENTION(tokens):
    // Each token generates three vectors: query, key, value
    for each token[i]:
        query[i] = token[i] * W_query
        key[i]   = token[i] * W_key
        value[i] = token[i] * W_value

    // Every token asks: "How relevant is every other token to me?"
    for each token[i]:
        for each token[j]:
            score[i][j] = dot_product(query[i], key[j])

        // Normalize scores so they sum to 1
        attention_weights[i] = softmax(score[i])

        // Weighted mix: tokens that matter more contribute more
        output[i] = sum(attention_weights[i][j] * value[j]
                        for all j)

    return output
```

In the sentence "The cat sat on the mat because it was tired," attention lets the model figure out that "it" refers to "the cat" by giving "cat" a high attention score when processing "it." Every token can attend to every other token in parallel, which is why transformers scale so well on GPUs.

Transformers enabled **large language models (LLMs)**: GPT, BERT, and their successors. These models train on vast text corpora through self-supervised learning: predict the next word, or fill in the blank. Scale the model and the data, and capabilities emerge that nobody programmed.

This is where the story gets strange. Ananthaswamy's book calls it **"Terra Incognita"**, and that's accurate. Modern deep networks routinely violate classical statistical theory:

* **Overparameterization**: These models have far more parameters than training examples. Classical theory says they should memorize training data and fail on new inputs. They don't.
* **Double descent**: Test error rises with model size (as expected), then drops again as models grow larger. Nobody predicted this.
* **Grokking**: Models that appear to have memorized training data suddenly discover generalizable solutions after extended training, long after learning should have stopped.

The systems work astonishingly well, and nobody fully understands why. The 18th-century math that got the field here is insufficient to explain what's happening now.

## What might come next

A few directions look promising.

**Mechanistic interpretability** aims to reverse-engineer what happens inside neural networks, not measuring inputs and outputs, but understanding the internal representations. Crack this, and we move from "it works and we don't know why" to actual understanding.

**Multimodal models** that process text, images, audio, and video together blur the boundaries between specialized AI systems. The trend points toward general-purpose systems handling multiple domains at once.

**Efficiency research** pushes back against the "just make it bigger" approach. Mixture-of-experts, quantization, and distillation aim to match performance with far smaller models, essential for deploying AI outside data centers.

Work on **reasoning and planning** moves beyond pattern matching toward systems that decompose problems, test hypotheses, and chain logical steps.

Nobody knows where this goes. The field has been here before, full of certainty that turned out wrong in both directions. What's different now: the systems work well enough to affect billions of people daily, which makes understanding them urgent.

## The math was always there

The through-line is simple: the math came first. Calculus from Newton and Leibniz. Linear algebra from Hamilton. Probability from Bayes. Gradient descent from Widrow and Hoff. The chain rule that powers backpropagation is a standard calculus result.

What changed wasn't the math. What changed was compute, data, and a handful of researchers who refused to give up when the field told them neural networks were a dead end. Hinton kept working through the AI winter. LeCun kept building CNNs when SVMs dominated. Rosenblatt believed in the perceptron when it was just a curiosity.

The history of AI is not a history of artificial intelligence. It is a history of human stubbornness in the face of elegant math the world wasn't ready for.

## References

* [Why Machines Learn][why-machines-learn], by Anil Ananthaswamy. The book that guided this article's narrative arc, tracing the mathematical foundations from perceptrons to modern deep learning.
* [Attention Is All You Need][attention-paper], by Vaswani et al. The 2017 paper introducing the Transformer architecture.
* [Chain rule][chain-rule], a plain reference for the exact calculus concept used by backpropagation.

[why-machines-learn]: https://www.penguinrandomhouse.com/books/677608/why-machines-learn-by-anil-ananthaswamy/
[attention-paper]: https://arxiv.org/abs/1706.03762
[chain-rule]: https://en.wikipedia.org/wiki/Chain_rule
[cholera-map]: https://pmc.ncbi.nlm.nih.gov/articles/PMC7150208/
[technology-know-it-alls]: https://jeffbailey.us/death-by-1000-cuts-5-technology-know-it-alls/