Fundamentals of Graph Databases

Introduction

Why do some queries that feel simple in your head turn into multi-table joins, subqueries, and endless recursion in SQL?

Years ago, I modeled a permission system where “User A has role X in project Y” became a maze of join tables. I envisioned a graph, but the implementation struggled with the database. A graph database would have fit my thinking better.

When your data is about relationships (who knows whom, what depends on what, how things connect), a relational database makes you work against its grain. Graph databases emerged in the 2000s precisely because of this friction, as social networks, recommendation engines, and fraud detection systems pushed relationship queries far beyond what SQL was designed for. Graph databases flip that friction around. They make relationships first-class: stored explicitly, queried directly.

This article explains why graph databases exist, how the core model works, and when they help (and when they don’t). It builds on concepts from Fundamentals of Databases.

Cover: diagram showing nodes connected by edges representing the graph database model

Type: Explanation (understanding-oriented). Primary audience: beginner to intermediate developers and architects evaluating data storage options

The Graph Model: Nodes, Edges, and Properties

A graph database stores data as a graph: nodes (entities) connected by edges (relationships). Both nodes and edges can carry properties (key-value attributes).

Think of a whiteboard diagram. You draw boxes for people, projects, and events. You draw lines between them: “works on,” “knows,” “depends on.” A graph database stores that whiteboard directly, rather than flattening it into tables and foreign keys.

Nodes

Nodes represent entities: users, products, orders, concepts, or anything you model.

In a property graph, each node has:

A label (or type), such as Person, Product, Project.
Properties, such as name, email, created_at.

(:Person {name: "Alice", email: "alice@example.com"})
(:Product {sku: "WIDGET-1", price: 29.99})

Labels group similar nodes. Properties attach data to each node.

Edges (Relationships)

Edges connect nodes and represent relationships. They have:

A type, such as KNOWS, PURCHASED, DEPENDS_ON.
Direction (from one node to another).
Optional properties (e.g., since, quantity).

Unlike relational foreign keys, edges are first-class. You can query “all paths between A and B” or “friends of friends” without building join chains.

(Alice)-[:KNOWS {since: 2020}]->(Bob)
(Alice)-[:PURCHASED {quantity: 2}]->(Widget)

Why Edges Beat Foreign Keys

In a relational database, relationships are implicit: foreign keys pointing between tables. To traverse “friends of friends,” you join, filter, and join again. Each hop adds another join and more complexity.

In a graph database, relationships are explicit. Traversing from a node to its neighbors is a core operation, not something you reconstruct. The storage engine and query model both favor “follow the edges” over “build the joins.”

This difference compounds. One hop is manageable in SQL. Two hops are awkward. Five hops are miserable. In a graph database, depth has little effect on the query.

Why Graphs Help: When Relationships Dominate

Graph databases shine when your questions are about connections, not just attributes.

The Relational Struggle

In a relational database, “friends of friends” might look like:

SELECT f2.user_id
FROM friendships f1
JOIN friendships f2 ON f1.friend_id = f2.user_id
WHERE f1.user_id = ? AND f2.user_id != ?

For one hop, that’s manageable. For “friends of friends of friends” or variable depth, you need recursive common table expressions (CTEs). Each level of depth adds another layer of joins, and performance degrades fast.

The Graph Approach

In a graph database, the same query becomes:

MATCH (me:Person {name: "Alice"})-[:KNOWS*2..3]-(friend)
RETURN DISTINCT friend

The *2..3 means “2 to 3 hops.” The database is built for this kind of traversal. You describe the pattern; the engine follows the edges.

Where This Matters Most

Social graphs: friends, followers, influence.
Recommendations: “users who bought X also bought Y.”
Dependency graphs: packages, services, data pipelines.
Fraud detection: suspicious connection patterns across shared devices, addresses, or payment methods.
Knowledge graphs: entities and how they relate.

If your most important queries are about “who is connected to whom” or “how do these things relate,” a graph database removes the friction that a relational database creates.

Property Graphs: The Dominant Model

Two main graph paradigms exist: property graphs and RDF (Resource Description Framework) triple stores. In application development, property graphs dominate.

How Property Graphs Work

In a property graph:

Nodes have labels and properties.
Edges have types, direction, and optional properties.
No global schema: different nodes can have different properties.

This flexibility suits application data where structure evolves. You can add a new property to a node without migrating a schema. You can introduce a new edge type without creating a join table.

RDF Versus Property Graphs

RDF models data as subject-predicate-object triples. It’s well standardized and strong for semantic data and interoperability, but tends toward rigidity in application code. SPARQL is its query language.

Property graphs optimize for traversal and flexibility. Neo4j, Amazon Neptune, and others use this model. This article focuses on property graphs because they’re what most application developers encounter first.

Visualizing a Property Graph

Nodes are entities. Edges are relationships. Properties live on both. This diagram could be a whiteboard sketch, and that’s the point: the database stores what you’d naturally draw.

Query Languages: Cypher and Gremlin

SQL can’t naturally express “from here, follow these edges, optionally N hops.” Graph databases offer declarative or traversal-oriented query languages that can, and the database optimizes traversal rather than emulating it with joins.

Cypher (Declarative)

Cypher (used by Neo4j and others) is a pattern-based language. You sketch the shape of the subgraph you want; the engine finds matches.

MATCH (p:Person {name: "Alice"})-[:KNOWS]->(friend)
RETURN friend.name

Read it as: find Person nodes named Alice, follow KNOWS edges, and return the connected friends’ names.

For variable-depth paths:

MATCH (p:Person {name: "Alice"})-[:KNOWS*1..4]-(distant)
RETURN DISTINCT distant.name

*1..4 means 1 to 4 hops. The engine handles the traversal.

Gremlin (Traversal)

Gremlin (used by Apache TinkerPop and databases like JanusGraph) describes step-by-step traversals.

g.V()
  .has('Person', 'name', 'Alice')
  .out('KNOWS')
  .values('name')

Translation: start at vertices, filter to Person with name Alice, go out along KNOWS edges, and return name values.

Cypher favors pattern matching. Gremlin favors explicit traversal steps. Both work; I’ve found that choice usually comes down to ecosystem and team preference rather than any fundamental capability difference.

When to Use Graph Databases

Graph databases are well-suited to problems where relationships and connectivity are central. But they aren’t a default choice. They solve specific problems well and come with real trade-offs.

Where Graphs Fit

Social and recommendation systems: “People you may know” and “users who liked X also liked Y” are graph traversals. Doing this with joins across many tables is possible, but brittle and slow to evolve.
Fraud detection: Fraud rings show up as unusual connection patterns. Graphs make it natural to find clusters and suspicious paths.
Knowledge graphs and master data: When you have many entity types and relationship types, a graph keeps the model understandable. Adding a new relationship type doesn’t require new join tables.
Dependency and impact analysis: Package dependencies, microservice dependencies, and data lineage are all graphs. “What depends on this?” becomes a straightforward traversal.

Where Graphs Don’t Fit

Tabular, aggregate-heavy workloads: Reports, dashboards, and analytics over large fact tables are simpler and faster in a relational or columnar database. SQL and Online Analytical Processing (OLAP) tools are built for this.
Simple CRUD with few relationships: If your data is mostly independent records with few relationship queries, a relational database is simpler and cheaper to operate. Don’t pay the learning curve for a problem that doesn’t need it.
Team readiness: Graph databases require different mental models, query languages, and operational practices. If the team is small or new to graphs, the learning curve can outweigh the benefits.

The Hybrid Approach

Many systems use both relational and graph stores: relational for transactional and aggregate workloads, graph for relationship-heavy analytics or features. You sync relevant data and keep each system focused. This is common and often the right answer.

Common Misconceptions

I’ve seen a few beliefs that trip teams up when evaluating graph databases.

“Graph databases can’t scale.” Many support clustering and partitioning. Scalability depends on the product and the shape of your data, not on the graph model itself.
“You have to choose graph or relational.” Hybrid architectures are common and often the right answer. I’d argue they’re the norm for any system that genuinely needs both patterns.
“Graphs are only for social networks.” Any domain where relationships and connectivity matter is a candidate: fraud, recommendations, dependencies, knowledge graphs.
“Cypher and Gremlin are the only options.” Other languages exist (SPARQL for RDF, for instance). The emerging GQL (Graph Query Language) standard aims to be the SQL of graphs, potentially reducing vendor lock-in over time.
“Graphs are slow for simple lookups.” With proper indexes, key-based lookups are fast. Traversal is where graphs pull ahead.

Modeling Pitfalls

A few modeling mistakes recur when teams adopt graph databases for the first time.

Over-modeling edge types. Creating a new edge type for every nuance bloats the model. KNOWS_CASUALLY, KNOWS_PROFESSIONALLY, KNOWS_FAMILY could just be one KNOWS edge with a type property. Add new edge types only when the relationship meaning actually changes.

Choosing a graph for the wrong workload. Graph hype is real. Storing high-volume event logs in a graph “because we might want to analyze relationships later” adds complexity without payoff. Use a relational or time-series store for logs. Use a graph only when relationship traversal is a primary workload.

Ignoring indexes. Graph databases need indexes on the properties you filter by, just like relational databases do. Without indexes on the labels and properties used in MATCH or WHERE clauses, lookups that should be fast become full scans. I’ve seen teams blame the graph engine for poor performance when the real problem was missing indexes.

Wrapping Up

Graph databases exist because some data is fundamentally about connections. When your mental model is a graph (nodes and edges, not rows and columns), a graph database lets you store and query that model directly.

The core idea is simple: make relationships first-class. Store them explicitly. Query them by following edges instead of reconstructing paths through joins. That shift makes certain classes of queries (traversals, paths, pattern matching) dramatically simpler.

The decision comes down to workload. If your most important queries are about how things relate, a graph database removes friction. Suppose your workload is tabular reporting and aggregates, stick with relational. If you need both, use both.

If you want to go deeper, start with the Neo4j Cypher Manual for hands-on query practice, or read Graph Databases (O’Reilly) for a thorough treatment of graph modeling.

Fundamentals of Databases explains database types and when to choose each. Fundamentals of Backend Engineering covers data storage in backend systems. Fundamentals of Data Engineering shows how graph data fits into data pipelines.

Glossary

Node: An entity in a graph (e.g., a person, product, or concept). Has a label and optional properties.

Edge (Relationship): A connection between two nodes. Has a type, direction, and optional properties.

Property graph: A graph model where nodes and edges can have arbitrary key-value properties.

Traversal: Following edges from one node to others, possibly for multiple hops.

Cypher: A declarative query language for property graphs, used by Neo4j and others.

Gremlin: A traversal-oriented query language for property graphs, used by Apache TinkerPop-compatible databases.

RDF: Resource Description Framework. A triple-based model (subject-predicate-object) used for semantic data.

SPARQL: Query language for RDF data.

References

Industry Sources

Neo4j Graph Database Platform, property graph database and Cypher documentation.
Apache TinkerPop, a graph computing framework and Gremlin language.
Amazon Neptune, managed graph database service.
Graph Databases (O’Reilly) by Robinson, Webber, and Eifrem, a foundational book on graph database concepts.

Standards and Specifications

GQL Standard (ISO), efforts toward a standard graph query language.
W3C RDF, RDF specification and semantic web standards.

Learning Resources

Neo4j Cypher Manual, official Cypher reference and tutorials.
Gremlin Documentation, Gremlin traversal language guide.

Introduction#

The Graph Model: Nodes, Edges, and Properties#

Nodes#

Edges (Relationships)#

Why Edges Beat Foreign Keys#

Why Graphs Help: When Relationships Dominate#

The Relational Struggle#

The Graph Approach#

Where This Matters Most#

Property Graphs: The Dominant Model#

How Property Graphs Work#

RDF Versus Property Graphs#

Visualizing a Property Graph#

Query Languages: Cypher and Gremlin#

Cypher (Declarative)#

Gremlin (Traversal)#

When to Use Graph Databases#

Where Graphs Fit#

Where Graphs Don’t Fit#

The Hybrid Approach#

Common Misconceptions#

Modeling Pitfalls#

Wrapping Up#

Related Articles#

Glossary#

References#

Industry Sources#

Standards and Specifications#

Learning Resources#

Comments #