0 votes
in Neo4j by
What is Neo4j architecture?

1 Answer

0 votes
by
The easiest way to sum up what a native graph storage engine is, would be that the storage structures are purpose-built for graph optimized interactions. In the case of Neo4j, this native graph storage is evidenced in the fixed record structure of nodes, edges, and so on, as well as the store files that contain the records, as explained below. For Neo4j, as with any database, there are two principal kinds of storage that it leverages, physical storage and memory (described in more detail further below).

PHYSICAL STORAGE

In Neo4j, data stored on physical disk is stored according to the index-free adjacency principle. The graph is structured in a set of files known as store files that are generally broken down by record type (that contain the fixed-size record format described previously) as shown in the table below:

Neo4j performance - storage engine file structure.

TIP 3: For Neo4j performance tuning, use EXT4 and XFS, and avoid NFS or NAS file systems.

TIP 4: Another Neo4j performance tuning option is to store data and transaction logs on separate drives to avoid contention and allow I/O to occur at the same time for both.

MEMORY

Neo4j uses a disk-based approach for data storage but also relies heavily on memory for caching disk contents for performance (or as temporary storage). As would be expected, Neo4j performs best when leveraging the data cached in-memory as much as possible as it significantly reduces the disk I/O expense.

Tip 5: Because Neo4j is Java-based, an additional option is to configure the object memory directly via the neo4j.conf file. In this way, developers can leverage the JVM to tune the memory configuration to maximize performance by balancing the JVM heap space with the garbage collector use, among other tuning options.

In effect though, and particularly as the cost of memory has come down so significantly in the last decade, it is now quite normal to store one’s entire graph in memory for production use.

TIP 6: To adequately size Neo4j memory settings, here is a memory calculation rule of thumb: Total Physical Memory = Heap + Page Cache + OS Memory. One can also accurately calculate the required space for the core data since all the records are fixed size.

To become more familiar with Neo4j memory use, this figure below puts it in context:

Neo4j memory

Neo4j Performance through Hardware Scaling (Neo4j distributed)

Massive scalability is still a challenge for all graph databases and has been put forward as one of the limitations of neo4j, but as the market and technology matures Neo4j is solving the problem creatively and well. The main issue is the problem of maintaining state across many servers, to keep them synchronized while scaling, given the uniqueness of graph data as discussed below.

Until recently, Neo4j could only scale horizontally by using replication (copying one database content on multiple servers) and then using Neo4j sharding (to scale) through a workaround technique call cache sharding which routes each request to the database instance that holds the desired portion of the graph that is already in memory. Conversely, the standard approach to sharding distributes data on different cluster nodes, each of which is able to perform its own reads and writes, often with different users accessing duplicated, separate sets of data.

So why not just use the standard approach to sharding? The reason is found in the fact that the property graph model relies on nodes and the physical connections between them as the basis for queries. If a query must manage sub-graphs, across network boundaries and still attempt to maintain ACID compliance with all the relationships, data consistency, and integrity, it becomes a uniquely challenging problem to solve. In fact, the difficulty in solving this is so significant that it is classified in the NP-hard problems category.

Are there Neo4j Scalability Limits? Making Neo4j Sharding Work

So does this represent hard Neo4j scalability limits? Not at all. As of the release of its version 4.0, Neo4j is clearly now well on the path of offering a truly Neo4j distributed database approach, which also makes it more possible to approach standard sharding practices.

So how has the Neo4j approach to sharding evolved? As of version 4.0, Neo4j shards by spreading subgraphs across different Neo4j database instances across servers and even networks. To maintain the relationship between entities, it duplicates the node that holds the relationship between two shards, on both shards. What this means is that while now a developer can shard datasets, they still cannot shard the relationships since it is not possible to traverse shards at this point.

To manage the challenge of state across servers and networks, Neo4j now also provides Neo4j Fabric to manage the querying process. It can be thought of as a proxy that manages requests and connection information across all the shards by keeping a master representation of the database entities and locations, being aware of all shards at all times. Neo4j Fabric itself does not hold any data.

One of the challenges that still remain when sharding in general is the need to plan access patterns ahead of time since fallacies of distributed computing exist in this context as well. To effectively counter those risks Neo4j has offered causal clustering in order to drive consistency.

A causal cluster is designed to facilitate a single unit of servers (cluster) to maintain its state across replicas, even when communications may be interrupted. It provides a solution for synchronizing state across many servers while scaling by applying causal consistency and a consensus algorithm.
...