Posted by & filed under Content - Highlights and Reviews, Programming & Development.

A guest post from Nigel Small, whose current areas of interest include Python, JavaScript, PostgreSQL, Neo4j and Linux. He has also founded a number of open source projects, most significantly py2neo, and is an active blogger, speaker and Neo4j community member who can be reached at @technige.

Data duplication isn’t fun, and many of us have spent hours trying to track a duplication problem back to its source in an effort to plug the leaks in some legacy system. It’s obviously better to take some time to avoid the problem happening in the first place, and since almost every non-trivial piece of data-driven software will have some requirement to manage uniqueness, it’s worth knowing how to avoid it.

In the context of Neo4j, it may be necessary to ensure that individual nodes are unique within a particular context; or there may be more complex uniqueness requirements across paths or subgraphs. Read Using Neo4j from Python for more on Neo4j and the Py2neo library. Neo4j provides two basic ways to ensure uniqueness: unique index entries and the Cypher CREATE UNIQUE clause. The first way allows nodes (or less commonly, relationships) to be tagged by a key-value pair. If used correctly, the key-value pair will identify only a single entry. CREATE UNIQUE is used to ensure paths across two or more nodes are unique within the context of known reference points.

Unique Index Entries

It’s important to remember that as far as Neo4j itself is concerned, a standard index and a unique index are fundamentally the same thing. The difference exists only in the methods used to work with the index. Broken application code can still permit multiple entries and, for that reason, applications should gracefully handle the condition of multiple index entries when only one is expected.

Py2neo, a Python library that provides access to Neo4j via its RESTful web service interface, exposes unique index management through the Index class (unsurprisingly) as well as through the WriteBatch class and the high-level get_or_create_indexed_node function. This function is a convenient wrapper around functionality from the Index class, and it is often used to create fixed reference points within the graph.

The Index class provides three atomic methods for working with unique nodes (or relationships):

  • get_or_create(self, key, value, abstract) – if a node exists under the given key-value, return it; otherwise, create and return a new node using the abstract provided.
  • create_if_none(self, key, value, abstract) – operates identically to get_or_create, but will return None if no node previously existed, which is useful to identify when a node is newly created.
  • add_if_none(self, key, value, entity) – similar to create_if_none, but adds an existing node to the index instead of creating a new one, and returns None if nothing was added.

The code below illustrates the utility of these methods. It shows several ways to approach the problem of fetching a node from an index that uniquely represents a person:

Similar functionality is provided by the WriteBatch class with the get_or_create_indexed_node and get_or_add_indexed_node methods for conditional insertion of new and existing nodes respectively. Version 1.9 of the Neo4j server introduced extra functionality, allowing the introduction of two further batch methods: create_indexed_node_or_fail and add_indexed_node_or_fail. As their names imply, these fail the entire batch if nodes already exist at the specified entry point. Equivalent methods exist for relationship indexes.

CREATE UNIQUE

Version 1.8 of the Neo4j server introduced mutating Cypher, and with it came the CREATE UNIQUE clause. CREATE UNIQUE is defined by the manual as “in the middle of MATCH and CREATE - it will match what it can, and create what is missing.” It’s used to build relationships or chains of relationships by doing just that.

Obviously the CREATE UNIQUE clause can be used within a direct Cypher query. The recommended way to use it from Py2neo, though, is by using the Path class. The following code shows how to define a path starting from a known node:

At this point, the path is defined, but nothing has actually been created in the database. To do so, use either the create method (which uses Cypher CREATE) or the get_or_create method (which uses CREATE UNIQUE):

As might be expected, the get_or_create method can be safely executed any number of times and will always return the same chain of nodes and relationships. This even works to match partial paths, creating only the bits that are missing.

Paths can also be built using methods attached to the first node in the path:

Finally, the WriteBatch class offers a similar helper method for building a single unique relationship:

Conclusion

Uniqueness management is an essential part of most data-driven applications. So, it pays to get familiar with the key Neo4j mechanisms, unique indexes and CREATE UNIQUE, as well as the wrapper methods provided by Py2neo. This way, your data duplication headaches will hopefully become a thing of the past!

See below for sections covering Neo4j in resources from Safari Books Online.

Safari Books Online has the content you need

Spring Data shows you how Spring Data makes it relatively easy to build applications across a wide range of new data access technologies such as NoSQL and Hadoop. Read Neo4j: A Graph Database for some details on the graph database.
Cassandra: The Definitive Guide provides you with all of the details and practical examples you need to understand Cassandra’s non-relational database design and put it to work in a production environment.
Spring in Practice shows you how to tackle the challenges you face when you build Spring-based applications. The book empowers software developers to solve concrete business problems by mapping application-level issues to Spring-centric solutions. Read Creating a simple configuration item for more on Neo4j.

About the author

small Nigel began programming at an early age and has worked professionally in a variety of computing roles for over 15 years. His current areas of interest include Python, JavaScript, PostgreSQL, Neo4j and Linux. He has also founded a number of open source projects, most significantly py2neo, and is an active blogger, speaker and Neo4j community member and can be reached at @technige.

Tags: CREATE UNIQUE, Index, Neo4j, Py2neo, Unique Index Entries, Uniqueness Management,

One Response to “Managing Neo4j Uniqueness with Py2neo”

  1. Angello Maggio

    Thank you!
    These are wonderful methods that solve a big problem I had.
    I haven’t tried them yet, but they seem completely correct.
    Once again, thank you, great content.

    Angello