|

Managing Neo4j Uniqueness with Py2neo

A guest post from Nigel Small, whose current areas of interest include Python, JavaScript, PostgreSQL, Neo4j and Linux. He has also founded a number of open source projects, most significantly py2neo, and is an active blogger, speaker and Neo4j community member who can be reached at @technige.

Data duplication isn’t fun, and many of us have spent hours trying to track a duplication problem back to its source in an effort to plug the leaks in some legacy system. It’s obviously better to take some time to avoid the problem happening in the first place, and since almost every non-trivial piece of data-driven software will have some requirement to manage uniqueness, it’s worth knowing how to avoid it.

In the context of Neo4j, it may be necessary to ensure that individual nodes are unique within a particular context; or there may be more complex uniqueness requirements across paths or subgraphs. Read Using Neo4j from Python for more on Neo4j and the Py2neo library. Neo4j provides two basic ways to ensure uniqueness: unique index entries and the Cypher CREATE UNIQUE clause. The first way allows nodes (or less commonly, relationships) to be tagged by a key-value pair. If used correctly, the key-value pair will identify only a single entry. CREATE UNIQUE is used to ensure paths across two or more nodes are unique within the context of known reference points.

Unique Index Entries

It’s important to remember that as far as Neo4j itself is concerned, a standard index and a unique index are fundamentally the same thing. The difference exists only in the methods used to work with the index. Broken application code can still permit multiple entries and, for that reason, applications should gracefully handle the condition of multiple index entries when only one is expected.

Py2neo, a Python library that provides access to Neo4j via its RESTful web service interface, exposes unique index management through the Index class (unsurprisingly) as well as through the WriteBatch class and the high-level get_or_create_indexed_node function. This function is a convenient wrapper around functionality from the Index class, and it is often used to create fixed reference points within the graph.

The Index class provides three atomic methods for working with unique nodes (or relationships):

The code below illustrates the utility of these methods. It shows several ways to approach the problem of fetching a node from an index that uniquely represents a person:

people = graph_db.get_or_create_index(neo4j.Node, "people")

def get_person(name):
    """ A simple, nïeve function to fetch the first person indexed with a
    particular name.
    """
    person_nodes = people.get("name", name)
    return person_nodes[0]

def get_person_if_exists(name):
    """ An extension of the function above, dealing with the case where no-one
    has that name.
    """
    person_nodes = people.get("name", name)
    if person_nodes:
        return person_nodes[0]
    else:
        return None

def get_person_safely(name):
    """ A further extension, recognising the case where multiple nodes are
    indexed under the same name.
    """
    person_nodes = people.get("name", name)
    if len(person_nodes) == 1:
        return person_nodes[0]
    elif person_nodes:
        raise LookupError("Multiple people found")
    else:
        return None

def get_or_create_person(name):
    """ Now, if the person doesn't exist, create them and return the new node
    instead.
    """
    person_nodes = people.get("name", name)
    if len(person_nodes) == 1:
        return person_nodes[0]
    elif person_nodes:
        raise LookupError("Multiple people found")
    else:
        return people.create("name", name, {"name": name})

def get_or_create_person_safely(name):
    """ And finally, avoid race conditions by using an atomic get_or_create
    method.
    """
    try:
        person_node = people.get_or_create("name", name, {"name": name})
    except:
        raise LookupError("Multiple people found")
    else:
        return person_node

Similar functionality is provided by the WriteBatch class with the get_or_create_indexed_node and get_or_add_indexed_node methods for conditional insertion of new and existing nodes respectively. Version 1.9 of the Neo4j server introduced extra functionality, allowing the introduction of two further batch methods: create_indexed_node_or_fail and add_indexed_node_or_fail. As their names imply, these fail the entire batch if nodes already exist at the specified entry point. Equivalent methods exist for relationship indexes.

CREATE UNIQUE

Version 1.8 of the Neo4j server introduced mutating Cypher, and with it came the CREATE UNIQUE clause. CREATE UNIQUE is defined by the manual as “in the middle of MATCH and CREATE - it will match what it can, and create what is missing.” It’s used to build relationships or chains of relationships by doing just that.

Obviously the CREATE UNIQUE clause can be used within a direct Cypher query. The recommended way to use it from Py2neo, though, is by using the Path class. The following code shows how to define a path starting from a known node:

from py2neo import neo4j
graph_db = neo4j.GraphDatabaseService()
alice, = graph_db.create({"name": "Alice"})
chain_of_friends = neo4j.Path(alice, "KNOWS", {"name": "Bob"}, "KNOWS", 
  {"name": "Carol"})

At this point, the path is defined, but nothing has actually been created in the database. To do so, use either the create method (which uses Cypher CREATE) or the get_or_create method (which uses CREATE UNIQUE):

# create a brand new path
path = chain_of_friends.create(graph_db)

# create a new path if one doesn't already exist
path = chain_of_friends.get_or_create(graph_db)

As might be expected, the get_or_create method can be safely executed any number of times and will always return the same chain of nodes and relationships. This even works to match partial paths, creating only the bits that are missing.

Paths can also be built using methods attached to the first node in the path:

# create a brand new path
path = alice.create_path("KNOWS", {"name": "Bob"}, "KNOWS", {"name": "Carol"})

# create a new path if one doesn't already exist
path = alice.get_or_create_path("KNOWS", {"name": "Bob"}, "KNOWS", {"name": "Carol"})

Finally, the WriteBatch class offers a similar helper method for building a single unique relationship:

batch = WriteBatch(graph_db)
batch.get_or_create_relationship(alice, "KNOWS", {"name": "Bob"})
batch.submit()

Conclusion

Uniqueness management is an essential part of most data-driven applications. So, it pays to get familiar with the key Neo4j mechanisms, unique indexes and CREATE UNIQUE, as well as the wrapper methods provided by Py2neo. This way, your data duplication headaches will hopefully become a thing of the past!

See below for sections covering Neo4j in resources from Safari Books Online.

Safari Books Online has the content you need

Spring Data shows you how Spring Data makes it relatively easy to build applications across a wide range of new data access technologies such as NoSQL and Hadoop. Read Neo4j: A Graph Database for some details on the graph database.
Cassandra: The Definitive Guide provides you with all of the details and practical examples you need to understand Cassandra’s non-relational database design and put it to work in a production environment. Read Neo4J for details on the graph database.
Spring in Practice shows you how to tackle the challenges you face when you build Spring-based applications. The book empowers software developers to solve concrete business problems by mapping application-level issues to Spring-centric solutions. Read Creating a simple configuration item for more on Neo4j.

About the author

small Nigel began programming at an early age and has worked professionally in a variety of computing roles for over 15 years. His current areas of interest include Python, JavaScript, PostgreSQL, Neo4j and Linux. He has also founded a number of open source projects, most significantly py2neo, and is an active blogger, speaker and Neo4j community member and can be reached at @technige.

About Safari Books Online

Safari Books Online is an online learning library that provides access to thousands of technical, engineering, business, and digital media books and training videos. Get the latest information on topics like Windows 8, Android Development, iOS Development, Cloud Computing, HTML5, and so much more – sometimes even before the book is published or on bookshelves. Learn something new today with a free subscription to Safari Books Online.
|

One Response to Managing Neo4j Uniqueness with Py2neo

  1. Angello Maggio says:

    Thank you!
    These are wonderful methods that solve a big problem I had.
    I haven’t tried them yet, but they seem completely correct.
    Once again, thank you, great content.

    Angello