Neo4j Graph Algorithms: (3) Similarity Algorithms

Published by Cristian Scutaru on

Similarity algorithms compute the similarity of pairs of nodes using different vector-based metrics. This visual presentation of the Neo4j graph algorithms is focused on quick understanding and less implementation details. With small reusable samples, for less time-consuming labs.

Table of Contents

Introduction

This article presents quickly – in a graphical and descriptive manner, skipping many implementation details – most of the Similarity algorithms implemented by Neo4j in their Graph Data Science (GDS) library. We focus on one or two small samples reused by most of the algorithms, to keep it simple and allow for less time-consuming labs.

Run the following CREATE query in a new empty project. All these Cypher queries can be run quickly in a new Blank Project on the free online Neo4j Sandbox. Or, if you have your own environment, on your Neo4j Desktop.

CREATE (john:Person {name: 'John', age: 24}),
   (mary:Person {name: 'Mary', age: 24}),
   (mike:Person {name: 'Mike', age: 73}),
   (arya:Person {name: 'Arya', age: 48}),
   (french:Cuisine {name:'French'}),
   (italian:Cuisine {name:'Italian'}),
   (indian:Cuisine {name:'Indian'}),
   (john)-[:LIKES {score: 9}]->(indian),
   (mary)-[:LIKES {score: 5}]->(french),
   (mary)-[:LIKES {score: 7}]->(italian),
   (mike)-[:LIKES {score: 3}]->(french),
   (mike)-[:LIKES {score: 9}]->(italian),
   (mike)-[:LIKES {score: 1}]->(indian),
   (arya)-[:LIKES {score: 2}]->(french),
   (arya)-[:LIKES {score: 6}]->(indian);

This creates the following bipartite graph in your database, with two sets of nodes labeled Person and Cuisine. Relationships are in the form one Person LIKES one or more types of Cuisine, with a 1 to 10 score relationship factor telling how much:

Call anytime MATCH (n) RETURN n; to get a simplified visual representation of the whole graph.

1. Jaccard Similarity Function

Jaccard Similarity is a coefficient that measures similarities between sets (lists interpreted as multisets). It is defined as the size of the intersection divided by the size of the union of two sets.

MATCH (p1:Person {name: 'John'})-[:LIKES]->(cuisine1)
WITH p1, collect(id(cuisine1)) AS p1Cuisine
MATCH (p2:Person)-[:LIKES]->(cuisine2) WHERE p1 <> p2
WITH p1, p1Cuisine, p2, collect(id(cuisine2)) AS p2Cuisine
RETURN p1.name AS from, p2.name AS to,
   gds.alpha.similarity.jaccard(p1Cuisine, p2Cuisine) AS jaccard;

GDS implements a few simple procedures that calculate a similarity coefficient between two different nodes from the same set. In our samples, we’ll look in particular how similar the John node is to the other Person nodes (Mary, Mike and Arya), based on different criteria (the left side of the next image).

The previous query returns the Jaccard coefficient – that you can check on the next image – when comparing John with Mary, Mike and Arya based on the type of Cuisine they each like (how much was not important, as this was considered and unweighted graph):

You can see John is most similar to Arya, because they both like Indian cuisine. They are only 0.5 similar, because Arya also likes French cuisine.

2. Node Similarity Algorithm

Node Similarity is an algorithm that compares a set of nodes based on the nodes they are connected to. Two nodes are considered similar if they share many of the same neighbors. Node Similarity computes pair-wise similarities based on the Jaccard Similarity Score.

Calling the following two queries will first prepare an in-memory graph, then run the Node Similarity algorithm on it:

CALL gds.graph.create('myGraph', ['Person', 'Cuisine'], {
   LIKES: {
      type: 'LIKES',
      properties: { score: { property: 'score', defaultValue: 1.0 } }
   }
});

CALL gds.nodeSimilarity.stream('myGraph')
YIELD node1, node2, similarity
WITH gds.util.asNode(node1).name AS from,
   gds.util.asNode(node2).name AS to,
   similarity
WHERE from = 'John'
RETURN from, to, similarity;

Returned values here – for John to the other three persons – should be very similar to the ones returned for the Jaccard coefficient.

Discard the native graph from memory now – with CALL gds.graph.drop(‘myGraph’); – as we’ll no longer need it.

3. Cosine Similarity Function

Cosine similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space. It is the dot product of the two vectors divided by the product of the two vectors’ lengths (or magnitudes).

MATCH (p1:Person {name: 'John'})-[likes1:LIKES]->(cuisine)
MATCH (p2:Person)-[likes2:LIKES]->(cuisine2) WHERE p1 <> p2
WITH p1, collect(likes1.score) AS p1L, p2, collect(likes2.score) AS p2L
RETURN p1.name AS from, p2.name AS to,
   gds.alpha.similarity.cosine(p1L, p2L) AS cosine;

Look at the returned cosine values and compare them with those from our last image.

4. Overlap Similarity Algorithm

Overlap similarity measures overlap between two sets. It is defined as the size of the intersection of two sets, divided by the size of the smaller of the two sets.

MATCH (p1:Person)-[:LIKES]->(cuisine1)
WITH { item: id(p1), categories: collect(id(cuisine1)) } AS userData
WITH collect(userData) AS data
CALL gds.alpha.similarity.overlap.stream({data: data})
YIELD item1, item2, similarity
RETURN gds.util.asNode(item1).name AS from,
   gds.util.asNode(item2).name AS to,
   similarity AS overlap;

You can see John and Mary have a 0 overlap, because they like different cuisines. But John has a 1 overlap with either Mike or Arya, because they all like Indian cuisine.

5. Pearson Similarity Function

Pearson similarity is the covariance of the two n-dimensional vectors divided by the product of their standard deviations.

MATCH (p1:Person {name: 'John'})-[likes1:LIKES]->(cuisine1)
WITH p1, gds.alpha.similarity.asVector(cuisine1, likes1.score) AS p1L
MATCH (p2:Person)-[likes2:LIKES]->(cuisine2) WHERE p1 <> p2
WITH p1, p2, p1L, 
   gds.alpha.similarity.asVector(cuisine2, likes2.score) AS p2L
RETURN p1.name AS from, p2.name AS to,
   gds.alpha.similarity.pearson(p1L, p2L, {vectorType:"maps"}) AS pearson;

The Pearson coefficient for John was always 0 here.

6. Euclidean Distance Function

Euclidean distance measures the straight line distance between two points in n-dimensional space.

MATCH (p1:Person {name: 'John'})-[likes1:LIKES]->(cuisine)
MATCH (p2:Person)-[likes2:LIKES]->(cuisine) WHERE p2 <> p1
WITH p1, p2, collect(likes1.score) as p1L, collect(likes2.score) as p2L
RETURN p1.name AS from, p2.name AS to,
   gds.alpha.similarity.euclideanDistance(p1L, p2L) AS euclidean;

The largest calculated euclidean distance here is between John and Mike (8).

7. K-Nearest Neighbors (KNN) Algorithm

Unlike all previous Similarity algorithms, KNN computes a distance value for all node pairs in the graph and creates new relationships between each node and its k nearest neighbors. The distance is calculated based on node properties.

The following sample creates an in-memory projection of the graph, based on the age property of a Person (the Cuisine nodes do not matter at all here). Then it creates new SIMILAR relationships between the Person nodes, and assigns score relationship property values based on the age difference:

CALL gds.graph.create('myGraph', {
   Person: { label: 'Person', properties: 'age' }
}, '*');

CALL gds.beta.knn.write('myGraph', {
   writeRelationshipType: 'SIMILAR',
   writeProperty: 'score',
   topK: 1,
   randomSeed: 42,
   nodeWeightProperty: 'age'
})
YIELD nodesCompared, relationshipsWritten;

Call MATCH (n) RETURN n; for a visual representation of the graph. On the more elaborated image below, look at the newly created SIMILAR relationships between the Person nodes, in pink, with a score property:

Based on the age of each Person, KNN determined John and Mary have the same age (24, with a score of 1). It also created new SIMILAR relationships between Mike and Arya, and Arya and Mary, and assigned a score value proportional with the difference in age between them.

8. Approximate KNN (ANN) Algorithm

ANN constructs a KNN graph for a set of objects based on a provided similarity algorithm. The similarity of items is computed based on Jaccard Similarity, Cosine Similarity, Euclidean Distance, or Pearson Similarity.

The following example generates such SIMILAR relationships with a score between all Cuisine nodes, based on the number of Person nodes that like each type of cuisine (the score of each LIKES doesn’t matter):

MATCH (p:Person)-[:LIKES]->(cuisine)
WITH { item: id(cuisine), categories: collect(id(p))} AS userData
WITH collect(userData) AS data
CALL gds.alpha.ml.ann.write({ data: data, algorithm: 'jaccard' })
YIELD nodes, similarityPairs
RETURN nodes, similarityPairs;

Look at the Cuisine nodes and the new pink SIMILAR relationships between them. There is a score of 0.5 between French and Indian cuisines, because they are each likes by 3 persons:

Conclusions

Read all articles from the same Neo4j Graph Algorithms series:

  1. Path Finding Algorithms
  2. Centrality Algorithms
  3. Similarity Algorithms
  4. Community Detection Algorithms
  5. Link Prediction Algorithms
Need a NoSQL Expert?

Certified Solutions Architect in Azure and AWS
Certified Professional in Cassandra, Couchbase, Redis, Neo4j
Experienced in DynamoDB, Cosmos DB, MongoDB

Categories: Graphs

Cristian Scutaru

I designed and implemented the Data Xtractor suite, with Model Xtractor, Query Xtractor, and Visual Xtractor as separate modules. I am a software architect and developer with over 30 years professional experience. I’ve been working with relational databases for almost three decades and I was constantly unhappy with the relative limitation of those tools used to connect directly to a platform, and instantly extract and display data in flexible ways.