Neo4j vs Relational: Refactoring – Extracting node/table
In my previous blog post I showed how to add a new property/field to a node with a label/record in a table for a football transfers dataset that I’ve been playing with.
After introducing this ‘nationality’ property I realised that I now had some duplication in the model:
players.nationality and clubs.country are referring to the same countries but they’ve both got them stored as strings so we can’t ensure the integrity of our countries and ensure that we’re referring to the same country.
We have the same issue in the graph model as well:
This time Player.nationality and Club.country refer to the same countries.
We can solve our problem by introducing a countries table in the relational model and a set of nodes with a ‘Country’ label in the graph model. Let’s start with relational.
This is the model we’re driving towards:
The first thing we need to do is create a countries table and populate it:
CREATE TABLE countries ( "code" CHARACTER VARYING(3) NOT NULL PRIMARY KEY, "name" CHARACTER VARYING(50) NOT NULL );
INSERT INTO countries VALUES('MNE', 'Montenegro'); INSERT INTO countries VALUES('SWZ', 'Swaziland', 'caf'); ...
Next let’s update the clubs table to reference the countries table:
ALTER TABLE clubs ADD COLUMN country_id CHARACTER VARYING(3) REFERENCES countries(code);
And let’s run a query to populate that column:
UPDATE clubs AS cl SET country_id = c.code FROM clubs INNER JOIN countries AS c ON c.name = clubs.country WHERE cl.id = clubs.id;
This query iterates over all the clubs, queries the country table to find the country id for that row and then stores it in the ‘country_id’ field. Finally we can remove the ‘country’ field:
ALTER TABLE clubs DROP COLUMN country;
Now we do the same drill for the players table:
ALTER TABLE players ADD COLUMN country_id CHARACTER VARYING(3) REFERENCES countries(code);
UPDATE players AS p SET country_id = c.code FROM players INNER JOIN countries AS c ON c.name = players.nationality WHERE p.id = players.id;
ALTER TABLE players DROP COLUMN nationality;
Now it’s time for the graph. This is the model we want to get to:
First we’ll create the countries:
CREATE CONSTRAINT ON (c:Country) ASSERT c.id IS UNIQUE
LOAD CSV WITH HEADERS FROM "file:///countries.csv" AS row MERGE (country:Country {id: row.countryCode}) ON CREATE SET country.name = row.country
And now let’s get clubs and players to point at those countries nodes and get rid of their respective nationality/country properties:
MATCH (club:Club) MATCH (country:Country {name: club.country}) MERGE (club)-[:PART_OF]->(country) REMOVE club.country
MATCH (player:Player) MATCH (country:Country {name: player.nationality}) MERGE (player)-[:PLAYS_FOR]->(country) REMOVE player.nationality
And that’s it, we can now write queries against our new model.
Reference: | Neo4j vs Relational: Refactoring – Extracting node/table from our JCG partner Mark Needham at the Mark Needham Blog blog. |