Neo4j: LOAD CSV – Processing hidden arrays in your CSV documents
I was recently asked how to process an ‘array’ of values inside a column in a CSV file using Neo4j’s LOAD CSV tool and although I initially thought this wouldn’t be possible as every cell is treated as a String, Michael showed me a way of working around this which I thought was pretty neat.
Let’s say we have a CSV file representing people and their friends. It might look like this:
name,friends "Mark","Michael,Peter" "Michael","Peter,Kenny" "Kenny","Anders,Michael"
And what we want is to have the following nodes:
- Mark
- Michael
- Peter
- Kenny
- Anders
And the following friends relationships:
- Mark -> Michael
- Mark -> Peter
- Michael -> Peter
- Michael -> Kenny
- Kenny -> Anders
- Kenny -> Michael
We’ll start by loading the CSV file and returning each row:
$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row RETURN row; +------------------------------------------------+ | row | +------------------------------------------------+ | {name -> "Mark", friends -> "Michael,Peter"} | | {name -> "Michael", friends -> "Peter,Kenny"} | | {name -> "Kenny", friends -> "Anders,Michael"} | +------------------------------------------------+ 3 rows
As expected the ‘friends’ column is being treated as a String which means we can use the split function to get an array of people that we want to be friends with:
$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row RETURN row, split(row.friends, ",") AS friends; +-----------------------------------------------------------------------+ | row | friends | +-----------------------------------------------------------------------+ | {name -> "Mark", friends -> "Michael,Peter"} | ["Michael","Peter"] | | {name -> "Michael", friends -> "Peter,Kenny"} | ["Peter","Kenny"] | | {name -> "Kenny", friends -> "Anders,Michael"} | ["Anders","Michael"] | +-----------------------------------------------------------------------+ 3 rows
Now that we’ve got them as an array we can use UNWIND to get pairs of friends that we want to create:
$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row WITH row, split(row.friends, ",") AS friends UNWIND friends AS friend RETURN row.name, friend; +-----------------------+ | row.name | friend | +-----------------------+ | "Mark" | "Michael" | | "Mark" | "Peter" | | "Michael" | "Peter" | | "Michael" | "Kenny" | | "Kenny" | "Anders" | | "Kenny" | "Michael" | +-----------------------+ 6 rows
And now we’ll introduce some MERGE statements to create the appropriate nodes and relationships:
$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row WITH row, split(row.friends, ",") AS friends UNWIND friends AS friend MERGE (p1:Person {name: row.name}) MERGE (p2:Person {name: friend}) MERGE (p1)-[:FRIENDS_WITH]->(p2); +-------------------+ | No data returned. | +-------------------+ Nodes created: 5 Relationships created: 6 Properties set: 5 Labels added: 5 373 ms
And now if we query the database to get back all the nodes + relationships…
$ match (p1:Person)-[r]->(p2) RETURN p1,r, p2; +------------------------------------------------------------------------+ | p1 | r | p2 | +------------------------------------------------------------------------+ | Node[0]{name:"Mark"} | :FRIENDS_WITH[0]{} | Node[1]{name:"Michael"} | | Node[0]{name:"Mark"} | :FRIENDS_WITH[1]{} | Node[2]{name:"Peter"} | | Node[1]{name:"Michael"} | :FRIENDS_WITH[2]{} | Node[2]{name:"Peter"} | | Node[1]{name:"Michael"} | :FRIENDS_WITH[3]{} | Node[3]{name:"Kenny"} | | Node[3]{name:"Kenny"} | :FRIENDS_WITH[4]{} | Node[4]{name:"Anders"} | | Node[3]{name:"Kenny"} | :FRIENDS_WITH[5]{} | Node[1]{name:"Michael"} | +------------------------------------------------------------------------+ 6 rows
…you’ll see that we have everything. If instead of a comma separated list of people we have a literal array in the cell…
name,friends "Mark", "[Michael,Peter]" "Michael", "[Peter,Kenny]" "Kenny", "[Anders,Michael]"
…we’d need to tweak the part of the query which extracts our friends to strip off the first and last characters:
$ load csv with headers from "file:/Users/markneedham/Desktop/friendsa.csv" AS row RETURN row, split(substring(row.friends, 1, length(row.friends) -2), ",") AS friends; +-------------------------------------------------------------------------+ | row | friends | +-------------------------------------------------------------------------+ | {name -> "Mark", friends -> "[Michael,Peter]"} | ["Michael","Peter"] | | {name -> "Michael", friends -> "[Peter,Kenny]"} | ["Peter","Kenny"] | | {name -> "Kenny", friends -> "[Anders,Michael]"} | ["Anders","Michael"] | +-------------------------------------------------------------------------+ 3 rows
And then if we put the whole query together we end up with this:
$ load csv with headers from "file:/Users/markneedham/Desktop/friendsa.csv" AS row WITH row, split(substring(row.friends, 1, length(row.friends) -2), ",") AS friends UNWIND friends AS friend MERGE (p1:Person {name: row.name}) MERGE (p2:Person {name: friend}) MERGE (p1)-[:FRIENDS_WITH]->(p2);; +-------------------+ | No data returned. | +-------------------+ Nodes created: 5 Relationships created: 6 Properties set: 5 Labels added: 5
Reference: | Neo4j: LOAD CSV – Processing hidden arrays in your CSV documents from our JCG partner Mark Needham at the Mark Needham Blog blog. |