Neo4j: TF/IDF (and variants) with cypher

Mark NeedhamMarch 11th, 2015Last Updated: March 9th, 2015

0 30 5 minutes read

A few weeks ago I wrote a blog post on running TF/IDF over HIMYM transcripts using scikit-learn to find the most important phrases by episode and afterwards I was curious how difficult it’d be to do in Neo4j.

I started by translating one of wikipedia’s TF/IDF examples to cypher to see what the algorithm would look like:

WITH 3 as termFrequency, 2 AS numberOfDocuments, 1 as numberOfDocumentsWithTerm
WITH termFrequency, log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency
return termFrequency * inverseDocumentFrequency
 
0.9030899869919435

Next I needed to go over the HIMYM episode transcripts and extract phrases and their corresponding counts in each episode. I used scikit-learn’s CountVectorizer to do this and wrote the results into a CSV file.

Here’s a preview of that file:

$ head -n 10 data/import/words_scikit.csv
EpisodeId,Phrase,Count
1,2005,1
1,2005 seven,1
1,2005 seven just,1
1,2030,3
1,2030 kids,1
1,2030 kids intently,1
1,2030 narrator,1
1,2030 narrator kids,1
1,2030 son,1

Now let’s import that into Neo4j using the LOAD CSV tool:

// phrases
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/words_scikit.csv" AS row
MERGE (phrase:Phrase {value: row.Phrase});

// episode -> phrase
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/words_scikit.csv" AS row
MATCH (phrase:Phrase {value: row.Phrase})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (episode)-[:CONTAINED_PHRASE {times:TOINT(row.Count)}]->(phrase);

Now that all the data’s in we can translate the TF/IDF query to make use of our graph. We’ll start with episode 1:

match (e:Episode)
WITH COUNT(e) AS numberOfDocuments
match (p:Phrase)<-[r:CONTAINED_PHRASE]-(e:Episode {id: 1})
WITH numberOfDocuments, p, r.times AS termFrequency
MATCH (p)<-[:CONTAINED_PHRASE]->(otherEpisode)
WITH p, COUNT(otherEpisode) AS numberOfDocumentsWithTerm, numberOfDocuments, termFrequency
WITH p, numberOfDocumentsWithTerm,  log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency, termFrequency, numberOfDocuments
RETURN p.value, termFrequency, numberOfDocumentsWithTerm, inverseDocumentFrequency, termFrequency * inverseDocumentFrequency AS score
ORDER BY score DESC
LIMIT 10
 
==> +-----------------------------------------------------------------------------------+
==> | p.value                | termFrequency | numberOfDocumentsWithTerm | inverseDocumentFrequency | score              |
==> +-----------------------------------------------------------------------------------+
==> | "olives"               | 18            | 2                         | 2.0170333392987803       | 36.306600107378046 |
==> | "yasmine"              | 13            | 1                         | 2.3180633349627615       | 30.1348233545159   |
==> | "signal"               | 11            | 5                         | 1.6127838567197355       | 17.740622423917088 |
==> | "goanna"               | 10            | 4                         | 1.7160033436347992       | 17.16003343634799  |
==> | "flashback date"       | 6             | 1                         | 2.3180633349627615       | 13.908380009776568 |
==> | "scene"                | 17            | 37                        | 0.6989700043360189       | 11.88249007371232  |
==> | "flashback date robin" | 5             | 1                         | 2.3180633349627615       | 11.590316674813808 |
==> | "ted yasmine"          | 5             | 1                         | 2.3180633349627615       | 11.590316674813808 |
==> | "smurf pen1s"          | 5             | 2                         | 2.0170333392987803       | 10.085166696493902 |
==> | "eye patch"            | 5             | 2                         | 2.0170333392987803       | 10.085166696493902 |
==> +-----------------------------------------------------------------------------------+
==> 10 rows

The score we’ve calculated is different to scikit-learn’s but the relative order seems fine so that’s good. The neat thing about calculating this in Neo4j is that we can now vary the ‘inverse document’ part of the equation e.g. to find out the most important phrases in a season rather than an episode:

match (:Season) 
WITH COUNT(*) AS numberOfDocuments
match (p:Phrase)<-[r:CONTAINED_PHRASE]-(:Episode)-[:IN_SEASON]->(s:Season {number: "1"})
WITH p, SUM(r.times) AS termFrequency, numberOfDocuments
MATCH (p)<-[:CONTAINED_PHRASE]->(otherEpisode)-[:IN_SEASON]->(s:Season)
WITH p, COUNT(DISTINCT s) AS numberOfDocumentsWithTerm, termFrequency, numberOfDocuments
WITH p, numberOfDocumentsWithTerm,  log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency, termFrequency, numberOfDocuments
RETURN p.value, termFrequency, numberOfDocumentsWithTerm, inverseDocumentFrequency, termFrequency * inverseDocumentFrequency AS score
ORDER BY score DESC
LIMIT 10
 
==> +-----------------------------------------------------------------------------------+
==> | p.value         | termFrequency | numberOfDocumentsWithTerm | inverseDocumentFrequency | score              |
==> +-----------------------------------------------------------------------------------+
==> | "moby"          | 46            | 1                         | 0.9542425094393249       | 43.895155434208945 |
==> | "int"           | 71            | 3                         | 0.47712125471966244      | 33.87560908509603  |
==> | "ellen"         | 53            | 2                         | 0.6020599913279624       | 31.909179540382006 |
==> | "claudia"       | 104           | 4                         | 0.3010299956639812       | 31.307119549054043 |
==> | "ericksen"      | 59            | 3                         | 0.47712125471966244      | 28.150154028460083 |
==> | "party number"  | 29            | 1                         | 0.9542425094393249       | 27.67303277374042  |
==> | "subtitle"      | 27            | 1                         | 0.9542425094393249       | 25.76454775486177  |
==> | "vo"            | 47            | 3                         | 0.47712125471966244      | 22.424698971824135 |
==> | "ted vo"        | 47            | 3                         | 0.47712125471966244      | 22.424698971824135 |
==> | "future ted vo" | 45            | 3                         | 0.47712125471966244      | 21.47045646238481  |
==> +-----------------------------------------------------------------------------------+
==> 10 rows

From this query we learn that ‘Moby’ was only mentioned once in the whole series and actually all of those mentions were in the same episode. The occurrence of ‘int’ seems to be more of a data issue – in some episodes the transcript describes location but in many it doesn’t:

$ ack -iw "int" data/import/sentences.csv
2361,8,1,8,"INT. LIVING ROOM, YEAR 2030"
2377,8,1,8,INT. CHINESE RESTAURANT
2395,8,1,8,INT. APARTMENT
2412,8,1,8,INT. APARTMENT
2419,8,1,8,INT. BAR
2472,8,1,8,INT. APARTMENT
2489,8,1,8,INT. BAR
2495,8,1,8,INT. APARTMENT
2506,8,1,8,INT. BAR
2584,8,1,8,INT. APARTMENT
2629,8,1,8,INT. RESTAURANT
2654,8,1,8,INT. APARTMENT
2682,8,1,8,INT. RESTAURANT
2689,8,1,8,(Robin gets up and leaves restaurant) INT. HOSPITAL WAITING AREA

‘vo’ stands for voice over which should probably be stripped out in the stop words as it doesn’t add much value. It shows up here because the transcripts aren’t consistent in the way they represent Future Ted speaking.

Let’s have a look at the final season to see how that fares:

match (:Season)
WITH COUNT(*) AS numberOfDocuments
match (p:Phrase)<-[r:CONTAINED_PHRASE]-(:Episode)-[:IN_SEASON]->(s:Season {number: "9"})
WITH p, SUM(r.times) AS termFrequency, numberOfDocuments
MATCH (p)<-[:CONTAINED_PHRASE]->(otherEpisode:Episode)-[:IN_SEASON]->(s:Season)
WITH p, COUNT(DISTINCT s) AS numberOfDocumentsWithTerm, termFrequency, numberOfDocuments
WITH p, numberOfDocumentsWithTerm,  log10(numberOfDocuments / numberOfDocumentsWithTerm) AS inverseDocumentFrequency, termFrequency, numberOfDocuments
RETURN p.value, termFrequency, numberOfDocumentsWithTerm, inverseDocumentFrequency, termFrequency * inverseDocumentFrequency AS score
ORDER BY score DESC
LIMIT 10
 
==> +-----------------------------------------------------------------------------------+
==> | p.value              | termFrequency | numberOfDocumentsWithTerm | inverseDocumentFrequency | score              |
==> +-----------------------------------------------------------------------------------+
==> | "ring bear"          | 28            | 1                         | 0.9542425094393249       | 26.718790264301095 |
==> | "click options"      | 26            | 1                         | 0.9542425094393249       | 24.810305245422448 |
==> | "thank linus"        | 26            | 1                         | 0.9542425094393249       | 24.810305245422448 |
==> | "vow"                | 39            | 2                         | 0.6020599913279624       | 23.480339661790534 |
==> | "just click"         | 24            | 1                         | 0.9542425094393249       | 22.901820226543798 |
==> | "rehearsal dinner"   | 23            | 1                         | 0.9542425094393249       | 21.947577717104473 |
==> | "linus"              | 36            | 2                         | 0.6020599913279624       | 21.674159687806647 |
==> | "just click options" | 22            | 1                         | 0.9542425094393249       | 20.993335207665147 |
==> | "locket"             | 32            | 2                         | 0.6020599913279624       | 19.265919722494797 |
==> | "cassie"             | 19            | 1                         | 0.9542425094393249       | 18.13060767934717  |
==> +-----------------------------------------------------------------------------------+

There are several phrases which are specific to Barney & Robin’s wedding (‘vow’, ‘ring bear’, ‘rehearsal dinner’) so it makes sense that those come out top. The ‘linus’ here mostly refers to the server at the bar who interacts with Lily although a quick search over the transcripts reveals that she also had an Uncle Linus!

$ ack -iw "linus" data/import/sentences.csv  | head -n 5
18649,61,3,17,Lily: Why don't we just call Duluth Mental Hospital and say my Uncle Linus can live with us?
59822,185,9,1,Linus.
59826,185,9,1,"Are you my guy, Linus?"
59832,185,9,1,Thank you Linus.
59985,185,9,1,"Thank you, Linus."
...

From doing this exercise I think TF/IDF is an interesting way of exploring unstructured data but for a phrase to be really interesting to us it should appear across multiple episodes/seasons.

One way to achieve that would be to weight those features more so I’ll try that out next.