Neo4j: The BBC Champions League graph
A couple of weekends ago I started scraping the BBC live text feed of the Bayern Munich/Barcelona match, initially starting out with just the fouls and building the foul graph.
I’ve spent a bit more time on it since then and have managed to model several other events as well including attempts, goals, cards and free kicks.
I started doing this just for the Bayern Munich/Barcelona match but realised it wasn’t particularly difficult to extend this out and graph the events for every match in the Champions League 2014/2015.
To do this we first need to download the pages for each of the matches. I downloaded this page and wrote a simple Python script to get the appropriate URIs:
from bs4 import BeautifulSoup from soupselect import select import bs4 soup = BeautifulSoup(open("data/results", "r")) matches = select(soup, "a.report") for match in matches: print "http://www.bbc.co.uk/%s" %(match.get("href"))
I then piped the output of running this script into wget:
find_all_matches.py | xargs wget -O data/raw
It was relatively simple to update the scraping and import code to handle multiple matches. The whole process from end to end looks like this:
Most of the code is in the ‘scraping magic’ phase where I go through all the events and pull out the appropriate elements that we can link together in the graph.
e.g. a freekick and foul event are typically adjacent so we’d look to pull out the two players involved, the type of card issues, the time of the incident and the match the event occurred in.
I used Python’s Beautiful Soup library for this task but there’s no reason you couldn’t use another set of tools.
The README page shows how to create your own version of the graph but here’s an overview of what the graph looks like using Rik’s meta graph query:
Here’s a few of my favourite queries so far:
Which player with more than 10 shots has the best conversion rate?
match (a:Attempt)<-[:HAD_ATTEMPT]-(app)<-[:MADE_APPEARANCE]-(player), (app)-[:FOR_TEAM]-(team) WITH player, COUNT(*) as times, COLLECT(a) AS attempts, team WITH player, times, LENGTH([a in attempts WHERE a:Goal]) AS goals, team WHERE times > 10 RETURN player.name, team.name, goals, times, (goals * 1.0 / times) AS conversionRate ORDER BY conversionRate DESC LIMIT 10 ==> +------------------------------------------------------------------------------------+ ==> | player.name | team.name | goals | times | conversionRate | ==> +------------------------------------------------------------------------------------+ ==> | "Luiz Adriano" | "Shakhtar Donetsk" | 9 | 14 | 0.6428571428571429 | ==> | "Yacine Brahimi" | "FC Porto" | 5 | 13 | 0.38461538461538464 | ==> | "Mario Mandzukic" | "Atlético de Madrid" | 5 | 14 | 0.35714285714285715 | ==> | "Sergio Agüero" | "Manchester City" | 6 | 18 | 0.3333333333333333 | ==> | "Karim Benzema" | "Real Madrid" | 6 | 19 | 0.3157894736842105 | ==> | "Klaas-Jan Huntelaar" | "FC Schalke 04" | 5 | 16 | 0.3125 | ==> | "Neymar" | "Barcelona" | 9 | 29 | 0.3103448275862069 | ==> | "Thomas Müller" | "FC Bayern München" | 7 | 24 | 0.2916666666666667 | ==> | "Jackson Martínez" | "FC Porto" | 7 | 24 | 0.2916666666666667 | ==> | "Callum McGregor" | "Celtic" | 3 | 11 | 0.2727272727272727 | ==> +------------------------------------------------------------------------------------+
Which players gained immediate revenge for a foul?
match (firstFoul:Foul)-[:COMMITTED_AGAINST]->(app1)<-[:MADE_APPEARANCE]-(revengeFouler), (app1)-[:IN_MATCH]->(match), (firstFoulerApp)-[:COMMITTED_FOUL]->(firstFoul), (app1)-[:COMMITTED_FOUL]->(revengeFoul)-[:COMMITTED_AGAINST]->(firstFoulerApp), (firstFouler)-[:MADE_APPEARANCE]->(firstFoulerApp) WHERE (firstFoul)-[:NEXT]->(revengeFoul) RETURN firstFouler.name AS firstFouler, revengeFouler.name AS revengeFouler, firstFoul.time, revengeFoul.time, match.home + " vs " + match.away ==> +---------------------------------------------------------------------------------------------------------------------------------+ ==> | firstFouler | revengeFouler | firstFoul.time | revengeFoul.time | match.home + " vs " + match.away | ==> +---------------------------------------------------------------------------------------------------------------------------------+ ==> | "Derk Boerrigter" | "Jean Philippe Mendy" | "88:48" | "89:42" | "Celtic vs NK Maribor" | ==> | "Mario Suárez" | "Pajtim Kasami" | "27:17" | "32:38" | "Olympiakos vs Atlético de Madrid" | ==> | "Aleksandr Volodko" | "Casemiro" | "39:27" | "44:32" | "FC Porto vs BATE Borisov" | ==> | "Thomas Müller" | "Mario Fernandes" | "87:22" | "88:31" | "CSKA Moscow vs FC Bayern München" | ==> | "Vinicius" | "Marco Verratti" | "56:36" | "58:00" | "APOEL Nicosia vs Paris Saint Germain" | ==> | "Lasse Schöne" | "Dani Alves" | "84:08" | "86:18" | "Barcelona vs Ajax" | ==> | "Nick Viergever" | "Dani Alves" | "57:22" | "60:37" | "Barcelona vs Ajax" | ==> | "Nani" | "Atsuto Uchida" | "6:10" | "8:40" | "FC Schalke 04 vs Sporting Lisbon" | ==> | "Andreas Samaris" | "Yannick Ferreira-Carrasco" | "89:21" | "90:00 +4:21" | "Monaco vs Benfica" | ==> | "Simon Kroon" | "Guillherme Siqueira" | "84:05" | "90:00 +0:29" | "Atlético de Madrid vs Malmö FF" | ==> | "Mario Suárez" | "Isaac Thelin" | "32:02" | "38:47" | "Atlético de Madrid vs Malmö FF" | ==> | "Hakan Balta" | "Henrikh Mkhitaryan" | "62:09" | "64:14" | "Borussia Dortmund vs Galatasaray" | ==> | "Marco Reus" | "Selcuk Inan" | "36:17" | "44:03" | "Borussia Dortmund vs Galatasaray" | ==> | "Hakan Balta" | "Sven Bender" | "10:57" | "12:51" | "Borussia Dortmund vs Galatasaray" | ==> | "Vinicius" | "Edinson Cavani" | "87:56" | "90:00 +1:25" | "Paris Saint Germain vs APOEL Nicosia" | ==> | "Jackson Martínez" | "Carlos Gurpegi" | "64:55" | "66:17" | "Athletic Club vs FC Porto" | ==> | "Nani" | "Chinedu Obasi" | "1:30" | "4:47" | "Sporting Lisbon vs FC Schalke 04" | ==> | "Vitali Rodionov" | "Bruno Martins Indi" | "52:16" | "60:08" | "BATE Borisov vs FC Porto" | ==> | "Raheem Sterling" | "Behrang Safari" | "29:00" | "33:27" | "Liverpool vs FC Basel" | ==> | "Derlis González" | "Fábio Coentrão" | "52:55" | "57:59" | "FC Basel vs Real Madrid" | ==> | "Josip Drmic" | "Lisandro López" | "15:04" | "17:35" | "Benfica vs Bayer 04 Leverkusen" | ==> | "Fred" | "Bastian Schweinsteiger" | "6:04" | "9:28" | "Shakhtar Donetsk vs FC Bayern München" | ==> | "Alex Sandro" | "Derlis González" | "4:07" | "7:28" | "FC Basel vs FC Porto" | ==> | "Luca Zuffi" | "Ruben Neves" | "73:49" | "84:44" | "FC Porto vs FC Basel" | ==> | "Marco Verratti" | "Oscar" | "28:49" | "34:04" | "Chelsea vs Paris Saint Germain" | ==> | "Cristiano Ronaldo" | "Jesús Gámez" | "20:59" | "25:37" | "Real Madrid vs Atlético de Madrid" | ==> | "Bernardo Silva" | "Álvaro Morata" | "49:20" | "62:31" | "Monaco vs Juventus" | ==> | "Arturo Vidal" | "Fabinho" | "38:19" | "45:00" | "Monaco vs Juventus" | ==> +---------------------------------------------------------------------------------------------------------------------------------+
Which players took the longest to gain revenge for a foul?
match (foul1:Foul)-[:COMMITTED_AGAINST]->(app1)-[:COMMITTED_FOUL]->(foul2)-[:COMMITTED_AGAINST]->(app2)-[:COMMITTED_FOUL]->(foul1), (player1)-[:MADE_APPEARANCE]->(app1), (player2)-[:MADE_APPEARANCE]->(app2), (foul1)-[:COMMITTED_IN_MATCH]->(match:Match)<-[:COMMITTED_IN_MATCH]-(foul2) WHERE (foul1)-[:NEXT*]->(foul2) WITH match, foul1, player1, player2, foul2 ORDER BY foul1.sortableTime, foul2.sortableTime WITH match, foul1, player1, player2, COLLECT(foul2) AS revenge WITH match, foul1, player1,player2, revenge[0] AS revengeFoul RETURN player1.name, player2.name, foul1.time, revengeFoul.time, revengeFoul.sortableTime - foul1.sortableTime AS secondsWaited, match.home + " vs " + match.away AS match ORDER BY secondsWaited DESC LIMIT 5 ==> +---------------------------------------------------------------------------------------------------------------------------+ ==> | player1.name | player2.name | foul1.time | revengeFoul.time | secondsWaited | match | ==> +---------------------------------------------------------------------------------------------------------------------------+ ==> | "Stefan Johansen" | "Ondrej Duda" | "1:30" | "82:11" | 4841 | "Legia Warsaw vs Celtic" | ==> | "Neymar" | "Vinicius" | "2:35" | "80:08" | 4653 | "Barcelona vs APOEL Nicosia" | ==> | "Jérémy Toulalan" | "Stefan Kießling" | "9:19" | "86:37" | 4638 | "Monaco vs Bayer 04 Leverkusen" | ==> | "Nabil Dirar" | "Domenico Criscito" | "6:32" | "82:39" | 4567 | "Zenit St Petersburg vs Monaco" | ==> | "Nabil Dirar" | "Eliseu" | "7:20" | "81:30" | 4450 | "Monaco vs Benfica" | ==> +---------------------------------------------------------------------------------------------------------------------------+
Who’s had the most shots?
match (team)<-[:FOR_TEAM]-(app)<-[appRel:MADE_APPEARANCE]-(player:Player) optional match (a:Attempt)<-[att:HAD_ATTEMPT]-(app) WITH player, COUNT( DISTINCT appRel) AS apps, COUNT(att) as times, COLLECT(a) AS attempts, team WITH player,apps, times, LENGTH([a in attempts WHERE a:Goal]) AS goals, team WHERE times > 10 RETURN player.name, team.name, apps, goals, times, (goals * 1.0 / times) AS conversionRate ORDER BY times DESC LIMIT 10 ==> +-------------------------------------------------------------------------------------------+ ==> | player.name | team.name | apps | goals | times | conversionRate | ==> +-------------------------------------------------------------------------------------------+ ==> | "Cristiano Ronaldo" | "Real Madrid" | 12 | 10 | 69 | 0.14492753623188406 | ==> | "Lionel Messi" | "Barcelona" | 12 | 10 | 51 | 0.19607843137254902 | ==> | "Robert Lewandowski" | "FC Bayern München" | 12 | 6 | 43 | 0.13953488372093023 | ==> | "Carlos Tévez" | "Juventus" | 12 | 7 | 34 | 0.20588235294117646 | ==> | "Gareth Bale" | "Real Madrid" | 10 | 2 | 32 | 0.0625 | ==> | "Luis Suárez" | "Barcelona" | 9 | 6 | 30 | 0.2 | ==> | "Neymar" | "Barcelona" | 11 | 9 | 29 | 0.3103448275862069 | ==> | "Hakan Calhanoglu" | "Bayer 04 Leverkusen" | 8 | 2 | 29 | 0.06896551724137931 | ==> | "Edinson Cavani" | "Paris Saint Germain" | 8 | 6 | 27 | 0.2222222222222222 | ==> | "Alexis Sánchez" | "Arsenal" | 9 | 4 | 25 | 0.16 | ==> +-------------------------------------------------------------------------------------------+
Maybe you can think of some cooler ones? I’d love to see them. Grab the code from github and give it a try.
Reference: | Neo4j: The BBC Champions League graph from our JCG partner Mark Needham at the Mark Needham Blog blog. |