Mini Search Engine – Just the basics, using Neo4j, Crawler4j, Graphstream and Encog
Continuing to chapter 4 of Programming Collection Intelligence (PCI) which is implementing a search engine.
I may have bitten off a little more than I should of in 1 exercise. Instead of using the normal relational database construct as used in the book, I figured, I always wanted to have a look at Neo4J so now was the time. Just to say, this isn’t necessarily the ideal use case for a graph db, but how hard could to be to kill 3 birds with 1 stone.
Working through the tutorials trying to reset my SQL Server, Oracle mindset took a little longer than expected, but thankfully there are some great resources around Neo4j.
Just a couple:
Since I just wanted to run this as a little exercise, I decided to go for a in memory implementation and not run it as a service on my machine. In hindsight this was probably a mistake and the tools and web interface would have helped me visualise my data graph quicker in the beginning.
As you can only have 1 writable instance of the in memory implementation, I made a little double lock singleton factory to create and clear the DB.
package net.briandupreez.pci.chapter4; import org.neo4j.graphdb.GraphDatabaseService; import org.neo4j.graphdb.factory.GraphDatabaseFactory; import org.neo4j.kernel.impl.util.FileUtils; import java.io.File; import java.io.IOException; import java.util.HashMap; import java.util.Map; public class CreateDBFactory { private static GraphDatabaseService graphDb = null; public static final String RESOURCES_CRAWL_DB = "resources/crawl/db"; public static GraphDatabaseService createInMemoryDB() { if (null == graphDb) { synchronized (GraphDatabaseService.class) { if (null == graphDb) { final Map<String, String> config = new HashMap<>(); config.put("neostore.nodestore.db.mapped_memory", "50M"); config.put("string_block_size", "60"); config.put("array_block_size", "300"); graphDb = new GraphDatabaseFactory() .newEmbeddedDatabaseBuilder(RESOURCES_CRAWL_DB) .setConfig(config) .newGraphDatabase(); registerShutdownHook(graphDb); } } } return graphDb; } private static void registerShutdownHook(final GraphDatabaseService graphDb) { Runtime.getRuntime().addShutdownHook(new Thread() { @Override public void run() { graphDb.shutdown(); } }); } public static void clearDb() { try { if(graphDb != null){ graphDb.shutdown(); graphDb = null; } FileUtils.deleteRecursively(new File(RESOURCES_CRAWL_DB)); } catch (final IOException e) { throw new RuntimeException(e); } } }
Then using Crawler4j created a graph of all the URLs starting with my blog, their relationships to other URLs and all the words and indexes of the words that those URLs contain.
package net.briandupreez.pci.chapter4; import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.parser.HtmlParseData; import edu.uci.ics.crawler4j.url.WebURL; import org.neo4j.graphdb.GraphDatabaseService; import org.neo4j.graphdb.Node; import org.neo4j.graphdb.Relationship; import org.neo4j.graphdb.Transaction; import org.neo4j.graphdb.index.Index; import java.util.ArrayList; import java.util.Arrays; import java.util.List; public class Neo4JWebCrawler extends WebCrawler { private final GraphDatabaseService graphDb; /** * Constructor. */ public Neo4JWebCrawler() { this.graphDb = CreateDBFactory.createInMemoryDB(); } @Override public boolean shouldVisit(final WebURL url) { final String href = url.getURL().toLowerCase(); return !NodeConstants.FILTERS.matcher(href).matches(); } /** * This function is called when a page is fetched and ready * to be processed by your program. */ @Override public void visit(final Page page) { final String url = page.getWebURL().getURL(); System.out.println("URL: " + url); final Index<Node> nodeIndex = graphDb.index().forNodes(NodeConstants.PAGE_INDEX); if (page.getParseData() instanceof HtmlParseData) { HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); String text = htmlParseData.getText(); //String html = htmlParseData.getHtml(); List<WebURL> links = htmlParseData.getOutgoingUrls(); Transaction tx = graphDb.beginTx(); try { final Node pageNode = graphDb.createNode(); pageNode.setProperty(NodeConstants.URL, url); nodeIndex.add(pageNode, NodeConstants.URL, url); //get all the words final List<String> words = cleanAndSplitString(text); int index = 0; for (final String word : words) { final Node wordNode = graphDb.createNode(); wordNode.setProperty(NodeConstants.WORD, word); wordNode.setProperty(NodeConstants.INDEX, index++); final Relationship relationship = pageNode.createRelationshipTo(wordNode, RelationshipTypes.CONTAINS); relationship.setProperty(NodeConstants.SOURCE, url); } for (final WebURL webURL : links) { System.out.println("Linking to " + webURL); final Node linkNode = graphDb.createNode(); linkNode.setProperty(NodeConstants.URL, webURL.getURL()); final Relationship relationship = pageNode.createRelationshipTo(linkNode, RelationshipTypes.LINK_TO); relationship.setProperty(NodeConstants.SOURCE, url); relationship.setProperty(NodeConstants.DESTINATION, webURL.getURL()); } tx.success(); } finally { tx.finish(); } } } private static List<String> cleanAndSplitString(final String input) { if (input != null) { final String[] dic = input.toLowerCase().replaceAll("\\p{Punct}", "").replaceAll("\\p{Digit}", "").split("\\s+"); return Arrays.asList(dic); } return new ArrayList<>(); } }
After the data was collected, I could query it and perform the functions of a search engine. For this I decided to use java futures as it was another thing I had only read about and not yet implemented. In my day to day working environment we use Weblogic / CommonJ work managers within the application server to perform the same task.
final ExecutorService executorService = Executors.newFixedThreadPool(4); final String[] searchTerms = {"java", "spring"}; List<Callable<TaskResponse>> tasks = new ArrayList<>(); tasks.add(new WordFrequencyTask(searchTerms)); tasks.add(new DocumentLocationTask(searchTerms)); tasks.add(new PageRankTask(searchTerms)); tasks.add(new NeuralNetworkTask(searchTerms)); final List<Future<TaskResponse>> results = executorService.invokeAll(tasks);
I then went about creating a task for each of the following counting the word frequency, document location, Page Rank and neural network (with fake input / training data) to rank the pages returned based on the search criteria. All the code is in my public github blog repo.
Disclaimer: The Neural Network task, either didn’t have enough data to be affective, or I implemented the data normalisation incorrectly, so it is currently not very useful, I’ll return to it once I have completed the journey through the while PCI book.
The one task worth sharing was the Page Rank one, I quickly read some of the theory for it, decided I am not that clever and went searching for a library that had it implemented. I discovered Graphstream a wonderful opensource project that does a WHOLE lot more than just PageRank, check out their video.
From that it was then simple to implement my PageRank task of this exercise.
package net.briandupreez.pci.chapter4.tasks; import net.briandupreez.pci.chapter4.NodeConstants; import net.briandupreez.pci.chapter4.NormalizationFunctions; import org.graphstream.algorithm.PageRank; import org.graphstream.graph.Graph; import org.graphstream.graph.implementations.SingleGraph; import org.neo4j.cypher.javacompat.ExecutionEngine; import org.neo4j.cypher.javacompat.ExecutionResult; import org.neo4j.graphdb.Node; import org.neo4j.graphdb.Relationship; import java.util.HashMap; import java.util.Iterator; import java.util.Map; import java.util.concurrent.Callable; public class PageRankTask extends SearchTask implements Callable<TaskResponse> { public PageRankTask(final String... terms) { super(terms); } @Override protected ExecutionResult executeQuery(final String... words) { final ExecutionEngine engine = new ExecutionEngine(graphDb); final StringBuilder bob = new StringBuilder("START page=node(*) MATCH (page)-[:CONTAINS]->words "); bob.append(", (page)-[:LINK_TO]->related "); bob.append("WHERE words.word in ["); bob.append(formatArray(words)); bob.append("] "); bob.append("RETURN DISTINCT page, related"); return engine.execute(bob.toString()); } public TaskResponse call() { final ExecutionResult result = executeQuery(searchTerms); final Map<String, Double> returnMap = convertToUrlTotalWords(result); final TaskResponse response = new TaskResponse(); response.taskClazz = this.getClass(); response.resultMap = NormalizationFunctions.normalizeMap(returnMap, true); return response; } private Map<String, Double> convertToUrlTotalWords(final ExecutionResult result) { final Map<String, Double> uniqueUrls = new HashMap<>(); final Graph g = new SingleGraph("rank", false, true); final Iterator<Node> pageIterator = result.columnAs("related"); while (pageIterator.hasNext()) { final Node node = pageIterator.next(); final Iterator<Relationship> relationshipIterator = node.getRelationships().iterator(); while (relationshipIterator.hasNext()) { final Relationship relationship = relationshipIterator.next(); final String source = relationship.getProperty(NodeConstants.SOURCE).toString(); uniqueUrls.put(source, 0.0); final String destination = relationship.getProperty(NodeConstants.DESTINATION).toString(); g.addEdge(String.valueOf(node.getId()), source, destination, true); } } computeAndSetPageRankScores(uniqueUrls, g); return uniqueUrls; } /** * Compute score * * @param uniqueUrls urls * @param graph the graph of all links */ private void computeAndSetPageRankScores(final Map<String, Double> uniqueUrls, final Graph graph) { final PageRank pr = new PageRank(); pr.init(graph); pr.compute(); for (final Map.Entry<String, Double> entry : uniqueUrls.entrySet()) { final double score = 100 * pr.getRank(graph.getNode(entry.getKey())); entry.setValue(score); } } }
In between all of this I found a great implementation of sorting a map by values on Stackoverflow.
package net.briandupreez.pci.chapter4; import java.util.*; public class MapUtil { /** * Sort a map based on values. * The values must be Comparable. * * @param map the map to be sorted * @param ascending in ascending order, or descending if false * @param <K> key generic * @param <V> value generic * @return sorted list */ public static <K, V extends Comparable<? super V>> List<Map.Entry<K, V>> entriesSortedByValues(final Map<K, V> map, final boolean ascending) { final List<Map.Entry<K, V>> sortedEntries = new ArrayList<>(map.entrySet()); Collections.sort(sortedEntries, new Comparator<Map.Entry<K, V>>() { @Override public int compare(final Map.Entry<K, V> e1, final Map.Entry<K, V> e2) { if (ascending) { return e1.getValue().compareTo(e2.getValue()); } else { return e2.getValue().compareTo(e1.getValue()); } } } ); return sortedEntries; } }
The Maven dependencies used to implement all of this
<dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>14.0.1</version> </dependency> <dependency> <groupId>org.encog</groupId> <artifactId>encog-core</artifactId> <version>3.2.0-SNAPSHOT</version> </dependency> <dependency> <groupId>edu.uci.ics</groupId> <artifactId>crawler4j</artifactId> <version>3.5</version> <type>jar</type> <scope>compile</scope> </dependency> <dependency> <groupId>org.neo4j</groupId> <artifactId>neo4j</artifactId> <version>1.9</version> </dependency> <dependency> <groupId>org.graphstream</groupId> <artifactId>gs-algo</artifactId> <version>1.1.2</version> </dependency>
Now to chapter 5 on PCI… Optimisation.
It’s very nice
Is it free to use NEO4j