Resolve coreference using Stanford CoreNLP
Coreference resolution is the task of finding all expressions that refer to the same entity in a text. Stanford CoreNLP coreference resolution system is the state-of-the-art system to resolve coreference in the text. To use the system, we usually create a pipeline, which requires tokenization, sentence splitting, part-of-speech tagging, lemmarization, named entity recoginition, and parsing. However sometimes, we use others tools for preprocessing, particulaly when we are working on a specific domain. In these cases, we need a stand-alone coreference resolution system. This post demenstrates how to create such a system using Stanford CoreNLP.
Load properties
In general, we can just create an empty Properties, because the Stanford CoreNLP tool can automatically load the default one in the model jar file, which is under edu.stanford.nlp.pipeline
.
In other cases, we would like to use specific properties. The following code shows one example of loading the property file from the working directory.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | private static final String PROPS_SUFFIX = ".properties" ; private Properties loadProperties(String name) { return loadProperties(name, Thread.currentThread().getContextClassLoader()); } private Properties loadProperties(String name, ClassLoader loader) { if (name.endsWith(PROPS_SUFFIX)) name = name.substring( 0 , name.length() - PROPS_SUFFIX.length()); name = name.replace( '.' , '/' ); name += PROPS_SUFFIX; Properties result = null ; // Returns null on lookup failures System.err.println( "Searching for resource: " + name); InputStream in = loader.getResourceAsStream(name); try { if (in != null ) { InputStreamReader reader = new InputStreamReader(in, "utf-8" ); result = new Properties(); result.load(reader); // Can throw IOException } } catch (IOException e) { result = null ; } finally { IOUtils.closeIgnoringExceptions(in); } return result; } |
Initialize the system
After getting the properties, we can initialize the coreference resovling system. For example,
01 02 03 04 05 06 07 08 09 10 | try { corefSystem = new SieveCoreferenceSystem( new Properties()); mentionExtractor = new MentionExtractor(corefSystem.dictionaries(), corefSystem.semantics()); } catch (Exception e) { System.err.println( "ERROR: cannot create DeterministicCorefAnnotator!" ); e.printStackTrace(); throw new RuntimeException(e); } |
Annotation
To feed the resolving system, we first need to understand the structure of annotation, which represents a span of text in a document. This is the most tricky part in this post, because to my knowledge there is no document to explain it in details. The Annotation class itself is just an implementation of Map.
Basically, an annotation contains a sequence of sentences (which is another map). For each sentence, we need to provide a seqence of tokens (a list of CoreLabel), the parsing tree (Tree), and the dependency graph (SemanticGraph).
1 2 3 4 5 | Annotation CoreAnnotations.SentencesAnnotation -> sentences CoreAnnotations.TokensAnnotation -> tokens TreeCoreAnnotations.TreeAnnotation -> Tree SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation -> SemanticGraph |
Tokens
The sequence of tokens represents the text of one sentence. Each token is an instance of CoreLabel, which stores word, tag (part-of-speech), lemma, named entity, normailzied named entity, etc.
01 02 03 04 05 06 07 08 09 10 11 | List<CoreLabel> tokens = new ArrayList<>(); for ( int i= 0 ; i<n; i++) { // create a token CoreLabel token = new CoreLabel(); token.setWord(word); token.setTag(tag); token.setNer(ner); ... tokens.add(token); } ann.set(TokensAnnotation. class , tokens); |
Parse tree
A parse tree is an instance of Tree. If you use the Penn treebank style, Stanford corenlp tool provide an easy to parse the format.
1 2 | Tree tree = Tree.valueOf(getText()); ann.set(TreeAnnotation. class , tree); |
Semantic graph
Semantic graph can be created using typed dependencis from the tree by rules. However the code is not that straightforward.
1 2 3 4 5 6 | GrammaticalStructureFactory grammaticalStructureFactory = new EnglishGrammaticalStructureFactory(); GrammaticalStructure gs = grammaticalStructureFactory .newGrammaticalStructure(tree); SemanticGraph semanticGraph = new SemanticGraph(gs.typedDependenciesCollapsed()); |
Please note that Stanford Corenlp provide different types of dependencies. Among others, coreference system needs “collapsed-dependencies”, so to set the annotation, you may write
1 2 3 | ann.set( CollapsedDependenciesAnnotation. class , new SemanticGraph(gs.typedDependenciesCollapsed())); |
Resolve coreference
At last, you can feed the system with the annotation. The following code is one example. It is a bit long but easy to understand.
001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | private void annotate(Annotation annotation) { try { List<Tree> trees = new ArrayList<Tree>(); List<List<CoreLabel>> sentences = new ArrayList<List<CoreLabel>>(); // extract trees and sentence words // we are only supporting the new annotation standard for this Annotator! if (annotation.containsKey(CoreAnnotations.SentencesAnnotation. class )) { // int sentNum = 0; for (CoreMap sentence : annotation .get(CoreAnnotations.SentencesAnnotation. class )) { List<CoreLabel> tokens = sentence .get(CoreAnnotations.TokensAnnotation. class ); sentences.add(tokens); Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation. class ); trees.add(tree); MentionExtractor.mergeLabels(tree, tokens); MentionExtractor.initializeUtterance(tokens); } } else { System.err .println( "ERROR: this coreference resolution system requires SentencesAnnotation!" ); return ; } // extract all possible mentions // this is created for each new annotation because it is not threadsafe RuleBasedCorefMentionFinder finder = new RuleBasedCorefMentionFinder(); List<List<Mention>> allUnprocessedMentions = finder .extractPredictedMentions(annotation, 0 , corefSystem.dictionaries()); // add the relevant info to mentions and order them for coref Document document = mentionExtractor.arrange( annotation, sentences, trees, allUnprocessedMentions); List<List<Mention>> orderedMentions = document.getOrderedMentions(); if (VERBOSE) { for ( int i = 0 ; i < orderedMentions.size(); i++) { System.err.printf( "Mentions in sentence #%d:\n" , i); for ( int j = 0 ; j < orderedMentions.get(i).size(); j++) { System.err.println( "\tMention #" + j + ": " + orderedMentions.get(i).get(j).spanToString()); } } } Map<Integer, CorefChain> result = corefSystem.coref(document); annotation.set(CorefCoreAnnotations.CorefChainAnnotation. class , result); // for backward compatibility if (OLD_FORMAT) { List<Pair<IntTuple, IntTuple>> links = SieveCoreferenceSystem .getLinks(result); if (VERBOSE) { System.err.printf( "Found %d coreference links:\n" , links.size()); for (Pair<IntTuple, IntTuple> link : links) { System.err.printf( "LINK (%d, %d) -> (%d, %d)\n" , link.first.get( 0 ), link.first.get( 1 ), link.second.get( 0 ), link.second.get( 1 )); } } // // save the coref output as CorefGraphAnnotation // // cdm 2013: this block didn't seem to be doing anything needed.... // List<List<CoreLabel>> sents = new ArrayList<List<CoreLabel>>(); // for (CoreMap sentence: // annotation.get(CoreAnnotations.SentencesAnnotation.class)) { // List<CoreLabel> tokens = // sentence.get(CoreAnnotations.TokensAnnotation.class); // sents.add(tokens); // } // this graph is stored in CorefGraphAnnotation -- the raw links found // by the coref system List<Pair<IntTuple, IntTuple>> graph = new ArrayList<Pair<IntTuple, IntTuple>>(); for (Pair<IntTuple, IntTuple> link : links) { // // Note: all offsets in the graph start at 1 (not at 0!) // we do this for consistency reasons, as indices for syntactic // dependencies start at 1 // int srcSent = link.first.get( 0 ); int srcTok = orderedMentions.get(srcSent - 1 ).get( link.first.get( 1 ) - 1 ).headIndex + 1 ; int dstSent = link.second.get( 0 ); int dstTok = orderedMentions.get(dstSent - 1 ).get( link.second.get( 1 ) - 1 ).headIndex + 1 ; IntTuple dst = new IntTuple( 2 ); dst.set( 0 , dstSent); dst.set( 1 , dstTok); IntTuple src = new IntTuple( 2 ); src.set( 0 , srcSent); src.set( 1 , srcTok); graph.add( new Pair<IntTuple, IntTuple>(src, dst)); } annotation.set(CorefCoreAnnotations.CorefGraphAnnotation. class , graph); for (CorefChain corefChain : result.values()) { if (corefChain.getMentionsInTextualOrder().size() < 2 ) continue ; Set<CoreLabel> coreferentTokens = Generics.newHashSet(); for (CorefMention mention : corefChain.getMentionsInTextualOrder()) { CoreMap sentence = annotation.get( CoreAnnotations.SentencesAnnotation. class ).get( mention.sentNum - 1 ); CoreLabel token = sentence.get( CoreAnnotations.TokensAnnotation. class ).get( mention.headIndex - 1 ); coreferentTokens.add(token); } for (CoreLabel token : coreferentTokens) { token.set( CorefCoreAnnotations.CorefClusterAnnotation. class , coreferentTokens); } } } } catch (RuntimeException e) { throw e; } catch (Exception e) { throw new RuntimeException(e); } } |
Reference: | Resolve coreference using Stanford CoreNLP from our JCG partner Yifan Peng at the PGuru blog. |
Doesn’t work