Software Development

Resolve coreference using Stanford CoreNLP

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. Stanford CoreNLP coreference resolution system is the state-of-the-art system to resolve coreference in the text. To use the system, we usually create a pipeline, which requires tokenization, sentence splitting, part-of-speech tagging, lemmarization, named entity recoginition, and parsing. However sometimes, we use others tools for preprocessing, particulaly when we are working on a specific domain. In these cases, we need a stand-alone coreference resolution system. This post demenstrates how to create such a system using Stanford CoreNLP.
 
 

Load properties

In general, we can just create an empty Properties, because the Stanford CoreNLP tool can automatically load the default one in the model jar file, which is under edu.stanford.nlp.pipeline.

In other cases, we would like to use specific properties. The following code shows one example of loading the property file from the working directory.

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
private static final String PROPS_SUFFIX = ".properties";
 
  private Properties loadProperties(String name) {
    return loadProperties(name,
       Thread.currentThread().getContextClassLoader());
  }
   
  private Properties loadProperties(String name, ClassLoader loader) {
    if (name.endsWith(PROPS_SUFFIX))
      name = name.substring(0, name.length() - PROPS_SUFFIX.length());
    name = name.replace('.', '/');
    name += PROPS_SUFFIX;
    Properties result = null;
 
    // Returns null on lookup failures
    System.err.println("Searching for resource: " + name);
    InputStream in = loader.getResourceAsStream(name);
    try {
      if (in != null) {
        InputStreamReader reader = new InputStreamReader(in, "utf-8");
        result = new Properties();
        result.load(reader); // Can throw IOException
      }
    } catch (IOException e) {
      result = null;
    } finally {
      IOUtils.closeIgnoringExceptions(in);
    }
 
    return result;
  }

Initialize the system

After getting the properties, we can initialize the coreference resovling system. For example,

01
02
03
04
05
06
07
08
09
10
try {
      corefSystem = new SieveCoreferenceSystem(new Properties());
      mentionExtractor = new MentionExtractor(corefSystem.dictionaries(),
          corefSystem.semantics());
       
    } catch (Exception e) {
      System.err.println("ERROR: cannot create DeterministicCorefAnnotator!");
      e.printStackTrace();
      throw new RuntimeException(e);
    }

Annotation

To feed the resolving system, we first need to understand the structure of annotation, which represents a span of text in a document. This is the most tricky part in this post, because to my knowledge there is no document to explain it in details. The Annotation class itself is just an implementation of Map.

Basically, an annotation contains a sequence of sentences (which is another map). For each sentence, we need to provide a seqence of tokens (a list of CoreLabel), the parsing tree (Tree), and the dependency graph (SemanticGraph).

1
2
3
4
5
Annotation
  CoreAnnotations.SentencesAnnotation -> sentences
    CoreAnnotations.TokensAnnotation -> tokens
    TreeCoreAnnotations.TreeAnnotation -> Tree
    SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation -> SemanticGraph

Tokens

The sequence of tokens represents the text of one sentence. Each token is an instance of CoreLabel, which stores word, tag (part-of-speech), lemma, named entity, normailzied named entity, etc.

01
02
03
04
05
06
07
08
09
10
11
List<CoreLabel> tokens = new ArrayList<>();
for(int i=0; i<n; i++) {
  // create a token
  CoreLabel token = new CoreLabel();
  token.setWord(word);
  token.setTag(tag);
  token.setNer(ner);
  ...
  tokens.add(token);
}
ann.set(TokensAnnotation.class, tokens);

Parse tree

A parse tree is an instance of Tree. If you use the Penn treebank style, Stanford corenlp tool provide an easy to parse the format.

1
2
Tree tree = Tree.valueOf(getText());
ann.set(TreeAnnotation.class, tree);

Semantic graph

Semantic graph can be created using typed dependencis from the tree by rules. However the code is not that straightforward.

1
2
3
4
5
6
GrammaticalStructureFactory grammaticalStructureFactory =
    new EnglishGrammaticalStructureFactory();
GrammaticalStructure gs = grammaticalStructureFactory
    .newGrammaticalStructure(tree);
SemanticGraph semanticGraph =
    new SemanticGraph(gs.typedDependenciesCollapsed());

Please note that Stanford Corenlp provide different types of dependencies. Among others, coreference system needs “collapsed-dependencies”, so to set the annotation, you may write

1
2
3
ann.set(
          CollapsedDependenciesAnnotation.class,
          new SemanticGraph(gs.typedDependenciesCollapsed()));

Resolve coreference

At last, you can feed the system with the annotation. The following code is one example. It is a bit long but easy to understand.

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
private void annotate(Annotation annotation) {
    try {
      List<Tree> trees = new ArrayList<Tree>();
      List<List<CoreLabel>> sentences = new ArrayList<List<CoreLabel>>();
 
      // extract trees and sentence words
      // we are only supporting the new annotation standard for this Annotator!
      if (annotation.containsKey(CoreAnnotations.SentencesAnnotation.class)) {
        // int sentNum = 0;
        for (CoreMap sentence : annotation
            .get(CoreAnnotations.SentencesAnnotation.class)) {
          List<CoreLabel> tokens = sentence
              .get(CoreAnnotations.TokensAnnotation.class);
          sentences.add(tokens);
          Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
          trees.add(tree);
          MentionExtractor.mergeLabels(tree, tokens);
          MentionExtractor.initializeUtterance(tokens);
        }
      } else {
        System.err
            .println("ERROR: this coreference resolution system requires SentencesAnnotation!");
        return;
      }
 
      // extract all possible mentions
      // this is created for each new annotation because it is not threadsafe
      RuleBasedCorefMentionFinder finder = new RuleBasedCorefMentionFinder();
      List<List<Mention>> allUnprocessedMentions = finder
          .extractPredictedMentions(annotation, 0, corefSystem.dictionaries());
 
      // add the relevant info to mentions and order them for coref
      Document document = mentionExtractor.arrange(
          annotation,
          sentences,
          trees,
          allUnprocessedMentions);
      List<List<Mention>> orderedMentions = document.getOrderedMentions();
      if (VERBOSE) {
        for (int i = 0; i < orderedMentions.size(); i++) {
          System.err.printf("Mentions in sentence #%d:\n", i);
          for (int j = 0; j < orderedMentions.get(i).size(); j++) {
            System.err.println("\tMention #"
                + j
                + ": "
                + orderedMentions.get(i).get(j).spanToString());
          }
        }
      }
 
      Map<Integer, CorefChain> result = corefSystem.coref(document);
      annotation.set(CorefCoreAnnotations.CorefChainAnnotation.class, result);
 
      // for backward compatibility
      if (OLD_FORMAT) {
        List<Pair<IntTuple, IntTuple>> links = SieveCoreferenceSystem
            .getLinks(result);
 
        if (VERBOSE) {
          System.err.printf("Found %d coreference links:\n", links.size());
          for (Pair<IntTuple, IntTuple> link : links) {
            System.err.printf(
                "LINK (%d, %d) -> (%d, %d)\n",
                link.first.get(0),
                link.first.get(1),
                link.second.get(0),
                link.second.get(1));
          }
        }
 
        //
        // save the coref output as CorefGraphAnnotation
        //
 
        // cdm 2013: this block didn't seem to be doing anything needed....
        // List<List<CoreLabel>> sents = new ArrayList<List<CoreLabel>>();
        // for (CoreMap sentence:
        // annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
        // List<CoreLabel> tokens =
        // sentence.get(CoreAnnotations.TokensAnnotation.class);
        // sents.add(tokens);
        // }
 
        // this graph is stored in CorefGraphAnnotation -- the raw links found
        // by the coref system
        List<Pair<IntTuple, IntTuple>> graph = new ArrayList<Pair<IntTuple, IntTuple>>();
 
        for (Pair<IntTuple, IntTuple> link : links) {
          //
          // Note: all offsets in the graph start at 1 (not at 0!)
          // we do this for consistency reasons, as indices for syntactic
          // dependencies start at 1
          //
          int srcSent = link.first.get(0);
          int srcTok = orderedMentions.get(srcSent - 1).get(
              link.first.get(1) - 1).headIndex + 1;
          int dstSent = link.second.get(0);
          int dstTok = orderedMentions.get(dstSent - 1).get(
              link.second.get(1) - 1).headIndex + 1;
          IntTuple dst = new IntTuple(2);
          dst.set(0, dstSent);
          dst.set(1, dstTok);
          IntTuple src = new IntTuple(2);
          src.set(0, srcSent);
          src.set(1, srcTok);
          graph.add(new Pair<IntTuple, IntTuple>(src, dst));
        }
        annotation.set(CorefCoreAnnotations.CorefGraphAnnotation.class, graph);
 
        for (CorefChain corefChain : result.values()) {
          if (corefChain.getMentionsInTextualOrder().size() < 2)
            continue;
          Set<CoreLabel> coreferentTokens = Generics.newHashSet();
          for (CorefMention mention : corefChain.getMentionsInTextualOrder()) {
            CoreMap sentence = annotation.get(
                CoreAnnotations.SentencesAnnotation.class).get(
                mention.sentNum - 1);
            CoreLabel token = sentence.get(
                CoreAnnotations.TokensAnnotation.class).get(
                mention.headIndex - 1);
            coreferentTokens.add(token);
          }
          for (CoreLabel token : coreferentTokens) {
            token.set(
                CorefCoreAnnotations.CorefClusterAnnotation.class,
                coreferentTokens);
          }
        }
      }
    } catch (RuntimeException e) {
      throw e;
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
  }
Reference: Resolve coreference using Stanford CoreNLP from our JCG partner Yifan Peng at the PGuru blog.
Do you want to know how to develop your skillset to become a Java Rockstar?
Subscribe to our newsletter to start Rocking right now!
To get you started we give you our best selling eBooks for FREE!
1. JPA Mini Book
2. JVM Troubleshooting Guide
3. JUnit Tutorial for Unit Testing
4. Java Annotations Tutorial
5. Java Interview Questions
6. Spring Interview Questions
7. Android UI Design
and many more ....
I agree to the Terms and Privacy Policy

Yifan Peng

Yifan is a senior CIS PhD student in University of Delaware. His main researches include relation extraction, sentence simplification, text mining and natural language processing. He does not consider himself of a Java geek, but all his projects are written in Java.
Subscribe
Notify of
guest


This site uses Akismet to reduce spam. Learn how your comment data is processed.

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Simon Schala
Simon Schala
9 years ago

Doesn’t work

Back to top button