Building models of Java code from source and JAR files
Recently I spent some time working on effectivejava, which is on its way to reach 300 stars on GitHub (feel free to help reaching the target :D).
Effectivejava is a tool to run queries on your Java code. It is based on another project I contribute to, javaparser. Javaparser takes as input Java source code and produce an Abstract Syntax Tree (AST). We can perform simple analysis directly on the AST. For example we can find out which methods take more than 5 parameters (you may want to refactor them…). However more sophisticate analysis require to resolve symbols.
In this post I describe how I am working on implementing symbol resolution considering both source code and JAR files. In this first post we will build an homogenous view on both source code and JAR files, in the next post we will solve these symbols exploring these models.
Code is available on GitHub, on the branch symbolsolver of effectivejava.
Resolving symbols
For which reason do we need to resolve symbols?
Given this code:
foo.method(a, b, c);
we need to figure out what foo, method, a, b, c are. Are they references to local variables? To arguments of the current method? To fields declared in the class? To fields inherited from a super-class class? What type they have? To answer this question we need to be able to resolve symbols.
To solve symbols we can navigate the AST and apply scoping rules. For example we may look if a certain symbol corresponds to a local variable. If not we can look among the parameters of that method. If we cannot still find a correspondence we need to look among the fields declared by the class and if have still no luck we may have to luck among the fields inherited by this class.
Now, scoping rules are much more complex than the bunch of little steps I just described. It is especially complex to resolve methods, because of overloading. However one key point is that to solve symbols we need to look among imported classes, extended classes and external classes in general which may be part of the project or be imported as dependencies.
So to solve symbol we need to look for corresponding declarations:
- on the ASTs of the classes of the project we are examining
- among the classes contained in the JAR files used as dependencies
Javaparser provides to us the ASTs we need for the first point, for the second one we are going to build a model of classes in JAR files using Javassist.
Build a model of classes contained in JAR files
Our symbol solver should look among a list of entries (our classpath entries) in order, and see if a certain class can be found there. To do so, we would need to open the JAR files and look among its contents. For performance reasons we could want to build a cache of elements contained in a given JAR.
(ns app.jarloading (:use [app.javaparser]) (:use [app.operations]) (:use [app.utils]) (:import [app.operations Operation])) (import java.net.URLDecoder) (import java.util.jar.JarEntry) (import java.util.jar.JarFile) (import javassist.ClassPool) (import javassist.CtClass) ; An element on the classpath (a single class, interface, enum or resource file) (defrecord ClasspathElement [resource path contentAsStreamThunk]) (defn- jarEntryToClasspathElement [jarFile jarEntry] (let [name (.getName jarEntry) content (fn [] (.getInputStream jarFile jarEntry))] (ClasspathElement. jarFile name content))) (defn getElementsEntriesInJar "Return a set of ClasspathElements" [pathToJarFile] (let [url (URLDecoder/decode pathToJarFile "UTF-8") jarfile (new JarFile url) entries (enumeration-seq (.entries jarfile)) entries' (filter (fn [e] (not (.isDirectory e))) entries )] (map (partial jarEntryToClasspathElement jarfile) entries'))) (defn getClassesEntriesInJar "Return a set of ClasspathElements" [pathToJarFile] (filter (fn [e] (.endsWith (.path e) ".class")) (getElementsEntriesInJar pathToJarFile))) (defn pathToTypeName [path] (if (.endsWith path ".class") (let [path' (.substring path 0 (- (.length path) 6)) path'' (clojure.string/replace path' #"/" ".") path''' (clojure.string/replace path'' "$" ".")] path''') (throw (IllegalArgumentException. "Path not ending with .class")))) (defn findEntry "return the ClasspathElement corresponding to the given name, or nil" [typeName classEntries] (first (filter (fn [e] (= typeName (pathToTypeName (.path e)))) classEntries))) (defn findType "return the CtClass corresponding to the given name, or nil" [typeName classEntries] (let [entry (findEntry typeName classEntries) classPool (ClassPool/getDefault)] (if entry (.makeClass classPool ((.contentAsStreamThunk entry))) nil)))
How we start? First of all we read the entries listed in the jar (getElementEntriesInJar). In this way we get a list of ClasspathElements. Then we focus only on the .class files (getClassesEntriesInJar). This method should be invoked once per jar and result should be cached. Given a list of ClasspathElement we can then search for the element corresponding to a given name (e.g., com.github.javaparser.ASTParser). For doing that we can use the method findEntry. Or we can also load that class by using Javassist: this what the method findType does, returning an instance of CtClass.
Why not just using reflection?
Someone could think that it would be easier to just add the dependencies in the classpath of effectivejava and then use the normal classloader and reflection to obtain the needed information. While it would be easier there are some drawbacks:
- when a class is loaded the static initializers are executed and it could be not what we want
- it could possibly conflict with real dependencies of effective java.
- Finally not all the information available in the bytecode are easily retrievable through the reflection API
Solve symbols: combining heterogenous models
Ok now, to solve symbols we will have to implement the scoping rules and navigate both the ASTs obtained from Javaparser and the CtClasses obtained from Javassist. We will see the details on a future blog post but we need to consider one other aspect first. Consider this code:
package me.tomassetti; import com.github.someproject.ClassInJar; public class MyClass extends ClassInJar { private int myDeclaredField; public int foo(){ return myDeclaredField + myInheritedField; } }
In this case we suppose to have a JAR containing the class com.github.someproject.ClassInJar which declared the field myInheritedField. When we will solve symbols we will have these mappings:
- myDeclaredField will be resolved to an instance of com.github.javaparser.ast.body.VariableDeclarator (in Javaparser we have nodes of type FieldDeclaration which maps to constructs such as private int a, b, c;. VariableDeclarators instead point to the single fields such as a, b or c)
- myInheritedField will be resolved to an instance of javassist.CtField
The problem is that we want to be able to treat them in an homogenous way: we should be able to treat each field using the same functions, irrespectively of their origin (a JAR file or a Java source file). To do so we are going to build common views using clojure protocols. I tend to view clojure’s protocols as the equivalent of Java’s interfaces.
(defprotocol FieldDecl (fieldName [this])) (extend-protocol FieldDecl com.github.javaparser.ast.body.VariableDeclarator (fieldName [this] (.getName (.getId this)))) (extend-protocol FieldDecl javassist.CtField (fieldName [this] (.getName this)))
While in Java we would have to build adapters, implementing the new interface (FieldDecl) and wrapping the existing classes (VariableDeclarator, CtField) in Clojure we can just say that those classes extend the protocol and we are done.
Now we are able to treat each field as fieldDecl and we can invoke on each field fieldName. We still need to figure out how to solve the type of the field. For doing that we need to look into symbol resolution and in particular into type resolution, which is our next step.
Conclusions
Building model of Java code is something that has fascinated me for a while. As part of my master thesis I wrote a DSL which interacted with existing Java code (I had also editors, written as Eclipse plugins and code generators: it was kind of cool). In the DSL was possible to specify references to Java classes, using both source code and JAR files. I was using EMF and probably I adopted JaMoPP and Javassist for that project.
Later I built CodeModels a library to analyze ASTs of several languages (Java, JavaScript, Ruby, Html, etc.).
I think that building tools to manipulate code is a very interesting form of metaprogramming, and it should be in the toolbox of each developer. I plan to spend some more time playing with effectivejava. Fun times are coming.
Feel free to share comments and suggestions!
Reference: | Building models of Java code from source and JAR files from our JCG partner Federico Tomassetti at the Federico Tomassetti blog. |
The link http://www.csg.ci.i.u-tokyo.ac.jp/~chiba/javassist/ is dead.
Thanks Patrick. I can fix it on my original article but I have not the permission to fix it here. Btw the correct link should be https://github.com/jboss-javassist/javassist