Code Generation with Roslyn: a Skeleton Class from UML
We have already seen some examples of transformation and analysis of C# code with Roslyn. Now we are going to see how to create a more complex example of code generation with Roslyn and parsing with Sprache. We are going to create a skeleton class from a PlantUML file. In short, we are doing the inverse of what we have done. The first step is to parse, of course.
@startuml class ClassDiagramGenerator { - writer : TextWriter - indent : string - nestingDepth : int + ClassDiagramGenerator(writer:TextWriter, indent:string) + {{static}} VisitInterfaceDeclaration(node:Node) : void + {{static}} VisitStructDeclaration(node:Node) : void + VisitEnumDeclaration(node:Node) : void - WriteLine(line:string) : void - GetTypeModifiersText(modifiers:SyntaxTokenList) : string - GetMemberModifiersText(modifiers:SyntaxTokenList) : string } @enduml
As you can see, there are four entities in this example: PlantUML start and end tags and class, variable and function declarations.
Parsing all the things
We are going to parse the file line by line instead of doing it in one big swoop, this is in part because of the limitations of Sprache, but also because it’s easier to correctly parse one thing at a time instead of trying to get it right all in one go.
public static Parser<string> UmlTags = Parse.Char('@').Once().Then(_ => Parse.CharExcept('\n').Many()).Text().Token(); public static Parser<string> Identifier = Parse.CharExcept(" ,):").Many().Text().Token();
With CharExcept
we are parsing all characters except for the one(s) indicated, which is an handy but imprecise way to collect all the text for an identifier. The roughness of this process is obvious, because we are forced to exclude all the characters that comes after an identifier. If you look at the file .plantuml,
at the beginning of the article, you see that there is a space after the field names, a ‘}’ after the modifier static
, a ‘:’ after the argument, to divide identifier and its type, and finally the closing parenthesis, after the type. You might say that we should simply have checked for “Letters”, which would work in this specific case, but would exclude legal C# name for identifiers.
public static Parser<string> Modifier = Parse.Char('+').Once().Return("public") .Or(Parse.Char('-').Once().Return("private")) .Or(Parse.Char('#').Once().Return("protected")) .Or(from start in StartBracket.Then(_ => StartBracket).Once() from modifier in Parse.CharExcept('}').Many().Text().Token() from end in EndBracket.Then(_ => EndBracket).Once() select modifier ) .Or(from start in LessThen.Then(_ => LessThen).Once() from modifier in Parse.CharExcept('>').Many().Text().Token() from end in GreaterThen.Then(_ => GreaterThen).Once() select modifier ) .Text().Token(); public static Parser<Field> Field = from modifiers in Parse.Ref(() => Modifier).DelimitedBy(Parse.Char(' ').Many().Token()).Optional() from name in Identifier from delimeter in Parse.Char(':') from type in Identifier select new Field(name, type, modifiers.IsDefined ? modifiers.Get() : null); public static Parser<Method> Method = from modifiers in Parse.Ref(() => Modifier).DelimitedBy(Parse.Char(' ').Many().Token()).Optional() from name in Parse.CharExcept('(').Many().Text().Token() from startArg in Parse.Char('(') from arguments in Parse.Ref(() => Field).DelimitedBy(Parse.Char(',').Many().Token()).Optional() from endArg in Parse.Char(')') from delimeter in Parse.String(" : ").Optional() from returnType in Identifier.Optional() select new Method(modifiers.IsDefined ? modifiers.Get() : null, name, arguments.IsDefined ? arguments.Get() : null returnType.IsDefined ? returnType.Get() : null);
The Modifier
parser is quite uninteresting, except for the lines 6 and 11 where we are seeing the same problem just mentioned to identify the correct name. The last case is referring to something that doesn’t happen in this example, but could happen in others UML diagrams: override
modifiers. The real deal is in the lines 18 and 22, where we are seeing the
Ref
parser, which is used, as the documentation says, to: “Refer to another parser indirectly. This allows circular compile-time dependency between parsers”. DelimitedBy
is use to select many of the same items delimited by the specified rule, and finally Optional
refers to a rule that isn’t necessary to parse correctly, but it might appear. Since the rule is optional, the value could be undefined and it must be accessed using the method shown on the line 22. The rule Method
is slightly more complicated, but it uses the same methods. In case you are wondering, methods without a return type are constructors.
Parsing line by line
foreach (var line in lines) { var attemptedClass = UmlParser.Class.TryParse(line); if (attemptedClass.WasSuccessful) { currentClass.Name = attemptedClass.Value; } var attemptedMethod = UmlParser.Method.TryParse(line); if (attemptedMethod.WasSuccessful) { currentClass.Declarations.Add(attemptedMethod.Value); continue; } var attemptedField = UmlParser.Field.TryParse(line); if (attemptedField.WasSuccessful) { currentClass.Declarations.Add(attemptedField.Value); } var attempted = UmlParser.EndBracket.TryParse(line); if (attempted.WasSuccessful) { currentClass.Generate(); currentClass = new UmlClass(writer, (new DirectoryInfo(outputDir)).Name, Path.GetFileNameWithoutExtension(file)); } }
We can see our parser at work on the main method, where we try to parse every line with every parser and, if successful, we add the value to a custom type, that we are going to see later. We need a custom type because code generation requires to have all the elements in their place, we can’t do it line by line, at least we can’t if we want to use the formatter of Roslyn. We could just take the information and print them ourselves, which is good enough for small project, but complicated for larger one. Also, we would miss all the nice automatic options for formatting. On line 13 we are skipping a cycle, if we found a method, because method could also be parsed, improperly, as fields, so to avoid the risk we jump over.
Code Generation
public void Generate() { CompilationUnitSyntax cu = SyntaxFactory.CompilationUnit() .AddUsings(SyntaxFactory.UsingDirective (SyntaxFactory.IdentifierName("System"))) .AddUsings(SyntaxFactory.UsingDirective (SyntaxFactory.IdentifierName("System.Collections.Generic"))) .AddUsings(SyntaxFactory.UsingDirective (SyntaxFactory.IdentifierName("System.Linq"))) .AddUsings(SyntaxFactory.UsingDirective (SyntaxFactory.IdentifierName("System.Text"))) .AddUsings(SyntaxFactory.UsingDirective (SyntaxFactory.IdentifierName("System.Threading.Tasks"))); NamespaceDeclarationSyntax localNamespace = SyntaxFactory.NamespaceDeclaration(SyntaxFactory.IdentifierName(directoryName)); ClassDeclarationSyntax localClass = SyntaxFactory.ClassDeclaration(Name);
If you remember the first lessons about Roslyn it’s quite verbose, because it’s very powerful. You have also to remember that we can’t modify nodes, even the ones we create ourselves and are not, say, parsed from a file. Once you get around to use SyntaxFactory
for everything, it’s all quite obvious, you have just to find the correct methods. The using
directive are simply the ones usually inserted by default by Visual Studio.
Generation of methods
foreach (var member in Declarations) { switch (member.DeclarationType) { case "method": var currentMethod = member as Method; MethodDeclarationSyntax method = SyntaxFactory.MethodDeclaration( SyntaxFactory.IdentifierName(SyntaxFactory.Identifier(currentMethod.Type)), currentMethod.Name); List<SyntaxToken> mods = new List<SyntaxToken>(); foreach (var modifier in currentMethod.Modifiers) mods.Add(SyntaxFactory.ParseToken(modifier)); method = method.AddModifiers(mods.ToArray()); SeparatedSyntaxList<ParameterSyntax> ssl = SyntaxFactory.SeparatedList<ParameterSyntax>(); foreach (var param in currentMethod.Arguments) { ParameterSyntax ps = SyntaxFactory.Parameter( new SyntaxList<AttributeListSyntax>(), new SyntaxTokenList(), SyntaxFactory.IdentifierName(SyntaxFactory.Identifier(param.Type)), SyntaxFactory.Identifier(param.Name), null); ssl = ssl.Add(ps); } method = method.AddParameterListParameters(ssl.ToArray()); ThrowStatementSyntax notReady = SyntaxFactory.ThrowStatement( SyntaxFactory.ObjectCreationExpression( SyntaxFactory.IdentifierName("NotImplementedException"), SyntaxFactory.ArgumentList(), null)); method = method.AddBodyStatements(notReady); localClass = localClass.AddMembers(method); break;
Let’s start by saying that Declarations
and DeclarationType
are fields in our custom class, that is not shown, but you can look at it in the source code. Then we proceed to generate the method of our skeleton C# class. MethodDeclaration
allow us to choose the name and the return type of the method itself; mods refer to the modifiers, which obviously could be more than one, and so they are in a list. Then we create the parameters, which in our case need only a name and a type.
We choose to throw an exception, since we obviously cannot determine the body of the methods just with the UML diagram. So we create a throw statement and a new object of the type NotImplementedException
. This also allows us to add a meaningful body to the method. You should add a body in any case, if you use the formatter, because otherwise it will not create a correct method: there won’t be a body or the curly braces.
Generation of fields
svd = svd.Add(SyntaxFactory.VariableDeclarator(currentField.Name));
The case “field” is easier that the “method” one and the only real new thing is on line 12, where we use a method to parse the type from a string filled by our parser.
localNamespace = localNamespace.AddMembers(localClass); cu = cu.AddMembers(localNamespace); AdhocWorkspace cw = new AdhocWorkspace(); OptionSet options = cw.Options; cw.Options.WithChangedOption(CSharpFormattingOptions.IndentBraces, true); SyntaxNode formattedNode = Formatter.Format(cu, cw, options); formattedNode.WriteTo(writer); }
The end of the Generate
method is where we add the class created by the for cycle, and use Formatter
. Notice that cu is the CompilationUnitSyntax
that we created at the beginning of this method.
Limitations of this example
The unit tests are not shown because they don’t contain anything worth noting, although I have to say that Sprache is really easy to test, which is a great thing. If you run the program you would find that the generated code is correct, but it’s still missing something. It lack some of the necessary using
directives, because we can’t detect them starting just from the UML diagram. In a real life scenario, with many files and classes and without the original source code, you might identify the assemblies beforehand and then you could use reflection to find their namespace(s). Also, we obviously don’t implement many things that PlantUML has, such as the relationship between classes, so keep that in mind.
Conclusions
Code Generation with Roslyn is not hard, but it requires to know exactly what are you doing. It’s better to have an idea of the code you are generating beforehand, or you will have to take in account every possible case, which would make every little step hard to accomplish. I think it works best for specific scenarios and short pieces of code, for which it could become very useful. In such cases, you could create tools that are useful and productive for your project, or yourself, in a very short period of time and benefit from them, as long as you don’t change tools or work habit. For instance, if you are a professor, you could create an automatic code generator to translate your pseudo-code of a short algorithm in real C#. If you think about it, this complexity is a good thing, otherwise, if anybody could generate whole programs from scratch, us programmers will lose our jobs.
You might think that using Sprache for such a project might have been a bad idea, but it’s actually a good tool for parsing single lines. And while there are limitations, this approach make much easier to make something working in little time, instead of waiting to create a complete grammar for a “real” parser. For cases in which code generation is most useful, specific scenarios and such, this is actually the best approach, in my opinion, since it allows you to easily pick and choose which part to use and just skip the rest.
Reference: | Code Generation with Roslyn: a Skeleton Class from UML from our JCG partner Federico Tomassetti at the Federico Tomassetti blog. |