SrcML.NET Roadmap

At ABB Corporate Research, we use SrcML.NET to perform lightweight code analysis for a variety of tasks ranging from experimentation to powering tools. It hasn’t seen too many updates recently. In this post, I’m going to lay out what we’d like to do with SrcML.NET. Each task will get a github issue. The tasks will be broken into different categories: organization, code modernization, and new features.

Organization

The SrcML.NET repository is currently organized as a single solution with a number of different sub-projects. The projects range from core libraries (ABB.SrcML and ABB.SrcML.Data), to tools (Src2SrcMLPreview), to visual studio integration (the program transformation VS addin and the SrcML Service VSIX package). The problem with this layout is that it makes it very difficult to on-board new project members. In particular, including the Visual Studio libraries means that we need to have VS 2010, 2012, & 2013 installed along with their respective VS SDKs. This is incredibly frustrating for someone who just wants to work on one of the core libraries or standalone tools. The solution is to split the monolithic solution into many different solutions. There will be a few different repositories:

  • The main ABB.SrcML repository will have the core library (ABB.SrcML) and associated test libraries. It will also include ABB.SrcML.Data and the command line runner for generating srcML archives. This limited set of core projects means that we can start thinking about making SrcML.NET platform independent (Linux/Mac support via Mono).
  • The Src2SrcMLPreview tool is a small GUI tool that allows users to visualize how srcML for a source code snippet is structured. While I would like to package this with the “core” repository, I believe platform independence is a more important goal.
  • The Visual Studio projects will each get their own solution: the SrcML service (and associated test projects) and the program transformation add-in.

Additionally, the srcML executable that we depend on are included as part of the ABB.SrcML project file. Other projects (such as the SrcML Service) must have a manual link to those files in order to include them. Instead, we would like to pull theose libraries & executables out of ABB.SrcML and package them into their own nuget package. This way, projects that need them can explicitly declare that dependency. We can also let individual packages depend on different versions of the Kent State executables. For instance, ABB.SrcML doesn’t care about changes in source code parsing by srcML — it only cares if the command line tools or library APIs changes. However, ABB.SrcML.Data is very dependent on how source code is parsed. Packaging the srcML binaries separately will allow us to manage these relationships more effectively.

Right now, the project has a coding standard defined in the wiki. However, not many people know about it. This has led to inconsistency in the codebase. I want to look at EditorConfig and/or StyleCop in order to automatically enforce these guidelines. Each solution will include these files.

GitHub Issues

Code Modernization

Tasks in this category aren’t really new features. However, they should allow the codebase to be more easily understandable.

Right now, there’s a lot of code in ABB.SrcML, ABB.SrcML.Data, and the SrcML Service devoted to monitoring something and then routing those monitoring events to different archives. There are monitors for file systems, Visual Studio, and archives. While the code itself isn’t terribly complicated, the interplay of the different sources and their monitors can be hard to understand. It would be nice to explore existing, well-maintained libraries for managing these types of relationships. That way, we can focus SrcML.NET on creating and managing srcML instead of event routing. One possible library is the Reactive Extensions Library.

Make ABB.SrcML and ABB.SrcML.Data platform agnostic. The core srcML functionality should be platform agnostic. It should run on Windows, Linux, and Mac OS X equally well. This issue should also modify the NuGet package created in #67 so that it works on these different platforms.

GitHub Issues

New Features

Tasks in this category are new features that we can work on once tasks in the previous two sections have either been completed or have seen significant progress.

Item 1 is to improve the public facing APIs for ABB.SrcML.Data. It is currently very difficult to manage the object lifecycle for objects returned from the data queries. One avenue to explore is using an HTTP-based front end for submitting and answering queries. For example, OmniSharp used NancyFx to provide an HTTP front end.

Item 2 is to improve the call graph query code. The call graph currently works by creating a large structure in memory on which method calls and object references can be built. The code that keeps this structure up to date (in response to file changes, for instance) is very complicated. We should look at doing name resolution on individual data files through something like a reverse index.

Item 3 is to implement more accurate expression parsing. Currently, the SrcML.Data handling of expressions is very basic and basically mirrors how srcML stores expressions. This issue should look at making our expressions reflect an actual expression tree. This should improve e accuracy of name resolution and the call graph.

GitHub Issues

Conclusions

These improvements should improve the accuracy and performance of SrcML.NET while improving the maintainability of the codebase. If you’re interested, comment on one of the issues to get started!

SrcML.NET: Speedy, Good Enough, Multi-Language Program Analysis for Software Tools

Phil: Claire, it’s time we upgraded to the latest version of Microsoft’s compiler. It looks like the new operator throws exceptions when it fails now. How long will it take us to adapt our code?

Claire: I don’t know — let me see how we’re doing it now.

Back at her desk, Claire tries to see how new, an oft-used operator in C++, is used in their large codebase. Claire might do the following to answer Phil’s question:

  1. Use “find-in-files” to find all uses of the new operator
  2. Click on the first few results
  3. Realize they’re comments
  4. Click on a few more results
  5. Report to Phil that there are at least 500 uses of new, but some of them are just comments, or the word “new” appearing in a variable or function name.

SrcML.NET is a C# framework that I’ve been developing at ABB Corporate Research that can help answer questions like Phil’s. It is a program analysis framework that focuses on:

  1. Speed: we want code operations to be fast so you can do them all the time (for instance: every time you save a file)
  2. Easy: it should be easy to develop new queries and run them over your source code.
  3. Good enough: Analysis should be mostly correct considering we’re not actually compiling the code.
  4. Multi-language: it should support multiple languages out of the box and provide common analysis tools where appropriate.

The first tool we built on top of SrcML.NET is a Visual Studio plugin for lightweight program transformations.

SrcML is an XML representation for source code developed at Kent State University. It wraps all of the program elements in your source code with XML elements. For instance, all if statements are wrapped with if and all functions are wrapped with .... The utilities provided by srcML annotate all code constructs for the supported languages without making any changes to the source code. All of the original structure in the code (preprocessor macros, whitespace, comments, and names) are all preserved. By not relying on compilation, srcML lets us understand and modify source code very quickly by processing the resulting XML.

LINQ is a language enhancement that was added to C# as part of .NET 3.5. It provides a SQL-like syntax with full intellisense support for querying a variety of data sources. What we’re most interested in is that LINQ can be used to query XML documents.

I know what you’re thinking:

I get to combine the conciseness of XML with the easy learning curve of SQL.

When I started exploring this space, I had roughly the same reaction. I changed my tune after using them together for two key scenarios.

Example: Lightweight Program Transformation

We first started using srcML as the platform for a lightweight program transformation tool. There are a lot of program transformation tools available. For our tool, we wanted something that:

  1. Allows developers to experiment with transformations.
  2. Lets developers modify specific parts of their source code without touching anything else.
  3. Allows developers to implement transformations in a natural, easy to understand way.
  4. Supports C & C++ code. Other languages are a bonus. Good C & C++ parsing is a must.

SrcML supports C, C++, Objective-C, and Java out of the box. In addition, they’re actively working on other languages. LINQ lets us write natural looking queries that let developers find code patterns they’re interested in. By providing a Visual Studio Add-In that lets us easily test transformations, we can enable the following workflow:

  1. Define a query: the query finds instances that need changing. Refine by hitting the “test” button.
  2. Define a transform: The transform modifies each query result. Refine by hitting the “test” button.
  3. Execute the transformation: run the complete transformation on a source code tree.

As an example, here’s a snippet of LINQ that finds all uses of the new operator:

var newUses = from unit in project.FileUnits # iterate over all files
from use in unit.Descendants(OP.Operator) # find all operators
where use.Value == “new” # where the value is “new”
select use;

Example: Code Analysis

The queries implemented as part of program transformations led us to the realization that srcML and srcML.NET were perfectly suited to writing useful querying of source code.

These queries can be used to answer two types of questions. The first are navigation aids that are typically provided by IDEs (such as Visual Studio’s Intellisense). Examples of this include:

  • What are all of the variable names in my program?
  • What is the type of variable A?
  • Where is a variable B called?
  • What are the callers and callees of function C?

The second thing we would like to ask is more along the lines of “metrics.” Examples of this are:

  • How often is language feature X used?
  • Is method Y always called before method Z?
  • What global variables are used across multiple namespaces?
  • How many functions are updated per changeset?

While there are other tools that can answer these questions, the combination of LINQ and srcML means that we can do it quickly, without compilation, and with a high degree of accuracy. This means we can make this information available in developer tools or as part of an automated build.

Now Open Source!

Come join us! Some of these ideas are untested or only exist as a prototype! If you’re interested in developing a fast, multi-language tool for program analysis, download the code and see what it does! Did you see my tool demonstration at FSE 2012? You can get the code I used in the demonstration here.