SrcML.NET: Speedy, Good Enough, Multi-Language Program Analysis for Software Tools

Phil: Claire, it’s time we upgraded to the latest version of Microsoft’s compiler. It looks like the new operator throws exceptions when it fails now. How long will it take us to adapt our code?

Claire: I don’t know — let me see how we’re doing it now.

Back at her desk, Claire tries to see how new, an oft-used operator in C++, is used in their large codebase. Claire might do the following to answer Phil’s question:

  1. Use “find-in-files” to find all uses of the new operator
  2. Click on the first few results
  3. Realize they’re comments
  4. Click on a few more results
  5. Report to Phil that there are at least 500 uses of new, but some of them are just comments, or the word “new” appearing in a variable or function name.

SrcML.NET is a C# framework that I’ve been developing at ABB Corporate Research that can help answer questions like Phil’s. It is a program analysis framework that focuses on:

  1. Speed: we want code operations to be fast so you can do them all the time (for instance: every time you save a file)
  2. Easy: it should be easy to develop new queries and run them over your source code.
  3. Good enough: Analysis should be mostly correct considering we’re not actually compiling the code.
  4. Multi-language: it should support multiple languages out of the box and provide common analysis tools where appropriate.

The first tool we built on top of SrcML.NET is a Visual Studio plugin for lightweight program transformations.

SrcML is an XML representation for source code developed at Kent State University. It wraps all of the program elements in your source code with XML elements. For instance, all if statements are wrapped with if and all functions are wrapped with .... The utilities provided by srcML annotate all code constructs for the supported languages without making any changes to the source code. All of the original structure in the code (preprocessor macros, whitespace, comments, and names) are all preserved. By not relying on compilation, srcML lets us understand and modify source code very quickly by processing the resulting XML.

LINQ is a language enhancement that was added to C# as part of .NET 3.5. It provides a SQL-like syntax with full intellisense support for querying a variety of data sources. What we’re most interested in is that LINQ can be used to query XML documents.

I know what you’re thinking:

I get to combine the conciseness of XML with the easy learning curve of SQL.

When I started exploring this space, I had roughly the same reaction. I changed my tune after using them together for two key scenarios.

Example: Lightweight Program Transformation

We first started using srcML as the platform for a lightweight program transformation tool. There are a lot of program transformation tools available. For our tool, we wanted something that:

  1. Allows developers to experiment with transformations.
  2. Lets developers modify specific parts of their source code without touching anything else.
  3. Allows developers to implement transformations in a natural, easy to understand way.
  4. Supports C & C++ code. Other languages are a bonus. Good C & C++ parsing is a must.

SrcML supports C, C++, Objective-C, and Java out of the box. In addition, they’re actively working on other languages. LINQ lets us write natural looking queries that let developers find code patterns they’re interested in. By providing a Visual Studio Add-In that lets us easily test transformations, we can enable the following workflow:

  1. Define a query: the query finds instances that need changing. Refine by hitting the “test” button.
  2. Define a transform: The transform modifies each query result. Refine by hitting the “test” button.
  3. Execute the transformation: run the complete transformation on a source code tree.

As an example, here’s a snippet of LINQ that finds all uses of the new operator:

var newUses = from unit in project.FileUnits # iterate over all files
from use in unit.Descendants(OP.Operator) # find all operators
where use.Value == “new” # where the value is “new”
select use;

Example: Code Analysis

The queries implemented as part of program transformations led us to the realization that srcML and srcML.NET were perfectly suited to writing useful querying of source code.

These queries can be used to answer two types of questions. The first are navigation aids that are typically provided by IDEs (such as Visual Studio’s Intellisense). Examples of this include:

  • What are all of the variable names in my program?
  • What is the type of variable A?
  • Where is a variable B called?
  • What are the callers and callees of function C?

The second thing we would like to ask is more along the lines of “metrics.” Examples of this are:

  • How often is language feature X used?
  • Is method Y always called before method Z?
  • What global variables are used across multiple namespaces?
  • How many functions are updated per changeset?

While there are other tools that can answer these questions, the combination of LINQ and srcML means that we can do it quickly, without compilation, and with a high degree of accuracy. This means we can make this information available in developer tools or as part of an automated build.

Now Open Source!

Come join us! Some of these ideas are untested or only exist as a prototype! If you’re interested in developing a fast, multi-language tool for program analysis, download the code and see what it does! Did you see my tool demonstration at FSE 2012? You can get the code I used in the demonstration here.