At ABB Corporate Research, we use SrcML.NET to perform lightweight code analysis for a variety of tasks ranging from experimentation to powering tools. It hasn’t seen too many updates recently. In this post, I’m going to lay out what we’d like to do with SrcML.NET. Each task will get a github issue. The tasks will be broken into different categories: organization, code modernization, and new features.

Organization

The SrcML.NET repository is currently organized as a single solution with a number of different sub-projects. The projects range from core libraries (ABB.SrcML and ABB.SrcML.Data), to tools (Src2SrcMLPreview), to visual studio integration (the program transformation VS addin and the SrcML Service VSIX package). The problem with this layout is that it makes it very difficult to on-board new project members. In particular, including the Visual Studio libraries means that we need to have VS 2010, 2012, & 2013 installed along with their respective VS SDKs. This is incredibly frustrating for someone who just wants to work on one of the core libraries or standalone tools. The solution is to split the monolithic solution into many different solutions. There will be a few different repositories:

  • The main ABB.SrcML repository will have the core library (ABB.SrcML) and associated test libraries. It will also include ABB.SrcML.Data and the command line runner for generating srcML archives. This limited set of core projects means that we can start thinking about making SrcML.NET platform independent (Linux/Mac support via Mono).
  • The Src2SrcMLPreview tool is a small GUI tool that allows users to visualize how srcML for a source code snippet is structured. While I would like to package this with the “core” repository, I believe platform independence is a more important goal.
  • The Visual Studio projects will each get their own solution: the SrcML service (and associated test projects) and the program transformation add-in.

Additionally, the srcML executable that we depend on are included as part of the ABB.SrcML project file. Other projects (such as the SrcML Service) must have a manual link to those files in order to include them. Instead, we would like to pull theose libraries & executables out of ABB.SrcML and package them into their own nuget package. This way, projects that need them can explicitly declare that dependency. We can also let individual packages depend on different versions of the Kent State executables. For instance, ABB.SrcML doesn’t care about changes in source code parsing by srcML — it only cares if the command line tools or library APIs changes. However, ABB.SrcML.Data is very dependent on how source code is parsed. Packaging the srcML binaries separately will allow us to manage these relationships more effectively.

Right now, the project has a coding standard defined in the wiki. However, not many people know about it. This has led to inconsistency in the codebase. I want to look at EditorConfig and/or StyleCop in order to automatically enforce these guidelines. Each solution will include these files.

GitHub Issues

Code Modernization

Tasks in this category aren’t really new features. However, they should allow the codebase to be more easily understandable.

Right now, there’s a lot of code in ABB.SrcML, ABB.SrcML.Data, and the SrcML Service devoted to monitoring something and then routing those monitoring events to different archives. There are monitors for file systems, Visual Studio, and archives. While the code itself isn’t terribly complicated, the interplay of the different sources and their monitors can be hard to understand. It would be nice to explore existing, well-maintained libraries for managing these types of relationships. That way, we can focus SrcML.NET on creating and managing srcML instead of event routing. One possible library is the Reactive Extensions Library.

Make ABB.SrcML and ABB.SrcML.Data platform agnostic. The core srcML functionality should be platform agnostic. It should run on Windows, Linux, and Mac OS X equally well. This issue should also modify the NuGet package created in #67 so that it works on these different platforms.

GitHub Issues

New Features

Tasks in this category are new features that we can work on once tasks in the previous two sections have either been completed or have seen significant progress.

Item 1 is to improve the public facing APIs for ABB.SrcML.Data. It is currently very difficult to manage the object lifecycle for objects returned from the data queries. One avenue to explore is using an HTTP-based front end for submitting and answering queries. For example, OmniSharp used NancyFx to provide an HTTP front end.

Item 2 is to improve the call graph query code. The call graph currently works by creating a large structure in memory on which method calls and object references can be built. The code that keeps this structure up to date (in response to file changes, for instance) is very complicated. We should look at doing name resolution on individual data files through something like a reverse index.

Item 3 is to implement more accurate expression parsing. Currently, the SrcML.Data handling of expressions is very basic and basically mirrors how srcML stores expressions. This issue should look at making our expressions reflect an actual expression tree. This should improve e accuracy of name resolution and the call graph.

GitHub Issues

Conclusions

These improvements should improve the accuracy and performance of SrcML.NET while improving the maintainability of the codebase. If you’re interested, comment on one of the issues to get started!

Open source is where society innovates — Jono Bacon

This is the first post in my All Things Open 2015 Series.

What makes a health community? What are examples of healthy communities? What features can we lift from successful communities for our own teams and projects?

Goals and principles

The defining feature of open source projects and communities is directly in the name: “openness”.

There was a great keynote on the first day by Red Hat’s CEO, Jim Whitehurst. He defined an “open organization” to be

An organization that engages participative communities both inside and out

Organizations must welcome all kinds of input from all kinds of people to attract and retain talent.

Nuts and bolts

There were some great tips that people felt contributed to the above goals. Two of the key talks in this area were Brandon Keeper’s Open Source Principles for Better Engineering Teams and Kaitlin Devine’s Power to the People: Transforming Government with Open Source. They both had great things to say about how their respective organizations worked (GitHub & 18F) in a distributed fashion and interacted with the community. The core points were to:

  • Communicate via methods that are permanent, asynchronous mediums
  • Have a strong code of conduct
  • Learn by lurking / teach by doing
  • Automate grunt work

One fact that Brandon highlighted was that, historically, open source projects never had the luxury of face to face meetings. Instead, they use issue tracking systems, mailing lists, and version control systems to communicate. The permanent record provided by these tools helps newcomers learn the ropes. The asynchronous nature of these tools (You can e-mail a mailing list, and continue working until someone responds) means that people can respond at their leisure and you can keep working until you get a response.

Brandon noted that at GitHub, they respond to questions with a URL (to documentation, a pull request, or a blog post). He stated that informal knowledge stops a conversation:

Alice: Why don’t we do it this way?
Bob: We tried that a few years ago and it didn’t work

A durable, search-able knowledgebase lets the conversation continue:

Alice: Why don’t we do it this way?
Bob: We tried that a few years ago and it didn’t work. See http://link-to-project
Alice (some time later): I see — since then, there have been these new advances. It might be worth trying some of them to see if this would work now.

Kaitlin talked about the importance of having a strong code of conduct. A code of conduct protects community members by making it clear what kind of conduct is appropriate and what the procedure is when someone violates it. An easy way to make your code of conduct prominent on a project-by-project basis is to link to it prominently from both the README.md and CONTRIBUTING.md documents.

The conference itself was a good example of a community with a strong code of conduct. Each communication from the organizing committee mentions the code of conduct.

The way we start working with communities (open source or otherwise) is important. He stated that new OSS project members learn by doing and veteran members teach by doing.

In the course of their daily work veteran community members work primarily on durable mediums. By responding to questions and requests with links (references to the durable record), they teach new community members to work the same way. Similarly, by responding to code of conduct violations in a public fashion, new members know what to expect and how to handle violations. This both promotes a safe space and reinforces community norms.

The final thought that resonated with me was the automation of grunt work. Github employees go out of their way to automate grunt work so that people can keep working on the hard parts of their jobs. Some of the things that they automate include:

  1. Automating coding standard checking and other basic warnings: this way, code reviews end up being about the substance of the code and not important, but nitpicky details.
  2. Writing style: the GitHub blog uses automated checks to ensure quality. In particular, they have a Jenkins job that flags common writing flaws by running candidate posts through the write-good. This promotes a consistent style across the blog, and once again eliminates simple errors and lets human reviewers focus on the substance of the post.
  3. Blog post calendar: They considered appointing someone to organize when specific entries would get posted to the blog to ensure that there was a regular flow of posts. Instead, they added another Jenkins job that throws an error if a day has too many entries. If a day is too full, the job suggests that you schedule your post for a different day.

What I love about these suggestions is that they take the common machinery required to work day-to-day and focuses it to providing thoughtful feedback. Computers handle the minutiae that they are best at.

We are more receptive to feedback from pedantic robots than pedantic people ...and robots are more reliable... — Brandon Keepers

There was a lot of good “how to manage a community” sprinkled throughout ATO this year. I plan on taking these ideas and trying to drive them in my own teams and projects.

Ready for ATO2015

This is the third year that I’ve had the pleasure of attending All Things Open in Raleigh, NC.

The conference has gotten bigger every year. This year there were 1,700 attendees across thirteen tracks. The tracks cover a variety of technical topics ranging from JavaScript/web development to big data, to community management.

I haven’t previously blogged about the talks I go to. This year (I tell myself) will be different.

This year is big enough that I’m going to highlight several different themes and projects that really caught my attention over the two days. I’ll write a post on each of the following topics:

  1. Community: community is a big part of All Things Open, and many of the talks are either specifically about open source communities, or peripherally mention them as a key to technical success. I think there were a lot of good takeaways on how to engineer communities to be inclusive, productive, and self-sustaining.
  2. Development & Deployment Environments: While I’d previously been aware of tools like Docker and Vagrant, hearing experts talk about them and give live demonstrations really cements what these tools are for and how they can be used to simplify my life as a developer, researcher, and systems administrator.
  3. Big Data: I went to several of the big data talks — this year there were two great talks on using Apache Nifi and Apache Spark to ingest and explore data.
  4. Graph Databases: Graph databases (like Neo4J) are an interesting take on databases. Again, it was instructive to see an expert talk about graph databases and give a live demonstration.
  5. Rust: The Rust programming language is an interesting new programming language that’s seen a considerable amount of development over the previous year.
  6. SrcML.NET: There was no talk on SrcML.NET this year — however, as its primary maintainer, I hope to apply many of the social and technical ideas I learned this year to the project.

As I write each post, I’ll update this post with a link.


Want to develop a better work routine? Discover how some of the world’s greatest minds organized their days.
Click image to see the interactive version (via Podio).

This is a great visualization of various routines. I love that Haruki Murakami works from 4AM-noon and then spends time with his family in the afternoon. That sounds like a great schedule!

Like the rest of the Internet, I’ve been thoroughly engrossed by Serial. Since the podcast has ended, a whole bunch of activity has happened that has launched a whole new round of speculation.

It’s been fascinating to hear about the different viewpoints and stories from a single murder trial. It’s also, the more I read about it, like a real life episode of Law & Order. Is this a good thing? Part of me thinks that it is:getting people interested in the criminal justice process is a good thing. If someone’s guilt is confirmed or reversed in the process then that’s an added bonus. Another part of me thinks that this is all really just rubbernecking. Surely there must be a less callous way of serving justice?

In the meantime, I’ll continue scarfing down all the Serial tidbits I can find. Hopefully the 3rd part of Jay’s interview with The Intercept posts tomorrow.

At work, we’re primarily a windows shop. Everything is based on Active Directory. I manage lab resources for my group, and provide some file sharing and web services independently from our global IS group. The easiest way to do is to let people authenticate with the credentials they already have (their domain login).

This is pretty easy for web services. For example, most web frameworks support connecting to LDAP and authenticating a user. It’s also easy if you have a Linux machine (via nss-pam-ldapd). It’s relatively difficult to have a non-domain computer authenticate a user against Active Directory.

This is mostly a problem of finding the right search terms. pGina is an open source, pluggable authentication provider for Windows. If you come from a Linux background, the easiest way to think about this is PAM for Windows.

With that (brief?) introduction, I’ll spend the rest of this post laying out how to use pGina with an Active Directory service.

Step 1: Installation

First, download pGina from the downloads page. When installation is finished, you’ll have the option to launch the pGina configuration tool.

pGina Configuration Tool

Step 2: Configuration

Enabling Plugins in pGina

Click on the “Plugin Selection” tab in the configuration tool and check the “Authentication” and “Authorization” checkboxes. Then, make sure that the LDAP row is selected and press “Configure…”.

pgina-config-ldap

This is where the meet of the configuration is done. You’ll need to fill in the following fields:

  1. LDAP Host: This is your Active Directory server
  2. LDAP Port: use 636 and select “Use SSL” to encrypt the connection. If your domain is part of a forest, you may need a different port number.
  3. Search DN: This is the distinguished name for a login that will be used to search active directory. I have a service account in our Active Directory whose password doesn’t change.
  4. Search Password: The password for the “Search DN” user.
  5. Check the “Search for DN” check box
  6. Set the Search filter to (sAMAccountName=%u). sAMAccountName is typically an easy-to-remember user name. %u is what the user will type into the login field.
  7. Click “Save”

If you need help finding the distinguished name for an account, I recommend using AD Explorer from SysInternals.

Step 3: Testing

This is the moment of truth! Select the “Simulation” tab, put your domain username and password into the appropriate text boxes, and the press the “go” button (the one with the green triangle).

pgina-test-ldap

The first time you do this, the first stage (Authentication via the Local Machine plugin) will evaluate to False. This is because that user doesn’t exist on your local machine. pGina adds them as a local user so that subsequent runs will authenticate against the local user.

A second test. Log out from your machine and select “Switch User.” Your login screen should now look like this:

pgina-windows-login-screen

The pGina login item should say “Service Status: Connected”. If it does, click it and login!

Final Notes

I primarily use this to give domain users access to SMB shares and remote desktop on non-domain windows machines. So, there are some caveats:

  1. This computer doesn’t belong to the domain. Windows seems to match sAMAccountName on your domain PC to the login on your non-domain PC. So, it should authenticate properly.
  2. Users have to login on the console before you can grant SMB access or let them authenticate via Remote Desktop. If you want to authenticate new users over RDP, you can do so by following the advice in this thread.
  3. I haven’t tested password changes — I’m pretty sure this will break if your domain password changes (likely, the old password will still work).

There’s a fork of pGina that has some additional features. You may want to look at that to see if there’s something useful.

Phil: Claire, it’s time we upgraded to the latest version of Microsoft’s compiler. It looks like the new operator throws exceptions when it fails now. How long will it take us to adapt our code?

Claire: I don’t know — let me see how we’re doing it now.

Back at her desk, Claire tries to see how new, an oft-used operator in C++, is used in their large codebase. Claire might do the following to answer Phil’s question:

  1. Use “find-in-files” to find all uses of the new operator
  2. Click on the first few results
  3. Realize they’re comments
  4. Click on a few more results
  5. Report to Phil that there are at least 500 uses of new, but some of them are just comments, or the word “new” appearing in a variable or function name.

SrcML.NET is a C# framework that I’ve been developing at ABB Corporate Research that can help answer questions like Phil’s. It is a program analysis framework that focuses on:

  1. Speed: we want code operations to be fast so you can do them all the time (for instance: every time you save a file)
  2. Easy: it should be easy to develop new queries and run them over your source code.
  3. Good enough: Analysis should be mostly correct considering we’re not actually compiling the code.
  4. Multi-language: it should support multiple languages out of the box and provide common analysis tools where appropriate.

The first tool we built on top of SrcML.NET is a Visual Studio plugin for lightweight program transformations.

SrcML is an XML representation for source code developed at Kent State University. It wraps all of the program elements in your source code with XML elements. For instance, all if statements are wrapped with if and all functions are wrapped with .... The utilities provided by srcML annotate all code constructs for the supported languages without making any changes to the source code. All of the original structure in the code (preprocessor macros, whitespace, comments, and names) are all preserved. By not relying on compilation, srcML lets us understand and modify source code very quickly by processing the resulting XML.

LINQ is a language enhancement that was added to C# as part of .NET 3.5. It provides a SQL-like syntax with full intellisense support for querying a variety of data sources. What we’re most interested in is that LINQ can be used to query XML documents.

I know what you’re thinking:

I get to combine the conciseness of XML with the easy learning curve of SQL.

When I started exploring this space, I had roughly the same reaction. I changed my tune after using them together for two key scenarios.

Example: Lightweight Program Transformation

We first started using srcML as the platform for a lightweight program transformation tool. There are a lot of program transformation tools available. For our tool, we wanted something that:

  1. Allows developers to experiment with transformations.
  2. Lets developers modify specific parts of their source code without touching anything else.
  3. Allows developers to implement transformations in a natural, easy to understand way.
  4. Supports C & C++ code. Other languages are a bonus. Good C & C++ parsing is a must.

SrcML supports C, C++, Objective-C, and Java out of the box. In addition, they’re actively working on other languages. LINQ lets us write natural looking queries that let developers find code patterns they’re interested in. By providing a Visual Studio Add-In that lets us easily test transformations, we can enable the following workflow:

  1. Define a query: the query finds instances that need changing. Refine by hitting the “test” button.
  2. Define a transform: The transform modifies each query result. Refine by hitting the “test” button.
  3. Execute the transformation: run the complete transformation on a source code tree.

As an example, here’s a snippet of LINQ that finds all uses of the new operator:

var newUses = from unit in project.FileUnits # iterate over all files
from use in unit.Descendants(OP.Operator) # find all operators
where use.Value == “new” # where the value is “new”
select use;

Example: Code Analysis

The queries implemented as part of program transformations led us to the realization that srcML and srcML.NET were perfectly suited to writing useful querying of source code.

These queries can be used to answer two types of questions. The first are navigation aids that are typically provided by IDEs (such as Visual Studio’s Intellisense). Examples of this include:

  • What are all of the variable names in my program?
  • What is the type of variable A?
  • Where is a variable B called?
  • What are the callers and callees of function C?

The second thing we would like to ask is more along the lines of “metrics.” Examples of this are:

  • How often is language feature X used?
  • Is method Y always called before method Z?
  • What global variables are used across multiple namespaces?
  • How many functions are updated per changeset?

While there are other tools that can answer these questions, the combination of LINQ and srcML means that we can do it quickly, without compilation, and with a high degree of accuracy. This means we can make this information available in developer tools or as part of an automated build.

Now Open Source!

Come join us! Some of these ideas are untested or only exist as a prototype! If you’re interested in developing a fast, multi-language tool for program analysis, download the code and see what it does! Did you see my tool demonstration at FSE 2012? You can get the code I used in the demonstration here.

So, I clearly haven’t blogged in forever. What better way to start than by complaining about multinational corporations?

FedEx Tracking Map

I mailed my passport in to get a business visa via FedEx Priority Overnight. Apparently, “overnight” means “two days.” They’re at a loss to explain why my passport had to visit Memphis twice, and why it ended up in Orlando at all.

Coming up next: rants about politics, parenting, and revision control systems (though, not all at once, unless I’m feeling especially saucy).

This is a list of some of the main tutorials I use

Useful tutorials / books:

Packages I use regularly. There are others, but these are the things I always install:

  • IPython: much-improved shell. Gives you tab-to-completion, and lets you record sessions.
  • numpy: fast numerical arrays. You might want to get used to just doing things using lists first. After that, numpy all the way!
  • SciPy: scientific computing in python.
  • matplotlib: plotting tool that integrates nicely with the above-three tools. It also does 3D plots. I don’t use this as much anymore; I mostly do my plotting in the ggplot2 R package.