• How to build your own Static Code Analyser for C Sharp

    by  • June 10, 2012 • .Net, Journal • 0 Comments

    Static Code Analysis is the process of scanning the source code (or bytecode) of an application without actually running it. SCA can be used to determine a whole host of programmatic errors or just not conforming with a coding guideline.

    My first look into SCA was at university; during the ‘Theoretical Computation’ and ‘Mathematics of Computer Science’ lectures we looked into the ‘Halting Problem‘ i.e. Can you prove a program will or will not ever exit?

    From then on, I investigated SCA further while I was writing the OPO runtime for Windows Mobile. The OPO runtime interpreted and ran Psion QCode that was generated when the OPL programming language was compiled. I briefly investigated SCA because I wanted to use it in part to develop a set of analyse tools for OPL: Dead Paths, unused functions etc. (This was back in the day when space was extremely limited!)

    I came back to it a year or two later when I ported the runtime to Silverlight and then to Javascript as I investigated the concept of how to go about creating a JIT’er, as SCA would be used to determine which parts would benefit most (eg. Hot Paths) however my knowledge of dynamically emitting MSIL/CIL was limited at the time.

    Since then I haven’t thought much about it till now, and with my knowledge a bit further along in terms of how to analyse and parse .NET bytecode I thought I’d give it another shot and see how hard is it to analyse a simple program. – Though of course nothing on the scale of commercial or even free apps, just a starting point to which people can add their own checkers.

    How do they work?

    Static Code Analysers work by traversing your code, working out which routes are logically possible and then performing analysis on those routes. This is done using the source code itself, or an intermediate form, either extracted from the compiler, or done by examining the interpreted bytecode.

    What do I mean by logically possible routes, well imagine the following code.

    int x = 0;
    return;
    x++;
    return;
    

    You may notice code is placed after the first return statement, as there is no conditional, any code after it will never be run, this code will always be unused (dead code).

    In the following example, there is a conditional, as we are not dynamically running the application we can logically analyse both possible routes through the code.

    int x = 0;
    if (x == 0)
    {
        x = 1;
    }
    else
    {
        x = 0;
    }
    x++;
    

    Storing the data on these paths, and the analyse is done via trees, where each branch is the output of a conditional statement.

    Each node on the tree would store information about the code up until that point e.g. function calls, variable states, initialisations, chunk of MSIL bytecode etc. Each node would have a unique identifier based on variable states, code position etc. Therefore if a method is called from multiple places, the same place on the tree could be linked.

    Note: A static analyser can use these graphs to determine so called ‘hot paths’ where the same piece of code is repeatedly called, this would be found by having nodes where their are multiple parents.

    During the analysis phase, each route on the tree is traversed; either by a breadth or depth first search depending on the analysis requirements or if there are loops in the tree.

    Note: These trees form the basis of working out the halting problem, however the trees themselves could grow to infinite size if the application contains a non-exiting loop, thus why it is impossible to prove the problem.

    The analysis itself occurs by the bytecode / source in a node being parsed and the opcodes given to a checker, it is the duty of the checker(stateless or stated) to determine whether an event has occurred (eg. a variable has been used before initialised). In the case of my checkers, they will be designed to look at chunks of MSIL bytecode and whether they happen in the correct ordering.

    Stay tuned for part two…

    About

    Software engineer. Tea drinker

    http://MrPfister.com

    Leave a Reply

    Your email address will not be published. Required fields are marked *