Autoencoders are neural networks that can learn meaningful representations of the data in an unsupervised way. There exist diverse variants of autoencoders, however, in practice the purpose of all variants is to learn to reconstruct a representative copy of the given input. The underlying principle is that the encoded representation captures salient features which are reflected in the reconstructed output and discards other, less important, thus providing dimensionality reduction and de-noising capabilities.
A typical autoencoder model has essentially two sets of layers — encoding and decoding layers — symmetrically built across the compression pipeline.
The model produces an approximate, compressed representation of the input, which then attempts to reconstruct with some loss L. In its simplest architectures an autoencoder consists of dense layers where the input is
compressed by limiting the number of units in the intermediate hidden layers.
In this post, I will implement a simple autoencoder to identify code smells. The metaphor of code smells is used to indicate the presence of quality issues in source code. A large number of smells in a software system is associated with a high level of technical debt hampering the system’s evolution. I will take you on a step-by-step journey (refer to the figure below) to carry out all the involved tasks starting from downloading source code repositories, using an existing tool to detect smells (to prepare training and testing dataset), splitting a project into smaller fragments (i.e., methods), segregating them into either a positive or negative sample, train an autoencoder model, and finally detect a smell Complex Method (occurs when a method has high cyclomatic complexity).
If you want TL;DR version of the post, you can directly jump to step 6 where I discuss the implementation of autoencoder; you can clone this repository to obtain outcomes of the intermediate steps to execute step 6. On the other hand, if you love details, stay on.
Step 0: Set parameters
I am going to use many path variables that I need to initialize.
Step 1: Download repositories
The first step of our pursuit is to download subject systems. To do that, you can download any open-source GitHub repository. The repository include a CSV file containing a list of selected open-source Java repositories. I am downloading a fraction of them (10) for this experiment.
Step 2: Analyze repositories using DesigniteJava
We need to analyze the downloaded subject systems using DesigniteJava to identify code smells in each of the repository. Using this information, we will segregate each method into a positive or negative sample in later steps.
Step 3: Split the repositories into methods
Our training unit is a method and hence we need to identify each individual method in all the subject systems. I achieve this by using a utility CodeSplitJava.
Step 4: Segregate methods into positive/negative samples
Using the identified smells in each of the repository, I segregate each method sample into either positive sample bucket (where the smell complex method is present) or negative sample bucket. After this step, we will have each method put in either ‘positive’ or ‘negative’ folder inside training_data folder.
Step 5: Tokenize source code
Source code cannot be fed directly to an autoencoder model. Often source code is processed first and a set of features (e.g., metrics) are generated; those features are fed to any machine/deep learning model. In our case, we are feeding the source code in simple tokenized form generated from a Tokenizer.
Step 6: Train Autoencoder
The following implementation use autoencoders as classifiers of anomalies. I train the models to learn to represent patterns of non-smelly samples by using
only negative (i.e., non-smelly) examples. I test the trained models on data that include both positive and negative samples. I use the reconstruction loss as a proxy for classifying an instance as smelly. If for some instance the output of the model shows high loss, I accept that this example does not follow the pattern learnt by the model, which in turn implies classification of a positive instance of the smell under investigation.
The result of the above script is given below.
precision: 0.7483870967741936, recall: 0.3118279569892473, f1: 0.4402277039848197
The autoencoder implementation produce F1=0.44 which can be considered a mediocre accuracy. However, think about it that we did not generate any code features and rely on autoencoders capabilities to learn from raw (tokenized) code. It can be further improved with larger dataset (we only used 10 repositories) and by fine-tuning hyper-parameters (such as threshold, encoding dimensions, and epochs).