Skip to the content.

DFP - DockerFile Patcher

This artifact aims to improve the quality of Dockerfiles by analyzing the file using the linter Hadolint 1.23.0, retrieving possible patches from a database and applying them in order of their ranking.

The patching script will suggest patches for various lines in a given dockerfile, but won’t change the original file.

This repository contains scripts to

  1. Generate patches based on Hadolint’s violations and a large collection of Dockerfile changes in Open-Source projects MSR18 database (extended dataset can be found on Zenodo
    and
  2. Retrieve and apply these patches to any given Dockerfile

Table of contents

Getting started

There are several possibilities to get the artifact up and running:

  1. Docker image (recommended)
  2. Docker build
  3. Local

Pre-requisites:

Download the image and start the container:

docker run --rm --name dfp -d mando9/dfp

You now have a running container with the name dfp. The container will remove itself once it is stopped, due to the option --rm.
To access the container using bash the following command can be used:

docker exec -it dfp /bin/bash

Docker build

Pre-requisites:

Build the docker image using

docker build -t dfp .

This will create a docker image on your local machine with the tag dfp.
Then create a container and run it detached:

docker run --rm --name dfp -d dfp

You now have a running container from your local image with the container name dfp. The container will remove itself once it is stopped, due to the option --rm.
To access the container using bash the following command can be used:

docker exec -it dfp /bin/bash

Local

Pre-requisites:

Windows 10 was used for the local setup, if you use another OS your results may vary. The following uses the default user postgres with password postgres (can vary on different installation methods). If you want to use a different user, change the option -U <user>. You will also need to change the login information in config.ini accordingly.

The patch database can be restored by running

psql -U postgres -e < patch_database.sql

in a terminal (Windows: PowerShell won’t work, use the command prompt). This will create a database dfp with all patches.
Alternatively, you can create the database yourself and restore the data using

pg_restore -U postgres --dbname dfp patch_database

When you run the main script using the Dockerfile for this artifact,

# python executable for python 3.9 either just "python" or "python3"
python .\dfp_main.py .\Dockerfile

you should get an output like:

Number of violations: 3
Searching for patches for line (DL3009): RUN apt-get update
Trying patches for violation 0: : 82it [00:09,  8.59it/s]

You can then abort the execution using Ctrl+C.

File structure

This repository contains scripts for creating patches, running dfp to apply patches and evaluating it with a test set.

Running DFP

Main script

The main script analyzes a Dockerfile, queries the patch database and applies patches to find fixes.
Execution can last several minutes, depending on the amount of violations in the Dockerfile.
Usage of the main script is as follows:

python dfp_main.py [OPTIONS] DOCKERFILE

with options:

All files with suffix *_dockerfile in /testSet are Dockerfiles to patch.
An example execution would be

python dfp_main.py ./testSet/pID201_dID3718_sID7015_dockerfile

Processing the test set

This process can take a long time (several hours), since many Dockerfiles are analyzed.
Therefore, pre-computed results are provided in folder /results.
All files in the test set can be process using

python evalTestSet.py

The script will print some statistics about the evaluation and saves the data to a file called evalStats_<current_time>.pkl in the project repository.

Evaluate results

To display the results visually, use the script

python plotResults.py RESULT_FILE

This will show several plots of the result data and prints statistical information and LateX tables to the console. Plots:

  1. Violation distribution (Figure 9)
    Which rule violations are found how often.
  2. Execution times (Figure 10)
    How long does the execution take for one Dockerfile and for one violation.
  3. Fix rate (Figure 11)
    Found violations versus fixed violations
  4. Impact of patch limit to fixes
    How limiting the patch query affects found fixes. Can be found in Table 19.

The plots are then stored in the same directory as the results files pre-fixed with the result file name, i.e. for pre-computed patches in folder /results.

To view pre-computed results containing generated and custom patches, use

python plotResults.py ./results/resultsWithAllPatches.pkl

To view pre-computed results containing only generated patches, use

python plotResults.py ./results/resultsWithOnlyGeneratedPatches.pkl

To copy the plots from the docker container use the following on the host machine (example files for resultsWithAllPatches.pkl):

docker cp dfp:/dfp/results/resultsWithAllPatches_ExecutionTimes.png .             
docker cp dfp:/dfp/results/resultsWithAllPatches_FixRate.png .       
docker cp dfp:/dfp/results/resultsWithAllPatches_RuleDistribution.png .
docker cp dfp:/dfp/results/resultsWithAllPatches_PatchLimitImpact.png .

Dataset

The dataset used to mine the patches is extending the dataset of Structured Information on State and Evolution of Dockerfiles.
A description of their data schema can be found on the linked GitHub repository.
The extended dataset can be downloaded on Zenodo.
Similar to the patch database, the dataset is also a compressed PostgreSQL dump and can be imported with:

pg_restore -U postgres --dbname msr18_extended msr18_extended

The command will restore the database dump as the user postgres into a database with the name msr18_extended.

Important tables of the dataset include (more detailed information of the original schema can be found here):

Extensions include:

A SQL script to create the DB schema and a complete Entity-Relationship-Diagram can be found in /dataset.