13 May 2021, 11:19

How to write configuration files in your machine learning project

When working on a machine learning project, flexibility and reusability are very important to make your life easier while developing the solution. Let's learn more about how to manage your parameters and initial settings more effectively with configuration files.

Finding the best way to structure your project files can be difficult when you are a beginner, or when the project becomes bigger than expected. Sometimes you may end up duplicating or rewriting some part of your project, which is not good professional practice as a data scientist or machine learning engineer.

A quick example is when running different ML experiments to find the best model for the problem you are trying to solve, most of the time people tend to change the values of the different parameters directly from the source code and run the experiment again and again. They repeat this process until they get the best results. This is not a good approach or technique, and you can lose track of the different experiments you have done previously.

Using a configuration file can help you to solve this problem and can add value to your machine learning project in other ways.

After reading this article, you will know:

  • Importance of using a configuration file.
  • Introduction to YAML file.
  • Basics syntax of the YAML file.
  • Rules for creating a YAML file.
  • How to write your first YAML file.
  • How to load the YAML file in python.
  • How to use YAML file (as a configuration file) in your next ML project.

Let’s get started.

So what is the Configuration file?

From Wikipedia: “In computing, configuration files (or config files) are files used to configure the parameters and initial settings for some computer programs. They are used for user applications, server processes, and operating system settings.”

Wikipedia explains two important points when you define configuration file which are PARAMETERS and INITIAL SETTINGS. These parameters and initial settings can be specific values that should be applied in your system when is running. For example in machine learning, you can set batch_size, optimizer, learning rate,test_size, and evaluation metric as part of the configuration file.

In a simple definition, the configuration file, often shortened to config file, defines the parameters, options, settings, and preferences applied to systems, infrastructure devices, and applications. The principle use of a configuration file is to set how your application should run.

This means you can and should use a configuration file in your ML projects. By doing so, it will help you to run your project with flexibility and manage your system source code easily e.g when running different ML experiments.

There are different file types you can use for your configuration files such as YAML, JSON, XML, INI, and Python files. In this article, you will learn more about the most popular configuration file called YAML, and how to use it in your machine learning project.

YAML Configuration File

From Wikipedia: YAML (YAML Ain’t Markup Language) is a human-readable data serialization language. It is commonly used for configuration files but could be used in many applications where data is being stored.”

YAML file formats have become a crowd favorite for configurations, presumably for their ease of readability. YAML is relatively easy to write. Within simple YAML files, there are no data formatting items, such as braces and square brackets; most of the relations between items are defined using indentation.

The YAML acronym was shorthand for Yet Another Markup Language. But the maintainers renamed it to YAML Ain’t Markup Language to place more emphasis on its data-oriented features.

Basics Syntax of YAML file

YAML file has a very simple syntax and easier to learn for anyone, this is my main reason for choosing YAML files instead of other types of configuration files. The following basic syntax can help you to start using YAML as your configuration file:

(a) Comments

In YAML file comments begin with a pound sign.

Example:

# my first comment

(b) key-value Pair

Datatype in YAML is in the form of key-value pairs like other programming languages such as Python, Perl, and Javascript.

The key is always a string and the value can be any datatype.

Example:

learning_rate: 0.1
evaluation_metric: rmse

(c) Numerical Data

YAML recognizes and support different numerical data type such as integer, decimal, hexadecimal, or octal.

Example:

test_size: 0.2
epochs: 50
scientific_notation: 1e+12

(d) String

Write string in YAML is very simple and you don’t have to specify them in quotes. However, they can be.

Example:

experiment_title: find the best model by using f1 score

(e) Boolean

YAML indicates boolean values with the keywords True, On and Yes for true, and false is indicated with False, Off, or No.

Example:

cross_validation: True
save_model: False

(f) Array

YAML supports the creation of arrays or lists on a single line.

Example:

ages: [24,76,45,21,45]
labels: ["class_one","class_two”,"class_three"]

Rules for Creating YAML file

When it’s come to creating a YAML file, you have to follow some very important basic rules.

  • The files should have .yaml as the extension.
  • YAML is case sensitive.
  • Do not use tabs while creating YAML files.

Write your first YAML file.

To create a YAML file, open your favorite text editors such as sublime, vs code, or vim. Then create a new file and save it with the name of your choice example. my_configuration and add .yaml extension at the end. Now you have your first YAML file.

You can start writing different parameters and initial setting values in your my_configuration.yaml file.

Here is a simple example for you to understand how it can look like.

How to use a YAML file in an ML project.

Now that you have new knowledge of basic syntax of YAML files and how to write them, let’s see how you can use the YAML file as a configuration file in a machine learning project.

Dataset

For this simple machine learning project, I will use the Breast Cancer Wisconsin (Diagnostic) Data Set. The objective of this ML project is to predict whether a person has a benign or malignant tumor.

More information about the dataset can be found here: Breast Cancer Dataset.

From the above source code, you can see how to run this simple machine learning project from loading the dataset, handle missing values, drop columns, training and testing the model, and finally saving the model. But we didn’t set and use any configuration file to run this project.

You can see a lot of parameters and initial settings that are available in the source code and we can put all of them into a single configuration file.

So what parameters and initial settings we can add into the configuration file?

  • Data directory
  • Data name
  • Column(s) to drop
  • Target variable name
  • Test size ratio
  • Parameters of the classifier(KNN)
  • Model name
  • Models directory

Now we have identified what parameters and initial settings, then we can write our configuration file, and name it my_config.yaml.

The my_config.yaml contains all important initial settings and parameters for the K-Nearest Neighbors algorithm to run in our ML project.

How to load the YAML file in Python

In order to load the YAML file in Python, you need to install and use the PyYAML package. PyYAML package is a YAML parser and emitter for Python. The installation process for YAML is fairly straight forward, the easiest way to install the YAML library in Python is via the pip package manager. If you have pip installed in your system, run the following command to download and install YAML:

pip install pyyaml

To read the YAML file in python, first, import the YAML package import yamland then open your YAML file my_config.yaml and load the contents with the safe_load() method from the yaml module.

Now you know how to load the YAML file in python, let’s add the configurations we have identified and put them in our ML project.

Our project source code looks more beautiful and readable, we don't need to change parameters or initial settings directly from the source code, we have a configuration file to do that. We started by importing important Python packages include YAML package, loaded the configuration file by using load_config() function and added initial settings and parameters in our project.

If you want to change the dataset name, columns to drop, test size ratio, or classifier’s parameters you can do that in the configuration file. Sometimes you can create a new configuration file with the same initial settings and parameter names but different values and run your ML experiments.

Wrap up

Now you understand the importance of using a configuration file in your ML projects. In this article, you learned what configuration file is, the importance of the configuration file for your ML project, how to create a YAML file, and how to use it in your ML project. Now you can start using the configuration file in your next machine learning project.

The dataset and source code for this article is available on Github.

If you are interested to learn more about the YAML file, I recommend you read the online materials from Tutorials Point.

If you learned something new or enjoyed reading this article, please share it so that others can see it. Feel free to leave a comment too. Till then, see you in the next post! I can also be reached on Twitter @Davis_McDavid

Read more articles like this in the following links:

How to enter your first Zindi competition

A beginner's guide to scraping data from social media

15 undiscovered & open source machine learning frameworks you need to know in 2021