Managing Deep Learning Models Easily With TOML Configurations
You may never need those long CLI args for your train.py
Managing deep learning models can be difficult due to the huge number of parameters and settings that are needed for all modules. The training module might need parameters like batch_size
or the num_epochs
or parameters for the learning rate scheduler. Similarly, the data preprocessing module might need train_test_split
or parameters for image augmentation.
A naive approach to manage or introduce these parameters into pipeline is to use them as CLI arguments while running the scripts. Command line arguments could be difficult to enter and managing all parameters in a single file may not be possible. TOML files provide a cleaner way to manage configurations and scripts can load necessary parts of the configuration in the form of a Python dict
without needing boilerplate code to read/parse command-line args.
In this blog, we’ll explore the use of TOML in configuration files and how we can efficiently use them across training/deployment scripts.
What are TOML files?
TOML, stands for Tom’s Obvious Minimal Language, is file-format designed specifically for configuration files. The concept of a TOML file is quite similar to YAML/YML files which have the ability to store key-value pairs in a tree-like hierarchy. An advantage of TOML over YAML is its readability which becomes important when there are multiple nested levels.
Personally, except for enhanced readability, I find no practical reason to prefer TOML over YAML. Using YAML is absolutely fine, here’s a Python package for parsing YAML.
Why do we need configurations in TOML?
There are two advantages of using TOML for storing model/data/deployment configuration for ML models:
Managing all configurations in a single file: With TOML files, we can create multiple groups of settings that are required for different modules. For instance, in figure 1, the settings related to the model’s training procedure are nested under the [train]
attribute, similarly the port
and host
required for deploying the model are stored under deploy
. We need not jump between train.py
or deploy.py
to change their parameters, instead we can globalize all settings from a single TOML configuration file.
This could be super helpful if we’re training the model on a virtual machine, where code-editors or IDEs are not available for editing files. A single config file is easy to edit with
vim
ornano
available on most VMs.
How do we read configurations from TOML?
To read the configuration from a TOML files, two Python packages can be used, toml
and munch
. toml
will help us read the TOML file and return the contents of the file as a Python dict
. munch
will convert the contents of the dict
to enable attribute-style access of elements. For instance, instead of writing, config[ "training" ][ "num_epochs" ]
, we can just write config.training.num_epochs
which enhances readability.
Consider the following file structure,
- config.py
- train.py
- project_config.toml
project_config.toml
contains the configuration for our ML project, like,
[data]
vocab_size = 5589
seq_length = 10
test_split = 0.3
data_path = "dataset/"
data_tensors_path = "data_tensors/"
[model]
embedding_dim = 256
num_blocks = 5
num_heads_in_block = 3
[train]
num_epochs = 10
batch_size = 32
learning_rate = 0.001
checkpoint_path = "auto"
In config.py
, we create a function which returns the munchified-version of this configuration, using toml
and munch
,
$> pip install toml munch
import toml
import munch
def load_global_config( filepath : str = "project_config.toml" ):
return munch.munchify( toml.load( filepath ) )
def save_global_config( new_config , filepath : str = "project_config.toml" ):
with open( filepath , "w" ) as file:
toml.dump( new_config , file )
Now, now in any of our project files, like train.py
or predict.py
, we can load this configuration,
from config import load_global_config
config = load_global_config()
batch_size = config.train.batch_size
lr = config.train.learning_rate
if config.train.checkpoint_path == "auto":
# Make a directory with name as current timestamp
pass
The output of print( toml.load( filepath ) ) )
is,
{'data': {'data_path': 'dataset/',
'data_tensors_path': 'data_tensors/',
'seq_length': 10,
'test_split': 0.3,
'vocab_size': 5589},
'model': {'embedding_dim': 256, 'num_blocks': 5, 'num_heads_in_block': 3},
'train': {'batch_size': 32,
'checkpoint_path': 'auto',
'learning_rate': 0.001,
'num_epochs': 10}}
If you’re using MLOps tools like W&B Tracking or MLFlow, maintaining configuration as a dict
could be helpful as we can directly pass it as an argument.
The End
Hope you will consider using TOML configurations in your next ML project! Its a clean way of managing settings that are both global or local to your training / deployment or inference scripts.
Instead of writing long CLI arguments, the scripts could directly load the configuration from the TOML file. If we wish to train two versions of a model with different hyperparameters, we just need to change the TOML file in config.py
. I have started using TOML files in my recent projects and experimentation has become faster. MLOps tools can also manage versions of a model along with their configurations, but the simplicity of the above discussed approach is unique and required minimal change in existing projects.
Hope you’ve enjoyed reading. Have a nice day ahead!