Computer scienceData scienceMachine learningML deployment

Overview of DVC

6 minutes read

DVC (Data Version Control) is a framework designed to help ML engineers manage their data and model pipelines more efficiently. It addresses the challenges that arise when working with large datasets, complex model training processes, and collaboration.

In this topic, we will look at the core components of DVC.

The setup

Managing data and tracking experiments can be daunting. Datasets can be huge, and small changes in data can lead to significant differences in model performance. Model training often involves multiple steps, each with its own set of dependencies and configurations. DVC provides a solution to these problems by allowing to version control the data, models, and pipelines alongside the code. This ensures that changes to data and models are tracked, and experiments can be easily reproduced.

Git can’t be easily used for ML versioning directly (because versioning just the training code is not enough). DVC could be thought of as ‘Git for data’, and has a very Git-like syntax.

Basic commands

Let's get a quick overview for some of the most important commands in DVC:

The command	Description
`dvc init`	Initializes a new DVC repository in your current directory. It creates a `.dvc` directory and sets up the necessary files and structures for DVC to work.
`dvc add`	Starts to track data files or directories with DVC. When you run `dvc add path/to/data`, DVC creates a hash of the data and stores it in the DVC cache. It also creates a `.dvc` file that acts as a pointer to the cached data.
`dvc run`	Used to define and run data pipelines. It takes a command or script as input and tracks the dependencies (input files) and outputs of the pipeline stage. DVC automatically captures the intermediate outputs and caches them, making it easy to reproduce the pipeline later.
`dvc repro`	Reproduces a data pipeline or a specific stage of the pipeline. DVC checks if any of the dependencies have changed and recomputes the stages that are affected by the changes. This ensures that your pipeline is always up-to-date and reproducible.
`dvc push` and `dvc pull`	Push and pull data or model files to and from a remote storage location (e.g., AWS S3, Google Cloud Storage, or any other supported remote). This is particularly useful when dealing with large datasets that cannot be stored in a Git repository.
`dvc metrics`	Used to track and visualize model performance metrics. You can define metrics files that store your model's evaluation metrics, and DVC will track their changes over time, allowing you to compare model performance across different experiments
`dvc dag`	Displays the directed acyclic graph (DAG) of your data pipeline, showing the dependencies between different stages.

A few words on the structure

In a DVC project, the primary configuration file is dvc.yaml, which serves as the central hub for managing data pipelines, dependencies, and other project-specific settings. This file is typically located in the root directory of the project.

The dvc.yaml file contains a structured representation of the data pipeline stages, their dependencies, and outputs. Each stage is defined by a command or script that performs a specific task, such as data preprocessing, feature extraction, or model training. The dependencies and outputs of each stage are explicitly specified, allowing DVC to track and manage them efficiently.

To handle large files that cannot be stored in Git, DVC employs a cache mechanism. When a large file (e.g., a dataset or a model) is added to DVC using the dvc add command, DVC creates a hash of the file and stores it in the .dvc/cache directory. This cache directory is managed by DVC and is not tracked by Git. Instead, a small .dvc file is created, which acts as a pointer to the cached data.

DVC handles data dependencies in pipelines by representing them as a directed acyclic graph (DAG). Each stage in the pipeline is a node in the DAG, and the dependencies between stages are represented as edges. This structure allows DVC to efficiently determine which stages need to be recomputed when a dependency changes, ensuring that the pipeline remains up-to-date and reproducible.

When changes are made to data or model files that are part of a pipeline, DVC automatically detects these changes and updates the affected pipeline stages accordingly. DVC accomplishes this by monitoring the hashes of the input files and comparing them to the recorded dependencies. If a dependency has changed, DVC marks the corresponding pipeline stages as stale and schedules them for recomputation during the next dvc repro command.

In addition to the dvc.yaml file and the cache directory, DVC also creates other directories and files to manage the project's state and metadata. For example, the .dvc/config file stores project-specific configuration settings, and the .dvc/state file keeps track of the state of the pipeline stages, including their hashes and timestamps.

Overall, DVC's project structure is designed to provide an organized way to manage data, models, and pipelines, while ensuring reproducibility and efficient handling of large files.

Conclusion

As a result, you are now familiar with the core functionality of DVC, the basic commands, and the general structure of a DVC project.

How did you like the theory?

Report a typo

Overview of DVC

The setup

Basic commands

A few words on the structure

Conclusion

Related topics