Data Version Control

What is it?

Data Version Control (DVC) is a tool for version controlling large files which would be inappropriate or impossible to version control directly with Git due to their size. By providing a solution to version control data alongside code in repositories, DVC allows for models and codebases with data dependencies to be fully reproducable.

How does it work?

Similar to git, a DVC repository can be initialised in an existing Git repository. Files can be added to DVC for tracking which it does by creating a pointer file containing a hash tied to the file’s location. Files version controlled with DVC can be pushed to a remote location (e.g. an AWS S3 bucket). The pointer files and DVC cache directory (basically a tracker for all things DVC in your repo) can be pushed to GitHub so that that there is a record of the code and the other required large files through DVC. These files can be retrieved with a DVC pull request. For an example of how this looks within a repository, see one of the examples below.

Where are some examples of DVC being used in TAU?

The following are examples of data / components of repositories version controlled with DVC:

  • model inputs for the transport emissions model

  • demographics and networks inputs for the SEQSTM

  • trained machine learning model for the bike_lane_from_aerial package of the at_network_tools repo

  • inputs and outputs of API request for dspark requests repo (e.g. see request number 47)

General DVC Workflow

To begin working with DVC in an existing repository:

  1. Install the DVC command line tool. Although there is a python package available as well, it can be a little less reliable than the command line tool.

  2. Initialise DVC within the parent folder of the Git repository (where the .git file lives). On the command line:

    dvc init
    
  3. Add files or directories to DVC to track. If different files are likely to be changed within the same directory on different branches, then it is probably best to DVC version control the individual files or sub-directories to avoid merge conflicts.

    dvc add your-file-or-directory
    
  4. Push DVC tracked files to a configured remote repository (see below for more details)

    dvc push
    
  5. Commit and push .dvc pointers and cache directory as well as updated .gitignore files to git / GitHub

  6. Test changes have been successful and you have not made errors by git cloning the repo afresh and dvc pulling the data

Configuring a Remote Repository

There are various options for configuring a remote repository but must common within TAU currently is using a S3 bucket.

A workflow might look like:

  1. Create a new bucket for storing files tracked by DVC (usually one per repo / project).

  2. Attach a policy to the s3 bucket to prevent potential data loss due to accidental deletion. Example below.

     1{
     2"Version": "2012-10-17",
     3"Statement": [
     4    {
     5        "Effect": "Deny",
     6        "Principal": "*",
     7        "Action": [
     8            "s3:DeleteObject",
     9            "s3:DeleteObjectVersion",
    10            "s3:DeleteBucket",
    11            "s3:DeleteBucketPolicy",
    12            "s3:PutBucketPolicy",
    13            "s3:PutLifecycleConfiguration"
    14        ],
    15        "Resource": [
    16            "arn:aws:s3:::<YOUR_BUCKET_HERE>",
    17            "arn:aws:s3:::<YOUR_BUCKET_HERE>/*"
    18        ],
    19        "Condition": {
    20            "StringNotLike": {
    21                "aws:userId": [
    22                    "<USER_ID_HERE>:*",
    23                    "<ACCOUNT_ID_HERE>"
    24                ]
    25            }
    26        }
    27    },
    28    {
    29        "Effect": "Deny",
    30        "Principal": "*",
    31        "Action": "s3:DeleteBucket",
    32        "Resource": "arn:aws:s3:::<YOUR_BUCKET_HERE>"
    33    }
    34]
    35}
    
  3. Point your dvc cache to the remote repository location. If done this way, the remote will be stored in .dvc/config file so new users to your repo will not need to manually configure the remote again On the command line:

    dvc remote add -d myremote s3://<bucket>/<key>
    
  4. Point DVC to your local AWS credentials by setting the path to the AWS config file for your local machine. On windows, your path will be similar to C:/Users/USERNAME/.aws/config or ~/.aws/config on Unix-based systems.

    dvc remote modify --local your-bucket-name configpath path/to/aws/config
    
  5. Now you have a remote repository and can dvc push to this location.

Nb. If working with externals, it is often best to create a role with permissions only to access the required s3 resources. A SCO request will need to be lodged to create a new role. A configuration profile can then be created using this role to sign in and authenticate.