What is DVC?
🔗 Git for Data (What is DVC?)
Data Version Control (DVC) is an open-source tool for data science and machine learning teams to manage datasets, ML models, and experiments in Git. Key parts include:
- “Git for Data and Models” - DVC extends Git versioning to large files like datasets and ML models for rigorous project management. Use your regular Git workflow for ML projects, and share project materials with a Git repository URL.
- "Makefiles for data and ML projects" - DVC pipelines are makefiles for ML, with optimizations and human-readable formatting ideal for organizing ML projects. Pipelines connect scripts and dependencies, like datasets and models, for reproducibility and efficiency.
- “Experiment tracking via Git” - Compare model metrics, hyperparameters, and plots across commits, branches, and releases..
DVC was created in 2017 to address gaps in ML tools, and has evolved into a successful open source project with 150+ contributors and thousands of users.
Some interesting highlights from the community:
If you want to learn more about DVC and its journey, check out an interview with DVC creator Dmitry Petrov on Podcast.init. Listen HERE or read the transcript HERE.
Recently, version 1.0 was released. DVC 1.0 is inspired by discussions and contributions from our community of data scientists, ML engineers, developers and software engineers. Read up on new features, like data visualization and data transfer optimizations, in our release blog post.
Today, the project remains under active development. Now that the data management layer has reached a stable form, the DVC team is focusing on the data scientist’s experience. Our goal is to become Git for ML - a holistic tool to capture the ML experiments lifecycle following a Git-like philosophy. That means, no complicated infrastructure, databases, or dependencies on third-party external APIs.
🔗 DVC Contributor’s Guide
We welcome contributors from different backgrounds and levels of experience! We’ll be happy to guide and help with the contribution - either to the core project, documentation, tutorials, or blog. We’ve participated in the programs like Google Season of Docs (similar to Google Summer of Code) and have substantial experience guiding and mentoring folks who do their first contributions. As well as we have an established community of experienced contributors.
DVC is written in Python. It’s a command line tool that deals with large files and Git internals (among other things) but is built for ML engineers and data scientists in mind as end users. You may be a great match for us if you want to see and learn software engineering best practices in a mature project, and if you’re curious how ML teams operationalize their workflow.
The best way to start is to check the project’s issue tracker (or this one for the website and docs). There are usually plenty of low hanging fruits, which are tagged with “first good issue” and “hacktoberfest” labels. We recommend getting in touch with us about the issue (to confirm the details, get help, etc.) in one of the support channels or in the #tool_dvc channel in ODS.