Validation
Validating an environment consists of 2 elements:
- Confidently recreating the same environment
- Trusting what is in the environment
The first concern, reproducing environments, is covered at length by the different strategies for environment management. The validated strategy is particularly useful for creating sets of approved packages, though other strategies can be used depending on the context.
The second concern forces us to answer the question: “Can we trust our environment?”. To trust an environment, we must have confidence that the packages are accurate in their stated purpose. Unfortunately, with approximately 18,000 R packages on CRAN, and more added each day, it is impossible to provide a single list of trusted packages. Every organization, or industry, will need to apply their own judgement in determining whether or not to approve a package. This page presents a set of metrics to help organizations make these determinations.
Quick Links
Not what you were expecting? Before continuing, here are some quick links to other resources specific to validation in the clinical pharma space:
Package Characteristics
The following heuristics can help you judge whether or not a package is stable and useful. As a general rule of thumb, you can use these characteristics as a checklist when evaluating a package. Like any heuristic, there are exceptions - not all stable and useful packages will have everything.
CRAN Releases
The first question to ask when evaluating a package is: “Is the package on CRAN?”. Before CRAN accepts a package, CRAN runs a thorough set of tests to ensure the package will work with other packages on CRAN. Getting a package through these checks ensures the package is stable, and also indicates the package author is serious and motivated. While not every package on CRAN is perfect, a package on CRAN indicates a minimal level of effort and stability. More information on CRAN tests can be reviewed here.
Tests
In addition to documentation, a critical indicator that a package is ready for prime time is checking to see whether the package has tests. Normally, package authors include tests in a directory alongside their package code. Tests help authors check their code for accuracy and prevent them from accidentally breaking code.
Many packages will go a step further and report test coverage. This metric indicates how much of the package code is currently tested. Often package authors will automatically run tests using a continuous integration service and report test status and code coverage through public badges.
Documentation
A critical indicator of a package’s health and usefulness is the level of documentation. R packages provide documentation in a number of formats:
Downloads
The number of times a package is downloaded can help you determine how frequently a package is used. Often packages with many downloads are more stable than packages with fewer downloads. However, take care when using this metric - occasionally a package with fewer downloads may be a newer alternative to a package that has many downloads but is nearing end of life.
Posit provides download logs for the popular CRAN mirror, https://cran.rstudio.com. The easiest way to access these logs is through the cranlogs R package and API, or by visiting this shiny app.
Dependencies
When you consider bringing a package into your environment, it is important to evaluate the package’s dependencies. Evaluating the risk of package dependencies is a complex process. A great place to start is reviewing this talk and the related itdepends tool. A few quick tips:
- Package dependencies can be viewed in the package’s Description file and come in a few flavors: Suggests, Depends, Imports, and LinkingTo.
- Package dependencies describe what a package relies on. For example ggplot2 imports rlang, which means ggplot2 requires rlang in order to work. Reverse dependencies indicate the opposite, so ggplot2 is a reverse dependency for rlang.
- You should understand how package inter-dependencies impact reproducibility.
- In addition to depending on other R packages, a package can have system requirements. For example, the rJava package requires a Java installation. You can view system dependencies for a package in the Description file, though a more complete listing is available here or in Package Manager.
News, Releases, and Life Cycle
Another indicator of a package’s stability is the package’s release history. For packages on GitHub, this release history is often visible directly. You can also look for the package’s NEWS file.
Unfortunately, just looking at the number of releases or the date of the last release does not paint the whole picture. Some packages will have lots of recent releases because they are rapidly changing. Other packages might not have had a release for quite some time - is this because the package has been abandoned? Or is it because the package is really stable? Considering the package’s state of life can help answer these questions.
License Restrictions
Finally, when picking a package, you should consider if your organization has any licensing restrictions. Licenses for R packages can be found in their Description file, and many R packages include an additional license file. Organizations with strict licensing requirements might consider an internal repository to track and audit license usage.
Organizing Selected Packages
If you work in an organization, you may want an easy way to harness tribal knowledge about packages that meet your team’s requirements - or packages that have proven useful time and time again. An easy way to share useful sets of packages is through an internal repository which can be created using Package Manager. Internal repositories also provide an easy way to track package downloads, making it possible to see what packages are actually used by your team!