4. Data Access

In this section, you will learn:

what the admin needs to provide for connecting Workbench users to data sources
how to mount shared file storage
how to install and configure database drivers
troubleshooting methods for database connections

Your team’s data scientists will need access to their data, which could be stored in various locations. Administrator action is required to configure data access from the following locations:

Directory on the Workbench server (e.g., developer’s project directory or home directory)
Mounted file share
External database

Connections to data sources are handled in the developer’s script in R or Python, not by Workbench or Connect itself. If one can connect to a data source from R or Python running directly on the Linux server, one can also make that connection from within Workbench or Connect.

Directory on Workbench

Developers can store data in their code project directory or their home directory. You must use Linux file permissions to set access to files. For the RStudio Pro IDE, you can also restrict which files a user can view inside the IDE. See the Directory Management section of the Workbench Admin Guide.

An admin can create an additional shared directory that allows a team to upload/download/share data (e.g. /projects/project_id). After provisioning the directory, set access controls so that the appropriate team has read/write access.

Recommendations:

Provision access to the directory using groups in order to have auto-provisioning when new users are added.
Add appropriate conditions on the shared directory so that newly created content inherits the permissions of its parent folder.
Consider setting multiple levels of access so that there are different groups of users with read/write access and read-only access.

Databases

Databases may use many different methodologies for data storage, such as:

File-based (e.g., SQLite, DuckDB)
Relational Database Management Systems (e.g., PostgreSQL, MariaDB, Microsoft SQL)
NoSQL and Document Stores (e.g., MongoDB, Cassandra)

These databases use different wire protocols for connections, however standardization across database vendors permits a smoother experience for administrators and users. Key standards to be aware of include:

ODBC, a cross-language standard protocol defining how external tools and systems can submit queries to databases. ODBC database drivers use the ODBC protocol for database calls. The drivers translate the ODBC interface to the database-specific functions.
Similar to the ODBC protocol, JDBC is a standard protocol for database queries, however this is built around the Java ecosystem. We do not recommend JDBC database connections due to challenges in managing Java.
DBI, an R package and front-end for making database connections from the R language. This provides a consistent interface independent of the database. DBI is coupled with a back-end DBI-compliant R package such as odbc (note: lower-case. This is an R package) to make a database connection from R. Some database types have specific DBI-compliant packages available for making this front end connection, such as RPostgres, RMariaDB, bigrquery, or RSQLite.
PEP-249 is a proposed standard in Python for making database connectors. This standard is used by packages such as pyodbc for making ODBC connections to databases.

As an administrator, you will need to install (preferably) ODBC drivers for relevant databases on Workbench and Connect. Users can then use the appropriate R or Python packages to interact with these drivers to make their database connections.

Professional Drivers installation

Posit provides professionally-supported ODBC drivers for a number of databases for your use with Posit professional tooling. Installation is documented here. To configure the Posit Professional Drivers, or a driver supplied from another source, specify the driver location and attributes in /etc/odbcinst.ini.

Example /etc/odbcinst.ini file

[PostgreSQL]
Driver=/opt/rstudio-drivers/postgresql/bin/lib/libpostgresqlodbc_sb64.so
RStudioVersion=2021.10.0
Version=1.4.41.1001
Installer=RStudio Pro Drivers
UsageCount=1

Data Source Name (DSN)

DSNs provide a symbolic name that combines all the parameters required to make a database connection. This makes connections simpler for users to manage. Database credentials (typically a service account) can be stored in the DSN definition. If the database credentials are omitted here, users will need to include this in their database connection script. DSNs are configured in /etc/odbc.ini or ~/.odbc.ini.

Example odbc.ini file

[MyPostgresDB]
Driver   = "PostgreSQL"
Server   = "[your server's path]"
Database = "[your database's name]"
UID      = "[user]"
PWD      = "[password]"
Port     = 5432

Troubleshooting database connections

Always test and confirm database connections at the Linux command line using isql first before troubleshooting database connections from within Workbench or Connect.

Ensure there is a DSN for the database defined in odbc.ini
Use isql -v <DSN> to test the connection
Test connection from R or Python only after isql can successfully connect

Typical problems with database connections are:

Dynamic libraries not found
MicrosoftSQL & Kerberos: Service Principal Name (SPN) does not match

See additional troubleshooting information in the Professional Drivers documentation here.

Data Sources Lab

🚀 Launch the exercise environment!

In the exercise environment you will get experience:

Installing Posit Professional Database Drivers
Configuring a DSN
Connecting to a database from R and Python in Workbench

Go to: 5. Monitoring

Directory on Workbench

Mounted file share

NFS

SMB/CIFS

Databases

Professional Drivers installation

Data Source Name (DSN)

Troubleshooting database connections