4. Data Access
In this section, you will learn:
- what the admin needs to provide for connecting Workbench users to data sources
- how to mount shared file storage
- how to install and configure database drivers
- troubleshooting methods for database connections
Your team’s data scientists will need access to their data, which could be stored in various locations. Administrator action is required to configure data access from the following locations:
- Directory on the Workbench server (e.g., developer’s project directory or home directory)
- Mounted file share
- External database
Connections to data sources are handled in the developer’s script in R or Python, not by Workbench or Connect itself. If one can connect to a data source from R or Python running directly on the Linux server, one can also make that connection from within Workbench or Connect.
Directory on Workbench
Developers can store data in their code project directory or their home directory. You must use Linux file permissions to set access to files. For the RStudio Pro IDE, you can also restrict which files a user can view inside the IDE. See the Directory Management section of the Workbench Admin Guide.
An admin can create an additional shared directory that allows a team to upload/download/share data (e.g. /projects/project_id
). After provisioning the directory, set access controls so that the appropriate team has read/write access.
Recommendations:
- Provision access to the directory using groups in order to have auto-provisioning when new users are added.
- Add appropriate conditions on the shared directory so that newly created content inherits the permissions of its parent folder.
- Consider setting multiple levels of access so that there are different groups of users with read/write access and read-only access.
Databases
Databases may use many different methodologies for data storage, such as:
- File-based (e.g., SQLite, DuckDB)
- Relational Database Management Systems (e.g., PostgreSQL, MariaDB, Microsoft SQL)
- NoSQL and Document Stores (e.g., MongoDB, Cassandra)
These databases use different wire protocols for connections, however standardization across database vendors permits a smoother experience for administrators and users. Key standards to be aware of include:
- ODBC, a cross-language standard protocol defining how external tools and systems can submit queries to databases. ODBC database drivers use the ODBC protocol for database calls. The drivers translate the ODBC interface to the database-specific functions.
- Similar to the ODBC protocol, JDBC is a standard protocol for database queries, however this is built around the Java ecosystem. We do not recommend JDBC database connections due to challenges in managing Java.
DBI
, an R package and front-end for making database connections from the R language. This provides a consistent interface independent of the database.DBI
is coupled with a back-end DBI-compliant R package such asodbc
(note: lower-case. This is an R package) to make a database connection from R. Some database types have specific DBI-compliant packages available for making this front end connection, such asRPostgres
,RMariaDB
,bigrquery
, orRSQLite
.PEP-249
is a proposed standard in Python for making database connectors. This standard is used by packages such aspyodbc
for making ODBC connections to databases.
As an administrator, you will need to install (preferably) ODBC drivers for relevant databases on Workbench and Connect. Users can then use the appropriate R or Python packages to interact with these drivers to make their database connections.
Professional Drivers installation
Posit provides professionally-supported ODBC drivers for a number of databases for your use with Posit professional tooling. Installation is documented here. To configure the Posit Professional Drivers, or a driver supplied from another source, specify the driver location and attributes in /etc/odbcinst.ini
.
/etc/odbcinst.ini
file
[PostgreSQL]
Driver=/opt/rstudio-drivers/postgresql/bin/lib/libpostgresqlodbc_sb64.so
RStudioVersion=2021.10.0
Version=1.4.41.1001
Installer=RStudio Pro Drivers
UsageCount=1
Data Source Name (DSN)
DSNs provide a symbolic name that combines all the parameters required to make a database connection. This makes connections simpler for users to manage. Database credentials (typically a service account) can be stored in the DSN definition. If the database credentials are omitted here, users will need to include this in their database connection script. DSNs are configured in /etc/odbc.ini
or ~/.odbc.ini
.
odbc.ini
file
[MyPostgresDB]
Driver = "PostgreSQL"
Server = "[your server's path]"
Database = "[your database's name]"
UID = "[user]"
PWD = "[password]"
Port = 5432
Troubleshooting database connections
Always test and confirm database connections at the Linux command line using isql
first before troubleshooting database connections from within Workbench or Connect.
- Ensure there is a DSN for the database defined in
odbc.ini
- Use
isql -v <DSN>
to test the connection - Test connection from R or Python only after
isql
can successfully connect
Typical problems with database connections are:
- Dynamic libraries not found
- MicrosoftSQL & Kerberos: Service Principal Name (SPN) does not match
See additional troubleshooting information in the Professional Drivers documentation here.
🚀 Launch the exercise environment!
In the exercise environment you will get experience:
Installing Posit Professional Database Drivers
Configuring a DSN
Connecting to a database from R and Python in Workbench
Go to: 5. Monitoring