6. Data Access
In this section, you will learn:
- How to enable content to connect to data sources
- Different data access strategies
- How to create service accounts
- Methods for securing credentials
- Troubleshooting data source connection errors
Data access from Connect
Content on Connect usually relies on one or more data sources. That data can exist inside or outside a content bundle.
Data inside the content bundle
When deploying content, developers can include data inside a content bundle. The main limitation here is that your developers must deploy everything (data and code) at the same time for it to be updated. If they need the data to be refreshed more often than the code is deployed, then a different data access method should be explored. Bundling data also increases the amount of server disk space the bundle files take up.
Data outside the content bundle
Data outside the content bundle can be stored in a database, a flat file on the Connect server filesystem, a flat file on a mounted file share, a pin on the server with the pins
package, blog storage and other types of interfaces.
Any system requirements for accessing that data will need to be satisfied. For example, database drivers would need to be installed and configured and any file shares would need to be mounted.
Databases
Databases may use many different methodologies for data storage, such as:
- File-based (e.g., SQLite, DuckDB)
- Relational Database Management Systems (e.g., PostgresSQL, MariaDB, MicrosoftSQL)
- NoSQL and Document Stores (e.g., MongoDB, Cassadra)
These databases use different wire protocols for connections, however, standardization across database vendors permits a smoother experience for administrators and users. Key standards to be aware of include:
- ODBC, a cross-language standard protocol defining how external tools and systems can submit queries to databases. ODBC database drivers use the ODBC protocol for database calls. The drivers translate the ODBC interface to the database-specific functions.
- Similar to the ODBC protocol, JDBC is a standard protocol for database queries, however, this is built around the Java ecosystem. We do not recommend JDBC database connections due to challenges in managing Java.
DBI
, an R package and front-end for making database connections from the R language. This provides a consistent interface independent of the database.DBI
is coupled with a back-end DBI-compliant R package such asodbc
(note: lower-case. This is an R package) to make a database connection from R. Some database types have specific DBI-compliant packages available for making this front-end connection, such asRPostgres
,RMariaDB
,bigrquery
, orRSQLite
.PEP-249
is a proposed standard in Python for making database connectors. This standard is used by packages such aspyodbc
for making ODBC connections to databases.
As an administrator, you will need to install (preferably) ODBC drivers for relevant databases on Connect. Users can then use the appropriate R or Python packages to interact with these drivers to make their database connections.
Posit provides professionally-supported ODBC drivers for several databases for your use with Posit professional tooling. Installation is documented here. To configure the Posit Professional Drivers, or a driver supplied from another source, specify the driver location and attributes in /etc/odbcinst.ini
.
/etc/odbcinst.ini
file
[PostgreSQL]
Driver=/opt/rstudio-drivers/postgresql/bin/lib/libpostgresqlodbc_sb64.so
RStudioVersion=2021.10.0
Version=1.4.41.1001
Installer=RStudio Pro Drivers
UsageCount=1
Persistent Storage
One potential difficulty in using data from a mounted file share is that the path could change between the content running on Posit Workbench and Posit Connect. If the absolute path and files located there are identical on the Workbench and Connect servers, then the code will work on both servers (for example a /data
directory). If there are differences in path, or if it is common for developers to be using a local IDE rather than Workbench, using a tool such as the config package can help with path management.
Data access strategies
Content on Connect runs differently than how scripts are executed on Workbench. By default, content is run by a service account that has minimal permissions and access. What this means is that access to data will need to be set up intentionally by the developer.
External data connections are handled by the R or Python code, never by Workbench or Connect itself. Developers will be able to connect to any external data source from Connect which they can connect to from R or Python (running on Linux).
There are different strategies for managing the authentication portions of the data connections, whether at the developer or server level, as well as for managing credentials:
- Developer handled
- Data Source Name (DSN)
- Environment variables server-wide
- Service accounts
- Current user execution
An important note on security is that as Connect, by default, shares connections across an R or Python process data can be exposed to multiple connections. Often this is intended behavior for improving both app and server performance. However, in cases where there are security concerns around data access then forcing each connection to a single process by changing the runtime setting Max connections per process to 1 should be considered.
Developer handled
In this scenario, the developer is completely responsible for managing connections to the relevant data sources. Any server dependencies will need to be installed and configured on their behalf. For example database drivers, mounted file shares, and permissions for any designated directories inside Connect itself.
Consider recommending leveraging environment variables for storing any username and password information to avoid exposing secrets in the code. These variables are encrypted on disk for security.
Data Source Name (DSN)
DSNs provide a symbolic name that combines all the parameters required to make a database connection. This makes connections simpler for users to manage. Database credentials (typically a service account) can be stored in the DSN definition. If the database credentials are omitted here, users will need to include this in their database connection script. DSNs are configured in /etc/odbc.ini
or ~/.odbc.ini
.
For more details refer to the Installing Posit Professional Drivers for Posit Connect section.
odbc.ini
file
[MyPostgresDB]
Driver = "PostgreSQL"
Server = "[your server's path]"
Database = "[your database's name]"
UID = "[user]"
PWD = "[password]"
Port = 5432
Environment variables server-wide
Environment variables can be added server-wide for all pieces of content using a supervisor script, as detailed in the Process Management Section of the Connect Admin Guide. These environment variables can be connection strings, secrets, a DSN (see above), etc.
Service accounts
By default content on Connect runs as the default service account user rstudio-connect
. Other service account users can be set to run pieces of content. A Linux user must be created and be a member of the primary group of the default run-as user (by default rstudio-connect
and rstudio-connect
).
After a service account user is created and configured, then credentials can be attached. One recommended way to do this is to create a user-level DSN with the database connections and service account credentials that are read-only for the service account to access. As the service account users are provisioned with home directories the DSN can be stored there.
For more information refer to the User Account for Processes section of the Admin Guide.
Current user execution
Content can be set to run as the currently logged-in user accessing an application, instead of the default rstudio-connect
or other service accounts by running content using local Unix accounts for each user. This works for Shiny apps, Shiny documents, Dash, Streamlit and Bokeh applications and requires using PAM user authentication.
For more details on the configuration options available refer to the Current user execution section of the Admin Guide.
Credential management
As with every programming language, it is important to avoid publishing code with credentials in plain text. There are several options to protect credentials from being exposed:
- DSNs defined in
odbc.ini
- keyring package
- Content environment variables
- config package
- Prompt for credentials
For more information, refer to the Solutions article on managing credentials.
Connect API
The Connect API can do several content-level environment variable tasks including listing, setting and updating variables. Refer to the Content sections of the API documentation for more details on the endpoints available.
Troubleshooting
When in doubt, troubleshoot data access first from R or Python directly on the Connect Server. Instructions on how to utilize ODBC drivers in R and Python can be found in the Posit Pro Drivers Usage section of the documentation. This can help to distinguish between issues (driver, application code, and authentication issues). For troubleshooting, databases refer to the troubleshooting databases article.
The installation directories for R and Python can be searched to find all installed versions. Following Posit’s instructions for installing R and installing Python, each version will be listed in /opt/R
and /opt/python
. If R or Python has been installed in a different location then adjust the path accordingly for the examples that follow.
Executing R code
Check available language versions by running:
$ ls -1d /opt/R/*
sudo /opt/R/4.2.0
sudo /opt/R/4.1.3
Using the path found above, an interactive R session can be started with:
sudo /opt/R/4.2.0/bin/R
Commands can also be run from a single line. In this example the png package is being installed:
sudo /opt/R/4.2.0/bin/Rscript -e 'install.packages("png", repos = "http://cran.us.r-project.org")'
Executing Python code
Check available language versions by running:
$ ls -1d /opt/python/*
/opt/python/3.9.14
/opt/python/3.10.8
Using the path found above, an interactive Python session can be started with:
sudo /opt/python/3.10.8/bin/python3
Commands can also be run from a single line. In this example the pandas package is being installed:
sudo /opt/python/3.10.8/bin/python -m pip install pandas
Exercise
🚀 Launch the exercise environment!
In the exercise environment you will get experience:
- Installing Posit Pro Drivers
- Configuring DSNs and testing connections
- Testing database connections using R and Python
- Setting environment variables for content
- Creating and running content from a service account.
Go to: 7. Managing Resources