Setting up Connections Between R and Databases

hacking skills
Author

zenggyu

Published

2018-11-18

Abstract
Guidance on how connecting R to databases.

This post provides instructions on how to set up connections between R and various databases (e.g., PostgreSQL, Oracle) on Ubuntu 18.04.

Installing ODBC library and drivers

Run the following command to install the unixODBC library:

apt-get install unixodbc unixodbc-dev --install-suggests

With the unixODBC library installed, you then install the ODBC driver for the target database. For example, the following command installs the ODBC driver for PostgreSQL (see the appendix for instructions on how to install the driver for Oracle):

apt-get install odbc-postgresql

After installation, check that the driver (/usr/lib/x86_64-linux-gnu/odbc/psqlodbcw.so) exists.

Configuring ini files

There are two ini files to configure before you can access a database using ODBC:

  • ~/.odbcinst.ini, which contains driver information, particularly the name of the driver library. Multiple drivers can be specified in the same file.
  • ~/.odbc.ini, which contains connection (data source) information, particularly the username, password, database and host information. Multiple data sources can be specified in the same file.

Here is a template for the ~/.odbcinst.ini file:

[PostgreSQL Driver]
Driver = /usr/lib/x86_64-linux-gnu/odbc/psqlodbcw.so

Here is a template for the ~/.odbc.ini file (note that the Driver line corresponds to the driver defined in odbcinst.ini):

[data_source_name]
Driver              = PostgreSQL Driver
Database            = database_name
Servername          = localhost
UserName            = user_name
Password            = password
Port                = 5432

Making connections and queries

With the above configurations, you can now connect to a database through the DBI and odbc packages in R (note that data_source_name (DNS) corresponds to the name defined in ~/.odbc.ini):

con <- DBI::dbConnect(odbc::odbc(), "data_source_name")

There are three ways to query data:

  • You can query your data with DBI by using the DBI::dbGetQuery() function. Simply paste your SQL code into the R function as a quoted string. Care should be used to escape your quotes as needed.
  • You can write your code in dplyr syntax, and dplyr will translate your code into SQL. This can be done by passing the target connection and table to the dplyr::tbl() function.
  • You can also make SQL queries in rmarkdown with a {sql} code chunk, and specify your connection with the connection = con code chunk option. If you want to send the query output to an R dataframe, use output.var = "mydataframe" in the code chunk options.

The followings are the examples to demonstrate the above descriptions, and they all return the number of rows in the table (i.e., relation) tb from connection con:

# The DBI way
DBI::dbGetQuery(con, "select count(1) from tb;")
# The dplyr way
tbl(con, "tb") %>%
  count()
# The rmarkdown way
\```{sql, connection = con, output.var = "mydataframe"} 
select count(1) from tb;
\```
# You can then access the variable `mydataframe` in following chunks.

Appendix: connecting R to Oracle

The setup for Oracle databases is more complicated1, and here is how.

1 At the time of writing, there seems to be some issues that prevents the use of Oracle data source names defined in the ~/.odbcinst.ini file. When doing this, the R session would just abort.

After the installation of the unixODBC library, go to https://www.oracle.com/technetwork/database/database-technologies/instant-client/downloads/index.html and download the following Oracle Instant Client zip packages: Basic, SDK, ODBC. Extract the contents in the zip files to the directory /opt/oracle/instantclient_<version>.

Change the working directory to /opt/oracle/instantclient_<version> and execute the following commands:

ln -s libclntsh.so.<version> libclntsh.so # This step is necessary for Oracle databases prior to version 18.3
ln -s libocci.so.<version> libocci.so # This step is necessary for Oracle databases prior to version 18.3
sudo apt install libaio1
./odbc_update_ini.sh /

To use the local naming method, put the tnsnames.ora file under /opt/oracle/instantclient_<version>/network/admin/. Note that the linebreak character of the file should be LF instead of CRLF, otherwise there may be an error when trying to create the connection.

Edit ~/.profile and add the following lines:

export TNS_ADMIN=/opt/oracle/instantclient_<version>/network/admin/
export OCI_LIB=/scratch/instantclient_<version>/
export LD_LIBRARY_PATH=/scratch/instantclient_<version>:$LD_LIBRARY_PATH

After making sure the above environment variables are set, install the ROracle package in R:

install.packages("ROracle")

To create a connection to an Oracle database, use the following command (note that dbname can be a connect descriptor defined in the tnsnames.ora file or a connect string):

library(ROracle)
con <- DBI::dbConnect(DBI::dbDriver("Oracle"), username = "", password = "", dbname="")
Sys.setenv(NLS_LANG = "SIMPLIFIED CHINESE_CHINA.AL32UTF8", TZ = "UTC", ORA_SDTZ = "UTC")

AL32UTF8 stores characters beyond U+FFFF as four bytes (exactly as Unicode defines UTF-8). Oracle’s “UTF8” stores these characters as a sequence of two UTF-16 surrogate characters encoded using UTF-8 (or six bytes per character)

Appendix: Learning more

See: https://db.rstudio.com/