Using Miniconda to Set up an Environment for Data Science

hacking skills
Author

zenggyu

Published

2018-11-08

Abstract
Guidance on how to install and set up a miniconda environment, with a special focus on data science.

Introduction

This is one of a series of posts where I document software configurations for personal reference. This post documents the configurations for setting up data science environment with Miniconda. The instructions below are based on Ubuntu 18.04.

Download and install Miniconda

The installer can be downloaded from the official site: https://conda.io/miniconda.html. Enter ‘yes’ when the installer asks if you want to prepend the miniconda3 install location to PATH in ~/.bashrc.

Configure Miniconda

Run the following commands to add some channels provided by TUNA to speed up package downloads. Note the difference between the --prepend option and the --append option: the former adds a channel to the top of the list (with higher priority) while the latter adds a channel to the bottom of the list (with lower priority).

conda config --set show_channel_urls yes
conda config --prepend channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --prepend channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ # high priority
conda config --append channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/ # low priority

Instal Nvidia graphic card driver

The proprietary Nvidia graphic card driver is required by the GPU version of some packages (e.g., tensorflow), and it can be installed from Ubuntu’s Software & Updates app.

Install Python packages

The following command installs some frequently used packages.

conda install numpy scipy pandas matplotlib hdf5 pillow scikit-learn jupyterlab
conda install -c conda-forge feather-format

Additional packages (if not available from conda) can be installed with:

pip3 install dbt-core dbt-postgres dbt-mysql dbt-hive
pip3 install sqlfluff sqlfluff-templater-dbt
pip3 install sqlglot

Install R packages

conda install -c r r-essentials

Note: The -c r option tells conda to look for packages in the R channel; the r-essentials package contains many frequently used packages including tidyverse and IRkernel. To make the R kernel visible to Jupyter (see below for instructions on how to configure Jupyter), run the following command in an R session:

IRkernel::installspec() # set `user = FALSE` to install the spec system-wide

See my other post for instructions on configuring R.

Additional packages: install.packages(c("reticulate", "feather")).

Configure Jupyterlab on a server

If jupyter_notebook_config.py (usually located at ~/.jupyter/) does not exist, generate a new one with:

jupyter notebook --generate-config # optional

To enable https, generate some certificate files with the following command; then optionally set a password:

openssl req -x509 -nodes -days 9999 -newkey rsa:2048 -keyout mykey.key -out mycert.pem # optional
jupyter notebook password # optional

Open jupyter_notebook_config.py and set the following parameters:

# Set options for certfile, ip, password, and toggle off browser auto-opening
c.NotebookApp.certfile = '/absolute/path/to/your/certificate/mycert.pem' # optional
c.NotebookApp.keyfile = '/absolute/path/to/your/certificate/mykey.key' # optional
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.open_browser = False # optional

After the above setup, start the server with:

jupyter lab # Specify the `--allow-root` option if the server need to be run as root.

By default, the server is listening on port 8888. If https is enabled, then the address to access the server would be https://<server_address>:8888; otherwise it is http://<server_address>.8888.