Using Miniconda to Set up an Environment for Data Science
Introduction
This is one of a series of posts where I document software configurations for personal reference. This post documents the configurations for setting up data science environment with Miniconda. The instructions below are based on Ubuntu 18.04.
Download and install Miniconda
The installer can be downloaded from the official site: https://conda.io/miniconda.html. Enter ‘yes’ when the installer asks if you want to prepend the miniconda3 install location to PATH in ~/.bashrc
.
Configure Miniconda
Run the following commands to add some channels provided by TUNA to speed up package downloads. Note the difference between the --prepend
option and the --append
option: the former adds a channel to the top of the list (with higher priority) while the latter adds a channel to the bottom of the list (with lower priority).
conda config --set show_channel_urls yes
conda config --prepend channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --prepend channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ # high priority
conda config --append channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/ # low priority
Instal Nvidia graphic card driver
The proprietary Nvidia graphic card driver is required by the GPU version of some packages (e.g., tensorflow), and it can be installed from Ubuntu’s Software & Updates app.
Install Python packages
The following command installs some frequently used packages.
conda install numpy scipy pandas matplotlib hdf5 pillow scikit-learn jupyterlab
conda install -c conda-forge feather-format
Additional packages (if not available from conda) can be installed with:
pip3 install dbt-core dbt-postgres dbt-mysql dbt-hive
pip3 install sqlfluff sqlfluff-templater-dbt
pip3 install sqlglot
Install R packages
conda install -c r r-essentials
Note: The -c r
option tells conda to look for packages in the R channel; the r-essentials
package contains many frequently used packages including tidyverse
and IRkernel
. To make the R kernel visible to Jupyter (see below for instructions on how to configure Jupyter), run the following command in an R session:
::installspec() # set `user = FALSE` to install the spec system-wide IRkernel
See my other post for instructions on configuring R.
Additional packages: install.packages(c("reticulate", "feather"))
.
Configure Jupyterlab on a server
If jupyter_notebook_config.py
(usually located at ~/.jupyter/
) does not exist, generate a new one with:
jupyter notebook --generate-config # optional
To enable https, generate some certificate files with the following command; then optionally set a password:
openssl req -x509 -nodes -days 9999 -newkey rsa:2048 -keyout mykey.key -out mycert.pem # optional
jupyter notebook password # optional
Open jupyter_notebook_config.py
and set the following parameters:
# Set options for certfile, ip, password, and toggle off browser auto-opening
c.NotebookApp.certfile = '/absolute/path/to/your/certificate/mycert.pem' # optional
c.NotebookApp.keyfile = '/absolute/path/to/your/certificate/mykey.key' # optional
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.open_browser = False # optional
After the above setup, start the server with:
jupyter lab # Specify the `--allow-root` option if the server need to be run as root.
By default, the server is listening on port 8888. If https is enabled, then the address to access the server would be https://<server_address>:8888
; otherwise it is http://<server_address>.8888
.