Up and running with Python in RStudio

Intro

I enjoy using R for data analysis but recently have been learning Python. My intention is not to move from R to Python, but to be able to work with both R and Python, treating them as complementary tools. This is facilitated by the fact that RStudio has made it easier to work with both Python and R in a single project, using RStudio as a single interface. It’s what they refer to as R & Python: A Data Science Love Story.

In an upcoming post, I will be looking at learning Python from the perspective of an R user. In this post, I describe the nuts and bolts of getting Python up and running in RStudio with the reticulate package. Given that it took me a day or two to set up, I thought this warranted its own post and may be of use to others following the same path.

Getting set up

The objective here is to use RStudio as a Python IDE. We want to set up RStudio so that we can write Python code, add Python libraries, and transition between R and Python when working on a single project.

There appear to be a few ways to do this, and to be honest it took me a while to get it working. But I got there with the approach outlined below.

This approach uses Anaconda. More precisely, it uses a minimal version called Miniconda. The Miniconda installer will give us a version of Python, a package manager called conda, and some other useful packages.

The steps outlined below were partly based on notes provided by Dr Tiffany Timbers as part of an Intro to Reticulate workshop (https://github.com/ttimbers/intro-to-reticulate). The notes below are for Windows and assume you already have RStudio up and running. If you have trouble on the steps below, or need more detail, you may want to check out the material provided by Dr Timbers above.

1 - Download Git Bash

We want to be able to interact with our computer via a command line interface. Rather than using the built-in Windows command line (CMD), we will use Git Bash. To do this we download and install the Git Bash program, which can be found here: https://git-scm.com/download/win

Once downloaded, we can run Git Bash to open the terminal. (To open Git Bash, click on the Start menu and start typing ‘Git’). This is referred to below as “the terminal”.

2 - Use Miniconda to install Python

We need to have a base version of Python installed on our machine. To get this we are going to use Miniconda, which was described above.

Download and run the Miniconda installer which can be found here. In my case I downloaded the Python 3.8 version, and installed it with the default settings. By default, this should install to a directory like C:/Users/[your name]/miniconda3 .

3 - Integrate Python with the Git Bash terminal

You should now have access to the Anaconda Prompt (to open it, click on the Start menu and start typing ‘Anaconda’).

Rather than controlling our Python installations using the Anaconda Prompt (e.g. when installing new Python packages), we want to be able to do this directly from the Git Bash terminal.

To be able to do this, we open the Anaconda Prompt and type conda init bash.

From now on we will be able to do everything from the Git Bash terminal.

Close the Anaconda Prompt window and any open terminal windows.

4 - Install Python packages

Next we want to install the Python packages that we will be using.

Recall that we will be using the package manager conda, which was downloaded using Miniconda. conda installs Python packages from different online repositories which are called “channels”. We want to add something called the conda-forge channel, which is a community-driven effort to provide the most up to date versions of Python packages.

To add this channel we open the Git Bash terminal and type the following:

conda config --add channels conda-forge

Whenever we want to install Python packages we use the terminal and type:

conda install [package names]

In this case, we want to install NumPy and Pandas, which are some the most widely used Python libraries. To do this we type the following:

conda install numpy=1.* pandas=1.*

You should then be asked if you want to proceed; enter y to do so.

5 - Using Python in RStudio

We are now ready to use Python in RStudio.

To use Python in RStudio we need to do two things:

  • install the reticulate package, which is what we use to translate between Python and R, and

  • point RStudio to the installation of Python that we are using, which in this case is Miniconda Python that we installed above.

The reticulate package can be installed the usual way in R, using

install.packages("reticulate")

The second thing we do is point RStudio to the installation of Python we are using (our Python environment). To do this we use the the function below, where the path is the directory where Miniconda was installed. If you don’t know the path, you can find it by typing which python in a terminal (Git Bash) outside of RStudio. Note the use of \\ instead of \ in the code below.

library(reticulate)
Sys.setenv(RETICULATE_PYTHON = "C:\\Users\\steve\\miniconda3")

By entering the two lines above into your R markdown file you should now be up and running.

Most instructions I have seen actually suggest adding the second line of code to an .Rprofile file in your project’s home directory. This means the environment will be specified whenever you open the project and will not need to be run in each file (e.g. R Markdown document) within the project.

If you do not already have a .Rprofile file, you can create one by opening a text editor, entering the second line of code above, i.e. Sys.setenv(RETICULATE_PYTHON = …), closing the text file, and then renaming the text file to “.Rprofile” (make sure you delete “.txt” from the end). This file should be saved in your project’s home directory. Restart RStudio for the changes to take effect.

6 - Check it’s working

If everything is working you should now be able to write Python code in RStudio. In this section, I am working with an R Markdown (.Rmd) file.

We can check that everything is configured properly by typing in py_config(). This will return a message telling us which Python environment is being used. We should see the words “Python version was forced by RETICULATE_PYTHON”, as shown below.

py_config()

A quick test run

Now let’s check that everything is working by using a simple example that requires the Pandas package.

We will create two data frames in R, one containing the names of three people and their city (data_city), and another containing the names of each city and the corresponding state (data_state):

# R code

data_city <- data.frame(
  name = c("Andy", "Beth", "Carl"),
  city = c("Dallas", "San Francisco", "New York")
)

data_city
##   name          city
## 1 Andy        Dallas
## 2 Beth San Francisco
## 3 Carl      New York
# R code
data_state <- data.frame(
  city = c("Dallas", "San Francisco", "New York"),
  state = c("Texas", "California", "New York")
)

data_state
##            city      state
## 1        Dallas      Texas
## 2 San Francisco California
## 3      New York   New York

Thanks to reticulate, we will now be able to work with these data frames these using Python code.

To do this, we use Python chunks in our R Markdown document, i.e. using the header {python} instead of {r}, as illustrated below:

To access the data frames using Python, we add r. before their names. For example, to access the data_city data frame we would use the following:

# Python code

print(r.data_city)
##    name           city
## 0  Andy         Dallas
## 1  Beth  San Francisco
## 2  Carl       New York

Now lets say we want to merge the two data frames above using the Pandas merge function, joining the two data sets based on the city name. The following Python chunk would be run, which creates a Python data frame called data_final:

# Python code

import pandas as pd

data_final = pd.merge(r.data_city, r.data_state, on='city')
print(data_final)
##    name           city       state
## 0  Andy         Dallas       Texas
## 1  Beth  San Francisco  California
## 2  Carl       New York    New York

And that’s it! We’ve successfully configured RStudio to be able to run Python code. We have seen how we can create data frames using R, then access and manipulate these data frames using Python, all in the same R Markdown file. The ability to work across two programming languages (all within RStudio) greatly increases the analytical tools at our disposal.