Using R to Select an Optimum Stock Portfolio

Part II: Selecting the Stocks


This is the second step in a three-part series on using R to select an optimum stock portfolio. This page outlines the part of the R script that selects the portfolio. Before you can run the script, you need to do the following:

  1. Download daily prices for all stocks in the S&P 500 index
  2. Download daily S&P 500 index values
  3. Calculate the risk-free rate

You can read how to do this here: Using R to Select an Optimum Stock Portolio: Part I–Preliminaries

The R Script for Portfolio Selection

The R script itself creates a list with the individual stock data as the elements: this is a powerful technique that draws on Rs origins in Scheme (itself a dialect of LISP). The script is fairly long, so I’ve divided it into several parts to make it easier to understand. Each section is marked by a header, and you can click on the following list of headers to go directly to that section.

Read Stock Data and Create Master Stock List
Calculate Daily Stock Returns
Identify and Trim Extreme Values
Separate S&P 500 Index and Check Returns for Normality
Estimate Correlation Between Each Stock and the Overall Market

Here is a detailed description of how the script works; as a reminder, you can download the entire script (and related files) here.


The script begins by including a few required libraries, clearing all objects, and setting a working directory: note that you’ll want to change this to the directory where you’ve stored the stock data. It also sets the risk-free rate.

# Preliminaries

library(xts)			# Extended time series objects
library(nortest)		# For ad.test
library(lmtest)			# For Durbin-Watson autocorrelation test


# Risk-free rate and avg market return

rf.rate <- .018255405675  			# 10 year treasury (geometric mean, YTD)

Read Stock Data and Create Master Stock List

After the preliminaries, the script reads the stock price data from each file into a new environment named stocks. It then creates a master list from the stock data: each element in the list is the data for one stock.

# Get list of files with stock returns and read each file as a data frame in the
# environment "stocks"

files <- list.files()
files.count <- length(files)	# Store as variable so we don't recalculate it each loop iteration

stocks <- new.env()
for (i in 1:files.count)
	assign(files[i], read.csv(files[i]), pos = stocks)

# Make a list of all the data frames and add the stock names

stock.list <- lapply(ls(stocks), function(x) {get(x, envir=stocks)})
names(stock.list) <- ls(stocks)

Calculate Daily Stock Returns

Next, the script subsets only the stock prices from 2016 to date and creates an extended time series (xts) object for each stock; this will make computing the daily returns easier. The script then calculates the daily returns for each stock and removes the NA values that result from using the lag() function.

# Subset YTD returns

stock.list.subset <- lapply(stock.list, function (x) { subset(x, as.Date(Date) >= '2016-01-01', select = c("Date", "Adj.Close"))

# Transform each data frame into an xts object

for(i in 1:length(stock.list.subset))
{ <- as.Date(as.character(stock.list.subset[[i]]$Date), "%Y-%m-%d")
	stock.list.subset[[i]] <- xts(stock.list.subset[[i]]$Adj.Close,
	colnames(stock.list.subset[[i]]) <- c("Adj.Close")

# Calculate market return on all stocks

for(i in 1:length(stock.list.subset))
	stock.list.subset[[i]]$Return <- log(stock.list.subset[[i]]$Adj.Close / lag(stock.list.subset[[i]]$Adj.Close))

# Get rid of NAs (usually first observation, since no lagged value)

for(i in 1:length(stock.list.subset))
	stock.list.subset[[i]] <- stock.list.subset[[i]][![[i]]$Return)]

Identify and Trim Extreme Values

The script then identifies extreme values within each set of stock returns. This is important because we are only going to include stock returns that are normally distributed in the portfolio calculation: trimming extreme values makes the normality test we use later more reliable. Outliers are defined as values that are three times the interquartile range (IQR) from the first and third quartiles; the function outliers.iqr does the calculation.

# Get outliers for each stock, then trim

outliers.iqr <- function(x, multiplier = 3)
	# Returns the lower and upper bounds for outliers as determined by a multiplier * IQR
	# Assumes no particular distribution
	# Takes a vector x and an IQR multiplier
	# Returns a two-element list with lbound,ubound as members
	temp <- quantile(x, probs=c(.25,.75), names=FALSE) 
	iqr <- temp[2] - temp[1]
	lbound <- temp[1] - multiplier * iqr
	ubound <- temp[2] + multiplier * iqr

stock.list.outliers <- lapply(stock.list.subset, function (x) {outliers.iqr(x$Return,3)})

stock.list.trimmed <- lapply(stock.list.subset, function (x) {
	temp <- outliers.iqr(x$Return,3) subset(x, Return > temp$lbound & Return < temp$ubound, select = "Return")

Separate S&P 500 Index and Check Returns for Normality

Next, the script strips out the S&P 500 index and returns and checks the remaining stock data for normality; note that we use the Hochberg correction for the p-values, since we are conducting several hundred normality tests. Since the second lapply() just returns TRUE or FALSE (depending on whether the adjusted p-value is > 0.05), we can use it to filter the list of stock data after transforming it with unlist().

# Strip SPX out of the list of stocks to analyze, and make it a separate object

spx.stock.return <- subset(stock.list.trimmed[["SPX"]], select = "Return")
stock.list.return <- stock.list.trimmed[names(stock.list.trimmed) != "SPX"]

# Test each stock return for normality, then adjust for multiple comparisons
# Returns TRUE or FALSE for each stock, depending on whether test is passed

normal.test <- lapply(stock.list.return, function (x) {ad.test(as.numeric(x))})
normal.test.adj <- lapply(normal.test, function (x) {
	temp <- p.adjust(x$p.value, method="hochberg", n=length(normal.test)) return (temp > 0.05)

# Subset the stocks that are approximately normal using normal.test.adj as a filter <- stock.list.return[unlist(normal.test.adj)]

Estimate Correlation Between Each Stock and the Overall Market

The script then estimates the correlation between each stock and the S&P 500 index using linear regression. Here, the regression coefficient \(\beta_{market}\) is an estimate of how the stock’s returns are affected by changes in the market as a whole (represented by the S&P 500 index): for example, a beta of 1.10 means that when the market changes by 1, the stock will change by 1.10. All else equal, higher betas represent riskier stocks.

(note the use of the merge() function here to make sure the regression includes only dates where both the stock and the S&P 500 index contain values; without this, the lm() function will fail because there will be a different number of cases for each variable).

# Estimate Betas for each stock

Beta.model <- lapply(, function (x){ <- merge(spx.stock.return, x)
colnames( <- c("market.return", "stock.return")
lm(stock.return ~ market.return, data =

Next Steps

At this point, we have the raw materials we need to construct an optimum portfolio. Next, we’ll check that the assumptions of linear regression are met, then finish up by selecting the portfolio itself. You can read about how to do this HERE. If something isn’t working like it should, please fill out the comment form below, and I’ll do my best to help.

Troubleshooting at a Standstill? Feel Free to Reach Out: