# Using R to Select an Optimum Stock Portfolio

Part II: Selecting the Stocks

### Introduction

This is the second step in a three-part series on using R to select an optimum stock portfolio. This page outlines the part of the R script that selects the portfolio. Before you can run the script, you need to do the following:

- Download daily prices for all stocks in the S&P 500 index
- Download daily S&P 500 index values
- Calculate the risk-free rate

You can read how to do this here: Using R to Select an Optimum Stock Portolio: Part I–Preliminaries

### The R Script for Portfolio Selection

The R script itself creates a list with the individual stock data as the elements: this is a powerful technique that draws on Rs origins in Scheme (itself a dialect of LISP). The script is fairly long, so I’ve divided it into several parts to make it easier to understand. Each section is marked by a header, and you can click on the following list of headers to go directly to that section.

Preliminaries

Read Stock Data and Create Master Stock List

Calculate Daily Stock Returns

Identify and Trim Extreme Values

Separate S&P 500 Index and Check Returns for Normality

Estimate Correlation Between Each Stock and the Overall Market

Here is a detailed description of how the script works; as a reminder, you can download the entire script (and related files) here.

#### Preliminaries

The script begins by including a few required libraries, clearing all objects, and setting a working directory: note that you’ll want to change this to the directory where you’ve stored the stock data. It also sets the risk-free rate.

# Preliminaries library(stats) library(xts) # Extended time series objects library(nortest) # For ad.test library(lmtest) # For Durbin-Watson autocorrelation test rm(list=ls()) options(digits=5) setwd("/Users/davidschwab/StockDownload/Data") # Risk-free rate and avg market return rf.rate <- .018255405675 # 10 year treasury (geometric mean, YTD)

#### Read Stock Data and Create Master Stock List

After the preliminaries, the script reads the stock price data from each file into a new environment named *stocks*. It then creates a master list from the stock data: each element in the list is the data for one stock.

# Get list of files with stock returns and read each file as a data frame in the # environment "stocks" files <- list.files() files.count <- length(files) # Store as variable so we don't recalculate it each loop iteration stocks <- new.env() for (i in 1:files.count) { assign(files[i], read.csv(files[i]), pos = stocks) } # Make a list of all the data frames and add the stock names stock.list <- lapply(ls(stocks), function(x) {get(x, envir=stocks)}) names(stock.list) <- ls(stocks)

#### Calculate Daily Stock Returns

Next, the script subsets only the stock prices from 2016 to date and creates an extended time series (xts) object for each stock; this will make computing the daily returns easier. The script then calculates the daily returns for each stock and removes the NA values that result from using the *lag()* function.

# Subset YTD returns stock.list.subset <- lapply(stock.list, function (x) { subset(x, as.Date(Date) >= '2016-01-01', select = c("Date", "Adj.Close")) }) # Transform each data frame into an xts object for(i in 1:length(stock.list.subset)) { temp.date <- as.Date(as.character(stock.list.subset[[i]]$Date), "%Y-%m-%d") stock.list.subset[[i]] <- xts(stock.list.subset[[i]]$Adj.Close, temp.date) colnames(stock.list.subset[[i]]) <- c("Adj.Close") } # Calculate market return on all stocks for(i in 1:length(stock.list.subset)) { stock.list.subset[[i]]$Return <- log(stock.list.subset[[i]]$Adj.Close / lag(stock.list.subset[[i]]$Adj.Close)) } # Get rid of NAs (usually first observation, since no lagged value) for(i in 1:length(stock.list.subset)) { stock.list.subset[[i]] <- stock.list.subset[[i]][!is.na(stock.list.subset[[i]]$Return)] }

#### Identify and Trim Extreme Values

The script then identifies extreme values within each set of stock returns. This is important because we are only going to include stock returns that are normally distributed in the portfolio calculation: trimming extreme values makes the normality test we use later more reliable. Outliers are defined as values that are three times the interquartile range (IQR) from the first and third quartiles; the function *outliers.iqr* does the calculation.

# Get outliers for each stock, then trim outliers.iqr <- function(x, multiplier = 3) { # Returns the lower and upper bounds for outliers as determined by a multiplier * IQR # Assumes no particular distribution # Takes a vector x and an IQR multiplier # Returns a two-element list with lbound,ubound as members temp <- quantile(x, probs=c(.25,.75), names=FALSE) iqr <- temp[2] - temp[1] lbound <- temp[1] - multiplier * iqr ubound <- temp[2] + multiplier * iqr return(list(lbound=lbound,ubound=ubound)) } stock.list.outliers <- lapply(stock.list.subset, function (x) {outliers.iqr(x$Return,3)}) stock.list.trimmed <- lapply(stock.list.subset, function (x) { temp <- outliers.iqr(x$Return,3) subset(x, Return > temp$lbound & Return < temp$ubound, select = "Return") })

#### Separate S&P 500 Index and Check Returns for Normality

Next, the script strips out the S&P 500 index and returns and checks the remaining stock data for normality; note that we use the Hochberg correction for the p-values, since we are conducting several hundred normality tests. Since the second *lapply()* just returns TRUE or FALSE (depending on whether the adjusted p-value is > 0.05), we can use it to filter the list of stock data after transforming it with *unlist()*.

# Strip SPX out of the list of stocks to analyze, and make it a separate object spx.stock.return <- subset(stock.list.trimmed[["SPX"]], select = "Return") stock.list.return <- stock.list.trimmed[names(stock.list.trimmed) != "SPX"] # Test each stock return for normality, then adjust for multiple comparisons # Returns TRUE or FALSE for each stock, depending on whether test is passed normal.test <- lapply(stock.list.return, function (x) {ad.test(as.numeric(x))}) normal.test.adj <- lapply(normal.test, function (x) { temp <- p.adjust(x$p.value, method="hochberg", n=length(normal.test)) return (temp > 0.05) }) # Subset the stocks that are approximately normal using normal.test.adj as a filter stock.list.data <- stock.list.return[unlist(normal.test.adj)]

#### Estimate Correlation Between Each Stock and the Overall Market

The script then estimates the correlation between each stock and the S&P 500 index using linear regression. Here, the regression coefficient \(\beta_{market}\) is an estimate of how the stock’s returns are affected by changes in the market as a whole (represented by the S&P 500 index): for example, a beta of 1.10 means that when the market changes by 1, the stock will change by 1.10. All else equal, higher betas represent riskier stocks.

(note the use of the *merge()* function here to make sure the regression includes only dates where both the stock and the S&P 500 index contain values; without this, the *lm()* function will fail because there will be a different number of cases for each variable).

# Estimate Betas for each stock Beta.model <- lapply(stock.list.data, function (x){ model.data <- merge(spx.stock.return, x) colnames(model.data) <- c("market.return", "stock.return") lm(stock.return ~ market.return, data = model.data) })

### Next Steps

At this point, we have the raw materials we need to construct an optimum portfolio. Next, we’ll check that the assumptions of linear regression are met, then finish up by selecting the portfolio itself. You can read about how to do this HERE. If something isn’t working like it should, please fill out the comment form below, and I’ll do my best to help.