Introduction

This document is part of the course project for coursera PML class. The goal is to use machine learning methods to perform human activity recognition (HAR) using data from hardware devices such as accelerometers on the belt, forearm, arm, and dumbell. For more information please check the source: http://groupware.les.inf.puc-rio.br/har.

The Approach: Machine Learning using Boosting Model

The followings show the steps we achieve HAR using machine learning with the boosting model. First we clean up the data. Then we slice the training set into training and testing subsets for cross validation. Finally, we train our selected models and pick the one that produces a better result.

Step 1: Loading the original data sets

#load necessary libraries
rm(list=ls())
library(ggplot2); library(caret);
## Loading required package: lattice
#load given training and testing sets
training_original <- read.csv("pml-training.csv")
testing_original <- read.csv("pml-testing.csv")

A peek into the given data set reveals a large amount of irrelevant information for our prediction. In the following prediction process with Machine Learning, we’ll first clean up the data and then use the Boosting method.

Step 2.1: Data Cleaning - Remove columns with blank or NA values

We first replace blank cells with NA and then remove all columns with NAs. This will remove columns with both blank and NA values.

#clean up step 1: remove columns with values of either "" (blank) or "NA"
training_original[training_original==""] <- NA  # replace blank with NA first
testing_original[testing_original==""] <- NA
training_clean <- training_original[, !colSums(is.na(training_original))]  # remove columns with NA (sum is NA)
testing_clean <- testing_original[, !colSums(is.na(testing_original))]

Step 2.2: Data Cleaning - Remove other irrelevant columns

Notice that column 1 through 7 (“X” “user_name” “raw_timestamp_part_1” “raw_timestamp_part_2” “cvtd_timestamp” “new_window” “num_window”) are merely meta data for the data recording experiments. We can remove these columns and preverse only the hardware sensor data that are relevant to our machine learning model.

#clean up step 1: remove apparently uncorrelated data (preserving only sensor data)
training_clean <- training_clean[-7:-1]
testing_clean <- testing_clean[-7:-1]
dim(training_clean)
## [1] 19622    53

Note we now have 53 variables in the training set.

Step 3: Data Slicing

We’ll leave the original testing data set intact and slice the training set into our sub-training set and sub-testing set. Let’s use 75 percent of the data for training and the rest for testing, as shown below.

#data slicing the original training data set into my own training and testing sets
set.seed(12345)
inTrain <- createDataPartition(y=training_clean$classe, p=0.75, list=FALSE)
training <- training_clean[inTrain,]
testing <- training_clean[-inTrain,]

Now variables training and testing are the two real data sets we use to train and test our machine learning models.

Step 4: Boosting with Trees

Boosting is known to be a highly accurate classifier that can take in lots of possibly weak predictors. It weights these predictors and add them up to get a stronger predictor.

modFit <- train(classe~., method="gbm", data=training, verbose=FALSE)
modFit$finalModel

Step 5: Expected accuracy with Confusion Matrix

predictions <- predict(modFit, testing)
## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
confusionMatrix(predictions, testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1379   38    0    0    3
##          B    9  883   15    1    8
##          C    7   28  827   31    5
##          D    0    0   12  767   17
##          E    0    0    1    5  868
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9633          
##                  95% CI : (0.9576, 0.9684)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9536          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9885   0.9305   0.9673   0.9540   0.9634
## Specificity            0.9883   0.9917   0.9825   0.9929   0.9985
## Pos Pred Value         0.9711   0.9640   0.9209   0.9636   0.9931
## Neg Pred Value         0.9954   0.9835   0.9930   0.9910   0.9918
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2812   0.1801   0.1686   0.1564   0.1770
## Detection Prevalence   0.2896   0.1868   0.1831   0.1623   0.1782
## Balanced Accuracy      0.9884   0.9611   0.9749   0.9735   0.9809

The accuracy is estimated at about 96.33%, thus the out of sample error is around 3.67%.

Alternative training approach with PCA

Alternatively, if we pre-process the data using Principal Components Analysis (PCA) with default parameters, the results turn out to be inferior with lower prediction accuracy on the testing set:

modFit2 <- train(classe~., method="gbm", preProcess="pca", data=training, verbose=FALSE)
modFit2$finalModel
predictions2 <- predict(modFit2, testing)
confusionMatrix(predictions2, testing$classe)
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 24 predictors of which 24 had non-zero influence.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1252  106   45   30   26
##          B   32  717   61   28   67
##          C   50   84  718   98   62
##          D   53   20   17  632   38
##          E    8   22   14   16  708
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8212          
##                  95% CI : (0.8101, 0.8318)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7735          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8975   0.7555   0.8398   0.7861   0.7858
## Specificity            0.9410   0.9525   0.9274   0.9688   0.9850
## Pos Pred Value         0.8581   0.7923   0.7095   0.8316   0.9219
## Neg Pred Value         0.9585   0.9420   0.9648   0.9585   0.9533
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2553   0.1462   0.1464   0.1289   0.1444
## Detection Prevalence   0.2975   0.1845   0.2064   0.1550   0.1566
## Balanced Accuracy      0.9192   0.8540   0.8836   0.8774   0.8854

The accuracy turns out to be a lot lower. Therefore we prefer the previous model without PCA.

Step 6: Making predictions on the Original Test Set

result_gbm <- predict(modFit, testing_original)
result_gbm
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Step 7: Writing the result

pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
pml_write_files(result_gbm)

Conclusion

Cleaning the data is more important than choosing the model. For this project, we first removed invalid and irreleavant columns and preserved only sensor data. Further cleaning and compressing the predictors needs to be very careful. For example, if we use PCA for predictor compression, we might get a low accuracy. Our approach was to give the data to Boosting Model which further figured out 45 of 52 predictors that had non-zero influence, as finalModel indicated. The training process took around a couple of hours on my old home desktop machine (Intel Celeron G1610 @ 2.6GHz + 6GB RAM + Windows 64bit) to finish. The accuracy/performance is good and achieved about 96.33% accuracy on our sub-testing set.

I would like to thank PML staff for making this project. It has been an enriching experience.