This document is part of the course project for coursera PML class. The goal is to use machine learning methods to perform human activity recognition (HAR) using data from hardware devices such as accelerometers on the belt, forearm, arm, and dumbell. For more information please check the source: http://groupware.les.inf.puc-rio.br/har.
The followings show the steps we achieve HAR using machine learning with the boosting model. First we clean up the data. Then we slice the training set into training and testing subsets for cross validation. Finally, we train our selected models and pick the one that produces a better result.
#load necessary libraries
rm(list=ls())
library(ggplot2); library(caret);
## Loading required package: lattice
#load given training and testing sets
training_original <- read.csv("pml-training.csv")
testing_original <- read.csv("pml-testing.csv")
A peek into the given data set reveals a large amount of irrelevant information for our prediction. In the following prediction process with Machine Learning, we’ll first clean up the data and then use the Boosting method.
We first replace blank cells with NA and then remove all columns with NAs. This will remove columns with both blank and NA values.
#clean up step 1: remove columns with values of either "" (blank) or "NA"
training_original[training_original==""] <- NA # replace blank with NA first
testing_original[testing_original==""] <- NA
training_clean <- training_original[, !colSums(is.na(training_original))] # remove columns with NA (sum is NA)
testing_clean <- testing_original[, !colSums(is.na(testing_original))]
Notice that column 1 through 7 (“X” “user_name” “raw_timestamp_part_1” “raw_timestamp_part_2” “cvtd_timestamp” “new_window” “num_window”) are merely meta data for the data recording experiments. We can remove these columns and preverse only the hardware sensor data that are relevant to our machine learning model.
#clean up step 1: remove apparently uncorrelated data (preserving only sensor data)
training_clean <- training_clean[-7:-1]
testing_clean <- testing_clean[-7:-1]
dim(training_clean)
## [1] 19622 53
Note we now have 53 variables in the training set.
We’ll leave the original testing data set intact and slice the training set into our sub-training set and sub-testing set. Let’s use 75 percent of the data for training and the rest for testing, as shown below.
#data slicing the original training data set into my own training and testing sets
set.seed(12345)
inTrain <- createDataPartition(y=training_clean$classe, p=0.75, list=FALSE)
training <- training_clean[inTrain,]
testing <- training_clean[-inTrain,]
Now variables training and testing are the two real data sets we use to train and test our machine learning models.
Boosting is known to be a highly accurate classifier that can take in lots of possibly weak predictors. It weights these predictors and add them up to get a stronger predictor.
modFit <- train(classe~., method="gbm", data=training, verbose=FALSE)
modFit$finalModel
predictions <- predict(modFit, testing)
## Loading required package: gbm
## Loading required package: survival
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
confusionMatrix(predictions, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1379 38 0 0 3
## B 9 883 15 1 8
## C 7 28 827 31 5
## D 0 0 12 767 17
## E 0 0 1 5 868
##
## Overall Statistics
##
## Accuracy : 0.9633
## 95% CI : (0.9576, 0.9684)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9536
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9885 0.9305 0.9673 0.9540 0.9634
## Specificity 0.9883 0.9917 0.9825 0.9929 0.9985
## Pos Pred Value 0.9711 0.9640 0.9209 0.9636 0.9931
## Neg Pred Value 0.9954 0.9835 0.9930 0.9910 0.9918
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2812 0.1801 0.1686 0.1564 0.1770
## Detection Prevalence 0.2896 0.1868 0.1831 0.1623 0.1782
## Balanced Accuracy 0.9884 0.9611 0.9749 0.9735 0.9809
Alternatively, if we pre-process the data using Principal Components Analysis (PCA) with default parameters, the results turn out to be inferior with lower prediction accuracy on the testing set:
modFit2 <- train(classe~., method="gbm", preProcess="pca", data=training, verbose=FALSE)
modFit2$finalModel
predictions2 <- predict(modFit2, testing)
confusionMatrix(predictions2, testing$classe)
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 24 predictors of which 24 had non-zero influence.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1252 106 45 30 26
## B 32 717 61 28 67
## C 50 84 718 98 62
## D 53 20 17 632 38
## E 8 22 14 16 708
##
## Overall Statistics
##
## Accuracy : 0.8212
## 95% CI : (0.8101, 0.8318)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7735
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8975 0.7555 0.8398 0.7861 0.7858
## Specificity 0.9410 0.9525 0.9274 0.9688 0.9850
## Pos Pred Value 0.8581 0.7923 0.7095 0.8316 0.9219
## Neg Pred Value 0.9585 0.9420 0.9648 0.9585 0.9533
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2553 0.1462 0.1464 0.1289 0.1444
## Detection Prevalence 0.2975 0.1845 0.2064 0.1550 0.1566
## Balanced Accuracy 0.9192 0.8540 0.8836 0.8774 0.8854
result_gbm <- predict(modFit, testing_original)
result_gbm
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(result_gbm)
Cleaning the data is more important than choosing the model. For this project, we first removed invalid and irreleavant columns and preserved only sensor data. Further cleaning and compressing the predictors needs to be very careful. For example, if we use PCA for predictor compression, we might get a low accuracy. Our approach was to give the data to Boosting Model which further figured out 45 of 52 predictors that had non-zero influence, as finalModel indicated. The training process took around a couple of hours on my old home desktop machine (Intel Celeron G1610 @ 2.6GHz + 6GB RAM + Windows 64bit) to finish. The accuracy/performance is good and achieved about 96.33% accuracy on our sub-testing set.
I would like to thank PML staff for making this project. It has been an enriching experience.