data <- read.csv("pml-training.csv")
dim(data)
## [1] 19622 160
table(data$classe) #manner to be predicted
##
## A B C D E
## 5580 3797 3422 3216 3607
Using summary(data), learning about variables. Some variables with too many NAs or Blank, and first 7 variables about the number(X), user, time and windows, are also irrelevant. Removing these variables from predictors.
v <- apply(is.na(data)|data=="", 2, sum)
sum(v==0); sum(v>19000)
## [1] 60
## [1] 100
data <- subset(data[,v==0]); data <- data[-c(1:7)]
dim(data)
## [1] 19622 53
Now we have 53 variables, combination of:
6 different ways: roll, pitch, yaw, gyros, accel and magnet
4 accelerometer locations: belt, arm, dumbbell, forearm
Now exploring variables with some plots.
set.seed(23123)
library(caret)
inTrain <- createDataPartition(y=data$classe, p=0.6, list=FALSE)
train <- data[inTrain,]; test <- data[-inTrain,]
It’s a question of classification.
Trying to use classification tree, random forest, model based prediction nad moosting. With default setting, method=“lda” is best for its accuracy and speed.
The Accuracy is about 0.70, so the out of sample error is expected 0.30.
modFit <- train(classe~., method="lda", data=train)
modFit$result
## parameter Accuracy Kappa AccuracySD KappaSD
## 1 none 0.6969599 0.6164363 0.007214281 0.009004911
pred <- predict(modFit, test)
table(pred, test$classe)
##
## pred A B C D E
## A 1838 259 141 90 61
## B 40 973 123 62 238
## C 195 159 902 149 144
## D 153 55 160 944 149
## E 6 72 42 41 850
sum(pred==test$classe)/dim(test)[1] # Testing Set's Accuracy
## [1] 0.7018863
The accuracy of testing set is a little better than that of training set.
It’s not so good. But is as expected.
For submission of predictions, use the model to original Test data.
testing <- read.csv("pml-testing.csv")
answer <- predict(modFit, testing); answer
## [1] B A B C C E D D A A D A B A E A A B B B
## Levels: A B C D E