# Load and prepare the data

data <- read.csv("pml-training.csv")
dim(data)
## [1] 19622   160
table(data$classe) #manner to be predicted ## ## A B C D E ## 5580 3797 3422 3216 3607 Using summary(data), learning about variables. Some variables with too many NAs or Blank, and first 7 variables about the number(X), user, time and windows, are also irrelevant. Removing these variables from predictors. v <- apply(is.na(data)|data=="", 2, sum) sum(v==0); sum(v>19000) ## [1] 60 ## [1] 100 data <- subset(data[,v==0]); data <- data[-c(1:7)] dim(data) ## [1] 19622 53 # Explore the data Now we have 53 variables, combination of: 6 different ways: roll, pitch, yaw, gyros, accel and magnet 4 accelerometer locations: belt, arm, dumbbell, forearm Now exploring variables with some plots. # Machine learning ## Split data to training & testing sets for cross validation set.seed(23123) library(caret) inTrain <- createDataPartition(y=data$classe, p=0.6, list=FALSE)
train <- data[inTrain,]; test <- data[-inTrain,]

## Classification Model

It’s a question of classification.

Trying to use classification tree, random forest, model based prediction nad moosting. With default setting, method=“lda” is best for its accuracy and speed.

The Accuracy is about 0.70, so the out of sample error is expected 0.30.

modFit <- train(classe~., method="lda", data=train)
modFit$result ## parameter Accuracy Kappa AccuracySD KappaSD ## 1 none 0.6969599 0.6164363 0.007214281 0.009004911 pred <- predict(modFit, test) table(pred, test$classe)
##
## pred    A    B    C    D    E
##    A 1838  259  141   90   61
##    B   40  973  123   62  238
##    C  195  159  902  149  144
##    D  153   55  160  944  149
##    E    6   72   42   41  850
sum(pred==test\$classe)/dim(test)[1] # Testing Set's Accuracy
## [1] 0.7018863

The accuracy of testing set is a little better than that of training set.

It’s not so good. But is as expected.

# Machine learning algorithm apply to Test data

For submission of predictions, use the model to original Test data.

testing <- read.csv("pml-testing.csv")
answer <- predict(modFit, testing); answer
##  [1] B A B C C E D D A A D A B A E A A B B B
## Levels: A B C D E