Before conducting the Exploratory Data Analysis, the data set was split into a training data set and testing data set, where 75% of the rows were random sampling in the training data set and the remaining 25% were put into the test set.
Using str() function, we can find the structure of the training data set. The data set contains 30,000 observations and 15 variables.
For the testing data set, the data set contains 10,000 observation with 13 variables. The two missing variables were ‘price’ and ‘sent’, which we will be using the training data set to create models to predict the variables in the testing data set.
The data type for the variables id and sent is integer, price variable is numeric and the remaining variables are character. In order to continue with the exploration of the project, all variables with the type “character” are convert into “numeric” except the variables sent, year, built, reno, bedrooms, bathrooms and environ are convert into “factor”. This is because the variables are categorical nominal with categorical data input.
str(train)## 'data.frame':    30000 obs. of  15 variables:
##  $ id        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ price     : num  1.62 1.234 0.969 -0.156 -0.918 ...
##  $ sent      : int  1 1 1 0 0 1 1 1 0 0 ...
##  $ cond      : chr  "1.254" "2.626" "1.285" "2.521" ...
##  $ year      : chr  "5" "2" "9" "3" ...
##  $ built     : chr  "1" "2" "2" "2" ...
##  $ lat       : chr  "0.214292386167818" "1.04368478552843" "0.606694521912293" "0.196032855601255" ...
##  $ lon       : chr  "0.0870801669691538" "0.175831949918982" "0.0777294735788834" "0.104240128073335" ...
##  $ sq.m.h    : chr  "271" "400" "270" "130" ...
##  $ sq.m.block: chr  "1686" "805" "497" "446" ...
##  $ sq.m.pool : chr  "21" "0" "0" "0" ...
##  $ reno      : chr  "0" "0" "0" "0" ...
##  $ bedrooms  : chr  "3" "5" "2" "5" ...
##  $ bathrooms : chr  "3" "4" "3" "2" ...
##  $ environ   : chr  "1" "1" "1" "1" ...str(test)## 'data.frame':    10000 obs. of  13 variables:
##  $ id        : int  21 23 26 27 30 33 35 45 49 50 ...
##  $ cond      : chr  "-0.663" "-0.215" "0.781" "0.029" ...
##  $ year      : chr  "10" "9" "2" "6" ...
##  $ built     : chr  "8" "3" "2" "8" ...
##  $ lat       : chr  "0.198934576106144" "1.12880388882484" "-3.16573946722443" "NA" ...
##  $ lon       : chr  "0.318718838529676" "0.143025189524979" "-4.05361948837381" "0.253234283710486" ...
##  $ sq.m.h    : chr  "379" "99" "298" "273" ...
##  $ sq.m.block: chr  "1031" "781" "1401" "920" ...
##  $ sq.m.pool : chr  "64" "0" "0" "82" ...
##  $ reno      : chr  "1" "0" "0" "0" ...
##  $ bedrooms  : chr  "NA" "2" "3" "5" ...
##  $ bathrooms : chr  "3" "1" "3" "4" ...
##  $ environ   : chr  "0" "3" "3" "1" ...Before converting the data type, sent, year, built, reno, bedrooms, bathrooms and environ are convert to numeric first to fill in the missing values. After filling in the missing values we then can convert the data type to factor. We do the same steps for the testing data set.
sq.m.h, sq.m.block, sq.m.pool are right-skewed by looking at the line plot, with mean to the right of the median.
lat and lon are left-skewed by looking at the line graph, with the mean to the left of the median. There could be an extreme value on lat.
Cond is approcimately normally distributed.
Year and built are comb distribution, this may be contributed because year and built are categorical nominal data.
Furthermore, reno, bedrooms, bathrooms and environ are multimodal distribution too due to the input are categorical nominal data.
For skewed covariates, such as lat, lon, sq.m.h, sq.m.block and sq.m.pool, median imputation is used tp replace missing values with the median. Due to the selected covariates are continuous variables that have skewed distributions, with median imputation, it can help to maintain the central tendency of the data.
Mode imputation is used for year, built, reno, bedrooms, bathrooms and environ. To justify my choice, I used mode imputation to preserve the category with the highest occurrence.
As the covariate, cond, is approximately normally distributed, mean imputation is used to replace the missing values with the mean of the data. For the remaining covariates, id, price, and sent do not have any missing values, so no imputation methods were used.
There’s no positive nor negative correlation but there’s strong relationship between cond and price as data points are clustered tightly.
Looking at the box plot between cond and sent, we see that there’s an outlier where someone was very satisfied of the sale price after 6 months with a negative condition score. Inversely, there are some outliers with positive condition score that are not satisfied with the sale price. We also can tell that the worse the condition score, more people will not be satisfied with the sale price.
coef(lm(price ~ cond, data = train))## (Intercept)        cond 
##  0.24332531  0.02399454ggplot(data = train) +
  geom_point(mapping = aes(x = cond, y = price)) + geom_abline(intercept = 0.24332531, slope = 0.02399454, col = "red") + ylab("Price") +xlab("Condition") + ggtitle("Distribution of Price by Condition")ggplot(data = train, mapping = aes(x = cond, y = sent)) +
    geom_boxplot()+ ylab("Satisfaction Score") +xlab("Condition") + ggtitle("Box Plot of Satisfaction Score by Condition")From the box plot below, we see that houses sold on year 2 and 3 have significant outliers having higher sold price.
From the tile plot, starting from houses sold on year 9 to year 12, we noticed that there were a significant amount of observations that are not satisfied. Overall, we noticed that there were more observations that were not satisfied compared to satisfied.
ggplot(data = train) +
  geom_boxplot(mapping = aes(x = year, y = price))+ ylab("Price") +xlab("Year Sold") + ggtitle("Box Plot of Price by Year Sold")#reorder(year, price, FUN = median)
train %>% 
  count(sent, year) %>%  
  ggplot(mapping = aes(x = year, y = sent)) +
    geom_tile(mapping = aes(fill = n)) + ylab("Satisfaction Score") +xlab("Year Sold") + ggtitle("Tile Plot of Satisfaction Score by Year Sold")From the box plot, it is based on the lowest median to the highest median price, house built on the 10th decade will have the highest median price, while house built of the 4th decade will have the lowest median price.
From the tile plot, we observe that there’s significant amount of observations are not satisfied no matter what decade the house was built.
ggplot(data = train) +
  geom_boxplot(mapping = aes(x = built, y = price)) + ylab("Price") +xlab("Decade Built") + ggtitle("Box Plot of Price by Decade Built")train %>% 
  count(sent, built) %>%  
  ggplot(mapping = aes(x = built, y = sent)) +
    geom_tile(mapping = aes(fill = n)) + ylab("Satisfaction Score") +xlab("Decade Built") + ggtitle("Tile Plot of Satisfaction Score by Decade Built")From the scatterplot, latitude between -1 and 1 has a strong relationship and clustered tightly. However, latitude less than -3 has a strong relationship and clustered tightly, having a significant gap from latitude between -1 and 1.
From the boxplot, latitude under -3 has outliers of both 1 and 0 satisfaction score, which is expected.
coef(lm(price ~ lat, data = train))## (Intercept)         lat 
##  0.24506962  0.02301766ggplot(data = train) +
  geom_point(mapping = aes(x = lat, y = price)) + geom_abline(intercept = 0.24, slope = 0.02, col = "red")+ ylab("Price") +xlab("Lat") + ggtitle("Scatterplot of Price by Lat")ggplot(data = train, mapping = aes(x = lat, y = sent)) +
  geom_boxplot() + ylab("Satisfaction Score") +xlab("Lat") + ggtitle("Box Plot of Satisfaction Score by Lat")Looking at scatterplot, longtitude has a similar plot with latitude with a significant gap from longtitude between 0 and 1. Longtitude between 0 and 1 and longtitude under -4 has a strong relationship and clustered tightly.
From the boxplot, latitude under -4 has outliers of both 1 and 0 satisfaction score, similar to the previous box plot of lat.
coef(lm(price ~ lon, data = train))## (Intercept)         lon 
##   0.2305310  -0.1688007ggplot(data = train) +
  geom_point(mapping = aes(x = lon, y = price)) + geom_abline(intercept = 0.23, slope = -0.17, col = "red") + ylab("Price") +xlab("Lon") + ggtitle("Scatterplot of Price by Lon")ggplot(data = train, mapping = aes(x = lon, y = sent)) +
  geom_boxplot() + ylab("Satisfaction Score") +xlab("Lon") + ggtitle("Box Plot of Satisfaction Score by Lon")Looking at the scatterplot, we see that square meters of living space has a positive and stronger relationship with the price of house sold at. We see also that there’s an outlier of having more than 2000 square meters of living space with price less than 5.0.
Looking at the boxplot, it could be the same outlier observation is very satisfied with their purchase of the house.
coef(lm(price ~ sq.m.h, data = train))##  (Intercept)       sq.m.h 
## -1.169765675  0.006611341ggplot(data = train) +
  geom_point(mapping = aes(x = sq.m.h, y = price)) + geom_abline(intercept = -1.17, slope = 0.006, col = "red")+ ylab("Price") +xlab("Sq.m.h") + ggtitle("Scatterplot of Price by sq.m.h")ggplot(data = train, mapping = aes(x = sq.m.h, y = sent)) +
  geom_boxplot()+ ylab("Satisfaction Score") +xlab("Sq.m.h") + ggtitle("Box Plot of Satisfaction Score by Sq.m.h")From the scatterplot, there’s not a positive relationship but a strong relationship between price and square meters of the lot size.
From the boxplot, there exist outliers.
coef(lm(price ~ sq.m.block, data = train))##  (Intercept)   sq.m.block 
## 2.140177e-01 2.034816e-05ggplot(data = train) +
  geom_point(mapping = aes(x = sq.m.block, y = price)) + geom_abline(intercept = 0.21, slope = 0.00002, col = "red")+ ylab("Price") +xlab("Sq.m.block") + ggtitle("Scatterplot of Price by Sq.m.block")ggplot(data = train, mapping = aes(x = sq.m.block, y = sent)) +
  geom_boxplot()+ ylab("Satisfaction Score") +xlab("Sq.m.block") + ggtitle("Box Plot of Satisfaction Score by Sq.m.block")Observing the scatterplot, there’s a strong and positive relationship between price and square meters of the pool size. There are two outliers, one more than 300 square meters of pool size and one between 200 and 300 square meters of pool size.
Observing the boxplot, the outliers could be the same as the scatterplot between square meters of the pool size and satisfaction score.
coef(lm(price ~ sq.m.pool, data = train))## (Intercept)   sq.m.pool 
##  0.08382284  0.00629291ggplot(data = train) +
  geom_point(mapping = aes(x = sq.m.pool, y = price)) + geom_abline(intercept = 0.08, slope = 0.006, col = "red")+ ylab("Price") +xlab("Sq.m.pool") + ggtitle("Scatterplot of Price by Sq.m.pool")ggplot(data = train, mapping = aes(x = sq.m.pool, y = sent)) +
  geom_boxplot()+ ylab("Satisfaction Score") +xlab("Sq.m.pool") + ggtitle("Box Plot of Satisfaction Score by Sq.m.pool")From the box plot, reno and price seem to be normal.
Looking at the tile plot, we see that a significant amount of observations are not satisfied when the houses are not renovated.
ggplot(data = train) +
  geom_boxplot(mapping = aes(x = reorder(reno, price, FUN = median), y = price))+ ylab("Price") +xlab("Renovation Score") + ggtitle("Box Plot of Price by Renovation Score")train %>% 
  count(sent, reno) %>%  
  ggplot(mapping = aes(x = reno, y = sent)) +
    geom_tile(mapping = aes(fill = n))+ ylab("Satisfaction Score") +xlab("Renovation Score") + ggtitle("Tile Plot of Satisfaction Score by Renovation Score")From the box plot, it seems that there’s an outlier between number of bedrooms and price when there are 3 bedrooms.
In the tile plot, a significant amount of observations indicating that houses with 3 to 4 bedrooms are not satisfied with the purchase.
ggplot(data = train) +
  geom_boxplot(mapping = aes(x = bedrooms, y = price))+ ylab("Price") +xlab("Number of Bedrooms") + ggtitle("Box Plot of Price by Number of Bedrooms")train %>% 
  count(sent, bedrooms) %>%  
  ggplot(mapping = aes(x = bedrooms, y = sent)) +
    geom_tile(mapping = aes(fill = n)) + ylab("Satisfaction Score") +xlab("Number of Bedrooms") + ggtitle("Tile Plot of Satisfaction Score by Number of Bedrooms")Observing the box plot, it seems that house with 0 bathroom exists an outlier between price and bathrooms.
Looking at the tile plot, between 1 and 3 bathrooms, significant amount of observations are not satisfied with the purchase.
ggplot(data = train) +
  geom_boxplot(mapping = aes(x = bathrooms, y = price))+ ylab("Price") +xlab("Number of Bathrooms") + ggtitle("Box Plot of Price by Number of Bathrooms")train %>% 
  count(sent, bathrooms) %>%  
  ggplot(mapping = aes(x = bathrooms, y = sent)) +
    geom_tile(mapping = aes(fill = n)) + ylab("Satisfaction Score") +xlab("Number of Bathrooms") + ggtitle("Tile Plot of Satisfaction Score by Number of Bathrooms")Looking at the box plot, it seems to be normal and there are no significant observations require further research.
In the tile plot, an environ score of 1 has a significant amount of observations that are not satisfied with the purchase.
ggplot(data = train) +
  geom_boxplot(mapping = aes(x = environ, y = price))+ ylab("Price") +xlab("Environment Score") + ggtitle("Box Plot of Price by Environment Score")train %>% 
  count(sent, environ) %>%  
  ggplot(mapping = aes(x = environ, y = sent)) +
    geom_tile(mapping = aes(fill = n)) + ylab("Satisfaction Score") +xlab("Environment Score") + ggtitle("Tile Plot of Satisfaction Score by Environment Score")Looking at the correlation matrix, with the method of pearson, we find that square meters of living space (sq.m.h) has a strong correlation with the price of the house sold with a value of 0.58.
train %>% 
  correlate() %>% 
  plot() + xlab("Variables 1") + ylab("Variables 2") + ggtitle("Correlation Matrix (Method = Pearson)")Are there more 1s than 0s for satisfaction score in the dataset?
Calculating the mode for sent, there are more 0s compared to 1s.
calc_mode(train$sent)## [1] 0
## Levels: 0 1Are there more houses renovated compared to not renovated?
After calculating the mode for reno, there are more houses that are not renovated.
calc_mode(train$reno)## [1] 0
## Levels: 0 1Is the price of the house sold higher when renovated?
After calculating the mean price of the house when renovated, it is higher than houses that are not renovated.
mean(train[train$reno == '0', 'price'])## [1] 0.2091988mean(train[train$reno == '1', 'price'])## [1] 0.503839First model: Multiple regression with backward selection method to select the best predictors
Backward selection method is chosen because forward selection method because of the so-called suppressor effects produced by the forward selection technique. Predictors are only significant when another predictor is maintained constant therefore these suppressor effects occur.
With backward selection method, the best predictors for multiple regression include cond, year, built, lat, lon, sq.m.h, sq.m.block, sq.m.pool, bedrooms and bathrooms. Looking at step.model, sent can be disregarded as it is a dependent variable in this report. After removing sent, environ p-value is significantly more than 0.05, therefore can be disregarded as well.
Mean Squared Error for this model is 0.4.
# Fit the full model 
full.model <- lm(price ~., data = train)
# Stepwise regression model
step.model <- stepAIC(full.model, direction = "backward", 
                      trace = FALSE)
anova(step.model)mr <- lm(price ~ cond + year + built + lat + lon + sq.m.h + sq.m.block + sq.m.pool + bedrooms + bathrooms, data = train)
anova(mr)Second model: Logistic Regression of Generalized Linear Model using k-fold cross validation with forward selection method based on the value of AIC.
Using a 10-fold cross validation method for logistic regression of GLM, I find the best model consisting predictors cond, year, built, lat, lon, sq.m.h, sq.m.block, sq.m.pool, bedrooms, bathrooms and environ. Using forward selection method based on AIC, this model provides the lowest AIC possible of 51454. Predictors of reno and environ were not added as MSE and AIC have no changes towards MSE and no significant decrease of AIC.
Mean Squared Error for this model is 0.3673.
#Randomly shuffle the data
yourData<-train[sample(nrow(train)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(train)),breaks=10,labels=FALSE)
set.seed(1)
#Perform 10 fold cross validation
for(i in 1:10){
    
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- train[testIndexes, ]
    trainData <- train[-testIndexes, ]
    
    model <- glm(price ~ cond + year + built + lat + lon + sq.m.h + sq.m.block + sq.m.pool + as.numeric(bedrooms) + as.numeric(bathrooms) + environ, data = trainData)
    pred.test.glm <- predict(model, newdata = testData)
    mse.out <- mean((testData$price - pred.test.glm)^2, na.rm = TRUE)
}
mse.out## [1] 0.3673352Third Model: Lasso Regression with 10-fold cross validation
Using alpha = 1 to fit lasso regression model. 10-fold cross validation method was used to find the best lambda value.
Looking at best.model.lasso, no coefficient is shown for the predictor environ. This means that environ is completely dropped from the model because it wasn’t influential enough.
Mean Squared Error for this model is 0.4274.
library(glmnet)
y.lasso <- train$price
x.lasso <- data.matrix(train[,c('cond','year','built','lat','lon','sq.m.h','sq.m.block','sq.m.pool','reno','bedrooms','bathrooms','environ')])
cv.model.lasso <- cv.glmnet(x.lasso, y.lasso, alpha = 1)
best.lambda.lasso <- cv.model.lasso$lambda.min
best.lambda.lasso## [1] 0.001287221plot(cv.model.lasso)best.model.lasso <- glmnet(x.lasso, y.lasso, alpha = 1,lambda = best.lambda.lasso)
coef(best.model.lasso)## 13 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -1.626015e-01
## cond         2.191827e-02
## year        -1.492657e-01
## built       -1.926161e-02
## lat          4.112898e-01
## lon         -5.340585e-01
## sq.m.h       6.349427e-03
## sq.m.block   6.758345e-06
## sq.m.pool    1.025915e-03
## reno         2.913286e-02
## bedrooms    -6.429561e-02
## bathrooms    1.047535e-01
## environ      .y.predicted.lasso <- predict(best.model.lasso, s = best.lambda.lasso, newx = x.lasso)
mse.lasso <- mean((y.lasso - y.predicted.lasso)^2)
mse.lasso## [1] 0.4273711Fourth Model: Ridge Regression with 10 fold cross validation
Using alpha = 0 to fit ridge regression model. 10-fold cross validation method was used to find the best lambda value.
From this model, we see that sq.m.block is the least influential predictor in this model as it is the smallest value from coef(best.model.ridge).
Mean Squared Error for this model is 0.4359.
library(glmnet)
y.ridge <- train$price
x.ridge <- data.matrix(train[,c('cond','year','built','lat','lon','sq.m.h','sq.m.block','sq.m.pool','reno','bedrooms','bathrooms','environ')])
model.ridge <- glmnet(x.ridge, y.ridge, alpha = 0)
summary(model.ridge)##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1200   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         4   -none-    call   
## nobs         1   -none-    numericcv.model.ridge <- cv.glmnet(x.ridge, y.ridge, alpha = 0)
best.lambda.ridge <- cv.model.ridge$lambda.min
best.lambda.ridge## [1] 0.06557281plot(cv.model.ridge)best.model.ridge <- glmnet(x.ridge, y.ridge, alpha = 0, lambda = best.lambda.ridge)
coef(best.model.ridge)## 13 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -2.852261e-01
## cond         1.700912e-02
## year        -1.412251e-01
## built       -2.332211e-02
## lat          3.036680e-01
## lon         -4.244523e-01
## sq.m.h       5.771733e-03
## sq.m.block   5.911469e-06
## sq.m.pool    9.550063e-04
## reno         4.875670e-02
## bedrooms    -5.255336e-02
## bathrooms    1.482189e-01
## environ      4.690769e-03plot(model.ridge, xvar = "lambda")y.predicted.ridge <- predict(best.model.ridge, s = best.lambda.ridge, newx = x.ridge)
mse.ridge <- mean((y.ridge - y.predicted.ridge)^2)
mse.ridge## [1] 0.4358645Fifth Model: Principal Components Regression with 10-fold cross validation
Looking at the plots, we see that in each plot, the model fit improves by adding more components, and the decrease of test RMSE is not that significant on adding one component from 33 components compared to 32 components to 33 components.
Mean Squared Error for this model is 1.0323.
library(pls)
set.seed(1)
pcr.model <- pcr(price ~ cond + year + built + lat + lon + sq.m.h + sq.m.block + sq.m.pool + reno + as.numeric(bedrooms) + as.numeric(bathrooms) + environ, data = train, scale = TRUE, validation = "CV")
summary(pcr.model)## Data:    X dimension: 30000 34 
##  Y dimension: 30000 1
## Fit method: svdpc
## Number of components considered: 34
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           1.139     1.07    1.046   0.9917   0.9361   0.9207   0.9175
## adjCV        1.139     1.07    1.046   0.9916   0.9359   0.9207   0.9175
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV      0.9159   0.9144   0.9143    0.9109    0.9101    0.9091    0.9079
## adjCV   0.9158   0.9143   0.9144    0.9108    0.9105    0.9095    0.9088
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps  20 comps
## CV       0.9024     0.900    0.8977    0.8958    0.8941    0.8933    0.8782
## adjCV    0.9023     0.901    0.8977    0.8956    0.8938    0.8933    0.8780
##        21 comps  22 comps  23 comps  24 comps  25 comps  26 comps  27 comps
## CV       0.8762    0.8755    0.8701    0.8696    0.8694    0.8522    0.8519
## adjCV    0.8761    0.8756    0.8701    0.8695    0.8694    0.8522    0.8519
##        28 comps  29 comps  30 comps  31 comps  32 comps  33 comps  34 comps
## CV       0.8419    0.8079    0.7625    0.7587    0.7223    0.6264    0.6255
## adjCV    0.8419    0.8079    0.7624    0.7589    0.7223    0.6264    0.6254
## 
## TRAINING: % variance explained
##        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X        8.309    13.57    18.61    22.83    26.93    30.68    34.20    37.58
## price   11.738    15.66    24.25    32.52    34.70    35.16    35.39    35.59
##        9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X        40.94     44.25     47.51     50.75     53.98     57.19     60.37
## price    35.61     36.14     36.20     36.36     36.43     37.38     37.58
##        16 comps  17 comps  18 comps  19 comps  20 comps  21 comps  22 comps
## X         63.54     66.67     69.79     72.89     75.91     78.89     81.81
## price     38.02     38.30     38.60     38.63     40.72     40.97     41.04
##        23 comps  24 comps  25 comps  26 comps  27 comps  28 comps  29 comps
## X         84.66     87.34     89.98     92.50     94.35     96.09     97.63
## price     41.80     41.88     41.91     44.16     44.21     45.52     49.83
##        30 comps  31 comps  32 comps  33 comps  34 comps
## X         98.41     98.91     99.39     99.73    100.00
## price     55.33     55.76     59.92     69.86     69.95#visualize cross-validation plots
validationplot(pcr.model)validationplot(pcr.model, val.type="MSEP")validationplot(pcr.model, val.type="R2")#define training and testing sets
train.pcr <- train[1:24000, c("cond", "year", "built", "lat", "lon", "sq.m.h","sq.m.block","sq.m.pool","reno","bedrooms","bathrooms","environ")]
y.test.pcr <- train[24001:nrow(train), c("price")]
test.pcr <- train[24001:nrow(train), c("cond", "year", "built", "lat", "lon", "sq.m.h","sq.m.block","sq.m.pool","reno","bedrooms","bathrooms","environ")]
    
#use model to make predictions on a test set
pcr.model.pred <- pcr(price ~ cond + year + built + lat + lon + sq.m.h + sq.m.block + sq.m.pool + reno + as.numeric(bedrooms) + as.numeric(bathrooms) + environ, data = train, scale=TRUE, validation="CV")
pcr.pred <- predict(pcr.model.pred, test.pcr, ncomp=2)
#calculate MSE
mse.pcr <- mean((pcr.pred - y.test.pcr)^2)
mse.pcr## [1] 1.032316My Kaggle public name is YuHeng26.
The best predictive model for Modelling Component 1 is the second model: Linear Regression of Generalized Linear Model. To justify my point, the criterion for the predictions is Mean Squared Error, the second model provides the lowest Mean Squared Error of 0.3673. This means that the prediction errors is the lowest, providing a better predictive model.
test.price <- test
glm.model.pred.value <- predict(model, newdata = test)
test.price$price <- glm.model.pred.value
glm.model.sub <- subset(test.price, select = c(id, price))
write.csv(glm.model.sub, "C:\\Users\\Yu Heng\\Documents\\Archive\\ANU\\2022 SEM 1\\STAT3040\\Assignment 3\\glm_predicted_sent.csv", row.names = FALSE)First model: Multiple regression with backward selection method
To create lm model, the dependent variable cannot be factor, thus as.numeric function is used for sent. As price is a dependent variable, it can be removed in this model. After removing price, the best model is mr2, consists of year, built, lat, lon, sq.m.h, sq.m.block, sq.m.pool, reno, bedrooms, bathrooms and environ. The MSE is the lowest as well, 0.14.
Correct Classification rate for this model is 0.7505.
n <- nrow(train)
K <- 10
train.sample <- sample(1:n, n/10)
set.seed(1)
fold <- sample(rep(1:K , each=n/K))
data.train <- train[fold != K,]
data.test <- train[fold == K,]
full.model.2 <- lm(as.numeric(sent) ~., data = train)
step.model.2 <- stepAIC(full.model.2, direction = "backward", trace = FALSE)
anova(step.model.2)mr2 <- lm(as.numeric(sent) ~ year + built + lat + lon + sq.m.h + sq.m.block + sq.m.pool + reno + bedrooms + bathrooms + environ, data = train)
anova(mr2)p.hat.lm.2 <- predict(mr2, data.train[-train.sample,], type="response")
y.hat.lm.2 <- rep(0, n/10)
y.hat.lm.2[p.hat.lm.2 >= 0.5] <- 1
y.lm.2 <- (data.train[-train.sample,]$sent == "1")*1
table.lm.2 <- table(y.lm.2,y.hat.lm.2)
accuracy_Test.lm.2 <- sum(diag(table.lm.2)) / sum(table.lm.2)
accuracy_Test.lm.2## [1] 0.7502368Second model: Logistic Regression of Generalized Linear Model using k-fold cross validation with forward selection method based on the value of AIC.
The best predictors for this model are cond, built, lat, lon, sq.m.h, sq.m.block, sq.m.pool, reno, bedrooms, bathrooms and environ. The following predictors provide the lowest possible AIC of 23695 with MSE of 8.6633. The predictor year was not added into this model because it did not decrease the value of AIC. Forward selection method based on the value AIC was used.
Correct Classification rate for this model is 0.7446.
n <- nrow(train)
train.sample <- sample(1:n, n/10)
# K fold
K <- 10
##
set.seed(1)
fold <- sample(rep(1:K , each=n/K))
##
for (k in 1:K) {
  
  data.train <- train[fold != k,]
  data.test <- train[fold == k,]
  
  glm.model.2 <- glm(sent ~ cond + built + lat + lon + sq.m.h + sq.m.block + sq.m.pool + reno + as.numeric(bedrooms) + as.numeric(bathrooms) + environ, family = binomial, data = data.train)
  pred.test.glm.2 <- predict(glm.model.2, newdata = data.test)
  mse.out2 <- mean((as.numeric(data.test$sent) - pred.test.glm.2)^2, na.rm = TRUE)
}
p.hat.glm.2 <- predict(glm.model.2, data.train[-train.sample,], type="response")
y.hat.glm.2 <- rep(0, n/10)
y.hat.glm.2[p.hat.glm.2 >= 0.5] <- 1
y.glm.2 <- (data.train[-train.sample,]$sent == "1")*1
## Confusion matrix
table.glm.2 <- table(y.glm.2,y.hat.glm.2)
accuracy_Test.glm.2 <- sum(diag(table.glm.2)) / sum(table.glm.2)
accuracy_Test.glm.2## [1] 0.7436114Third Model: Linear Discriminant Analysisn of 10-fold cross validation with backward elimination method
Using backward elimination method for this LDA model, the best predictors are cond, year, built, lat, lon, sq.m.h, sq.m.pool, reno, bedrooms, bathrooms and environ.
Correct classification rate for this model is 0.8123 and AUC is 0.836.
95% confidence interval for somewhat satisfied or not satisfied: [0.7471,0.7569] 95% confidence interval for very satisfied: [0.2431,0.2529]
library(MASS)
library(caret)
library(pROC)
##
set.seed(1)
fold <- sample(rep(1:K , each=n/K))
data.train.lda <- train[fold != k,]
data.test.lda <- train[fold == k,]
lda.model <- lda(sent ~ cond + year + built + lat + lon + sq.m.h + sq.m.pool + reno + bedrooms + bathrooms + environ, data = train)
predictions <- predict(lda.model, data.test.lda)
predictions.probs <- predictions$posterior[,2]
predicted.classes <- predictions$class
observed.classes <- data.test.lda$sent
accuracy <- mean(observed.classes == predicted.classes)
accuracy## [1] 0.8133333# Compute roc
res.roc <- roc(observed.classes, predictions.probs)
plot.roc(res.roc, print.auc = TRUE)# 95% confidence interval
sent.size <- 30000
p.0 <- 22561/30000
p.1 <- 7439/30000
margin.0 <- qnorm(0.975)*sqrt(p.0*(1-p.0)/sent.size)
margin.1 <- qnorm(0.975)*sqrt(p.1*(1-p.1)/sent.size)
lowerinterval.0 <- p.0 - margin.0
lowerinterval.1 <- p.1 - margin.1
upperinterval.0 <- p.0 + margin.0
upperinterval.1 <- p.1 + margin.1Fourth Model: Quadratic Discriminant Analysis of 10 fold cross validation with backward elimination method
Using backward elimination method for this model, the best predictors are cond, year, lat, sq.m.h, sq.m.pool, reno, bedrooms and bathrooms.
Correct classification rate for this model is 0.7943 with AUC of 0.758.
for (k in 1:K) {
  
  data.train.qda <- train[fold != k,]
  data.test.qda <- train[fold == k,]
  
  qda.model <- qda(sent ~ cond + year + lat + sq.m.h + sq.m.pool + reno + as.numeric(bedrooms) + as.numeric(bathrooms), data = data.train.qda)
  pred.test.qda <- predict(qda.model, newdata = data.test.qda)
  pred.prob.qda <- pred.test.qda$posterior[,2]
  pred.class.qda <- pred.test.qda$class
  observed.classes.qda <- data.test.qda$sent
  
  accuracy.qda <- mean(observed.classes.qda == pred.class.qda)
}
accuracy.qda## [1] 0.7943333# Compute roc
res.roc <- roc(observed.classes.qda, pred.prob.qda)
plot.roc(res.roc, print.auc = TRUE)Fifth Model: Decision Tree
Splitting the data to 80/20, with 80 percent of the data serves to train the model, and 20 percent to make predictions.
Looking at the confusion matrix, this model correctly predicted 4266 identities as not satisfied but classified 256 as satisfied. Furthermore, the model misclassified 904 identities as satisfied but it turned out that they are not satisfied.
Changing all the parameters in the control, I find the best accuracy for this model is 0.8067.
Correct classification rate for this model is 0.8067.
library(rpart)
library(rpart.plot)
library(caret)
n <- nrow(train)
new.train <- subset(train, select = -c(price))
total_row <- 0.8 * n
train_sample <- 1:total_row
data.train.tree <- new.train[train_sample, ]
data.test.tree <- new.train[-train_sample, ]
tree.fit <- rpart(sent~., data = data.train.tree, method = 'class')
rpart.plot(tree.fit, extra = 106)accuracy_tune <- function(fit) {
    predict_unseen <- predict(tree.fit, data.test.tree, type = 'class')
    table_mat <- table(data.test.tree$sent, predict_unseen)
    accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
    accuracy_Test
}
control <- rpart.control(minsplit = 20,
    minbucket = round(20 / 3),
    maxdepth = 30,
    cp = 0)
tune_fit <- rpart(sent~., data = data.train.tree, method = 'class', control = control)
accuracy_tune(tune_fit)## [1] 0.8066667# Predict using the tree model on the test data
predictions <- predict(tree.fit, newdata = data.test.tree, type = "class")
# Create the confusion matrix
confusion_matrix <- confusionMatrix(predictions, data.test.tree$sent)
print(confusion_matrix)## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4266  904
##          1  256  574
##                                           
##                Accuracy : 0.8067          
##                  95% CI : (0.7964, 0.8166)
##     No Information Rate : 0.7537          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3892          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9434          
##             Specificity : 0.3884          
##          Pos Pred Value : 0.8251          
##          Neg Pred Value : 0.6916          
##              Prevalence : 0.7537          
##          Detection Rate : 0.7110          
##    Detection Prevalence : 0.8617          
##       Balanced Accuracy : 0.6659          
##                                           
##        'Positive' Class : 0               
## The best predictive model for Modelling Component 2 is the third model: Linear Discriminant Analysis. To justify my point, the criterion for the predictions is Correct Classification Rate, the third model provides the highest Correct Classification Rate of 0.8123. This means that the prediction accuracy is the highest, providing a better predictive model.
test.sent <- test
lda.model.pred.value <- predict(lda.model, newdata = test)
prediction.lda.values <- lda.model.pred.value$x
test_pred_lda <- factor(ifelse(prediction.lda.values > 0.5, "True", "False"))
test.sent$sent <- test_pred_lda
test.sent$sent <- recode_factor(test.sent$sent, 'False' = '0', 'True' = '1')
test.sent$sent <- recode_factor(test.sent$sent, '0' = 0,  '1' = 1)
lda.model.sub <- subset(test.sent, select = c(id, sent))
write.csv(lda.model.sub,"C:\\Users\\Yu Heng\\Documents\\Archive\\ANU\\2022 SEM 1\\STAT3040\\Assignment 3\\lda_predicted_sent.csv",row.names = FALSE)For this report, there are limitations when using single imputation. Single imputation value is treated as the true value and do not report the uncertainty towards the prediction of the missing values. To fill in the missing values of this dataset, multiple imputation method should be used as it is more advantageous. To justify my point, multiple imputation considers the variance estimation and interval estimation for the interested parameters. As multiple imputation method can be tedious, with the lack of great knowledge towards the multiple imputation method, single imputation was used in this report.
In conclusion, for both modelling components in this report, we find sq.m.h and bathrooms play an influential role on the predicted results. Several factors can determine the price of the house and the satisfaction score.