First, I export the (already downloaded) dataset from Excel.
excel_file <- "C:\\Learning R\\Data for analysis\\housing_price_dataset.xlsx"
sheet_names <- excel_sheets(excel_file)
print(sheet_names)
my_data <- read_excel(excel_file, sheet = "housing_price_dataset"); str(my_data)
The following tables are the contents of the data frame we’ve just exported and its summary statistics.
kable(head(my_data), caption = paste("The exported housing data in a table format ", "(total number of rows: ", nrow(my_data), ")")) %>%
kable_styling(full_width = FALSE)
SquareFeet | Bedrooms | Bathrooms | Neighborhood | YearBuilt | Price |
---|---|---|---|---|---|
2126 | 4 | 1 | Rural | 1969 | 215355.3 |
2459 | 3 | 2 | Rural | 1980 | 195014.2 |
1860 | 2 | 1 | Suburb | 1970 | 306891.0 |
2294 | 2 | 1 | Urban | 1996 | 206786.8 |
2130 | 5 | 2 | Suburb | 2001 | 272436.2 |
2095 | 2 | 3 | Suburb | 2020 | 198208.8 |
## SquareFeet Bedrooms Bathrooms Neighborhood
## Min. :1000 Min. :2.000 Min. :1.000 Length:50000
## 1st Qu.:1513 1st Qu.:3.000 1st Qu.:1.000 Class :character
## Median :2007 Median :3.000 Median :2.000 Mode :character
## Mean :2006 Mean :3.499 Mean :1.995
## 3rd Qu.:2506 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :2999 Max. :5.000 Max. :3.000
## YearBuilt Price
## Min. :1950 Min. :-36588
## 1st Qu.:1967 1st Qu.:169956
## Median :1985 Median :225052
## Mean :1985 Mean :224827
## 3rd Qu.:2003 3rd Qu.:279374
## Max. :2021 Max. :492195
In order to prepare the data for the upcoming analysis, I address several issues that I’ve noticed in the two tables. I change the names of the columns to a more convenient format using clear_names() function from “janitor” package, turn the rows of the neighbourhood column into factor variables, and eliminate the column(s) containing negative prices, which apparently resulted from errors during data collection. Additionally, I remove rows with NA values.
I prefer to always keep the original dataset as a separate object, so in case of errors or mistakes in data manipulatons the original data can be re-used without needing to rethrive it from external sourse. In the code below, I create a new object called ‘data’ which contains the original data set but with the highlighted issues addressed.
For this project, we’re interested in the rows that correspond to “Rural” factor. Before separating those rows, however, let’s study the other two factors in a little more detail.
ggplot(data, aes(neighborhood)) + geom_bar() + ggtitle("The number of houses in each neighbourhood")
On the histogram, we can see that our dataset contains an approximately equal number of datapoints in each factor.
Our prime interest is the relationship between the house price and its square footage. Let’s see how this relationship varies depending on the neighbourhood of the house.
ggplot(data, aes(x = square_feet, y = price,
color = neighborhood, shape = neighborhood)) +
geom_point(alpha = 0.3)
This plot shows us the multivariate relationship between the type of neighbourhood, square footage of the house, and its price. Its purpose is to address questions such as: do the houses located in the rural area have higher square footage and lower prices than those located in the urban area? (etc.) The red circles represent the houses located in the rural area, and their respective price and square footage parameters are reflected on the axes. Similarly, the green triangles represent houses in suburban area, and blue squares - those in urban area.
As we can see, there’s no visible distinction between either of the three factors, which leads me to a thought that our synthetic dataset might not accuratly reflect the characteristics of the target population. It also might indicate that the results of our linear regression might be quite similar for each of the three factors, although it is risky to make such assumptions with such a large dataset.
In any case, we’re now going to focus on the houses located in the rural area.
*We will conduct a simple linear regression analysis with the house price being a dependent variable, and the square footage - an independent variable. The goal is to determine the intercept, the coefficient (aka slope term), and, of course, to see if our model of predicting the house prices is valid with the help of our to-be-defined test set.
We're going to use the following equation:*
Where: * Y = house price * X = house square footage * β0 = intercept * β1 = coefficient representing the linear relationship * ϵ = a random error term
###1. Preparing the data
To begin with, we’ll create a new data table by excluding rows where the factor is not “Rural”. In addition, I’m going to delete the column that contains the factor name, because now it contains the same character in every row.
rural1 <- data %>%
filter(neighborhood == "Rural") %>%
mutate(neighborhood = factor(neighborhood))
rural1 <- rural1 %>%
select(-neighborhood)
Exploratory analysis typically involves the use of a training set to discover potential relationships, while a test set is employed to evaluate the effectiveness of the discovered relationships. In the next code chunk, I set a seed to 103 for reproducibility of my analysis, and use the conventional 60/40 split, where I train my model on 60% of my data and test it on the remaining 40%.
set.seed(103)
sample <- sample(c(TRUE, FALSE), nrow(rural1), replace = T, prob = c(0.6,0.4))
train_data <- rural1[sample, ]
test_data <- rural1[!sample, ]
With the help of lm() function, I produce the best-fit linear relationship for the data which minimizes the least-squares criterion.
mod <- lm(price ~ square_feet, data = train_data)
tidy_sum <- tidy(mod)
round_tidy_sum <- tidy_sum %>%
mutate_if(is.numeric, round, digits = 2)
print(round_tidy_sum)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 26124. 1818. 14.4 0
## 2 square_feet 99.0 0.87 113. 0
The results suggest that our p-value (click the arrow in the top right corner of the table to scroll) is close to zero, whcih means that the values of the intercept and the coefficient are statistically significant. The value of the intercept estimate is not practically meaningful in the context of out data (there’s no house with 0 square feet). The coefficient for square feet of 99.03 is more insightful. It tells us that, on average, each additional square foot leads to an increase of $99.03 in price.
With these insights, our model takes the form of:
Next, I access the 95% confidence interval by using confint() function.
## 2.5 % 97.5 %
## (Intercept) 22561.24 29686.95
## square_feet 97.32 100.75
It suggests that, with a 95% confidence, an increase of 1 square feet in the house area raises its price by $97 – $101. Together, this data suggests that we have a relationship between the area and price of the house.
It is important not only to comprehend quantitative metrics related to our model accuracy, but also to access our model visually for more useful insights.
ggplot(rural1, aes(x = square_feet, y = price, )) +
geom_point(alpha = 0.3) +
geom_smooth(method = 'lm') +
geom_smooth(se = FALSE, color = 'red') +
ggtitle("Vizualization of the linear regression model")
On this plot, the linearity of our model (blue line) and a non-linear LOESS model are compared, which helps ensure that our modeling approach aligns with the structure of the training data and allows us to deciside whether the linear regression is appropriate. Clearly, the red and blue lines align very closely, so the linear model adequately captures the underlying relationship between our variables.
Next, we create the residuals vs. fitted values plot, and our primary goal is to find out if the model is linear and if the error terms have constant variance.
mod_res <- augment(mod, train_data)
ggplot(mod_res, aes(.fitted, .resid)) +
geom_ref_line(h = 0) +
geom_point(alpha = 0.2) +
geom_smooth(se = FALSE) +
ggtitle("Residuals vs Fitted")
Judging by the blue line produced, we can conclude that the assumption of linearity is very accurate. Also, looking at the general shape of our residuals, we don’t see any gaps, a funnel, etc., which suggests that the spread of the residuals is consistent across the range of predicted values, in other words, that the assumption of homoscedasticity is accurate. Also, let’s check if we have any outliers and if the residuals are spread equally along ranges of predictors. The first graph is the same as “Residual vs. fitted values” graph, except for the y-axis.
res1 <- ggplot(mod_res, aes(.fitted, .std.resid)) +
geom_ref_line(h = 0) +
geom_point(alpha = 0.2) +
geom_smooth(se = FALSE) +
ggtitle("Standartized residuals vs \nfitted")
res2 <- ggplot(mod_res, aes(.fitted, sqrt(.std.resid))) +
geom_ref_line(h = 0) +
geom_point(alpha = 0.2) +
geom_smooth(se = FALSE) +
ggtitle("Scale-location plot")
grid.arrange(res1, res2, nrow = 1)
The first graph shows us where the residuals derivate by 1-4 standard derivations (standart devations are reflected on the y axis). If we consider a residual that deviates from the predicted value by over 3 standard deviations an outlier, then I can count around 12 outliers. The plot on the right is needed to to double-check the assumption of homoscedasticity, and better highlight any patterns in the variance. We see a horizontal line and, once again, no clear patter of the residuals, which confirms our claim that the error terms have constant variance.
As I mentioned, we seem to have some outliers. Let’s determine how much influence these outliers have on our regression line by using Cook’s Distance plot and residuals vs. leverage plot. These plots help us find influential data points which, when removed, might significantly alter our regression line.
## Warning in plot.window(...): "alpha" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "alpha" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "alpha" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "alpha" is not a
## graphical parameter
## Warning in box(...): "alpha" is not a graphical parameter
## Warning in title(...): "alpha" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "alpha" is not a
## graphical parameter
We definitely see a few outstanding points which are both to the upper right of our Residuals vs. Leverage plot. However, how influential these points really are? To determine it, I create a new tibble without the points with high Cook’s distances, and build a new residuals plot with the outliers removed, side-by-side with our old residuals plot:.
trsh <- 4/ length(cooks.distance(mod))
filtr_data <- train_data %>%
filter(cooks.distance(mod) <= trsh)
resid_plot <- ggplot(mod_res, aes(.fitted, .resid)) +
geom_ref_line(h = 0) +
geom_point(alpha = 0.2) +
geom_smooth(se = FALSE) +
ggtitle("Residuals vs Fitted, with influential points")
mod_cook <- lm(price ~ square_feet, data = filtr_data)
mod_res_cook <- augment(mod_cook, filtr_data)
cook_res_plot <- ggplot(mod_res_cook, aes(.fitted, .resid)) +
geom_ref_line(h = 0) +
geom_point(alpha = 0.2) +
geom_smooth(se = FALSE) +
ggtitle("Residuals vs Fitted, influential points removed")
grid.arrange(resid_plot, cook_res_plot, nrow = 1)
We see that the residuals plot without outliers violates the assumption of homoscedasticity. Apparently, the outliers played a significant role in stabilizing the variance of our model. Since the assumption of homoscedasticity is crucial in linear regression analysis, I will stick to the original data that includes outliers.
Now, let’s access the normality of the residuals by building a Q-Q plot. If the points are falling directly on the diagonal line, we can say that the residuals are normally distributed.
The normal Q-Q plot looks great, so our data follows normal distribution.
Next, let’s analyze how accurate the model is. To begin with, we can access RSE (an estimate of a standard deviation of a random error term) using sigma() function.
## [1] 50072.13
RSE of our model is 50159.9, which means that the actual house price will derivate from out predicted price by $50,159.9. To find out if this is a significant number, we calculate the percentage error: RSE/mean price.
## [1] 0.2233276
It equals to 22.4% which is fair. For more insights, further analysis is needed. Another way to measure the fitness of our model is to find the R^2 statistic. In our case, R^2 statistic is the proportion of variation in price that can be explained by the area of the house, in square feet. I calculate it using rsquare() function.
## [1] 0.5639178
It equals to 56%, which means that 56% of variability in price is explained by the variation in square feet of a house. It is a rather moderate result, so I will further test the accuracy of the model by accessing f-statistic value and the p-value (both are found on the bottom of the model summary table).
##
## Call:
## lm(formula = price ~ square_feet, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179631 -34518 -63 33801 227403
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.612e+04 1.818e+03 14.37 <2e-16 ***
## square_feet 9.903e+01 8.746e-01 113.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50070 on 9914 degrees of freedom
## Multiple R-squared: 0.5639, Adjusted R-squared: 0.5639
## F-statistic: 1.282e+04 on 1 and 9914 DF, p-value: < 2.2e-16
F-statistic value is 1.266*10^0.4, while our p-value is 2.210^-16. The results of our tests suggest that the accuracy of the produced model is moderate.
Finally, let’s make predictions on the house price based on its square feet. To begin with, I’ll access how the model performs on making predictions against our test data set (40 % of the initial data set). To find out if the predictions are accurate, I compute the mean squared prediction error for my test and training data.
mse_test <- test_data %>%
add_predictions(mod) %>%
summarise(MSE = mean((price - pred)^2))
mse_train <- train_data %>%
add_predictions(mod) %>%
summarise(MSE = mean((price - pred)^2))
cat('MSE for test data: ', mse_test$MSE,
'\nMSE for train data: ', mse_train$MSE, '.\n',
all.equal(mse_test, mse_train))
## MSE for test data: 2529908469
## MSE for train data: 2506712903 .
## Component "MSE": Mean relative difference: 0.00916854
As we can see, the difference is not that significant, which indicates that the model is performing similarly on both the training and test datasets. It is also a sign that the model is not biased towards either the test nor training data, and that is can be described as stable, in a sense that it is not very sensitive to variations in input data.