Local intelligence is a strong predictor of housing prices in Philadelphia that can optimize Zillow’s popular and consequential Zestimate tool. Real estate professionals as well as home buyers and sellers alike trust Zestimate to be an accurate prediction of local real estate values. Investing in the tool’s reliability stands to not only benefit Zillow’s customers, but the company’s dominant market position, too.
The following study explores a series of local intelligence variables that range from tree-lined street to local median income. A series of statistical processes reveal a handful of key variables that are then processed by prediction models. The best prediction model is then tested on 2023 sale price data for accuracy. The study is mindful of creating models that are generalizable across Philadelphia, and hopefully generalizable across major metro areas throughout the country.
Firstly, sales data is a tremendous resource in terms of the internal characteristics of each property which can certainly influence price. This study leverages data such as year built, total square footage, number of bathrooms and bedrooms to holistically understand what influences price.
However, the property itself does not tell the full story. Where it is located and the characteristics of that area are key, too. There are myriad local intelligence variables to chose from between a number of key open data sources (2021-2023), including: U.S. Census American Community Survey, Pennsylvania Spatial Data Access (PASDA), Philadelphia Open Data Portal, or from the Delaware Valley Regional Planning Commission (DVRPC). This study priorizited complete datasets that affect the city at large. For example, litter is a systemic issue whereas airports are very localized. The following table breaks down viable sources and from which we selected local intelligence variables that we deemed to be generalizable for the purposes of this study:
# Create a data frame with your data
data_frame <- data.frame(
"Dataset" = c(
"Floodplain Data",
"Sanitation Centers",
"Litter Index",
"Demographics and Socioeconomic Data",
"Housing Typology",
"Trees",
"Neighborhood",
"Proximity to Highways",
"School Catchment Performance",
"Walkscore",
"1937 Redlining Score"
),
"Generalizability" = c(
"", # Floodplain Data
"", # Sanitation Centers
"✔", # Litter Index
"✔", # Demographics and Socioeconomic Data
"", # Housing Typology
"✔", # Trees
"✔", # Neighborhood
"✔", # Proximity to Highways
"", # School Catchment Performance
"", # Walkscore
"✔" # 1937 Redlining Score
)
)
dt_url <- c("https://data-phl.opendata.arcgis.com/datasets/phl::fema-100-flood-plain/explore",
"https://opendataphilly.org/datasets/sanitation-areas/",
"https://opendataphilly.org/datasets/litter-index/",
"https://www.census.gov/",
"https://www.census.gov/",
"https://opendataphilly.org/datasets/phillytreemap/",
"https://www.pasda.psu.edu/uci/DataSummary.aspx?dataset=7047",
"https://dvrpc-dvrpcgis.opendata.arcgis.com/datasets/dvrpcgis::freight-highway-network/explore",
"https://www.philasd.org/performance/programsservices/open-data/school-information/",
"http://walkscore.com/",
"https://ncrc.org/holc/")
data_frame %>%
mutate(Dataset = cell_spec((Dataset), "html", link = dt_url)) %>%
kable(format = "html", escape = FALSE) %>%
kable_styling(bootstrap_options = c("hover", "condensced"), full_width = FALSE,
position = "left") %>%
column_spec(column = 2, width = "20%", extra_css = "text-align:center;"
)
From highways to rivers, man-made and natural boundaries alike delineate neighborhoods where certain people and types of businesses congregate. Neighborhood boundaries are arguably more organic than census block groups (which are optimized for consistent populations) or highway. Additionally, because residential sales are generalized by residential neighborhoods not by census block groups, neighborhoods (as defined by the Philly Planning Dept) are likely significant on local intelligence.You can see where areas of high income and low income cluster across Philadelphia, and how these boundaries are more recognizable at a neighbrhood level than block group level.
# plot sales data using median cut off
sales_plot_mod <- sales2023 %>%
filter(toPredict == "MODELLING")
median_sale_price <- median(sales_plot_mod$sale_price)
custom_colors <- ifelse(sales_plot_mod$sale_price <= median_sale_price, "#A5C4D4", "#4B7740")
sales_plot <- ggplot() +
geom_sf(data = blocks21.sf, color = "gray70") +
geom_sf(data = sales_plot_mod, aes(color = custom_colors), size = 0.5, alpha = 0.5) +
scale_color_manual(
values = c("#A5C4D4" = "#A5C4D4", "#4B7740" = "#4B7740"),
labels = c("#4B7740" = "Above Median", "#A5C4D4" = "Below Median")) +
labs(title = "Block Groups", color = "") +
theme(legend.position = "none") +
guides(color = guide_legend(override.aes = list(size = 3))) +
labs(subtitle = "2021 Philadelphia Median \nSale Price is $229,900") +
mapTheme()
#Bring in neighborhoods
hoods <- st_read("/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Midterm/PPA_Midterm_Git/data/PhillyPlanning_Neighborhoods/PhillyPlanning_Neighborhoods.shp") %>%
select(NAME) %>%
rename(Neighborhood = NAME)
sales2023 <- st_join(sales2023, hoods, join = st_within)
hood_plot <- ggplot() +
geom_sf(data = blocks21.sf, fill = "gray90", color = "gray90") +
geom_sf(data = hoods, color = "gray60") +
geom_sf(data = sales_plot_mod, aes(color = custom_colors), size = 0.5, alpha = 0.5) +
scale_color_manual(
values = c("#A5C4D4" = "#A5C4D4", "#4B7740" = "#4B7740"),
labels = c("#4B7740" = "Above Median", "#A5C4D4" = "Below Median")) +
labs(title = "Neighborhoods") +
theme(legend.position = "none") +
guides(color = guide_legend(override.aes = list(size = 5))) +
labs(subtitle = "Neighborhoods as defined by Philadelphia \nPlanning Dept") +
mapTheme()
#Bring in highways and apply buffer of 1/8 mile (660')
highways <- st_read("/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Midterm/PPA_Midterm_Git/data/Freight_Highway_Network.geojson")
highways <- highways %>%
filter(state == "PA") %>%
st_transform('EPSG:2272')
highways_phl <- st_intersection(highways, blocks21.sf) %>%
select(geometry)
highway_buffer <- st_buffer(highways_phl, 660)
highway_buffer$hwy_grade <- 1
# Make a dummy variable where 1 is a house within highway buffer and 0 is outside of buffer.
saleshighways <- st_intersection(sales2023, highway_buffer)
homes_near_hwys <- saleshighways$objectid #creates a list using unique ids from saleshighways
sales2023 <- sales2023 %>%
mutate(hwy_grade = ifelse(objectid %in% homes_near_hwys, 1, 0))
hwy_plot <- ggplot() +
geom_sf(data = blocks21.sf, color = "gray70") +
geom_sf(data = sales_plot_mod, aes(color = custom_colors), size = 0.5, alpha = 0.5) +
scale_color_manual(
values = c("#A5C4D4" = "#A5C4D4", "#4B7740" = "#4B7740"),
labels = c("#4B7740" = "Above Median", "#A5C4D4" = "Below Median")) +
geom_sf(data = highway_buffer, aes(fill = "#36151E", color = "#36151E")) +
scale_fill_identity(name = "Highway", labels = "#36151E") +
scale_color_identity(name = "Highway", labels = "#36151E") +
guides(color = guide_legend(override.aes = list(size = 5))) +
labs(title = "Highway Proximity") +
labs(subtitle = "The highway buffer is 660 feet, \nabout 2 Philadelphia blocks.") +
mapTheme()
#plot all three together
plot_grid(sales_plot, hood_plot, hwy_plot, ncol = 3, align = "v")
It is important to be aware of how the sales are distributed around the city. There are many block groups without any housing sales, whereas there are fewer neighborhoods where this is true because neighborhoods are residential in nature. Geographically, because neighborhoods are generally divided along residential vs industrial zoning lines, it is simpler to comprehend how it is generalizable across the city.
Interestingly, when considering the impact of highways in Philadelphia, it is reasonable to assess that it is not generalizable. However, given how consistently home prices drop below the median within proximity to the highway regardless of the sale is in Center City or Port Richmond, we decided that it is a variable worth further investigation.
In the 1930s, a federal agency, the Home Owners’ Loan Corporation (HOLC), created maps of mortgage risk which denied entire neighborhoods access to public lending benefits and overall economic opportunity. The institutionalize of marginalization of poor, Black communities ingrained persistent residential segregation that is still evidenced by income inequality across the city.
# Get and wrangle redlining data
redlining.sf <- st_read("/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Midterm/PPA_Midterm_Git/data/Redlining.geojson") %>%
st_transform('EPSG:2272') %>%
filter(state == "PA", city == "Philadelphia")
redlining.sf <- redlining.sf %>%
dplyr::select(-c(holc_id, state, city, neighborhood_id, area_description_data, name)) %>%
drop_na()
# Join each house with historic redlining grade
sales2023 <- st_join(sales2023, redlining.sf, join = st_within)
sales2023 <- sales2023 %>%
dplyr::mutate(holc_grade = ifelse(is.na(holc_grade) == TRUE, "NA", holc_grade))
# Plotting Redlining Maps
redlining_colors <- c("NA" = "gray85", "A" = "#197991", "B" = "#4B7740", "C" = "#E3D26F", "D" = "#D52F4D")
redlining_palette <- c("#197991", "#4B7740", "#E3D26F", "#D52F4D", "#A9233B")
redlining.sf.phl <- st_intersection(redlining.sf, blocks21.sf) #clip redlining to PHL boundaries
Redlining_plot <- ggplot() +
geom_sf(data = blocks21.sf, color = "gray70") +
geom_sf(data = redlining.sf.phl, aes(fill = holc_grade, text = holc_grade), alpha = 0.8) +
scale_fill_manual(values = redlining_colors, name = "HOLC Grade") +
labs(title = "1937 HOLC Redlining", caption = "by 2020 Block Group") +
theme(legend.position = "right") +
guides(color = guide_legend(override.aes = list(size = 5))) +
mapTheme()
Income_plot <- ggplot() +
geom_sf(data = blocks21.sf, color = "grey70") +
geom_sf(data = sales2023, aes(colour = q5(MedHHInc)),
show.legend = "point", size = .5, alpha = 0.5) +
scale_colour_manual(values = rev(redlining_palette),
labels=qBr(sales2023,"MedHHInc"),
name="Income\nBrackets") +
labs(title="2023 Median Incomes", caption = "by 2020 Block Group") +
theme(legend.position = "right") +
guides(color = guide_legend(override.aes = list(size = 3))) +
mapTheme()
plot_grid(Redlining_plot, Income_plot, ncol = 2, align = "v")
NCRC states that, “There is significantly greater economic inequality in cities where more of the HOLC graded high-risk or “Hazardous” areas are currently minority neighborhoods.” (NCRC) The impact of HOLC is generalizable. One exception to note is Center City where it has many high income earners despite its high-risk grade likely due to post-war downtown revitalization.
When considering the patterns of redlining and neighborhoods, it is likely that neighborhoods such as Center City will account for divergences in the redlining data, and neighborhoods such as Kensington will account for consistencies in the redlining data when we develop our definition of local intelligence.
## K NEAREST NEIGHBOR: PRICE
sales2023 <- sales2023 %>%
mutate(nb = st_knn(geometry, k = 5),
wt = st_weights(nb),
price_lag = st_lag(sale_price, nb, wt))
Price Lag : To systemically investigate price clustering beyond redlining and in today’s data, a function called “k-nearest neighbor” calculates a weighted average of the 5 nearest properties’ sales price. This price lag variable accounts Philadelphia’s ‘block-by-block’ dynamics borne from its density and long history where many uses from mansions to power plants have converged within mere feet of one another - but can drastically impact sale price potentially.
Proximity to amenities and distance from disamenities certainly has a colloquial effect on housing prices. Broker descriptions love to tout, “2 Blocks from Park!” or “Tucked away, quiet street!” To test the effect of environmental data, the study inclues tree density and litter conditions.
## TREES by BLOCK GROUP
trees.sf <- st_read("https://opendata.arcgis.com/api/v3/datasets/30ef36e9e880468fa74e2d5b18da4cfb_0/downloads/data?format=geojson&spatialRefId=4326") %>%
st_transform('EPSG:2272')
trees.sf <- st_join(trees.sf, blocks21.sf, join = st_within)
trees.sf <- trees.sf %>%
select(c(OBJECTID, TREE_NAME, YEAR, GEOID))
trees.sf <- trees.sf %>%
group_by(GEOID) %>% # census block group
summarize(trees = n())
blocks21.sf.trees <- st_join(blocks21.sf, trees.sf, by = "GEOID")
blocks21.sf.trees <- mutate(blocks21.sf.trees, trees = ifelse(is.na(trees), 0, trees)) %>%
select(trees, GEOID.y)
sales2023 <- sales2023 %>%
st_join(blocks21.sf.trees, join = st_intersects)
trees_plot <- ggplot() +
geom_sf(data = blocks21.sf, fill = "gray20", color = "gray40") +
stat_density2d(data = data.frame(st_coordinates(trees.sf)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 30, geom = 'polygon') +
scale_fill_gradient(
high = "#4B7740",
low = "#ABCDA2",
name = "",
breaks = c(0.0000000001, 0.000000001), # Set the breaks for low and high
labels = c("low", "high")) + # Set the corresponding labels
scale_alpha(range = c(0, .7), guide = "none") +
labs(title = "Tree Density", color = "") +
theme(legend.position = "bottom") +
guides(color = guide_legend(override.aes = list(size = 3))) +
labs(caption = "Tree Density is measured by \n number of trees in the block group") +
mapTheme()
# NEAREST LITTER TO HOUSE
litterindex <- st_read("https://opendata.arcgis.com/datasets/04fa63e09b284dbfbde1983eab367319_0.geojson") %>%
st_transform('EPSG:2272') %>%
rename(litterscore = HUNDRED_BLOCK_SCORE) %>%
rename(littergrade = SCORE_COLOR)
nearest_litter <- st_nearest_feature(sales2023, litterindex)
sales2023$litterscore <- litterindex$litterscore[nearest_litter]
sales2023$littergrade <- litterindex$littergrade[nearest_litter]
sales2023 <- sales2023 %>%
mutate(litterscore = round(litterscore, 2))
litter_plot <- ggplot() +
geom_sf(data = blocks21.sf, fill = "gray20", color = "gray40") +
stat_density2d(data = data.frame(st_coordinates(litterindex)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 30, geom = 'polygon') +
scale_fill_gradient(low = "#E3D26F", high = "#36151E", name = "",
breaks = c(0.0000000002, 0.0000000006),
labels = c("low", "high")) +
scale_alpha(range = c(0, .7), guide = "none") +
labs(title = "Litter Score", color = "") +
theme(legend.position = "bottom") +
guides(color = guide_legend(override.aes = list(size = 3))) +
labs(caption = "Litter Score is based on the Litter Index \ndeveloped by the City of Philadelphia") +
mapTheme()
# plot two maps
plot_grid(trees_plot, litter_plot, ncol = 2, align = "v")
Interestingly, these two environmental data points may send mixed signals. Where there are more trees, there is more litter. Since tree density and litter conditions send strong visual cues to potential buyers of the level of care of the neighborhood, the impact of these variables on the model may strengthen or dilute the results.
“All models are wrong, but some are useful.” - George Box
Zestimate seeks to predict housing prices. Predicting city trends is fraught. Oftentimes unpredictable political decisions such as post-war state-sponsored suburbanization (aka white flight) buck trend predictions. There is no model than account for every unforeseeable scenario. However as previously discussed, useful models are generalizable. Building a useful prediction model requires feature selection and feature engineering. These features are then used to train a model to recognize patterns associated with sale price. The predictive model will use those patterns to produce a Zestimate. While the nearest prices of nearby sales are likely a fine starting point to build Zestimate, including local intelligence will certainly strengthen the product.
This study uses the hedonic model (reference) theoretical framework that deconstructs housing prices into three categories:
# SELECTING VARIABLES TO INCLUDE IN MODEL
## PREPPING SALES DATA INTO MODEL V CHALLENGE
sales2023.mod <- sales2023 %>%
filter(toPredict == "MODELLING")
sales2023.chal <- sales2023 %>%
filter(toPredict == "CHALLENGE")
all_var <- c("objectid", "census_tract","GEOID", "toPredict","sale_price", "Neighborhood", "fireplaces", "garage_spaces", "number_of_bathrooms", "number_of_bedrooms", "number_stories", "total_area", "total_livable_area", "year_built","historic", "holc_grade", "price_lag", "trees", "litterscore", "TotalPop", "MedHHInc", "pctdetached", "pctmultifam_small", "hwy_grade")
sales2023.select <- sales2023.mod %>%
dplyr::select(all_of(all_var)) %>%
na.omit()
sales2023.summary <- sales2023.select %>%
st_drop_geometry() %>%
select(-c(objectid, census_tract, GEOID, toPredict))
# table of summary statistics with variable descriptions. Sort these variables by their category (internal characteristics, amenities/public services or spatial structure). Check out the `stargazer` package for this.
categories_to_include <- c("Amenities / Public Services", "Internal Characteristic", "Spatial Process")
#subsetting categories
#internal characteristics
InternalCharacteristics <- sales2023 %>%
dplyr::select("fireplaces", "garage_spaces",
"number_of_bathrooms", "number_of_bedrooms", "number_stories",
"total_area", "total_livable_area", "year_built", "historic", "sale_price")
stargazer(as.data.frame(InternalCharacteristics), type="text", digits=1, title = "Internal Characteristics", out = "Training_PHLInternal.txt", omit.summary.stat = "N")
Internal Characteristics
=====================================================
Statistic Mean St. Dev. Min Max
-----------------------------------------------------
fireplaces 0.05 0.3 0 5
garage_spaces 0.4 0.7 0 72
number_of_bathrooms 1.1 0.7 0 8
number_of_bedrooms 2.6 1.3 0 31
number_stories 2.0 0.6 1 5
total_area 1,828.9 3,818.9 0 226,512
total_livable_area 1,333.6 548.8 0 15,246
year_built 1,936.7 51.1 0 2,024
historic 1.9 0.4 1 3
sale_price 271,846.8 240,083.4 0 5,990,000
-----------------------------------------------------
#amenities/public services
Amenities_Public_Services <- sales2023 %>%
dplyr::select("Neighborhood", "trees", "TotalPop", "MedHHInc", "pctdetached", "pctmultifam_small", "litterscore")
stargazer(as.data.frame(Amenities_Public_Services), type="text", digits=1, title = "Public Services and Disamenities", out = "Training_PHLInternal.txt", omit.summary.stat = "N")
Public Services and Disamenities
===============================================
Statistic Mean St. Dev. Min Max
-----------------------------------------------
trees 116.7 112.9 2 2,136
TotalPop 1,445.2 708.6 0 4,518
MedHHInc 56,282.7 39,156.1 0 250,001
pctdetached 9.0 12.7 0.0 100.0
pctmultifam_small 12.4 12.2 0.0 75.0
litterscore 1.8 0.6 0.0 4.0
-----------------------------------------------
#spatial process
Spatial <- sales2023 %>%
dplyr::select("price_lag")
stargazer(as.data.frame(Spatial), type="text", digits=1, title = "Spatial Process of Prices", out = "Training_PHLInternal.txt", omit.summary.stat = "N")
Spatial Process of Prices
==================================================
Statistic Mean St. Dev. Min Max
--------------------------------------------------
price_lag 267,985.6 194,111.3 20,000.0 3,292,800.0
--------------------------------------------------
As we feature select and engineer, here are some important takeaways from the above summary tables:
The goal is to select variables that most significantly correlate to sale price to include in the predictive model. Correlation is a type of association test. Are sale prices more closely associated to square footage or to number of bedrooms? These are the types of subtle but important distinction we want to seek out.
# CORRELATION MATRIX OF VARIABLES
numericVars <- sales2023.select %>%
st_drop_geometry() %>%
select(-toPredict, -GEOID, -objectid, -census_tract) %>%
select(-holc_grade, -hwy_grade, -historic, -Neighborhood) #remove categorical variables for corrtest
price.corr <- corr.test(numericVars)
if (sum(is.na(numericVars)) > 0) {
# Handle missing values (e.g., impute or remove)
numericVars <- na.omit(numericVars) # Remove rows with missing values
}
correlation_matrix <- round(cor(numericVars), 1)
p_values <- cor_pmat(numericVars)
ggcorrplot(
correlation_matrix,
p.mat = p_values,
colors = c("#36151E", "white", "#197991"),
type = "lower",
insig = "blank"
) +
labs(title = "Degrees of Correlation to Sale Price") +
annotate(
geom = "rect",
xmin = .5, xmax = 16, ymin = .5, ymax = 1.5,
fill = "transparent", color = "red", alpha = 0.5
)
The correlation matrix revealed strong positive and negative correlations. These are the primary variables to note for further engineering:
It is important to note where variables correlate with each other as well. Price lag and median income had very strong positive correlation with one another, as did number of bathrooms and number of bedrooms. The number of bedrooms and the number of bathrooms are highly correlated to one another. Therefore to avoid multicollinearity, or two variables telling the same story, we will focus on number of bathrooms since it is also more correlated to sale price.
# str(sales2023.select)
# Create 4 home price correlation scatterplots that you think are of interest (using open data variables)
#MedHHInc, trees, price_lag, litterscore
st_drop_geometry(sales2023.select) %>%
dplyr::select(sale_price, MedHHInc, trees, price_lag, litterscore) %>%
gather(Variable, Value, -sale_price) %>%
ggplot(aes(Value, sale_price)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#D52F4D") +
facet_wrap(~Variable, ncol = 2, scales = "free", labeller = labeller(Variable = c(price_lag = "Price Lag", MedHHInc = "Median Income", trees = "Tree Density", litterscore = "Litter Conditions"))) +
labs(title = "Price as a function of select variables") +
plotTheme()
The relationship between sale price and median income and tree density are similarly positive, where as litter is negative. The correlation between price lag and sale price is nearly 1, which indicates a very tight fit (as supported by the values in the categorizing summary table.)
# Create 3 maps of the 3 most interesting variables.
plot_MedHHInc <- ggplot() +
geom_sf(data = blocks21.sf, color = "gray70") +
geom_sf(data = sales2023, aes(colour = q5(MedHHInc)),
show.legend = "point", size = .5) +
scale_colour_manual(values = rev(palette),
labels=qBr(sales2023,"MedHHInc"),
name="Median\nIncome") +
guides(color = guide_legend(override.aes = list(size = 5))) +
mapTheme()
plot_price_lag <- ggplot() +
geom_sf(data = blocks21.sf, color = "gray70") +
geom_sf(data = sales2023, aes(colour = q5(price_lag)),
show.legend = "point", size = .5) +
scale_colour_manual(values = rev(palette),
labels=qBr(sales2023,"price_lag"),
name="Nearby\nPrice") +
guides(color = guide_legend(override.aes = list(size = 5))) +
mapTheme()
trees_plot <- ggplot() +
geom_sf(data = blocks21.sf, color = "gray70") +
stat_density2d(data = data.frame(st_coordinates(trees.sf)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 50, geom = 'polygon') +
scale_fill_gradient(
low = "#ABCDA2",
high = "#4B7740",
name = "Tree Density") +
scale_alpha(range = c(0, .5), guide = "none") +
mapTheme()
plot_grid(plot_MedHHInc, plot_price_lag, trees_plot, align = "v", ncol = 2)
Spatially, different patterns emerge. For instance, Center City has high tree density, high median income, and high average nearby prices despite being historically marginalized by redlining in the 1930s. North Philadelphia experiences the inverse of these trends, as do many parts of West Philadelphia further from University City. Spatial exploration reinforces the breadth of variable selection.
#Filter out outliers discovered in summary statistics.
sales2023.select <- sales2023.select %>%
filter(sale_price <= 3500000, garage_spaces < 4, total_livable_area < 6500,
trees <= 1000)
After identifying correlated variables and looking at summary statistics, we also plotted the distribution of some key characteristics to identify any outliers possibly skewing the data. See Appendix A for scatterplots that visualize the outliers that were filtered out.
Data Filters:
The way the prediction model works is that we will train it to understand patterns about the relationship between sale price and our selected variables. Then we will test a portion of the data to see how well it performs. We use a 60:40 data partition to training and test, respectively.
#take note: all of these are categorical. Specifying categorical variables, fixed effects so the model is trained
inTrain <- createDataPartition(
y = paste(sales2023.select$holc_grade, sales2023.select$historic, sales2023.select$Neighborhood),
p = .60, list = FALSE)
philly.training <- sales2023.select[inTrain,]
philly.test <- sales2023.select[-inTrain,]
We established the variables of the model. Now, we need to formulate the best combination of those variables to predict sale price. To begin, we start with a simple model only based on price lag, and then we add and subtract variables and analyze if they strengthen or weaken the model.
Price Lag Only Model:
Considering the strong association between sale price and price lag, we developed a price lag only regression. The price lag only regression is statistically significant (p-value < 0.05) however its R-squared value indicates that approximately 57.34% of the variation in the dependent variable, sale_price, is explained by price lag.
Local Intelligence Model:
Let’s see how including local intelligence variables impacts the regression. A regression that includes a ‘kitchen sink’ of variables from our data sources including but not limited to fireplaces, number of bathrooms, redlining grade, trees, and litter performs better with an R-squared value that indicated approximately 73.44% of the variation in sale price is explained by local intelligence + price lag.
#Price Lag only model
reg.int <- lm(sale_price ~ price_lag, data = as.data.frame(philly.training))
#Local Intelligence Regression
var_all <- c("sale_price", "Neighborhood", "fireplaces", "garage_spaces", "number_of_bathrooms", "number_of_bedrooms", "number_stories", "total_area", "total_livable_area", "year_built","historic", "holc_grade", "price_lag", "trees", "litterscore", "TotalPop", "MedHHInc", "pctdetached", "pctmultifam_small", "hwy_grade")
reg_all <-
lm(sale_price ~ ., data = as.data.frame(philly.training) %>%
dplyr::select(var_all))
To ensure that the Local Intelligence Regression does not have multicollinearity, or multiple values telling the same story about sale price, we use a series of tests. The first test, VIF, indicates that Neighborhood and Redlining Grade have very high VIF scores relative to other variables and are potentially conflicting. To further investigate these categorical (non-numeric) variables, we performed a chi-square test which reveals that they do have high collinearity.
#test for multicollinearity of reg_all
library(car)
vif(reg_all)
#neighborhood (10221.05) and holc_grade (153.67) have very high vif scores compared to other variables.
#checking chisquare on categorical for collinearity
chi_test <- table(philly.training$Neighborhood, philly.training$holc_grade)
chisq.test(chi_test)
#remove holc_grade bc of high collinearity with Neighborhood
reg_slim <- lm(sale_price ~ Neighborhood + MedHHInc + price_lag + trees + number_of_bathrooms + total_livable_area + litterscore + number_stories + year_built, data = as.data.frame(philly.training))
Local Regression Slim Model:
The next regression iteration removed Redlining Grade as well as other variables that presented less association with sale price. This slimmed down regression performed marginally less well than the ‘kitchen sink’ with an R-Squared value of 72.97% variance explained.
R-Squared is not the only measure of fit. The second and third regressions have a lower AIC compared to price lag only is good sign of the models fit and leanness and meanness of including local intelligence generally. We also performed a likelihood ratio test which indicates a high chi-square for the ‘kitchen sink’ so while both are stat significant, the second regression seems to fit the sales data best.
AIC(reg.int, reg_all, reg_slim) #lower AIC compared to intercept is good sign of the models fit and leanness and meanness.
lrtest(reg.int, reg_all, reg_slim) #likelihood ratio test show a high chisq for reg_all so while both are stat significant, reg_all seems to fit the data best.
aic_table <- data.frame(
Regression = c("Price Lag Only", "Local Intel", "Local Intel Slim"),
AIC = c(373654.6, 367160.2, 367393.4))
chi_table <- data.frame(
Regression = c("Local Intel", "Local Intel Slim"),
Chi_Square = c(6818.45, 259.24),
Significance = c("<0.05 ***", "<0.05 ***")
)
Regression | AIC |
---|---|
Price Lag Only | 373654.6 |
Local Intel | 367160.2 |
Local Intel Slim | 367393.4 |
Regression | Chi_Square | Significance |
---|---|---|
Local Intel | 6818.45 | <0.05 *** |
Local Intel Slim | 259.24 | <0.05 *** |
With all this in mind, the predictive model will be trained with second regression that features a breadth of local intelligence features.
Here is an in-depth comparative table between regressions:
#compare kitchen sink and slim regressions
coef_all <- summary(reg_all)$coef
coef_slim <- summary(reg_slim)$coef
coef_df_all <- data.frame(
Model = c("reg_all"),
Coefficients = rownames(coef_all),
Estimate_all = coef_all[, 1],
Std.Error_all = coef_all[, 2],
p_value_all = coef_all[, 4])
coef_df_slim <- data.frame(
Model = c("reg_slim"),
Coefficients = rownames(coef_slim),
Estimate_slim = coef_slim[, 1],
Std.Error_slim = coef_slim[, 2],
p_value_slim = coef_slim[, 4]
)
left_merged_kable <- merge(coef_df_all, coef_df_slim, by = "Coefficients", all.x = TRUE) %>%
rename("Regression All" = Model.x, "Regression Slim" = Model.y)
kable(left_merged_kable, format = "html", caption = "Note: NAs are where variables were not included for both Regressions") %>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = "800px", height = "500px")
Coefficients | Regression All | Estimate_all | Std.Error_all | p_value_all | Regression Slim | Estimate_slim | Std.Error_slim | p_value_slim |
---|---|---|---|---|---|---|---|---|
(Intercept) | reg_all | -279476.7997428 | 49921.0374428 | 0.0000000 | reg_slim | -438474.3739475 | 46047.1181628 | 0.0000000 |
fireplaces | reg_all | 20055.8986415 | 3997.7262837 | 0.0000005 | NA | NA | NA | NA |
garage_spaces | reg_all | 19949.7716871 | 2494.9113408 | 0.0000000 | NA | NA | NA | NA |
historic | reg_all | 23109.8780154 | 2930.8283940 | 0.0000000 | NA | NA | NA | NA |
holc_gradeB | reg_all | -10835.7003932 | 9343.1614304 | 0.2461719 | NA | NA | NA | NA |
holc_gradeC | reg_all | -13681.0963243 | 10103.7213906 | 0.1757377 | NA | NA | NA | NA |
holc_gradeD | reg_all | -722.0059079 | 11025.4149394 | 0.9477883 | NA | NA | NA | NA |
holc_gradeNA | reg_all | -12411.2797850 | 11321.6474356 | 0.2729914 | NA | NA | NA | NA |
hwy_grade | reg_all | 11971.1364543 | 6119.6278170 | 0.0504632 | NA | NA | NA | NA |
litterscore | reg_all | -5267.4098876 | 1854.3596425 | 0.0045101 | reg_slim | -6582.2861356 | 1864.9903469 | 0.0004179 |
MedHHInc | reg_all | 0.2242095 | 0.0365627 | 0.0000000 | reg_slim | 0.2265662 | 0.0363103 | 0.0000000 |
NeighborhoodALLEGHENY WEST | reg_all | -65699.0082409 | 18729.5667850 | 0.0004533 | reg_slim | -78617.7264519 | 17418.9565627 | 0.0000064 |
NeighborhoodANDORRA | reg_all | -53626.6958289 | 26254.8610315 | 0.0411165 | reg_slim | -44715.5140164 | 25635.7775496 | 0.0811355 |
NeighborhoodASHTON | reg_all | -28074.8646915 | 27055.1471125 | 0.2994323 | reg_slim | -21237.8145788 | 27036.5642300 | 0.4321610 |
NeighborhoodBELLA VISTA | reg_all | 109341.7415589 | 20444.3351888 | 0.0000001 | reg_slim | 109732.1594553 | 18900.9530092 | 0.0000000 |
NeighborhoodBELLS CORNER | reg_all | -11101.2503662 | 20319.8413197 | 0.5848509 | reg_slim | 472.3591262 | 19253.3601216 | 0.9804271 |
NeighborhoodBELMONT | reg_all | -104943.0995189 | 23728.9404496 | 0.0000098 | reg_slim | -99477.3256346 | 22529.7147042 | 0.0000102 |
NeighborhoodBLUE BELL HILL | reg_all | -19560.9212990 | 43516.3565427 | 0.6530727 | reg_slim | -29843.8683334 | 42495.9151292 | 0.4825187 |
NeighborhoodBREWERYTOWN | reg_all | -37155.0836387 | 19201.3073030 | 0.0530071 | reg_slim | -40248.2425404 | 17662.9162070 | 0.0227008 |
NeighborhoodBRIDESBURG | reg_all | -31590.5782913 | 20265.8118929 | 0.1190638 | reg_slim | -37506.9338406 | 19105.0689525 | 0.0496437 |
NeighborhoodBROOKHAVEN | reg_all | -25643.5512275 | 37037.4650217 | 0.4887179 | reg_slim | -36185.3979561 | 37106.0231729 | 0.3294833 |
NeighborhoodBURHOLME | reg_all | -24016.7373695 | 19790.1842027 | 0.2249333 | reg_slim | -15227.1161084 | 18762.0188575 | 0.4170393 |
NeighborhoodBUSTLETON | reg_all | -9641.0202861 | 15771.9220198 | 0.5410259 | reg_slim | -5478.2779982 | 15888.7496221 | 0.7302577 |
NeighborhoodBYBERRY | reg_all | -112753.9122773 | 42597.9178231 | 0.0081317 | reg_slim | -90519.4614556 | 42731.0380078 | 0.0341628 |
NeighborhoodCABOT | reg_all | -66390.5445712 | 32361.7174698 | 0.0402356 | reg_slim | -61102.1363083 | 31623.8571900 | 0.0533609 |
NeighborhoodCALLOWHILL/CHINATOWN NORTH | reg_all | 64796.7274978 | 52473.9379988 | 0.2169123 | reg_slim | 42291.2166366 | 52562.5805279 | 0.4210713 |
NeighborhoodCARROLL PARK | reg_all | -86801.7567426 | 19289.6556555 | 0.0000069 | reg_slim | -93463.2521295 | 18069.1847659 | 0.0000002 |
NeighborhoodCASTOR GARDENS | reg_all | -48306.9491716 | 17752.2230931 | 0.0065131 | reg_slim | -42782.4758428 | 16436.3163278 | 0.0092531 |
NeighborhoodCATHEDRAL PARK | reg_all | -111863.9583836 | 32949.0487008 | 0.0006881 | reg_slim | -109757.2304376 | 32522.5898011 | 0.0007407 |
NeighborhoodCECIL B MOORE | reg_all | -125200.1873270 | 37616.8907321 | 0.0008761 | reg_slim | -128530.2754965 | 37044.9827495 | 0.0005229 |
NeighborhoodCEDAR PARK | reg_all | 4384.1862665 | 22747.0747485 | 0.8471683 | reg_slim | -7722.5142192 | 21661.3460154 | 0.7214631 |
NeighborhoodCEDARBROOK | reg_all | -65929.5791199 | 18345.7486857 | 0.0003271 | reg_slim | -57027.3432684 | 17052.2769512 | 0.0008272 |
NeighborhoodCENTRAL ROXBOROUGH | reg_all | -24678.6207347 | 17467.1426137 | 0.1577194 | reg_slim | -29885.6148045 | 16460.6246258 | 0.0694561 |
NeighborhoodCHESTNUT HILL | reg_all | 44483.1561038 | 21482.4802113 | 0.0384080 | reg_slim | 54474.7203282 | 20436.5000158 | 0.0076949 |
NeighborhoodCHINATOWN | reg_all | 54473.7351716 | 113146.6730846 | 0.6302089 | reg_slim | 58420.8113420 | 113987.5697621 | 0.6082959 |
NeighborhoodCOBBS CREEK | reg_all | -63981.5444107 | 17103.1741195 | 0.0001841 | reg_slim | -73888.9848888 | 15646.3033466 | 0.0000024 |
NeighborhoodCRESTMONT FARMS | reg_all | 26671.9898982 | 80921.8723036 | 0.7417059 | reg_slim | 65732.5487908 | 81301.3620753 | 0.4188138 |
NeighborhoodDUNLAP | reg_all | -131376.8407742 | 29913.8524480 | 0.0000113 | reg_slim | -127396.6883558 | 29029.1342570 | 0.0000115 |
NeighborhoodEAST FALLS | reg_all | -10064.7122781 | 19046.6508455 | 0.5972134 | reg_slim | -4640.7162482 | 18010.9143413 | 0.7966721 |
NeighborhoodEAST GERMANTOWN | reg_all | -72501.7393975 | 18674.9157549 | 0.0001039 | reg_slim | -82276.2877788 | 17485.3042865 | 0.0000026 |
NeighborhoodEAST KENSINGTON | reg_all | 3842.8712503 | 19795.2958595 | 0.8460765 | reg_slim | -4047.2779155 | 18179.3298099 | 0.8238262 |
NeighborhoodEAST MT. AIRY | reg_all | -84649.9436481 | 18006.7362165 | 0.0000026 | reg_slim | -76847.5754248 | 16628.0146949 | 0.0000038 |
NeighborhoodEAST OAK LANE | reg_all | -135938.3268446 | 23219.4068366 | 0.0000000 | reg_slim | -123609.7530494 | 21645.8535522 | 0.0000000 |
NeighborhoodEAST PARKSIDE | reg_all | -100322.0251927 | 26892.2036152 | 0.0001918 | reg_slim | -92772.1299044 | 26099.1750812 | 0.0003798 |
NeighborhoodEAST POPLAR | reg_all | -396.3533220 | 25894.7415980 | 0.9877880 | reg_slim | 5349.4991033 | 24787.5065327 | 0.8291356 |
NeighborhoodEAST TIOGA | reg_all | -145239.0794312 | 24910.4635948 | 0.0000000 | reg_slim | -157026.6561701 | 24153.3908417 | 0.0000000 |
NeighborhoodEAST TORRESDALE | reg_all | -9238.6538341 | 19994.9159892 | 0.6440526 | reg_slim | -9220.9858850 | 18982.4285427 | 0.6271420 |
NeighborhoodEASTWICK | reg_all | -69330.5450210 | 19278.0545896 | 0.0003238 | reg_slim | -57477.6195245 | 17776.9217236 | 0.0012267 |
NeighborhoodFAIRHILL | reg_all | -96214.4568443 | 20670.3893806 | 0.0000033 | reg_slim | -90795.1948537 | 19240.7577481 | 0.0000024 |
NeighborhoodFAIRMOUNT | reg_all | 79327.9472266 | 18413.3432530 | 0.0000166 | reg_slim | 83890.7617514 | 16454.2592408 | 0.0000003 |
NeighborhoodFELTONVILLE | reg_all | -81145.2928779 | 18933.2400007 | 0.0000183 | reg_slim | -81710.8299610 | 17715.1715893 | 0.0000040 |
NeighborhoodFERN ROCK | reg_all | -114447.5449107 | 25274.8155160 | 0.0000060 | reg_slim | -111728.1119313 | 24495.6166454 | 0.0000051 |
NeighborhoodFISHTOWN | reg_all | 48414.4583697 | 18283.4661780 | 0.0081063 | reg_slim | 47262.5499366 | 16328.9406343 | 0.0038047 |
NeighborhoodFORGOTTEN BLOCKS | reg_all | -61459.2515594 | 30873.7728943 | 0.0465376 | reg_slim | -66502.6183704 | 30329.3555583 | 0.0283469 |
NeighborhoodFOX CHASE | reg_all | -15096.7930131 | 17800.1156192 | 0.3963808 | reg_slim | -1286.2195882 | 17086.1237506 | 0.9399941 |
NeighborhoodFRANCISVILLE | reg_all | 9491.7041660 | 22289.3592951 | 0.6702309 | reg_slim | 22082.6082104 | 20669.0857723 | 0.2853640 |
NeighborhoodFRANKFORD | reg_all | -84273.7627814 | 16879.8316751 | 0.0000006 | reg_slim | -85196.4547140 | 15663.0238513 | 0.0000001 |
NeighborhoodFRANKFORD VALLEY | reg_all | -61743.8908957 | 21658.7989057 | 0.0043680 | reg_slim | -61348.2413654 | 20534.9453323 | 0.0028176 |
NeighborhoodGARDEN COURT | reg_all | 13303.9180466 | 28824.4160347 | 0.6444111 | reg_slim | 5972.4879777 | 28002.9007256 | 0.8311109 |
NeighborhoodGERMANTOWN | reg_all | -84269.4532083 | 17533.2526101 | 0.0000016 | reg_slim | -98075.7117969 | 16307.5649915 | 0.0000000 |
NeighborhoodGIRARD ESTATE | reg_all | 27630.9647696 | 30103.8360773 | 0.3587106 | reg_slim | 25135.7686960 | 29004.6716397 | 0.3861701 |
NeighborhoodGRAYS FERRY | reg_all | -46500.9043247 | 18386.4996740 | 0.0114472 | reg_slim | -41052.4016138 | 16602.8275385 | 0.0134247 |
NeighborhoodGREEN HILL FARMS | reg_all | -62868.8578340 | 53669.2962803 | 0.2414533 | reg_slim | -39580.3141921 | 53011.6575023 | 0.4552971 |
NeighborhoodHADDINGTON | reg_all | -91683.4876282 | 18590.7261149 | 0.0000008 | reg_slim | -88555.3769391 | 16899.6676403 | 0.0000002 |
NeighborhoodHARROWGATE | reg_all | -89954.8379758 | 18701.3055660 | 0.0000015 | reg_slim | -101149.6355978 | 17427.3601149 | 0.0000000 |
NeighborhoodHAWTHORNE | reg_all | 133391.3859209 | 27877.4896995 | 0.0000017 | reg_slim | 125888.7463502 | 26813.6237218 | 0.0000027 |
NeighborhoodHOLMESBURG | reg_all | -33030.8356079 | 18593.0781973 | 0.0756704 | reg_slim | -26684.3124741 | 17557.2450653 | 0.1285724 |
NeighborhoodHUNTING PARK | reg_all | -81830.7502200 | 18391.2804390 | 0.0000087 | reg_slim | -92245.6934911 | 17198.9716102 | 0.0000001 |
NeighborhoodHUNTING PARK INDUSTRIAL AREA | reg_all | -73860.0496874 | 43012.0188192 | 0.0859658 | reg_slim | -66750.3697964 | 42542.8981498 | 0.1166676 |
NeighborhoodJUNIATA PARK | reg_all | -58390.1274655 | 17809.2090079 | 0.0010456 | reg_slim | -53327.0794471 | 16662.7614704 | 0.0013756 |
NeighborhoodKENSINGTON | reg_all | -73936.6742166 | 17348.4892080 | 0.0000204 | reg_slim | -73332.2092400 | 15592.5003391 | 0.0000026 |
NeighborhoodKENSINGTON SOUTH | reg_all | 10660.3204395 | 22851.3446951 | 0.6408597 | reg_slim | 18252.7169842 | 21411.5511002 | 0.3939677 |
NeighborhoodKINGSESSING | reg_all | -71078.0125816 | 17397.6572345 | 0.0000442 | reg_slim | -79416.5903308 | 16002.6230213 | 0.0000007 |
NeighborhoodLAWNCREST | reg_all | -66141.4939686 | 18187.8036255 | 0.0002773 | reg_slim | -61558.5398941 | 17183.5961824 | 0.0003416 |
NeighborhoodLAWNDALE | reg_all | -60464.6773431 | 20705.3245826 | 0.0035032 | reg_slim | -49620.1552953 | 19593.2855456 | 0.0113359 |
NeighborhoodLEXINGTON | reg_all | 31966.1172131 | 25441.7068502 | 0.2089757 | reg_slim | 47245.4347744 | 24736.5221664 | 0.0561607 |
NeighborhoodLOGAN | reg_all | -116238.0022375 | 18566.9238177 | 0.0000000 | reg_slim | -119130.5563289 | 17242.2603564 | 0.0000000 |
NeighborhoodLOGAN SQUARE | reg_all | 62250.8192691 | 19952.7344864 | 0.0018127 | reg_slim | 69017.0749668 | 19439.3639828 | 0.0003860 |
NeighborhoodLUDLOW | reg_all | -52808.3703905 | 34230.8263943 | 0.1229230 | reg_slim | -49129.3421680 | 33592.6793599 | 0.1436265 |
NeighborhoodMANAYUNK | reg_all | -32849.5864851 | 17971.4593132 | 0.0675897 | reg_slim | -38507.7436252 | 16243.9220598 | 0.0177731 |
NeighborhoodMANTUA | reg_all | -62417.2976179 | 23852.8097917 | 0.0088863 | reg_slim | -54662.6443479 | 22720.9492087 | 0.0161489 |
NeighborhoodMAYFAIR | reg_all | -45574.5756867 | 16531.8817941 | 0.0058452 | reg_slim | -42219.6065777 | 15042.8142378 | 0.0050132 |
NeighborhoodMELROSE PARK GARDENS | reg_all | -74822.7659669 | 26573.9915337 | 0.0048749 | reg_slim | -62746.8330345 | 25612.5451040 | 0.0143039 |
NeighborhoodMILL CREEK | reg_all | -86286.9060187 | 21098.8747222 | 0.0000434 | reg_slim | -80718.0017347 | 19782.3318687 | 0.0000452 |
NeighborhoodMILLBROOK | reg_all | -18986.0464090 | 18148.4839895 | 0.2955098 | reg_slim | -17517.1150141 | 18052.9296840 | 0.3319040 |
NeighborhoodMORRELL PARK | reg_all | -23855.7611183 | 17240.1643011 | 0.1664629 | reg_slim | -25368.5216646 | 17078.4265040 | 0.1374571 |
NeighborhoodMORRIS PARK | reg_all | -74035.2260614 | 26778.6391821 | 0.0057048 | reg_slim | -71006.2224841 | 26038.1396081 | 0.0063992 |
NeighborhoodNICETOWN | reg_all | -95514.7522678 | 21504.8131378 | 0.0000090 | reg_slim | -97862.6975851 | 20342.2917389 | 0.0000015 |
NeighborhoodNORMANDY | reg_all | -504.9652983 | 38519.3590820 | 0.9895407 | reg_slim | -15862.4640491 | 38426.4109184 | 0.6797587 |
NeighborhoodNORRIS SQUARE | reg_all | -32424.5379200 | 24924.5633427 | 0.1933119 | reg_slim | -29401.7336867 | 23813.2494526 | 0.2169707 |
NeighborhoodNORTH CENTRAL | reg_all | -80050.1273523 | 18480.3989338 | 0.0000149 | reg_slim | -91827.5689139 | 16912.6671736 | 0.0000001 |
NeighborhoodNORTH DELAWARE | reg_all | -125249.9113691 | 66795.9599215 | 0.0607983 | reg_slim | -122275.9732774 | 66957.9079468 | 0.0678474 |
NeighborhoodNORTH PHILA. | reg_all | -98768.9359883 | 20987.1208531 | 0.0000025 | reg_slim | -101628.8866876 | 19625.7242436 | 0.0000002 |
NeighborhoodNORTHERN LIBERTIES | reg_all | 19683.2302778 | 21162.6740486 | 0.3523397 | reg_slim | 31407.2161736 | 19552.8861153 | 0.1082373 |
NeighborhoodNORTHWOOD | reg_all | -109857.7042768 | 27591.4371209 | 0.0000688 | reg_slim | -113248.4337823 | 26795.2691100 | 0.0000239 |
NeighborhoodOGONTZ | reg_all | -73306.8467902 | 18448.2118389 | 0.0000711 | reg_slim | -76392.5281295 | 17397.5640026 | 0.0000114 |
NeighborhoodOLD CITY | reg_all | -56147.6960712 | 113483.4492850 | 0.6207735 | reg_slim | -37773.1214624 | 114171.5101767 | 0.7407662 |
NeighborhoodOLDE KENSINGTON | reg_all | -15334.7403707 | 26307.0304497 | 0.5599606 | reg_slim | -14731.6502724 | 25124.7817587 | 0.5576570 |
NeighborhoodOLDE RICHMOND | reg_all | 15869.4741928 | 19363.6574097 | 0.4124871 | reg_slim | 10452.8543907 | 17672.2644171 | 0.5542061 |
NeighborhoodOLNEY | reg_all | -81051.0993539 | 17944.4988330 | 0.0000063 | reg_slim | -78219.1692285 | 15853.6904599 | 0.0000008 |
NeighborhoodOVERBROOK | reg_all | -95547.5871636 | 18894.8980732 | 0.0000004 | reg_slim | -95919.6370296 | 17585.7097417 | 0.0000000 |
NeighborhoodOVERBROOK FARMS | reg_all | -172684.0617529 | 32469.4522634 | 0.0000001 | reg_slim | -161275.5734957 | 30826.6156069 | 0.0000002 |
NeighborhoodOVERBROOK PARK | reg_all | -56451.7955031 | 22055.6680773 | 0.0104924 | reg_slim | -41076.3705219 | 20973.4106451 | 0.0501917 |
NeighborhoodOXFORD CIRCLE | reg_all | -66624.4141888 | 18468.3637994 | 0.0003103 | reg_slim | -57970.9662711 | 17066.7606727 | 0.0006839 |
NeighborhoodPACKER PARK | reg_all | -14381.8578155 | 24796.3898288 | 0.5619253 | reg_slim | -11369.4008805 | 24363.2785674 | 0.6407495 |
NeighborhoodPARADISE | reg_all | -53663.1677050 | 34960.8200870 | 0.1248185 | reg_slim | -56603.4633900 | 34411.6245983 | 0.1000144 |
NeighborhoodPARKWOOD | reg_all | -19199.1484625 | 17129.8742246 | 0.2623928 | reg_slim | -23270.8506505 | 16913.4072254 | 0.1688804 |
NeighborhoodPENNSPORT | reg_all | -6582.6538182 | 20130.6767112 | 0.7436757 | reg_slim | -4228.3511949 | 18204.7033291 | 0.8163341 |
NeighborhoodPENNYPACK WOODS | reg_all | 41960.6435048 | 113129.5540446 | 0.7107118 | reg_slim | 65227.5971361 | 114039.8416658 | 0.5673502 |
NeighborhoodPOINT BREEZE | reg_all | -8088.4612524 | 17309.9662729 | 0.6403127 | reg_slim | -4245.1828359 | 15356.1035523 | 0.7822072 |
NeighborhoodPORT RICHMOND | reg_all | -29512.1404750 | 17679.4934967 | 0.0950832 | reg_slim | -24532.4248997 | 15779.1562431 | 0.1200317 |
NeighborhoodPOWELTON VILLAGE | reg_all | -34605.9683188 | 52740.6332845 | 0.5117360 | reg_slim | -24015.8057978 | 52593.9763556 | 0.6479466 |
NeighborhoodQUEEN VILLAGE | reg_all | 93695.8143232 | 19964.4964507 | 0.0000027 | reg_slim | 98978.3461354 | 18273.0512294 | 0.0000001 |
NeighborhoodRHAWNHURST | reg_all | -3373.9229909 | 18579.8775991 | 0.8559070 | reg_slim | 9991.8375656 | 17428.9718079 | 0.5664583 |
NeighborhoodRITTENHOUSE SQ. | reg_all | 293773.0507151 | 18553.8054547 | 0.0000000 | reg_slim | 280872.1289137 | 17205.4170120 | 0.0000000 |
NeighborhoodSAUNDERS PARK | reg_all | -39282.6002949 | 48618.1316448 | 0.4191145 | reg_slim | -26126.8786552 | 48337.2458873 | 0.5888524 |
NeighborhoodSHARSWOOD | reg_all | -54367.5947877 | 24525.1100208 | 0.0266520 | reg_slim | -53499.8752214 | 23426.1196254 | 0.0224001 |
NeighborhoodSHAWMONT VALLEY | reg_all | -58495.1417862 | 40407.3922643 | 0.1477418 | reg_slim | -74499.8180405 | 40408.4212184 | 0.0652524 |
NeighborhoodSOCIETY HILL | reg_all | 223310.1849393 | 20144.0701817 | 0.0000000 | reg_slim | 218679.0227241 | 18190.9236294 | 0.0000000 |
NeighborhoodSOMERTON | reg_all | -2127.4648236 | 15935.0539383 | 0.8937932 | reg_slim | -3223.6079820 | 16035.2828123 | 0.8406763 |
NeighborhoodSOUTH PHILADELPHIA | reg_all | 8559.6425368 | 16568.4172324 | 0.6054269 | reg_slim | 9403.7046014 | 14640.7064377 | 0.5206900 |
NeighborhoodSOUTHWEST | reg_all | -83920.8752636 | 17066.3882595 | 0.0000009 | reg_slim | -82397.6901457 | 15812.5347453 | 0.0000002 |
NeighborhoodSOUTHWEST CENTER CITY | reg_all | 106732.5225133 | 18097.3078312 | 0.0000000 | reg_slim | 108779.3328829 | 16342.7869898 | 0.0000000 |
NeighborhoodSPRING GARDEN | reg_all | 77603.7536995 | 20260.6419528 | 0.0001286 | reg_slim | 87608.5166849 | 18407.0289858 | 0.0000020 |
NeighborhoodSPRUCE HILL | reg_all | 55453.9418361 | 26775.6943263 | 0.0383721 | reg_slim | 32932.5384605 | 25868.4027772 | 0.2030111 |
NeighborhoodST. HUGH | reg_all | -84376.6831640 | 20806.9308011 | 0.0000504 | reg_slim | -95639.2045120 | 19780.8709592 | 0.0000013 |
NeighborhoodSTRAWBERRY MANSION | reg_all | -80492.5823381 | 18714.0649261 | 0.0000171 | reg_slim | -96841.8354347 | 17115.6261400 | 0.0000000 |
NeighborhoodSUMMERDALE | reg_all | -70043.7232086 | 20226.3818041 | 0.0005358 | reg_slim | -61924.7207289 | 19163.2225759 | 0.0012346 |
NeighborhoodTACONY | reg_all | -60778.0340523 | 17277.0842742 | 0.0004365 | reg_slim | -60567.8269538 | 16140.5218212 | 0.0001758 |
NeighborhoodTIOGA | reg_all | -136940.3919822 | 19499.4280564 | 0.0000000 | reg_slim | -139746.6052022 | 18276.5480249 | 0.0000000 |
NeighborhoodUPPER HOLMESBURG | reg_all | -49117.2082574 | 22504.3607007 | 0.0290846 | reg_slim | -43100.4886552 | 21441.6702430 | 0.0444370 |
NeighborhoodUPPER NORTHWOOD | reg_all | -61942.4415593 | 20597.1073621 | 0.0026402 | reg_slim | -55845.8370408 | 19470.8696908 | 0.0041347 |
NeighborhoodUPPER ROXBOROUGH | reg_all | -39151.3185303 | 18183.7650762 | 0.0313286 | reg_slim | -29856.4679494 | 17064.3364054 | 0.0802023 |
NeighborhoodWALNUT HILL | reg_all | -39096.6410894 | 26657.3504243 | 0.1424976 | reg_slim | -43046.8258342 | 25972.3015942 | 0.0974586 |
NeighborhoodWALTON PARK | reg_all | -44212.4614591 | 30178.1304628 | 0.1429300 | reg_slim | -44753.7876904 | 30156.7736352 | 0.1378216 |
NeighborhoodWASHINGTON SQUARE WEST | reg_all | 168590.0820375 | 21203.4461655 | 0.0000000 | reg_slim | 169462.6339847 | 19569.5146383 | 0.0000000 |
NeighborhoodWEST FAIRHILL | reg_all | -127159.4012378 | 22247.9919633 | 0.0000000 | reg_slim | -124355.6183092 | 21180.6315858 | 0.0000000 |
NeighborhoodWEST KENSINGTON | reg_all | -47831.8556038 | 25278.1137276 | 0.0584819 | reg_slim | -49498.3790600 | 24372.1903475 | 0.0422804 |
NeighborhoodWEST MT. AIRY | reg_all | -63258.4744964 | 19759.4608285 | 0.0013706 | reg_slim | -54813.9137262 | 18667.2941880 | 0.0033263 |
NeighborhoodWEST OAK LANE | reg_all | -80596.4719677 | 17264.1880540 | 0.0000031 | reg_slim | -74889.6295475 | 15868.7867259 | 0.0000024 |
NeighborhoodWEST PARKSIDE | reg_all | -109853.3050213 | 36268.3923754 | 0.0024590 | reg_slim | -104608.8946361 | 35595.3540711 | 0.0032999 |
NeighborhoodWEST POPLAR | reg_all | -33032.1518695 | 36219.7385445 | 0.3617882 | reg_slim | -29878.3463061 | 35614.7995362 | 0.4015227 |
NeighborhoodWEST POWELTON | reg_all | -2738.1033258 | 33851.4863790 | 0.9355340 | reg_slim | -7886.4000773 | 33410.0426846 | 0.8133983 |
NeighborhoodWEST SHORE | reg_all | -23791.9480650 | 45311.9041412 | 0.5995425 | reg_slim | -23661.0629655 | 45071.5019339 | 0.5996143 |
NeighborhoodWHITAKER | reg_all | -79767.1412993 | 30923.3177517 | 0.0099041 | reg_slim | -81264.4461404 | 30921.6132196 | 0.0085963 |
NeighborhoodWHITMAN | reg_all | -9235.2623425 | 19762.5668074 | 0.6402848 | reg_slim | -8444.9175050 | 18249.7158755 | 0.6435563 |
NeighborhoodWINCHESTER | reg_all | -8141.1311035 | 19358.8864929 | 0.6740996 | reg_slim | 429.7532674 | 19126.7541313 | 0.9820744 |
NeighborhoodWINCHESTER PARK | reg_all | -16752.4470116 | 28317.5345494 | 0.5541330 | reg_slim | -12265.6333895 | 28463.6971471 | 0.6665318 |
NeighborhoodWISSAHICKON | reg_all | -7434.7842408 | 19752.4061702 | 0.7066261 | reg_slim | -18025.2254291 | 18681.4020772 | 0.3346240 |
NeighborhoodWISSINOMING | reg_all | -54146.6679531 | 17587.0174024 | 0.0020825 | reg_slim | -53416.9769139 | 16467.4435297 | 0.0011823 |
NeighborhoodWYNNEFIELD | reg_all | -124078.7498991 | 19788.7494761 | 0.0000000 | reg_slim | -116139.9060326 | 18971.3825342 | 0.0000000 |
NeighborhoodWYNNEFIELD HEIGHTS | reg_all | -25152.6556959 | 24626.0660526 | 0.3070906 | reg_slim | -25399.4422527 | 24507.1267002 | 0.3000287 |
NeighborhoodYORKTOWN | reg_all | -51085.7142768 | 36172.9040971 | 0.1578945 | reg_slim | -36278.3861451 | 35620.8263948 | 0.3084772 |
number_of_bathrooms | reg_all | 42042.8129161 | 2030.9653938 | 0.0000000 | reg_slim | 38010.2868205 | 1604.1250442 | 0.0000000 |
number_of_bedrooms | reg_all | -5425.9847155 | 1070.0491463 | 0.0000004 | NA | NA | NA | NA |
number_stories | reg_all | 14716.6215316 | 2141.6615697 | 0.0000000 | reg_slim | 9985.4837017 | 2121.1790827 | 0.0000025 |
pctdetached | reg_all | -240.1178379 | 102.6117756 | 0.0192947 | NA | NA | NA | NA |
pctmultifam_small | reg_all | 126.9913104 | 93.6901793 | 0.1752998 | NA | NA | NA | NA |
price_lag | reg_all | 0.3724456 | 0.0095447 | 0.0000000 | reg_slim | 0.3864421 | 0.0094745 | 0.0000000 |
total_area | reg_all | 2.4438817 | 0.3780751 | 0.0000000 | NA | NA | NA | NA |
total_livable_area | reg_all | 154.1511182 | 2.6404506 | 0.0000000 | reg_slim | 166.3417828 | 2.4510081 | 0.0000000 |
TotalPop | reg_all | -2.4162573 | 1.6397411 | 0.1406227 | NA | NA | NA | NA |
trees | reg_all | 24.8456495 | 12.0714046 | 0.0395873 | reg_slim | 30.4536953 | 11.4708127 | 0.0079426 |
year_built | reg_all | 84.7980919 | 24.6327387 | 0.0005781 | reg_slim | 180.5030307 | 22.2569316 | 0.0000000 |
Analyzing the mean absolute error of the regression reveals the regression’s accuracy. At $67,083, the error is not trivial considering the mean sale price is $229,000. The Mean Absolute Percent Error confirms that the model errs by 38%.
philly.test <- philly.test %>%
mutate(Regression = "Baseline Regression",
sale_price.Predict = predict(reg_all, philly.test),
sale_price.Error = sale_price.Predict - sale_price,
sale_price.AbsError = abs(sale_price.Predict - sale_price),
sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)%>%
filter(sale_price < 3500000)
#MAE
mae <- round(mean(philly.test$sale_price.AbsError, na.rm = TRUE), 2)
#MAPE
mape <- round(mean(philly.test$sale_price.APE, na.rm = TRUE) * 100, 2)
#Summary table of MAPE vs MAE
results_df <- data.frame(
Metric = c("Mean Absolute Error (MAE)", "Mean Absolute Percentage Error (MAPE)"),
Value = c(mae, paste0(mape, "%")) # Append "%" to MAPE value for percentage
)
kable(results_df) %>%
kable_material_dark(full_width = F) %>%
column_spec(column = 2, width = "20%", extra_css = "text-align:center;")
Metric | Value |
---|---|
Mean Absolute Error (MAE) | 67312.68 |
Mean Absolute Percentage Error (MAPE) | 30.78% |
Visualizing the data can also be instrumental in diagnosing the goodness of a model’s fit. Given how tightly aligned the observed and predicted prices are we performed dozens of variable combinations to rule out over fitting. We are confident that our variables are generalizable and do not over-fit.
# Predicted prices as a function of observed prices
ggplot(philly.test, aes(x = sale_price, y = sale_price.Predict)) +
geom_point(size = 2, alpha = 0.7, color = "#A5C4D4") +
stat_smooth(aes(sale_price, sale_price),
method = "lm", se = FALSE, size = 1, colour="#A5C4D4") +
stat_smooth(aes(sale_price.Predict, sale_price),
method = "lm", se = FALSE, size = 1, colour="#D52F4D") +
labs(
title = "Predicted vs Observed Prices",
subtitle = "Blue line is predicted. Red line is observed.",
x = "Observed Prices",
y = "Predicted Prices"
) +
plotTheme()
Cross validation is a way to test out the model on a subset of the data to see if the results hold. How this test works, is we create 100 random subsets to measure the goodness of fit. If the mean absolute error of each subset was approximately the same, we could deduce that the model is likely to generalize to new data. Unfortunately, we do not see clustered MAE. This spread out distribution of MAE indicates that the subsets produced different results. So while the model is potentially not over-fitting, it is potentially not powerful enough (which is confirmed by the high 38% MAE).
## Cross Validation
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv <-
train(sale_price ~ ., data = st_drop_geometry(sales2023.select) %>%
dplyr::select(var_all),
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv
reg.cv$resample[1:10,]
ggplot(as.data.frame(reg.cv$resample), aes(x = MAE)) +
geom_histogram(binwidth = 2500, fill = "#A5C4D4") +
geom_vline(aes(xintercept = mean(reg.cv$resample$MAE)), colour = "red", size = 1) +
labs(
title = "Distribution of MAE",
subtitle = "Observed MAE (red line)",
x = "Mean Absolute Error (MAE)",
y = "Count"
) +
plotTheme()
To understand the relationship between errors spatially, we mapped the error across Philadelphia. The map does not immediately indicate any clustering, however, a Moran’s I test takes a closer look at spatial clustering.
#Spatial Lags Error
# What is the relationship between errors? Are they clustered? Is the error of a single observation correlated with the error of nearby observations?
coords.test <- st_coordinates(philly.test)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
philly.test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error))
philly.test <- philly.test[is.finite(philly.test$sale_price.Error), ]
philly.test$sale_price.Error <- as.numeric(philly.test$sale_price.Error)
test_data <- philly.test %>%
group_by(Neighborhood) %>%
mutate(MeanPrice = mean(sale_price))
test_data <- test_data %>%
group_by(Neighborhood) %>%
mutate(MAPE = mean(sale_price.APE)) %>%
filter(MAPE > 0) %>%
filter(MAPE < 5)
ggplot() +
geom_sf(data = blocks21.sf, color = "grey70") +
geom_sf(data = test_data, aes(colour = q5(sale_price.Error)),
show.legend = "point", size = .25) +
scale_colour_manual(values = palette,
labels=qBr(test_data,"sale_price.Error"),
name="Error\nRanges") +
labs(title="Map of Price Absolute Errors") +
guides(color = guide_legend(override.aes = list(size = 5))) +
mapTheme()
## Do Errors Cluster? Using Moran's I
moranTest <- moran.mc(philly.test$sale_price.Error,
spatialWeights.test, nsim = 999)
moranTest
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.005, fill = "#A5C4D4") +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#D52F4D",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I of 0.073929 in red",
x="Moran's I",
y="Count") +
plotTheme()
The Moran’s I value of 0.08356 suggests that there is some spatial autocorrelation in the residuals, but it’s not particularly strong. In other words, while similar values (residuals) do tend to be somewhat spatially clustered, the effect is not very pronounced. The alternative hypothesis “greater” indicates that the observed spatial autocorrelation is stronger than what we’d expect by random chance. However, the fact that the Moran’s I value is close to zero means that nearby locations have only slightly more similar prediction errors than would be expected under random spatial distribution.
So if the error is not explained spatially, perhaps it can be explained by difference in housing quality or other local intelligence features.
Additionally, when we examine whether the mean absolute percentage error is clustered around any neighborhoods, we see that only two neighborhoods in North Philadelphia have slightly higher MAPEs than the other neighborhoods, and only at 8% compared to less than 4%. This indicates that the model is generalizable to new test information.
#provide a map of mean absolute percentage error (MAPE) by neighborhood.
st_drop_geometry(philly.test) %>%
group_by(Neighborhood) %>%
summarize(mean.MAPE = mean(sale_price.APE, na.rm = T)) %>%
ungroup() %>%
left_join(hoods, by = c("Neighborhood" = "Neighborhood")) %>%
st_sf() %>%
ggplot() +
geom_sf(data = blocks21.sf, color = "gray70") +
geom_sf(aes(fill = mean.MAPE)) +
geom_sf(data = philly.test, colour = "white", size = .5, alpha = 0.2) +
scale_fill_gradient(low = palette[1], high = palette[5],
name = "MAPE") +
labs(title = "Mean test set MAPE by neighborhood",
caption = "White dots represent each sale in 2023.") +
mapTheme()
#Provide a scatterplot of MAPE by neighborhood as a function of mean price by neighborhood.
ggplot(test_data) +
geom_point(aes(MeanPrice, MAPE)) +
geom_smooth(method = "lm", aes(MeanPrice, MAPE), se = FALSE, colour = "#D52F4D") +
labs(title = "MAPE by Neighborhood as a function of mean price by Neighborhood",
x = "Mean Home Price", y = "MAPE") +
plotTheme()
With the exception of Center City, which more recently flipped from a “D” grade (re: Redlining) residential neighborhood to an expensive one, housing prices are sticky. Areas like Chestnut Hill or Fairmount were historically highly valued, are today, and will continue to be in the future according to this model.
# Provide a map of your predicted values for where ‘toPredict’ is both “MODELLING” and “CHALLENGE”.
inTrain.both <- createDataPartition(
y = paste(sales2023$holc_grade, sales2023$historic, sales2023$Neighborhood),
p = .60, list = FALSE)
philly.training.both <- sales2023[inTrain.both,]
philly.test.both <- sales2023[-inTrain.both,]
reg_all.both <-
lm(sale_price ~ ., data = as.data.frame(philly.training.both) %>%
dplyr::select(var_all))
philly.test.both <- philly.test.both %>%
mutate(Regression = "Baseline Regression",
sale_price.Predict = predict(reg_all.both, philly.test.both),
sale_price.Error = sale_price.Predict - sale_price,
sale_price.AbsError = abs(sale_price.Predict - sale_price),
sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)%>%
filter(sale_price < 5000000)
ggplot() +
geom_sf(data = blocks21.sf, color = "gray70") +
geom_sf(data = philly.test.both, aes(colour = q5(sale_price.Predict)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette,
labels=qBr(test_data,"sale_price.Predict"),
name="Prices") +
labs(title="Predicted Sale Prices") +
guides(color = guide_legend(override.aes = list(size = 5))) +
mapTheme()
The income context of Philadelphia is important to consider when evaluating our model. When dividing Philadelphia into low or high income groups (bsed on the median income), there is clear clustering of high incomes in the center, northwest and northeast outskirts of the city, and lower incomes in dividing the outer and inner areas. Price absolute errors are clustered by income more so than by neighborhood boundary, which suggests income of nearby residents is highly influential in test set.
blocks21.sf <- blocks21.sf %>%
mutate(incomeContext = ifelse(MedHHInc > median(blocks21.sf$MedHHInc), "High Income", "Low Income"))
ggplot() + geom_sf(data = na.omit(blocks21.sf, color = "gray70"), aes(fill = incomeContext)) +
scale_fill_manual(values = c("#4B7740", "#A5C4D4"), name="Income") +
labs(title = "Median Income by Block Group") +
mapTheme() +
theme(legend.position="bottom")
st_join(philly.test, blocks21.sf) %>%
filter(!is.na(incomeContext)) %>%
group_by(Regression, incomeContext) %>%
summarize(mean.MAPE = scales::percent(mean(sale_price.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(incomeContext, mean.MAPE) %>%
kable(caption = "Test set MAPE by neighborhood income context") %>%
kable_material_dark() %>%
column_spec(column = 2, width = "20%", extra_css = "text-align:center;"
)
Regression | High Income | Low Income |
---|---|---|
Baseline Regression | 24% | 41% |
The study strongly suggests that local intelligence is an effective variable to predict housing prices. While nearby housing prices are a strong predictor of housing prices, combining nearby prices with local intelligence is an even stronger predictor.
The variables in the study with the strongest association to sale price, such as livable area, neighborhood, median household income, tree density, and litter conditions drive the the regression’s goodness of fit. Other variables affect the model at the margins.
In Philadelphia and likely in many major metro areas across the United States, downtown revitalization in the past 30 years disproportionately pushed housing prices. It would be interesting to further develop this variable beyond redlining and today’s pricing to better capture how value is dispersed across modern U.S. cities. If we could weigh factors, such as FAR or % of permits denied, that ultimately encourage or constrict development, it could make neighborhood and historical pricing variables even stronger.
The current model predicts 74% of the variation in price. Spatially, we do see that areas like Center City and Chestnut Hill are going to continue to achieve high sales prices. The model predicts these trends particulary well. However, where the model could be impoved is how the model may be discounting the rapid popularity and value in areas like Fishtown where development is new and abundant. Spatially, this model does not account for where new development is in the works. Understanding how real estate development investment can drive housing prices, but an oversupply could potentially push pricing down is important to factor into our predictions, too. Perhaps the model could predict more variation in price with this data split into sales and rentals.
Also important to note is the impact of internal characteristics on predicted price. While the model does predict on liveable area and features like fireplaces and number of bathrooms relatively well, it does not account for quality. Our MAE of around $70k could be explained by the interior finishes between two otherwise identical properties. Variables that can correct for quailty could be an effective addition to the model.
We recommend that Zillow incorporate local intelligence into its Zestimate algorithm. Our model is a solid starting point that can certianly be improved upon with additional feature selection and engineering. While our model predicts fairly well, it could be improved in how well accurate predictions are dispersed across the city. Reducing and clustering absolute errors will be central to future iterations.
While Amenities, dis-amenities and neighborhood spatial characteristics are often far more predictive of home prices than internal characteristics, these may be different in Philadelphia than in other cities. The key is that disparity in these spatial characteristics may highlight where there are disparities, and in turn inform where price may change significantly. In a city where trees, litter, or income are uniformly distributed, these characteristics will not impact price. Price is in many ways, therefore, a byproduct of disparities, creating tension between neighborhoods that have what is ultimately not available throughout the rest of a city.
We recommend future iterations of the model include variables that offer nuance between properties that may be otherwise identical in terms of form and location. Furthermore, variables that account for urban growth dynamics such as where building permits are issued relative to current density. This kind of variable could help the model account for pricing trends that are not present today but may be in the future. Another example of accounting for the impact of future development in Philadelphia is distance to transit since the entire trolley system is undergoing an extensive modernization. Lastly, variables that capture job centers since people often choose their home based on access or proximity to jobs or potential jobs could explain the pricing disparity between neighborhoods, especially those in downtown. We are confident with this model, Zillow is off to a great start to evolve its Zestimate score.
We want to note where outliers were identified and subsequently filtered to ensure the results were not skewed. For sales prices, we remove any sales above 3,500,000 since this is significantly higher than even the most expensive homes. We filtered sales with more than 4 garages because the values larger were over 70, likely representing an entire building rather than an individual unit. Additionally, parks with high concentrations of trees likely created outlier tree density values, when in reality the homes near them were not nearly as tree covered. Therefore, we only included values where trees were less than 1,000 in the block group, an already very high number. Finally, we only included homes where the total livable area was less than 6,500 to remove extreme outlier homes that were also likely for entire buildings, rather than one unit.
plot(sales2023$sale_price, ylim = c(0,6000000), main="Sales Prices") +
abline(h = 3500000, col = "#D52F4D")
plot(sales2023$garage_spaces, ylim = c(0, 80), main="Garage Spaces") +
abline(h = 4, col = "#D52F4D")
plot(sales2023$total_livable_area, ylim = c(0, 150000), main="Total Livable Area") +
abline(h = 6500, col = "#D52F4D")
plot(sales2023$trees, ylim = c(0, 10000), main="Trees") +
abline(h = 1000, col = "#D52F4D")
VIF_table <- data.frame(
Variable = c("Neighborhood", "fireplaces", "garage_spaces", "number_of_bathrooms", "number_of_bedrooms", "number_stories", "total_area", "total_livable_area", "year_built", "historic", "holc_grade", "price_lag", "trees", "litterscore", "TotalPop", "MedHHInc", "pctdetached", "pctmultifam_small", "hwy_grade"),
GVIF = c(9643.267543, 1.338493, 1.760822, 1.943907, 2.028490, 1.515672, 1.469092, 1.868998, 1.453886, 1.690424, 150.815224, 3.716039, 2.010839, 1.491015, 1.506094, 2.286827, 1.926168, 1.415946, 1.277872)
)
kable(VIF_table, caption = "VIF Values", format = "html") %>%
kable_styling()
Variable | GVIF |
---|---|
Neighborhood | 9643.267543 |
fireplaces | 1.338493 |
garage_spaces | 1.760822 |
number_of_bathrooms | 1.943907 |
number_of_bedrooms | 2.028490 |
number_stories | 1.515672 |
total_area | 1.469092 |
total_livable_area | 1.868998 |
year_built | 1.453886 |
historic | 1.690424 |
holc_grade | 150.815224 |
price_lag | 3.716039 |
trees | 2.010839 |
litterscore | 1.491015 |
TotalPop | 1.506094 |
MedHHInc | 2.286827 |
pctdetached | 1.926168 |
pctmultifam_small | 1.415946 |
hwy_grade | 1.277872 |