Overview

Local intelligence is a strong predictor of housing prices in Philadelphia that can optimize Zillow’s popular and consequential Zestimate tool. Real estate professionals as well as home buyers and sellers alike trust Zestimate to be an accurate prediction of local real estate values. Investing in the tool’s reliability stands to not only benefit Zillow’s customers, but the company’s dominant market position, too.

The following study explores a series of local intelligence variables that range from tree-lined street to local median income. A series of statistical processes reveal a handful of key variables that are then processed by prediction models. The best prediction model is then tested on 2023 sale price data for accuracy. The study is mindful of creating models that are generalizable across Philadelphia, and hopefully generalizable across major metro areas throughout the country.

Select and Load Data

Firstly, sales data is a tremendous resource in terms of the internal characteristics of each property which can certainly influence price. This study leverages data such as year built, total square footage, number of bathrooms and bedrooms to holistically understand what influences price.

However, the property itself does not tell the full story. Where it is located and the characteristics of that area are key, too. There are myriad local intelligence variables to chose from between a number of key open data sources (2021-2023), including: U.S. Census American Community Survey, Pennsylvania Spatial Data Access (PASDA), Philadelphia Open Data Portal, or from the Delaware Valley Regional Planning Commission (DVRPC). This study priorizited complete datasets that affect the city at large. For example, litter is a systemic issue whereas airports are very localized. The following table breaks down viable sources and from which we selected local intelligence variables that we deemed to be generalizable for the purposes of this study:

# Create a data frame with your data
data_frame <- data.frame(
  "Dataset" = c(
    "Floodplain Data",
    "Sanitation Centers",
    "Litter Index",
    "Demographics and Socioeconomic Data",
    "Housing Typology",
    "Trees",
    "Neighborhood",
    "Proximity to Highways",
    "School Catchment Performance",
    "Walkscore", 
    "1937 Redlining Score"
  ),
  "Generalizability" = c(
    "",   # Floodplain Data 
    "",   # Sanitation Centers 
    "✔",  # Litter Index 
    "✔",  # Demographics and Socioeconomic Data 
    "",   # Housing Typology 
    "✔",  # Trees 
    "✔",  # Neighborhood
    "✔",  # Proximity to Highways 
    "",   # School Catchment Performance 
    "",   # Walkscore 
    "✔"  # 1937 Redlining Score
  )
)


dt_url <- c("https://data-phl.opendata.arcgis.com/datasets/phl::fema-100-flood-plain/explore",
            "https://opendataphilly.org/datasets/sanitation-areas/",
            "https://opendataphilly.org/datasets/litter-index/",
            "https://www.census.gov/",
            "https://www.census.gov/",
            "https://opendataphilly.org/datasets/phillytreemap/",
           "https://www.pasda.psu.edu/uci/DataSummary.aspx?dataset=7047",
           "https://dvrpc-dvrpcgis.opendata.arcgis.com/datasets/dvrpcgis::freight-highway-network/explore",
           "https://www.philasd.org/performance/programsservices/open-data/school-information/",
           "http://walkscore.com/",
           "https://ncrc.org/holc/")

data_frame %>%
  mutate(Dataset = cell_spec((Dataset), "html", link = dt_url)) %>%
  kable(format = "html", escape = FALSE) %>%
  kable_styling(bootstrap_options = c("hover", "condensced"), full_width = FALSE, 
                position = "left") %>%
  column_spec(column = 2, width = "20%", extra_css = "text-align:center;"
               )

Dataset	Generalizability
Floodplain Data
Sanitation Centers
Litter Index	✔
Demographics and Socioeconomic Data	✔
Housing Typology
Trees	✔
Neighborhood	✔
Proximity to Highways	✔
School Catchment Performance
Walkscore
1937 Redlining Score	✔

Boundaries

From highways to rivers, man-made and natural boundaries alike delineate neighborhoods where certain people and types of businesses congregate. Neighborhood boundaries are arguably more organic than census block groups (which are optimized for consistent populations) or highway. Additionally, because residential sales are generalized by residential neighborhoods not by census block groups, neighborhoods (as defined by the Philly Planning Dept) are likely significant on local intelligence.You can see where areas of high income and low income cluster across Philadelphia, and how these boundaries are more recognizable at a neighbrhood level than block group level.

# plot sales data using median cut off  

sales_plot_mod <- sales2023 %>%
  filter(toPredict == "MODELLING")

median_sale_price <- median(sales_plot_mod$sale_price)
custom_colors <- ifelse(sales_plot_mod$sale_price <= median_sale_price, "#A5C4D4", "#4B7740")

sales_plot <- ggplot() +
  geom_sf(data = blocks21.sf, color = "gray70") +
  geom_sf(data = sales_plot_mod, aes(color = custom_colors), size = 0.5, alpha = 0.5) +
  scale_color_manual(
    values = c("#A5C4D4" = "#A5C4D4", "#4B7740" = "#4B7740"),
    labels = c("#4B7740" = "Above Median", "#A5C4D4" = "Below Median")) +
  labs(title = "Block Groups", color = "") +
  theme(legend.position = "none") + 
  guides(color = guide_legend(override.aes = list(size = 3))) +
  labs(subtitle = "2021 Philadelphia Median \nSale Price is $229,900") +
  mapTheme()

#Bring in neighborhoods

hoods <- st_read("/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Midterm/PPA_Midterm_Git/data/PhillyPlanning_Neighborhoods/PhillyPlanning_Neighborhoods.shp") %>%
  select(NAME) %>%
  rename(Neighborhood = NAME)

sales2023 <- st_join(sales2023, hoods, join = st_within)

hood_plot <- ggplot() +
  geom_sf(data = blocks21.sf, fill = "gray90", color = "gray90") +
  geom_sf(data = hoods, color = "gray60") +
  geom_sf(data = sales_plot_mod, aes(color = custom_colors), size = 0.5, alpha = 0.5) +
  scale_color_manual(
    values = c("#A5C4D4" = "#A5C4D4", "#4B7740" = "#4B7740"),
    labels = c("#4B7740" = "Above Median", "#A5C4D4" = "Below Median")) +
  labs(title = "Neighborhoods") +
  theme(legend.position = "none") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  labs(subtitle = "Neighborhoods as defined by Philadelphia \nPlanning Dept") +
  mapTheme()

#Bring in highways and apply buffer of 1/8 mile (660')

highways <- st_read("/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Midterm/PPA_Midterm_Git/data/Freight_Highway_Network.geojson")

highways <- highways %>%
  filter(state == "PA") %>%
  st_transform('EPSG:2272')

highways_phl <-  st_intersection(highways, blocks21.sf) %>%
  select(geometry)

highway_buffer <- st_buffer(highways_phl, 660)
highway_buffer$hwy_grade <- 1

# Make a dummy variable where 1 is a house within highway buffer and 0 is outside of buffer. 

saleshighways <- st_intersection(sales2023, highway_buffer)

homes_near_hwys <- saleshighways$objectid  #creates a list using unique ids from saleshighways

sales2023 <- sales2023 %>%
  mutate(hwy_grade = ifelse(objectid %in% homes_near_hwys, 1, 0))

hwy_plot <- ggplot() + 
  geom_sf(data = blocks21.sf, color = "gray70") +
   geom_sf(data = sales_plot_mod, aes(color = custom_colors), size = 0.5, alpha = 0.5) +
  scale_color_manual(
    values = c("#A5C4D4" = "#A5C4D4", "#4B7740" = "#4B7740"),
    labels = c("#4B7740" = "Above Median", "#A5C4D4" = "Below Median")) +
  geom_sf(data = highway_buffer, aes(fill = "#36151E", color = "#36151E")) +
  scale_fill_identity(name = "Highway", labels = "#36151E") +
  scale_color_identity(name = "Highway", labels = "#36151E") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  labs(title = "Highway Proximity") +
  labs(subtitle = "The highway buffer is 660 feet, \nabout 2 Philadelphia blocks.") +
  mapTheme()

#plot all three together

plot_grid(sales_plot, hood_plot, hwy_plot, ncol = 3, align = "v")

It is important to be aware of how the sales are distributed around the city. There are many block groups without any housing sales, whereas there are fewer neighborhoods where this is true because neighborhoods are residential in nature. Geographically, because neighborhoods are generally divided along residential vs industrial zoning lines, it is simpler to comprehend how it is generalizable across the city.

Interestingly, when considering the impact of highways in Philadelphia, it is reasonable to assess that it is not generalizable. However, given how consistently home prices drop below the median within proximity to the highway regardless of the sale is in Center City or Port Richmond, we decided that it is a variable worth further investigation.

Redlining and Price Clustering

In the 1930s, a federal agency, the Home Owners’ Loan Corporation (HOLC), created maps of mortgage risk which denied entire neighborhoods access to public lending benefits and overall economic opportunity. The institutionalize of marginalization of poor, Black communities ingrained persistent residential segregation that is still evidenced by income inequality across the city.

# Get and wrangle redlining data

redlining.sf <- st_read("/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Midterm/PPA_Midterm_Git/data/Redlining.geojson") %>% 
  st_transform('EPSG:2272') %>% 
  filter(state == "PA", city == "Philadelphia") 

redlining.sf <- redlining.sf %>%
  dplyr::select(-c(holc_id, state, city, neighborhood_id, area_description_data, name)) %>%
  drop_na() 

# Join each house with historic redlining grade 

sales2023 <- st_join(sales2023, redlining.sf, join = st_within) 

sales2023 <- sales2023 %>% 
  dplyr::mutate(holc_grade = ifelse(is.na(holc_grade) == TRUE, "NA", holc_grade))

# Plotting Redlining Maps 

redlining_colors <- c("NA" = "gray85", "A" = "#197991", "B" = "#4B7740", "C" = "#E3D26F", "D" = "#D52F4D")
redlining_palette <- c("#197991", "#4B7740", "#E3D26F", "#D52F4D", "#A9233B")

redlining.sf.phl <- st_intersection(redlining.sf, blocks21.sf) #clip redlining to PHL boundaries
  
Redlining_plot <- ggplot() +
  geom_sf(data = blocks21.sf, color = "gray70") +
  geom_sf(data = redlining.sf.phl, aes(fill = holc_grade, text = holc_grade), alpha = 0.8) +
  scale_fill_manual(values = redlining_colors, name = "HOLC Grade") +
  labs(title = "1937 HOLC Redlining", caption = "by 2020 Block Group") +
  theme(legend.position = "right") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  mapTheme() 

Income_plot <- ggplot() +
  geom_sf(data = blocks21.sf, color = "grey70") +
  geom_sf(data = sales2023, aes(colour = q5(MedHHInc)), 
          show.legend = "point", size = .5, alpha = 0.5) +
  scale_colour_manual(values = rev(redlining_palette),
                   labels=qBr(sales2023,"MedHHInc"),
                   name="Income\nBrackets") +
  labs(title="2023 Median Incomes", caption = "by 2020 Block Group") +
  theme(legend.position = "right") +
  guides(color = guide_legend(override.aes = list(size = 3))) +
  mapTheme()

plot_grid(Redlining_plot, Income_plot, ncol = 2, align = "v")

NCRC states that, “There is significantly greater economic inequality in cities where more of the HOLC graded high-risk or “Hazardous” areas are currently minority neighborhoods.” (NCRC) The impact of HOLC is generalizable. One exception to note is Center City where it has many high income earners despite its high-risk grade likely due to post-war downtown revitalization.

When considering the patterns of redlining and neighborhoods, it is likely that neighborhoods such as Center City will account for divergences in the redlining data, and neighborhoods such as Kensington will account for consistencies in the redlining data when we develop our definition of local intelligence.

## K NEAREST NEIGHBOR: PRICE

sales2023 <- sales2023 %>% 
  mutate(nb = st_knn(geometry, k = 5),
         wt = st_weights(nb),
         price_lag = st_lag(sale_price, nb, wt))

Price Lag : To systemically investigate price clustering beyond redlining and in today’s data, a function called “k-nearest neighbor” calculates a weighted average of the 5 nearest properties’ sales price. This price lag variable accounts Philadelphia’s ‘block-by-block’ dynamics borne from its density and long history where many uses from mansions to power plants have converged within mere feet of one another - but can drastically impact sale price potentially.

Environmental Conditions

Proximity to amenities and distance from disamenities certainly has a colloquial effect on housing prices. Broker descriptions love to tout, “2 Blocks from Park!” or “Tucked away, quiet street!” To test the effect of environmental data, the study inclues tree density and litter conditions.

## TREES by BLOCK GROUP

trees.sf <- st_read("https://opendata.arcgis.com/api/v3/datasets/30ef36e9e880468fa74e2d5b18da4cfb_0/downloads/data?format=geojson&spatialRefId=4326") %>%
  st_transform('EPSG:2272') 

trees.sf <- st_join(trees.sf, blocks21.sf, join = st_within)

trees.sf <- trees.sf %>%
  select(c(OBJECTID, TREE_NAME, YEAR, GEOID))

trees.sf <- trees.sf %>%
  group_by(GEOID) %>%  # census block group
  summarize(trees = n()) 

blocks21.sf.trees <- st_join(blocks21.sf, trees.sf, by = "GEOID")

blocks21.sf.trees <- mutate(blocks21.sf.trees, trees = ifelse(is.na(trees), 0, trees)) %>% 
  select(trees, GEOID.y) 

sales2023 <- sales2023 %>%
  st_join(blocks21.sf.trees, join = st_intersects)

trees_plot <- ggplot() +
  geom_sf(data = blocks21.sf, fill = "gray20", color = "gray40") +
  stat_density2d(data = data.frame(st_coordinates(trees.sf)),
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 30, geom = 'polygon') +
  scale_fill_gradient(
    high = "#4B7740", 
    low = "#ABCDA2", 
    name = "",
   breaks = c(0.0000000001, 0.000000001),  # Set the breaks for low and high
    labels = c("low", "high")) +  # Set the corresponding labels
  scale_alpha(range = c(0, .7), guide = "none") +
  labs(title = "Tree Density", color = "") +
  theme(legend.position = "bottom") + 
  guides(color = guide_legend(override.aes = list(size = 3))) +
  labs(caption = "Tree Density is measured by \n number of trees in the block group") +
  mapTheme()

# NEAREST LITTER TO HOUSE

litterindex <- st_read("https://opendata.arcgis.com/datasets/04fa63e09b284dbfbde1983eab367319_0.geojson") %>%
  st_transform('EPSG:2272') %>% 
  rename(litterscore = HUNDRED_BLOCK_SCORE) %>% 
  rename(littergrade = SCORE_COLOR)

nearest_litter <- st_nearest_feature(sales2023, litterindex)
sales2023$litterscore <- litterindex$litterscore[nearest_litter]
sales2023$littergrade <- litterindex$littergrade[nearest_litter]

sales2023 <- sales2023 %>%
  mutate(litterscore = round(litterscore, 2))

litter_plot <- ggplot() +
  geom_sf(data = blocks21.sf, fill = "gray20", color = "gray40") +
  stat_density2d(data = data.frame(st_coordinates(litterindex)),
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 30, geom = 'polygon') +
  scale_fill_gradient(low = "#E3D26F", high = "#36151E", name = "",
                      breaks = c(0.0000000002, 0.0000000006),
                      labels = c("low", "high")) +
  scale_alpha(range = c(0, .7), guide = "none") +
  labs(title = "Litter Score", color = "") +
  theme(legend.position = "bottom") + 
  guides(color = guide_legend(override.aes = list(size = 3))) +
  labs(caption = "Litter Score is based on the Litter Index \ndeveloped by the City of Philadelphia") +
  mapTheme()

# plot two maps
plot_grid(trees_plot, litter_plot, ncol = 2, align = "v")

Interestingly, these two environmental data points may send mixed signals. Where there are more trees, there is more litter. Since tree density and litter conditions send strong visual cues to potential buyers of the level of care of the neighborhood, the impact of these variables on the model may strengthen or dilute the results.

Build Prediction Models

“All models are wrong, but some are useful.” - George Box

Zestimate seeks to predict housing prices. Predicting city trends is fraught. Oftentimes unpredictable political decisions such as post-war state-sponsored suburbanization (aka white flight) buck trend predictions. There is no model than account for every unforeseeable scenario. However as previously discussed, useful models are generalizable. Building a useful prediction model requires feature selection and feature engineering. These features are then used to train a model to recognize patterns associated with sale price. The predictive model will use those patterns to produce a Zestimate. While the nearest prices of nearby sales are likely a fine starting point to build Zestimate, including local intelligence will certainly strengthen the product.

Categorize Variables

This study uses the hedonic model (reference) theoretical framework that deconstructs housing prices into three categories:

Internal Characteristics
Public Services and Disamenities
Spatial Process of Prices

# SELECTING VARIABLES TO INCLUDE IN MODEL

## PREPPING SALES DATA INTO MODEL V CHALLENGE

sales2023.mod <- sales2023 %>%
  filter(toPredict == "MODELLING")

sales2023.chal <- sales2023 %>%
  filter(toPredict == "CHALLENGE")

all_var <- c("objectid", "census_tract","GEOID", "toPredict","sale_price", "Neighborhood", "fireplaces", "garage_spaces", "number_of_bathrooms", "number_of_bedrooms", "number_stories", "total_area", "total_livable_area", "year_built","historic", "holc_grade", "price_lag", "trees", "litterscore", "TotalPop", "MedHHInc", "pctdetached", "pctmultifam_small", "hwy_grade")

sales2023.select <- sales2023.mod %>% 
  dplyr::select(all_of(all_var)) %>%
  na.omit()

sales2023.summary <- sales2023.select %>%
  st_drop_geometry() %>% 
  select(-c(objectid, census_tract, GEOID, toPredict))


# table of summary statistics with variable descriptions. Sort these variables by their category (internal characteristics, amenities/public services or spatial structure). Check out the `stargazer` package for this.

categories_to_include <- c("Amenities / Public Services", "Internal Characteristic", "Spatial Process")

#subsetting categories

#internal characteristics
InternalCharacteristics <- sales2023 %>%
  dplyr::select("fireplaces", "garage_spaces", 
          "number_of_bathrooms", "number_of_bedrooms", "number_stories", 
           "total_area", "total_livable_area", "year_built", "historic", "sale_price")

stargazer(as.data.frame(InternalCharacteristics), type="text", digits=1, title = "Internal Characteristics", out = "Training_PHLInternal.txt", omit.summary.stat = "N")


Internal Characteristics
=====================================================
Statistic             Mean    St. Dev.  Min    Max   
-----------------------------------------------------
fireplaces            0.05       0.3     0      5    
garage_spaces          0.4       0.7     0     72    
number_of_bathrooms    1.1       0.7     0      8    
number_of_bedrooms     2.6       1.3     0     31    
number_stories         2.0       0.6     1      5    
total_area           1,828.9   3,818.9   0   226,512 
total_livable_area   1,333.6    548.8    0   15,246  
year_built           1,936.7    51.1     0    2,024  
historic               1.9       0.4     1      3    
sale_price          271,846.8 240,083.4  0  5,990,000
-----------------------------------------------------

#amenities/public services

Amenities_Public_Services <- sales2023 %>%
  dplyr::select("Neighborhood", "trees", "TotalPop", "MedHHInc", "pctdetached", "pctmultifam_small", "litterscore")

stargazer(as.data.frame(Amenities_Public_Services), type="text", digits=1, title = "Public Services and Disamenities", out = "Training_PHLInternal.txt", omit.summary.stat = "N")


Public Services and Disamenities
===============================================
Statistic           Mean   St. Dev. Min   Max  
-----------------------------------------------
trees              116.7    112.9    2   2,136 
TotalPop          1,445.2   708.6    0   4,518 
MedHHInc          56,282.7 39,156.1  0  250,001
pctdetached         9.0      12.7   0.0  100.0 
pctmultifam_small   12.4     12.2   0.0  75.0  
litterscore         1.8      0.6    0.0   4.0  
-----------------------------------------------

#spatial process

Spatial <- sales2023 %>%
  dplyr::select("price_lag")

stargazer(as.data.frame(Spatial), type="text", digits=1, title = "Spatial Process of Prices", out = "Training_PHLInternal.txt", omit.summary.stat = "N")


Spatial Process of Prices
==================================================
Statistic   Mean    St. Dev.    Min        Max    
--------------------------------------------------
price_lag 267,985.6 194,111.3 20,000.0 3,292,800.0
--------------------------------------------------

As we feature select and engineer, here are some important takeaways from the above summary tables:

“price_lag” is the weighted average of nearby sales. These values are closely mirrored by actual sale price. Nearby prices are likely very influential so it is important to select features that can tell another side of the story.
Sales price ranges considerably in Philadelphia, and identifying other data that contains the same variety may be influential to help inform spatial price variation.
The variation of trees is quite high, with a minimum of only 2 in some areas, and a maximum of nearly 2150. This may be error or it may be important disparity to account for.
Total Population is based on census block groups which are engineered to normalize to 600 to 3,000 people.

Test for Correlation

The goal is to select variables that most significantly correlate to sale price to include in the predictive model. Correlation is a type of association test. Are sale prices more closely associated to square footage or to number of bedrooms? These are the types of subtle but important distinction we want to seek out.

# CORRELATION MATRIX OF VARIABLES

numericVars <- sales2023.select %>%
  st_drop_geometry() %>%
  select(-toPredict, -GEOID, -objectid, -census_tract) %>%
  select(-holc_grade, -hwy_grade, -historic, -Neighborhood) #remove categorical variables for corrtest

price.corr <- corr.test(numericVars)

if (sum(is.na(numericVars)) > 0) {
  # Handle missing values (e.g., impute or remove)
  numericVars <- na.omit(numericVars)  # Remove rows with missing values
}

correlation_matrix <- round(cor(numericVars), 1)
p_values <- cor_pmat(numericVars)

ggcorrplot(
  correlation_matrix,
  p.mat = p_values,
  colors = c("#36151E", "white", "#197991"),
  type = "lower",
  insig = "blank"
) +
labs(title = "Degrees of Correlation to Sale Price") +
  annotate(
  geom = "rect",
  xmin = .5, xmax = 16, ymin = .5, ymax = 1.5,
  fill = "transparent", color = "red", alpha = 0.5
)

The correlation matrix revealed strong positive and negative correlations. These are the primary variables to note for further engineering:

Fireplaces
Number of bathrooms
Total livable area (sf)
Tree density
Median Household Income
Litter conditions
Price lag

It is important to note where variables correlate with each other as well. Price lag and median income had very strong positive correlation with one another, as did number of bathrooms and number of bedrooms. The number of bedrooms and the number of bathrooms are highly correlated to one another. Therefore to avoid multicollinearity, or two variables telling the same story, we will focus on number of bathrooms since it is also more correlated to sale price.

# str(sales2023.select)

# Create 4 home price correlation scatterplots that you think are of interest (using open data variables)

#MedHHInc, trees, price_lag, litterscore

st_drop_geometry(sales2023.select) %>% 
  dplyr::select(sale_price, MedHHInc, trees, price_lag, litterscore) %>%
  gather(Variable, Value, -sale_price) %>% 
   ggplot(aes(Value, sale_price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#D52F4D") +
  facet_wrap(~Variable, ncol = 2, scales = "free", labeller = labeller(Variable = c(price_lag = "Price Lag", MedHHInc = "Median Income", trees = "Tree Density", litterscore = "Litter Conditions"))) +
     labs(title = "Price as a function of select variables") +
     plotTheme()

The relationship between sale price and median income and tree density are similarly positive, where as litter is negative. The correlation between price lag and sale price is nearly 1, which indicates a very tight fit (as supported by the values in the categorizing summary table.)

Examine Spatially

# Create 3 maps of the 3 most interesting variables.

plot_MedHHInc <- ggplot() +
  geom_sf(data = blocks21.sf, color = "gray70") +
  geom_sf(data = sales2023, aes(colour = q5(MedHHInc)), 
          show.legend = "point", size = .5) +
  scale_colour_manual(values = rev(palette),
                   labels=qBr(sales2023,"MedHHInc"),
                   name="Median\nIncome") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  mapTheme()

plot_price_lag <- ggplot() +
  geom_sf(data = blocks21.sf, color = "gray70") +
  geom_sf(data = sales2023, aes(colour = q5(price_lag)), 
          show.legend = "point", size = .5) +
  scale_colour_manual(values = rev(palette),
                   labels=qBr(sales2023,"price_lag"),
                   name="Nearby\nPrice") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  mapTheme()

trees_plot <- ggplot() +
  geom_sf(data = blocks21.sf, color = "gray70") +
  stat_density2d(data = data.frame(st_coordinates(trees.sf)),
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 50, geom = 'polygon') +
  scale_fill_gradient(
    low = "#ABCDA2",
    high = "#4B7740", 
    name = "Tree Density") +
  scale_alpha(range = c(0, .5), guide = "none") +
mapTheme()

plot_grid(plot_MedHHInc, plot_price_lag, trees_plot, align = "v", ncol = 2)

Spatially, different patterns emerge. For instance, Center City has high tree density, high median income, and high average nearby prices despite being historically marginalized by redlining in the 1930s. North Philadelphia experiences the inverse of these trends, as do many parts of West Philadelphia further from University City. Spatial exploration reinforces the breadth of variable selection.

Filter Data

#Filter out outliers discovered in summary statistics.
sales2023.select <- sales2023.select %>%
   filter(sale_price <= 3500000, garage_spaces < 4, total_livable_area < 6500,
          trees <= 1000)

After identifying correlated variables and looking at summary statistics, we also plotted the distribution of some key characteristics to identify any outliers possibly skewing the data. See Appendix A for scatterplots that visualize the outliers that were filtered out.

Data Filters:

Sale price <= $3,500,000
Trees < 1000
Garage Spaces < 4
Total Livable Area < 6,500 sf

Train and Test Prediction Models

The way the prediction model works is that we will train it to understand patterns about the relationship between sale price and our selected variables. Then we will test a portion of the data to see how well it performs. We use a 60:40 data partition to training and test, respectively.

#take note: all of these are categorical. Specifying categorical variables, fixed effects so the model is trained

inTrain <- createDataPartition(
              y = paste(sales2023.select$holc_grade, sales2023.select$historic, sales2023.select$Neighborhood),
              p = .60, list = FALSE)
philly.training <- sales2023.select[inTrain,] 
philly.test <- sales2023.select[-inTrain,]

We established the variables of the model. Now, we need to formulate the best combination of those variables to predict sale price. To begin, we start with a simple model only based on price lag, and then we add and subtract variables and analyze if they strengthen or weaken the model.

Price Lag Only Model:

Considering the strong association between sale price and price lag, we developed a price lag only regression. The price lag only regression is statistically significant (p-value < 0.05) however its R-squared value indicates that approximately 57.34% of the variation in the dependent variable, sale_price, is explained by price lag.

Local Intelligence Model:

Let’s see how including local intelligence variables impacts the regression. A regression that includes a ‘kitchen sink’ of variables from our data sources including but not limited to fireplaces, number of bathrooms, redlining grade, trees, and litter performs better with an R-squared value that indicated approximately 73.44% of the variation in sale price is explained by local intelligence + price lag.

#Price Lag only model
reg.int <- lm(sale_price ~ price_lag, data = as.data.frame(philly.training))

#Local Intelligence Regression 
var_all <- c("sale_price", "Neighborhood", "fireplaces", "garage_spaces", "number_of_bathrooms", "number_of_bedrooms", "number_stories", "total_area", "total_livable_area", "year_built","historic", "holc_grade", "price_lag", "trees", "litterscore", "TotalPop", "MedHHInc", "pctdetached", "pctmultifam_small", "hwy_grade")

reg_all <-
  lm(sale_price ~ ., data = as.data.frame(philly.training) %>% 
                             dplyr::select(var_all))

To ensure that the Local Intelligence Regression does not have multicollinearity, or multiple values telling the same story about sale price, we use a series of tests. The first test, VIF, indicates that Neighborhood and Redlining Grade have very high VIF scores relative to other variables and are potentially conflicting. To further investigate these categorical (non-numeric) variables, we performed a chi-square test which reveals that they do have high collinearity.

#test for multicollinearity of reg_all
library(car)
vif(reg_all)
#neighborhood (10221.05) and holc_grade (153.67) have very high vif scores compared to other variables.

#checking chisquare on categorical for collinearity 
chi_test <- table(philly.training$Neighborhood, philly.training$holc_grade)
chisq.test(chi_test)

#remove holc_grade bc of high collinearity with Neighborhood

reg_slim <- lm(sale_price ~ Neighborhood + MedHHInc + price_lag + trees + number_of_bathrooms + total_livable_area + litterscore + number_stories + year_built, data = as.data.frame(philly.training))

Local Regression Slim Model:

The next regression iteration removed Redlining Grade as well as other variables that presented less association with sale price. This slimmed down regression performed marginally less well than the ‘kitchen sink’ with an R-Squared value of 72.97% variance explained.

R-Squared is not the only measure of fit. The second and third regressions have a lower AIC compared to price lag only is good sign of the models fit and leanness and meanness of including local intelligence generally. We also performed a likelihood ratio test which indicates a high chi-square for the ‘kitchen sink’ so while both are stat significant, the second regression seems to fit the sales data best.

AIC(reg.int, reg_all, reg_slim) #lower AIC compared to intercept is good sign of the models fit and leanness and meanness.

lrtest(reg.int, reg_all, reg_slim) #likelihood ratio test show a high chisq for reg_all so while both are stat significant, reg_all seems to fit the data best.

aic_table <- data.frame(
  Regression = c("Price Lag Only", "Local Intel", "Local Intel Slim"),
  AIC = c(373654.6, 367160.2, 367393.4))

chi_table <- data.frame(
  Regression = c("Local Intel", "Local Intel Slim"),
  Chi_Square = c(6818.45, 259.24),
  Significance = c("<0.05 ***", "<0.05 ***")
)

AIC Value per Regressions
Regression	AIC
Price Lag Only	373654.6
Local Intel	367160.2
Local Intel Slim	367393.4

Chi-Square per Regression
Regression	Chi_Square	Significance
Local Intel	6818.45	<0.05 ***
Local Intel Slim	259.24	<0.05 ***

With all this in mind, the predictive model will be trained with second regression that features a breadth of local intelligence features.

Here is an in-depth comparative table between regressions:

#compare kitchen sink and slim regressions

coef_all <- summary(reg_all)$coef
coef_slim <- summary(reg_slim)$coef

coef_df_all <- data.frame(
  Model = c("reg_all"),
  Coefficients = rownames(coef_all),
  Estimate_all = coef_all[, 1],
  Std.Error_all = coef_all[, 2],
  p_value_all = coef_all[, 4])


coef_df_slim <- data.frame(
  Model = c("reg_slim"),
  Coefficients = rownames(coef_slim),
  Estimate_slim = coef_slim[, 1],
  Std.Error_slim = coef_slim[, 2],
  p_value_slim = coef_slim[, 4]
)

left_merged_kable <- merge(coef_df_all, coef_df_slim, by = "Coefficients", all.x = TRUE) %>%
  rename("Regression All" = Model.x, "Regression Slim" = Model.y)

kable(left_merged_kable, format = "html", caption = "Note: NAs are where variables were not included for both Regressions") %>% 
  kable_styling("striped", full_width = F) %>%
  scroll_box(width = "800px", height = "500px")

Note: NAs are where variables were not included for both Regressions
Coefficients	Regression All	Estimate_all	Std.Error_all	p_value_all	Regression Slim	Estimate_slim	Std.Error_slim	p_value_slim
(Intercept)	reg_all	-279476.7997428	49921.0374428	0.0000000	reg_slim	-438474.3739475	46047.1181628	0.0000000
fireplaces	reg_all	20055.8986415	3997.7262837	0.0000005	NA	NA	NA	NA
garage_spaces	reg_all	19949.7716871	2494.9113408	0.0000000	NA	NA	NA	NA
historic	reg_all	23109.8780154	2930.8283940	0.0000000	NA	NA	NA	NA
holc_gradeB	reg_all	-10835.7003932	9343.1614304	0.2461719	NA	NA	NA	NA
holc_gradeC	reg_all	-13681.0963243	10103.7213906	0.1757377	NA	NA	NA	NA
holc_gradeD	reg_all	-722.0059079	11025.4149394	0.9477883	NA	NA	NA	NA
holc_gradeNA	reg_all	-12411.2797850	11321.6474356	0.2729914	NA	NA	NA	NA
hwy_grade	reg_all	11971.1364543	6119.6278170	0.0504632	NA	NA	NA	NA
litterscore	reg_all	-5267.4098876	1854.3596425	0.0045101	reg_slim	-6582.2861356	1864.9903469	0.0004179
MedHHInc	reg_all	0.2242095	0.0365627	0.0000000	reg_slim	0.2265662	0.0363103	0.0000000
NeighborhoodALLEGHENY WEST	reg_all	-65699.0082409	18729.5667850	0.0004533	reg_slim	-78617.7264519	17418.9565627	0.0000064
NeighborhoodANDORRA	reg_all	-53626.6958289	26254.8610315	0.0411165	reg_slim	-44715.5140164	25635.7775496	0.0811355
NeighborhoodASHTON	reg_all	-28074.8646915	27055.1471125	0.2994323	reg_slim	-21237.8145788	27036.5642300	0.4321610
NeighborhoodBELLA VISTA	reg_all	109341.7415589	20444.3351888	0.0000001	reg_slim	109732.1594553	18900.9530092	0.0000000
NeighborhoodBELLS CORNER	reg_all	-11101.2503662	20319.8413197	0.5848509	reg_slim	472.3591262	19253.3601216	0.9804271
NeighborhoodBELMONT	reg_all	-104943.0995189	23728.9404496	0.0000098	reg_slim	-99477.3256346	22529.7147042	0.0000102
NeighborhoodBLUE BELL HILL	reg_all	-19560.9212990	43516.3565427	0.6530727	reg_slim	-29843.8683334	42495.9151292	0.4825187
NeighborhoodBREWERYTOWN	reg_all	-37155.0836387	19201.3073030	0.0530071	reg_slim	-40248.2425404	17662.9162070	0.0227008
NeighborhoodBRIDESBURG	reg_all	-31590.5782913	20265.8118929	0.1190638	reg_slim	-37506.9338406	19105.0689525	0.0496437
NeighborhoodBROOKHAVEN	reg_all	-25643.5512275	37037.4650217	0.4887179	reg_slim	-36185.3979561	37106.0231729	0.3294833
NeighborhoodBURHOLME	reg_all	-24016.7373695	19790.1842027	0.2249333	reg_slim	-15227.1161084	18762.0188575	0.4170393
NeighborhoodBUSTLETON	reg_all	-9641.0202861	15771.9220198	0.5410259	reg_slim	-5478.2779982	15888.7496221	0.7302577
NeighborhoodBYBERRY	reg_all	-112753.9122773	42597.9178231	0.0081317	reg_slim	-90519.4614556	42731.0380078	0.0341628
NeighborhoodCABOT	reg_all	-66390.5445712	32361.7174698	0.0402356	reg_slim	-61102.1363083	31623.8571900	0.0533609
NeighborhoodCALLOWHILL/CHINATOWN NORTH	reg_all	64796.7274978	52473.9379988	0.2169123	reg_slim	42291.2166366	52562.5805279	0.4210713
NeighborhoodCARROLL PARK	reg_all	-86801.7567426	19289.6556555	0.0000069	reg_slim	-93463.2521295	18069.1847659	0.0000002
NeighborhoodCASTOR GARDENS	reg_all	-48306.9491716	17752.2230931	0.0065131	reg_slim	-42782.4758428	16436.3163278	0.0092531
NeighborhoodCATHEDRAL PARK	reg_all	-111863.9583836	32949.0487008	0.0006881	reg_slim	-109757.2304376	32522.5898011	0.0007407
NeighborhoodCECIL B MOORE	reg_all	-125200.1873270	37616.8907321	0.0008761	reg_slim	-128530.2754965	37044.9827495	0.0005229
NeighborhoodCEDAR PARK	reg_all	4384.1862665	22747.0747485	0.8471683	reg_slim	-7722.5142192	21661.3460154	0.7214631
NeighborhoodCEDARBROOK	reg_all	-65929.5791199	18345.7486857	0.0003271	reg_slim	-57027.3432684	17052.2769512	0.0008272
NeighborhoodCENTRAL ROXBOROUGH	reg_all	-24678.6207347	17467.1426137	0.1577194	reg_slim	-29885.6148045	16460.6246258	0.0694561
NeighborhoodCHESTNUT HILL	reg_all	44483.1561038	21482.4802113	0.0384080	reg_slim	54474.7203282	20436.5000158	0.0076949
NeighborhoodCHINATOWN	reg_all	54473.7351716	113146.6730846	0.6302089	reg_slim	58420.8113420	113987.5697621	0.6082959
NeighborhoodCOBBS CREEK	reg_all	-63981.5444107	17103.1741195	0.0001841	reg_slim	-73888.9848888	15646.3033466	0.0000024
NeighborhoodCRESTMONT FARMS	reg_all	26671.9898982	80921.8723036	0.7417059	reg_slim	65732.5487908	81301.3620753	0.4188138
NeighborhoodDUNLAP	reg_all	-131376.8407742	29913.8524480	0.0000113	reg_slim	-127396.6883558	29029.1342570	0.0000115
NeighborhoodEAST FALLS	reg_all	-10064.7122781	19046.6508455	0.5972134	reg_slim	-4640.7162482	18010.9143413	0.7966721
NeighborhoodEAST GERMANTOWN	reg_all	-72501.7393975	18674.9157549	0.0001039	reg_slim	-82276.2877788	17485.3042865	0.0000026
NeighborhoodEAST KENSINGTON	reg_all	3842.8712503	19795.2958595	0.8460765	reg_slim	-4047.2779155	18179.3298099	0.8238262
NeighborhoodEAST MT. AIRY	reg_all	-84649.9436481	18006.7362165	0.0000026	reg_slim	-76847.5754248	16628.0146949	0.0000038
NeighborhoodEAST OAK LANE	reg_all	-135938.3268446	23219.4068366	0.0000000	reg_slim	-123609.7530494	21645.8535522	0.0000000
NeighborhoodEAST PARKSIDE	reg_all	-100322.0251927	26892.2036152	0.0001918	reg_slim	-92772.1299044	26099.1750812	0.0003798
NeighborhoodEAST POPLAR	reg_all	-396.3533220	25894.7415980	0.9877880	reg_slim	5349.4991033	24787.5065327	0.8291356
NeighborhoodEAST TIOGA	reg_all	-145239.0794312	24910.4635948	0.0000000	reg_slim	-157026.6561701	24153.3908417	0.0000000
NeighborhoodEAST TORRESDALE	reg_all	-9238.6538341	19994.9159892	0.6440526	reg_slim	-9220.9858850	18982.4285427	0.6271420
NeighborhoodEASTWICK	reg_all	-69330.5450210	19278.0545896	0.0003238	reg_slim	-57477.6195245	17776.9217236	0.0012267
NeighborhoodFAIRHILL	reg_all	-96214.4568443	20670.3893806	0.0000033	reg_slim	-90795.1948537	19240.7577481	0.0000024
NeighborhoodFAIRMOUNT	reg_all	79327.9472266	18413.3432530	0.0000166	reg_slim	83890.7617514	16454.2592408	0.0000003
NeighborhoodFELTONVILLE	reg_all	-81145.2928779	18933.2400007	0.0000183	reg_slim	-81710.8299610	17715.1715893	0.0000040
NeighborhoodFERN ROCK	reg_all	-114447.5449107	25274.8155160	0.0000060	reg_slim	-111728.1119313	24495.6166454	0.0000051
NeighborhoodFISHTOWN	reg_all	48414.4583697	18283.4661780	0.0081063	reg_slim	47262.5499366	16328.9406343	0.0038047
NeighborhoodFORGOTTEN BLOCKS	reg_all	-61459.2515594	30873.7728943	0.0465376	reg_slim	-66502.6183704	30329.3555583	0.0283469
NeighborhoodFOX CHASE	reg_all	-15096.7930131	17800.1156192	0.3963808	reg_slim	-1286.2195882	17086.1237506	0.9399941
NeighborhoodFRANCISVILLE	reg_all	9491.7041660	22289.3592951	0.6702309	reg_slim	22082.6082104	20669.0857723	0.2853640
NeighborhoodFRANKFORD	reg_all	-84273.7627814	16879.8316751	0.0000006	reg_slim	-85196.4547140	15663.0238513	0.0000001
NeighborhoodFRANKFORD VALLEY	reg_all	-61743.8908957	21658.7989057	0.0043680	reg_slim	-61348.2413654	20534.9453323	0.0028176
NeighborhoodGARDEN COURT	reg_all	13303.9180466	28824.4160347	0.6444111	reg_slim	5972.4879777	28002.9007256	0.8311109
NeighborhoodGERMANTOWN	reg_all	-84269.4532083	17533.2526101	0.0000016	reg_slim	-98075.7117969	16307.5649915	0.0000000
NeighborhoodGIRARD ESTATE	reg_all	27630.9647696	30103.8360773	0.3587106	reg_slim	25135.7686960	29004.6716397	0.3861701
NeighborhoodGRAYS FERRY	reg_all	-46500.9043247	18386.4996740	0.0114472	reg_slim	-41052.4016138	16602.8275385	0.0134247
NeighborhoodGREEN HILL FARMS	reg_all	-62868.8578340	53669.2962803	0.2414533	reg_slim	-39580.3141921	53011.6575023	0.4552971
NeighborhoodHADDINGTON	reg_all	-91683.4876282	18590.7261149	0.0000008	reg_slim	-88555.3769391	16899.6676403	0.0000002
NeighborhoodHARROWGATE	reg_all	-89954.8379758	18701.3055660	0.0000015	reg_slim	-101149.6355978	17427.3601149	0.0000000
NeighborhoodHAWTHORNE	reg_all	133391.3859209	27877.4896995	0.0000017	reg_slim	125888.7463502	26813.6237218	0.0000027
NeighborhoodHOLMESBURG	reg_all	-33030.8356079	18593.0781973	0.0756704	reg_slim	-26684.3124741	17557.2450653	0.1285724
NeighborhoodHUNTING PARK	reg_all	-81830.7502200	18391.2804390	0.0000087	reg_slim	-92245.6934911	17198.9716102	0.0000001
NeighborhoodHUNTING PARK INDUSTRIAL AREA	reg_all	-73860.0496874	43012.0188192	0.0859658	reg_slim	-66750.3697964	42542.8981498	0.1166676
NeighborhoodJUNIATA PARK	reg_all	-58390.1274655	17809.2090079	0.0010456	reg_slim	-53327.0794471	16662.7614704	0.0013756
NeighborhoodKENSINGTON	reg_all	-73936.6742166	17348.4892080	0.0000204	reg_slim	-73332.2092400	15592.5003391	0.0000026
NeighborhoodKENSINGTON SOUTH	reg_all	10660.3204395	22851.3446951	0.6408597	reg_slim	18252.7169842	21411.5511002	0.3939677
NeighborhoodKINGSESSING	reg_all	-71078.0125816	17397.6572345	0.0000442	reg_slim	-79416.5903308	16002.6230213	0.0000007
NeighborhoodLAWNCREST	reg_all	-66141.4939686	18187.8036255	0.0002773	reg_slim	-61558.5398941	17183.5961824	0.0003416
NeighborhoodLAWNDALE	reg_all	-60464.6773431	20705.3245826	0.0035032	reg_slim	-49620.1552953	19593.2855456	0.0113359
NeighborhoodLEXINGTON	reg_all	31966.1172131	25441.7068502	0.2089757	reg_slim	47245.4347744	24736.5221664	0.0561607
NeighborhoodLOGAN	reg_all	-116238.0022375	18566.9238177	0.0000000	reg_slim	-119130.5563289	17242.2603564	0.0000000
NeighborhoodLOGAN SQUARE	reg_all	62250.8192691	19952.7344864	0.0018127	reg_slim	69017.0749668	19439.3639828	0.0003860
NeighborhoodLUDLOW	reg_all	-52808.3703905	34230.8263943	0.1229230	reg_slim	-49129.3421680	33592.6793599	0.1436265
NeighborhoodMANAYUNK	reg_all	-32849.5864851	17971.4593132	0.0675897	reg_slim	-38507.7436252	16243.9220598	0.0177731
NeighborhoodMANTUA	reg_all	-62417.2976179	23852.8097917	0.0088863	reg_slim	-54662.6443479	22720.9492087	0.0161489
NeighborhoodMAYFAIR	reg_all	-45574.5756867	16531.8817941	0.0058452	reg_slim	-42219.6065777	15042.8142378	0.0050132
NeighborhoodMELROSE PARK GARDENS	reg_all	-74822.7659669	26573.9915337	0.0048749	reg_slim	-62746.8330345	25612.5451040	0.0143039
NeighborhoodMILL CREEK	reg_all	-86286.9060187	21098.8747222	0.0000434	reg_slim	-80718.0017347	19782.3318687	0.0000452
NeighborhoodMILLBROOK	reg_all	-18986.0464090	18148.4839895	0.2955098	reg_slim	-17517.1150141	18052.9296840	0.3319040
NeighborhoodMORRELL PARK	reg_all	-23855.7611183	17240.1643011	0.1664629	reg_slim	-25368.5216646	17078.4265040	0.1374571
NeighborhoodMORRIS PARK	reg_all	-74035.2260614	26778.6391821	0.0057048	reg_slim	-71006.2224841	26038.1396081	0.0063992
NeighborhoodNICETOWN	reg_all	-95514.7522678	21504.8131378	0.0000090	reg_slim	-97862.6975851	20342.2917389	0.0000015
NeighborhoodNORMANDY	reg_all	-504.9652983	38519.3590820	0.9895407	reg_slim	-15862.4640491	38426.4109184	0.6797587
NeighborhoodNORRIS SQUARE	reg_all	-32424.5379200	24924.5633427	0.1933119	reg_slim	-29401.7336867	23813.2494526	0.2169707
NeighborhoodNORTH CENTRAL	reg_all	-80050.1273523	18480.3989338	0.0000149	reg_slim	-91827.5689139	16912.6671736	0.0000001
NeighborhoodNORTH DELAWARE	reg_all	-125249.9113691	66795.9599215	0.0607983	reg_slim	-122275.9732774	66957.9079468	0.0678474
NeighborhoodNORTH PHILA.	reg_all	-98768.9359883	20987.1208531	0.0000025	reg_slim	-101628.8866876	19625.7242436	0.0000002
NeighborhoodNORTHERN LIBERTIES	reg_all	19683.2302778	21162.6740486	0.3523397	reg_slim	31407.2161736	19552.8861153	0.1082373
NeighborhoodNORTHWOOD	reg_all	-109857.7042768	27591.4371209	0.0000688	reg_slim	-113248.4337823	26795.2691100	0.0000239
NeighborhoodOGONTZ	reg_all	-73306.8467902	18448.2118389	0.0000711	reg_slim	-76392.5281295	17397.5640026	0.0000114
NeighborhoodOLD CITY	reg_all	-56147.6960712	113483.4492850	0.6207735	reg_slim	-37773.1214624	114171.5101767	0.7407662
NeighborhoodOLDE KENSINGTON	reg_all	-15334.7403707	26307.0304497	0.5599606	reg_slim	-14731.6502724	25124.7817587	0.5576570
NeighborhoodOLDE RICHMOND	reg_all	15869.4741928	19363.6574097	0.4124871	reg_slim	10452.8543907	17672.2644171	0.5542061
NeighborhoodOLNEY	reg_all	-81051.0993539	17944.4988330	0.0000063	reg_slim	-78219.1692285	15853.6904599	0.0000008
NeighborhoodOVERBROOK	reg_all	-95547.5871636	18894.8980732	0.0000004	reg_slim	-95919.6370296	17585.7097417	0.0000000
NeighborhoodOVERBROOK FARMS	reg_all	-172684.0617529	32469.4522634	0.0000001	reg_slim	-161275.5734957	30826.6156069	0.0000002
NeighborhoodOVERBROOK PARK	reg_all	-56451.7955031	22055.6680773	0.0104924	reg_slim	-41076.3705219	20973.4106451	0.0501917
NeighborhoodOXFORD CIRCLE	reg_all	-66624.4141888	18468.3637994	0.0003103	reg_slim	-57970.9662711	17066.7606727	0.0006839
NeighborhoodPACKER PARK	reg_all	-14381.8578155	24796.3898288	0.5619253	reg_slim	-11369.4008805	24363.2785674	0.6407495
NeighborhoodPARADISE	reg_all	-53663.1677050	34960.8200870	0.1248185	reg_slim	-56603.4633900	34411.6245983	0.1000144
NeighborhoodPARKWOOD	reg_all	-19199.1484625	17129.8742246	0.2623928	reg_slim	-23270.8506505	16913.4072254	0.1688804
NeighborhoodPENNSPORT	reg_all	-6582.6538182	20130.6767112	0.7436757	reg_slim	-4228.3511949	18204.7033291	0.8163341
NeighborhoodPENNYPACK WOODS	reg_all	41960.6435048	113129.5540446	0.7107118	reg_slim	65227.5971361	114039.8416658	0.5673502
NeighborhoodPOINT BREEZE	reg_all	-8088.4612524	17309.9662729	0.6403127	reg_slim	-4245.1828359	15356.1035523	0.7822072
NeighborhoodPORT RICHMOND	reg_all	-29512.1404750	17679.4934967	0.0950832	reg_slim	-24532.4248997	15779.1562431	0.1200317
NeighborhoodPOWELTON VILLAGE	reg_all	-34605.9683188	52740.6332845	0.5117360	reg_slim	-24015.8057978	52593.9763556	0.6479466
NeighborhoodQUEEN VILLAGE	reg_all	93695.8143232	19964.4964507	0.0000027	reg_slim	98978.3461354	18273.0512294	0.0000001
NeighborhoodRHAWNHURST	reg_all	-3373.9229909	18579.8775991	0.8559070	reg_slim	9991.8375656	17428.9718079	0.5664583
NeighborhoodRITTENHOUSE SQ.	reg_all	293773.0507151	18553.8054547	0.0000000	reg_slim	280872.1289137	17205.4170120	0.0000000
NeighborhoodSAUNDERS PARK	reg_all	-39282.6002949	48618.1316448	0.4191145	reg_slim	-26126.8786552	48337.2458873	0.5888524
NeighborhoodSHARSWOOD	reg_all	-54367.5947877	24525.1100208	0.0266520	reg_slim	-53499.8752214	23426.1196254	0.0224001
NeighborhoodSHAWMONT VALLEY	reg_all	-58495.1417862	40407.3922643	0.1477418	reg_slim	-74499.8180405	40408.4212184	0.0652524
NeighborhoodSOCIETY HILL	reg_all	223310.1849393	20144.0701817	0.0000000	reg_slim	218679.0227241	18190.9236294	0.0000000
NeighborhoodSOMERTON	reg_all	-2127.4648236	15935.0539383	0.8937932	reg_slim	-3223.6079820	16035.2828123	0.8406763
NeighborhoodSOUTH PHILADELPHIA	reg_all	8559.6425368	16568.4172324	0.6054269	reg_slim	9403.7046014	14640.7064377	0.5206900
NeighborhoodSOUTHWEST	reg_all	-83920.8752636	17066.3882595	0.0000009	reg_slim	-82397.6901457	15812.5347453	0.0000002
NeighborhoodSOUTHWEST CENTER CITY	reg_all	106732.5225133	18097.3078312	0.0000000	reg_slim	108779.3328829	16342.7869898	0.0000000
NeighborhoodSPRING GARDEN	reg_all	77603.7536995	20260.6419528	0.0001286	reg_slim	87608.5166849	18407.0289858	0.0000020
NeighborhoodSPRUCE HILL	reg_all	55453.9418361	26775.6943263	0.0383721	reg_slim	32932.5384605	25868.4027772	0.2030111
NeighborhoodST. HUGH	reg_all	-84376.6831640	20806.9308011	0.0000504	reg_slim	-95639.2045120	19780.8709592	0.0000013
NeighborhoodSTRAWBERRY MANSION	reg_all	-80492.5823381	18714.0649261	0.0000171	reg_slim	-96841.8354347	17115.6261400	0.0000000
NeighborhoodSUMMERDALE	reg_all	-70043.7232086	20226.3818041	0.0005358	reg_slim	-61924.7207289	19163.2225759	0.0012346
NeighborhoodTACONY	reg_all	-60778.0340523	17277.0842742	0.0004365	reg_slim	-60567.8269538	16140.5218212	0.0001758
NeighborhoodTIOGA	reg_all	-136940.3919822	19499.4280564	0.0000000	reg_slim	-139746.6052022	18276.5480249	0.0000000
NeighborhoodUPPER HOLMESBURG	reg_all	-49117.2082574	22504.3607007	0.0290846	reg_slim	-43100.4886552	21441.6702430	0.0444370
NeighborhoodUPPER NORTHWOOD	reg_all	-61942.4415593	20597.1073621	0.0026402	reg_slim	-55845.8370408	19470.8696908	0.0041347
NeighborhoodUPPER ROXBOROUGH	reg_all	-39151.3185303	18183.7650762	0.0313286	reg_slim	-29856.4679494	17064.3364054	0.0802023
NeighborhoodWALNUT HILL	reg_all	-39096.6410894	26657.3504243	0.1424976	reg_slim	-43046.8258342	25972.3015942	0.0974586
NeighborhoodWALTON PARK	reg_all	-44212.4614591	30178.1304628	0.1429300	reg_slim	-44753.7876904	30156.7736352	0.1378216
NeighborhoodWASHINGTON SQUARE WEST	reg_all	168590.0820375	21203.4461655	0.0000000	reg_slim	169462.6339847	19569.5146383	0.0000000
NeighborhoodWEST FAIRHILL	reg_all	-127159.4012378	22247.9919633	0.0000000	reg_slim	-124355.6183092	21180.6315858	0.0000000
NeighborhoodWEST KENSINGTON	reg_all	-47831.8556038	25278.1137276	0.0584819	reg_slim	-49498.3790600	24372.1903475	0.0422804
NeighborhoodWEST MT. AIRY	reg_all	-63258.4744964	19759.4608285	0.0013706	reg_slim	-54813.9137262	18667.2941880	0.0033263
NeighborhoodWEST OAK LANE	reg_all	-80596.4719677	17264.1880540	0.0000031	reg_slim	-74889.6295475	15868.7867259	0.0000024
NeighborhoodWEST PARKSIDE	reg_all	-109853.3050213	36268.3923754	0.0024590	reg_slim	-104608.8946361	35595.3540711	0.0032999
NeighborhoodWEST POPLAR	reg_all	-33032.1518695	36219.7385445	0.3617882	reg_slim	-29878.3463061	35614.7995362	0.4015227
NeighborhoodWEST POWELTON	reg_all	-2738.1033258	33851.4863790	0.9355340	reg_slim	-7886.4000773	33410.0426846	0.8133983
NeighborhoodWEST SHORE	reg_all	-23791.9480650	45311.9041412	0.5995425	reg_slim	-23661.0629655	45071.5019339	0.5996143
NeighborhoodWHITAKER	reg_all	-79767.1412993	30923.3177517	0.0099041	reg_slim	-81264.4461404	30921.6132196	0.0085963
NeighborhoodWHITMAN	reg_all	-9235.2623425	19762.5668074	0.6402848	reg_slim	-8444.9175050	18249.7158755	0.6435563
NeighborhoodWINCHESTER	reg_all	-8141.1311035	19358.8864929	0.6740996	reg_slim	429.7532674	19126.7541313	0.9820744
NeighborhoodWINCHESTER PARK	reg_all	-16752.4470116	28317.5345494	0.5541330	reg_slim	-12265.6333895	28463.6971471	0.6665318
NeighborhoodWISSAHICKON	reg_all	-7434.7842408	19752.4061702	0.7066261	reg_slim	-18025.2254291	18681.4020772	0.3346240
NeighborhoodWISSINOMING	reg_all	-54146.6679531	17587.0174024	0.0020825	reg_slim	-53416.9769139	16467.4435297	0.0011823
NeighborhoodWYNNEFIELD	reg_all	-124078.7498991	19788.7494761	0.0000000	reg_slim	-116139.9060326	18971.3825342	0.0000000
NeighborhoodWYNNEFIELD HEIGHTS	reg_all	-25152.6556959	24626.0660526	0.3070906	reg_slim	-25399.4422527	24507.1267002	0.3000287
NeighborhoodYORKTOWN	reg_all	-51085.7142768	36172.9040971	0.1578945	reg_slim	-36278.3861451	35620.8263948	0.3084772
number_of_bathrooms	reg_all	42042.8129161	2030.9653938	0.0000000	reg_slim	38010.2868205	1604.1250442	0.0000000
number_of_bedrooms	reg_all	-5425.9847155	1070.0491463	0.0000004	NA	NA	NA	NA
number_stories	reg_all	14716.6215316	2141.6615697	0.0000000	reg_slim	9985.4837017	2121.1790827	0.0000025
pctdetached	reg_all	-240.1178379	102.6117756	0.0192947	NA	NA	NA	NA
pctmultifam_small	reg_all	126.9913104	93.6901793	0.1752998	NA	NA	NA	NA
price_lag	reg_all	0.3724456	0.0095447	0.0000000	reg_slim	0.3864421	0.0094745	0.0000000
total_area	reg_all	2.4438817	0.3780751	0.0000000	NA	NA	NA	NA
total_livable_area	reg_all	154.1511182	2.6404506	0.0000000	reg_slim	166.3417828	2.4510081	0.0000000
TotalPop	reg_all	-2.4162573	1.6397411	0.1406227	NA	NA	NA	NA
trees	reg_all	24.8456495	12.0714046	0.0395873	reg_slim	30.4536953	11.4708127	0.0079426
year_built	reg_all	84.7980919	24.6327387	0.0005781	reg_slim	180.5030307	22.2569316	0.0000000

Understand Errors

Analyzing the mean absolute error of the regression reveals the regression’s accuracy. At $67,083, the error is not trivial considering the mean sale price is $229,000. The Mean Absolute Percent Error confirms that the model errs by 38%.

philly.test <- philly.test %>%
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(reg_all, philly.test), 
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)%>%
  filter(sale_price < 3500000)

#MAE
mae <- round(mean(philly.test$sale_price.AbsError, na.rm = TRUE), 2)

#MAPE
mape <- round(mean(philly.test$sale_price.APE, na.rm = TRUE) * 100, 2)  

#Summary table of MAPE vs MAE 
results_df <- data.frame(
  Metric = c("Mean Absolute Error (MAE)", "Mean Absolute Percentage Error (MAPE)"),
  Value = c(mae, paste0(mape, "%"))  # Append "%" to MAPE value for percentage
)

kable(results_df) %>%
  kable_material_dark(full_width = F) %>%
  column_spec(column = 2, width = "20%", extra_css = "text-align:center;")

Metric	Value
Mean Absolute Error (MAE)	67312.68
Mean Absolute Percentage Error (MAPE)	30.78%

Visualizing the data can also be instrumental in diagnosing the goodness of a model’s fit. Given how tightly aligned the observed and predicted prices are we performed dozens of variable combinations to rule out over fitting. We are confident that our variables are generalizable and do not over-fit.

# Predicted prices as a function of observed prices

ggplot(philly.test, aes(x = sale_price, y = sale_price.Predict)) +
  geom_point(size = 2, alpha = 0.7, color = "#A5C4D4") +
stat_smooth(aes(sale_price, sale_price), 
             method = "lm", se = FALSE, size = 1, colour="#A5C4D4") + 
  stat_smooth(aes(sale_price.Predict, sale_price), 
              method = "lm", se = FALSE, size = 1, colour="#D52F4D") +  
  labs(
    title = "Predicted vs Observed Prices",
    subtitle = "Blue line is predicted. Red line is observed.",
    x = "Observed Prices",
    y = "Predicted Prices"
  ) +
  plotTheme()

Cross Validation

Cross validation is a way to test out the model on a subset of the data to see if the results hold. How this test works, is we create 100 random subsets to measure the goodness of fit. If the mean absolute error of each subset was approximately the same, we could deduce that the model is likely to generalize to new data. Unfortunately, we do not see clustered MAE. This spread out distribution of MAE indicates that the subsets produced different results. So while the model is potentially not over-fitting, it is potentially not powerful enough (which is confirmed by the high 38% MAE).

## Cross Validation

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(sale_price ~ ., data = st_drop_geometry(sales2023.select) %>% 
                                dplyr::select(var_all), 
     method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv

reg.cv$resample[1:10,]

ggplot(as.data.frame(reg.cv$resample), aes(x = MAE)) +
  geom_histogram(binwidth = 2500, fill = "#A5C4D4") +
  geom_vline(aes(xintercept = mean(reg.cv$resample$MAE)), colour = "red", size = 1) +
  labs(
    title = "Distribution of MAE",
    subtitle = "Observed MAE (red line)",
    x = "Mean Absolute Error (MAE)",
    y = "Count"
  ) +
  plotTheme()

Spatial Lag Errors

To understand the relationship between errors spatially, we mapped the error across Philadelphia. The map does not immediately indicate any clustering, however, a Moran’s I test takes a closer look at spatial clustering.

#Spatial Lags Error

# What is the relationship between errors? Are they clustered? Is the error of a single observation correlated with the error of nearby observations?

coords.test <-  st_coordinates(philly.test) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")
 
philly.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)) 

philly.test <- philly.test[is.finite(philly.test$sale_price.Error), ]
philly.test$sale_price.Error <- as.numeric(philly.test$sale_price.Error)

test_data <- philly.test %>%
  group_by(Neighborhood) %>%
  mutate(MeanPrice = mean(sale_price))

test_data <- test_data %>%
  group_by(Neighborhood) %>%
  mutate(MAPE = mean(sale_price.APE)) %>%
  filter(MAPE > 0) %>% 
  filter(MAPE < 5)

ggplot() +
  geom_sf(data = blocks21.sf, color = "grey70") +
  geom_sf(data = test_data, aes(colour = q5(sale_price.Error)), 
          show.legend = "point", size = .25) +
  scale_colour_manual(values = palette,
                   labels=qBr(test_data,"sale_price.Error"),
                   name="Error\nRanges") +
  labs(title="Map of Price Absolute Errors") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  mapTheme()

## Do Errors Cluster? Using Moran's I

moranTest <- moran.mc(philly.test$sale_price.Error, 
                      spatialWeights.test, nsim = 999)
moranTest

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.005, fill = "#A5C4D4") +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#D52F4D",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I of 0.073929 in red",
       x="Moran's I",
       y="Count") +
  plotTheme()

The Moran’s I value of 0.08356 suggests that there is some spatial autocorrelation in the residuals, but it’s not particularly strong. In other words, while similar values (residuals) do tend to be somewhat spatially clustered, the effect is not very pronounced. The alternative hypothesis “greater” indicates that the observed spatial autocorrelation is stronger than what we’d expect by random chance. However, the fact that the Moran’s I value is close to zero means that nearby locations have only slightly more similar prediction errors than would be expected under random spatial distribution.

So if the error is not explained spatially, perhaps it can be explained by difference in housing quality or other local intelligence features.

Additionally, when we examine whether the mean absolute percentage error is clustered around any neighborhoods, we see that only two neighborhoods in North Philadelphia have slightly higher MAPEs than the other neighborhoods, and only at 8% compared to less than 4%. This indicates that the model is generalizable to new test information.

#provide a map of mean absolute percentage error (MAPE) by neighborhood.

st_drop_geometry(philly.test) %>%
  group_by(Neighborhood) %>%
  summarize(mean.MAPE = mean(sale_price.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(hoods, by = c("Neighborhood" = "Neighborhood")) %>%
    st_sf() %>%
 ggplot() + 
      geom_sf(data = blocks21.sf, color = "gray70") +    
    geom_sf(aes(fill = mean.MAPE)) +
      geom_sf(data = philly.test, colour = "white", size = .5, alpha = 0.2) +
      scale_fill_gradient(low = palette[1], high = palette[5],
                          name = "MAPE") +
      labs(title = "Mean test set MAPE by neighborhood",
           caption = "White dots represent each sale in 2023.") +
      mapTheme()

#Provide a scatterplot of MAPE by neighborhood as a function of mean price by neighborhood.

ggplot(test_data) +
  geom_point(aes(MeanPrice, MAPE)) +
  geom_smooth(method = "lm", aes(MeanPrice, MAPE), se = FALSE, colour = "#D52F4D") +
  labs(title = "MAPE by Neighborhood as a function of mean price by Neighborhood",
       x = "Mean Home Price", y = "MAPE") +
  plotTheme()

Predict Prices

With the exception of Center City, which more recently flipped from a “D” grade (re: Redlining) residential neighborhood to an expensive one, housing prices are sticky. Areas like Chestnut Hill or Fairmount were historically highly valued, are today, and will continue to be in the future according to this model.

# Provide a map of your predicted values for where ‘toPredict’ is both “MODELLING” and “CHALLENGE”.

inTrain.both <- createDataPartition(
              y = paste(sales2023$holc_grade, sales2023$historic, sales2023$Neighborhood),
              p = .60, list = FALSE)

philly.training.both <- sales2023[inTrain.both,] 
philly.test.both <- sales2023[-inTrain.both,]  

reg_all.both <- 
  lm(sale_price ~ ., data = as.data.frame(philly.training.both) %>% 
                             dplyr::select(var_all))

philly.test.both <- philly.test.both %>%
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(reg_all.both, philly.test.both), 
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)%>%
  filter(sale_price < 5000000)

ggplot() +
  geom_sf(data = blocks21.sf, color = "gray70") +
  geom_sf(data = philly.test.both, aes(colour = q5(sale_price.Predict)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette,
                   labels=qBr(test_data,"sale_price.Predict"),
                   name="Prices") +
  labs(title="Predicted Sale Prices") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  mapTheme()

Contextualize

The income context of Philadelphia is important to consider when evaluating our model. When dividing Philadelphia into low or high income groups (bsed on the median income), there is clear clustering of high incomes in the center, northwest and northeast outskirts of the city, and lower incomes in dividing the outer and inner areas. Price absolute errors are clustered by income more so than by neighborhood boundary, which suggests income of nearby residents is highly influential in test set.

blocks21.sf <- blocks21.sf %>% 
  mutate(incomeContext = ifelse(MedHHInc > median(blocks21.sf$MedHHInc), "High Income", "Low Income"))

ggplot() + geom_sf(data = na.omit(blocks21.sf, color = "gray70"), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#4B7740", "#A5C4D4"), name="Income") +
    labs(title = "Median Income by Block Group") +
    mapTheme() + 
  theme(legend.position="bottom")

st_join(philly.test, blocks21.sf) %>% 
  filter(!is.na(incomeContext)) %>%
  group_by(Regression, incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(sale_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood income context") %>%
  kable_material_dark() %>%
  column_spec(column = 2, width = "20%", extra_css = "text-align:center;"
               )

Test set MAPE by neighborhood income context
Regression	High Income	Low Income
Baseline Regression	24%	41%

Conclusion

Discussion

The study strongly suggests that local intelligence is an effective variable to predict housing prices. While nearby housing prices are a strong predictor of housing prices, combining nearby prices with local intelligence is an even stronger predictor.

The variables in the study with the strongest association to sale price, such as livable area, neighborhood, median household income, tree density, and litter conditions drive the the regression’s goodness of fit. Other variables affect the model at the margins.

In Philadelphia and likely in many major metro areas across the United States, downtown revitalization in the past 30 years disproportionately pushed housing prices. It would be interesting to further develop this variable beyond redlining and today’s pricing to better capture how value is dispersed across modern U.S. cities. If we could weigh factors, such as FAR or % of permits denied, that ultimately encourage or constrict development, it could make neighborhood and historical pricing variables even stronger.

The current model predicts 74% of the variation in price. Spatially, we do see that areas like Center City and Chestnut Hill are going to continue to achieve high sales prices. The model predicts these trends particulary well. However, where the model could be impoved is how the model may be discounting the rapid popularity and value in areas like Fishtown where development is new and abundant. Spatially, this model does not account for where new development is in the works. Understanding how real estate development investment can drive housing prices, but an oversupply could potentially push pricing down is important to factor into our predictions, too. Perhaps the model could predict more variation in price with this data split into sales and rentals.

Also important to note is the impact of internal characteristics on predicted price. While the model does predict on liveable area and features like fireplaces and number of bathrooms relatively well, it does not account for quality. Our MAE of around $70k could be explained by the interior finishes between two otherwise identical properties. Variables that can correct for quailty could be an effective addition to the model.

Recommendation

We recommend that Zillow incorporate local intelligence into its Zestimate algorithm. Our model is a solid starting point that can certianly be improved upon with additional feature selection and engineering. While our model predicts fairly well, it could be improved in how well accurate predictions are dispersed across the city. Reducing and clustering absolute errors will be central to future iterations.

While Amenities, dis-amenities and neighborhood spatial characteristics are often far more predictive of home prices than internal characteristics, these may be different in Philadelphia than in other cities. The key is that disparity in these spatial characteristics may highlight where there are disparities, and in turn inform where price may change significantly. In a city where trees, litter, or income are uniformly distributed, these characteristics will not impact price. Price is in many ways, therefore, a byproduct of disparities, creating tension between neighborhoods that have what is ultimately not available throughout the rest of a city.

We recommend future iterations of the model include variables that offer nuance between properties that may be otherwise identical in terms of form and location. Furthermore, variables that account for urban growth dynamics such as where building permits are issued relative to current density. This kind of variable could help the model account for pricing trends that are not present today but may be in the future. Another example of accounting for the impact of future development in Philadelphia is distance to transit since the entire trolley system is undergoing an extensive modernization. Lastly, variables that capture job centers since people often choose their home based on access or proximity to jobs or potential jobs could explain the pricing disparity between neighborhoods, especially those in downtown. We are confident with this model, Zillow is off to a great start to evolve its Zestimate score.

Appendices

Appendix A: Filtering Data

We want to note where outliers were identified and subsequently filtered to ensure the results were not skewed. For sales prices, we remove any sales above 3,500,000 since this is significantly higher than even the most expensive homes. We filtered sales with more than 4 garages because the values larger were over 70, likely representing an entire building rather than an individual unit. Additionally, parks with high concentrations of trees likely created outlier tree density values, when in reality the homes near them were not nearly as tree covered. Therefore, we only included values where trees were less than 1,000 in the block group, an already very high number. Finally, we only included homes where the total livable area was less than 6,500 to remove extreme outlier homes that were also likely for entire buildings, rather than one unit.

plot(sales2023$sale_price, ylim = c(0,6000000), main="Sales Prices") +
abline(h = 3500000, col = "#D52F4D")

plot(sales2023$garage_spaces, ylim = c(0, 80), main="Garage Spaces") + 
abline(h = 4, col = "#D52F4D")

plot(sales2023$total_livable_area, ylim = c(0, 150000), main="Total Livable Area") +
  abline(h = 6500, col = "#D52F4D")

plot(sales2023$trees, ylim = c(0, 10000), main="Trees") + 
  abline(h = 1000, col = "#D52F4D")

Appendix B: VIF Scores

VIF_table <- data.frame(
  Variable = c("Neighborhood", "fireplaces", "garage_spaces", "number_of_bathrooms", "number_of_bedrooms", "number_stories", "total_area", "total_livable_area", "year_built", "historic", "holc_grade", "price_lag", "trees", "litterscore", "TotalPop", "MedHHInc", "pctdetached", "pctmultifam_small", "hwy_grade"),
  GVIF = c(9643.267543, 1.338493, 1.760822, 1.943907, 2.028490, 1.515672, 1.469092, 1.868998, 1.453886, 1.690424, 150.815224, 3.716039, 2.010839, 1.491015, 1.506094, 2.286827, 1.926168, 1.415946, 1.277872)
)

kable(VIF_table, caption = "VIF Values", format = "html") %>%
  kable_styling()

VIF Values
Variable	GVIF
Neighborhood	9643.267543
fireplaces	1.338493
garage_spaces	1.760822
number_of_bathrooms	1.943907
number_of_bedrooms	2.028490
number_stories	1.515672
total_area	1.469092
total_livable_area	1.868998
year_built	1.453886
historic	1.690424
holc_grade	150.815224
price_lag	3.716039
trees	2.010839
litterscore	1.491015
TotalPop	1.506094
MedHHInc	2.286827
pctdetached	1.926168
pctmultifam_small	1.415946
hwy_grade	1.277872

Optimizing Zestimate®

Leveraging Local Intelligence to Predict Housing Prices, Philadelphia

Authors: Alexa Ringer and Laura Frances, University of Pennsylvania

Date: 2023-10-11