Overview

Local intelligence is a strong predictor of housing prices in Philadelphia that can optimize Zillow’s popular and consequential Zestimate tool. Real estate professionals as well as home buyers and sellers alike trust Zestimate to be an accurate prediction of local real estate values. Investing in the tool’s reliability stands to not only benefit Zillow’s customers, but the company’s dominant market position, too.

The following study explores a series of local intelligence variables that range from tree-lined street to local median income. A series of statistical processes reveal a handful of key variables that are then processed by prediction models. The best prediction model is then tested on 2023 sale price data for accuracy. The study is mindful of creating models that are generalizable across Philadelphia, and hopefully generalizable across major metro areas throughout the country.

Select and Load Data

Firstly, sales data is a tremendous resource in terms of the internal characteristics of each property which can certainly influence price. This study leverages data such as year built, total square footage, number of bathrooms and bedrooms to holistically understand what influences price.

However, the property itself does not tell the full story. Where it is located and the characteristics of that area are key, too. There are myriad local intelligence variables to chose from between a number of key open data sources (2021-2023), including: U.S. Census American Community Survey, Pennsylvania Spatial Data Access (PASDA), Philadelphia Open Data Portal, or from the Delaware Valley Regional Planning Commission (DVRPC). This study priorizited complete datasets that affect the city at large. For example, litter is a systemic issue whereas airports are very localized. The following table breaks down viable sources and from which we selected local intelligence variables that we deemed to be generalizable for the purposes of this study:

# Create a data frame with your data
data_frame <- data.frame(
  "Dataset" = c(
    "Floodplain Data",
    "Sanitation Centers",
    "Litter Index",
    "Demographics and Socioeconomic Data",
    "Housing Typology",
    "Trees",
    "Neighborhood",
    "Proximity to Highways",
    "School Catchment Performance",
    "Walkscore", 
    "1937 Redlining Score"
  ),
  "Generalizability" = c(
    "",   # Floodplain Data 
    "",   # Sanitation Centers 
    "✔",  # Litter Index 
    "✔",  # Demographics and Socioeconomic Data 
    "",   # Housing Typology 
    "✔",  # Trees 
    "✔",  # Neighborhood
    "✔",  # Proximity to Highways 
    "",   # School Catchment Performance 
    "",   # Walkscore 
    "✔"  # 1937 Redlining Score
  )
)


dt_url <- c("https://data-phl.opendata.arcgis.com/datasets/phl::fema-100-flood-plain/explore",
            "https://opendataphilly.org/datasets/sanitation-areas/",
            "https://opendataphilly.org/datasets/litter-index/",
            "https://www.census.gov/",
            "https://www.census.gov/",
            "https://opendataphilly.org/datasets/phillytreemap/",
           "https://www.pasda.psu.edu/uci/DataSummary.aspx?dataset=7047",
           "https://dvrpc-dvrpcgis.opendata.arcgis.com/datasets/dvrpcgis::freight-highway-network/explore",
           "https://www.philasd.org/performance/programsservices/open-data/school-information/",
           "http://walkscore.com/",
           "https://ncrc.org/holc/")

data_frame %>%
  mutate(Dataset = cell_spec((Dataset), "html", link = dt_url)) %>%
  kable(format = "html", escape = FALSE) %>%
  kable_styling(bootstrap_options = c("hover", "condensced"), full_width = FALSE, 
                position = "left") %>%
  column_spec(column = 2, width = "20%", extra_css = "text-align:center;"
               )
Dataset Generalizability
Floodplain Data
Sanitation Centers
Litter Index
Demographics and Socioeconomic Data
Housing Typology
Trees
Neighborhood
Proximity to Highways
School Catchment Performance
Walkscore
1937 Redlining Score

Boundaries

From highways to rivers, man-made and natural boundaries alike delineate neighborhoods where certain people and types of businesses congregate. Neighborhood boundaries are arguably more organic than census block groups (which are optimized for consistent populations) or highway. Additionally, because residential sales are generalized by residential neighborhoods not by census block groups, neighborhoods (as defined by the Philly Planning Dept) are likely significant on local intelligence.You can see where areas of high income and low income cluster across Philadelphia, and how these boundaries are more recognizable at a neighbrhood level than block group level.

# plot sales data using median cut off  

sales_plot_mod <- sales2023 %>%
  filter(toPredict == "MODELLING")

median_sale_price <- median(sales_plot_mod$sale_price)
custom_colors <- ifelse(sales_plot_mod$sale_price <= median_sale_price, "#A5C4D4", "#4B7740")

sales_plot <- ggplot() +
  geom_sf(data = blocks21.sf, color = "gray70") +
  geom_sf(data = sales_plot_mod, aes(color = custom_colors), size = 0.5, alpha = 0.5) +
  scale_color_manual(
    values = c("#A5C4D4" = "#A5C4D4", "#4B7740" = "#4B7740"),
    labels = c("#4B7740" = "Above Median", "#A5C4D4" = "Below Median")) +
  labs(title = "Block Groups", color = "") +
  theme(legend.position = "none") + 
  guides(color = guide_legend(override.aes = list(size = 3))) +
  labs(subtitle = "2021 Philadelphia Median \nSale Price is $229,900") +
  mapTheme()

#Bring in neighborhoods

hoods <- st_read("/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Midterm/PPA_Midterm_Git/data/PhillyPlanning_Neighborhoods/PhillyPlanning_Neighborhoods.shp") %>%
  select(NAME) %>%
  rename(Neighborhood = NAME)

sales2023 <- st_join(sales2023, hoods, join = st_within)

hood_plot <- ggplot() +
  geom_sf(data = blocks21.sf, fill = "gray90", color = "gray90") +
  geom_sf(data = hoods, color = "gray60") +
  geom_sf(data = sales_plot_mod, aes(color = custom_colors), size = 0.5, alpha = 0.5) +
  scale_color_manual(
    values = c("#A5C4D4" = "#A5C4D4", "#4B7740" = "#4B7740"),
    labels = c("#4B7740" = "Above Median", "#A5C4D4" = "Below Median")) +
  labs(title = "Neighborhoods") +
  theme(legend.position = "none") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  labs(subtitle = "Neighborhoods as defined by Philadelphia \nPlanning Dept") +
  mapTheme()

#Bring in highways and apply buffer of 1/8 mile (660')

highways <- st_read("/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Midterm/PPA_Midterm_Git/data/Freight_Highway_Network.geojson")

highways <- highways %>%
  filter(state == "PA") %>%
  st_transform('EPSG:2272')

highways_phl <-  st_intersection(highways, blocks21.sf) %>%
  select(geometry)

highway_buffer <- st_buffer(highways_phl, 660)
highway_buffer$hwy_grade <- 1

# Make a dummy variable where 1 is a house within highway buffer and 0 is outside of buffer. 

saleshighways <- st_intersection(sales2023, highway_buffer)

homes_near_hwys <- saleshighways$objectid  #creates a list using unique ids from saleshighways

sales2023 <- sales2023 %>%
  mutate(hwy_grade = ifelse(objectid %in% homes_near_hwys, 1, 0))

hwy_plot <- ggplot() + 
  geom_sf(data = blocks21.sf, color = "gray70") +
   geom_sf(data = sales_plot_mod, aes(color = custom_colors), size = 0.5, alpha = 0.5) +
  scale_color_manual(
    values = c("#A5C4D4" = "#A5C4D4", "#4B7740" = "#4B7740"),
    labels = c("#4B7740" = "Above Median", "#A5C4D4" = "Below Median")) +
  geom_sf(data = highway_buffer, aes(fill = "#36151E", color = "#36151E")) +
  scale_fill_identity(name = "Highway", labels = "#36151E") +
  scale_color_identity(name = "Highway", labels = "#36151E") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  labs(title = "Highway Proximity") +
  labs(subtitle = "The highway buffer is 660 feet, \nabout 2 Philadelphia blocks.") +
  mapTheme()

#plot all three together

plot_grid(sales_plot, hood_plot, hwy_plot, ncol = 3, align = "v")

It is important to be aware of how the sales are distributed around the city. There are many block groups without any housing sales, whereas there are fewer neighborhoods where this is true because neighborhoods are residential in nature. Geographically, because neighborhoods are generally divided along residential vs industrial zoning lines, it is simpler to comprehend how it is generalizable across the city.

Interestingly, when considering the impact of highways in Philadelphia, it is reasonable to assess that it is not generalizable. However, given how consistently home prices drop below the median within proximity to the highway regardless of the sale is in Center City or Port Richmond, we decided that it is a variable worth further investigation.

Redlining and Price Clustering

In the 1930s, a federal agency, the Home Owners’ Loan Corporation (HOLC), created maps of mortgage risk which denied entire neighborhoods access to public lending benefits and overall economic opportunity. The institutionalize of marginalization of poor, Black communities ingrained persistent residential segregation that is still evidenced by income inequality across the city.

# Get and wrangle redlining data

redlining.sf <- st_read("/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Midterm/PPA_Midterm_Git/data/Redlining.geojson") %>% 
  st_transform('EPSG:2272') %>% 
  filter(state == "PA", city == "Philadelphia") 

redlining.sf <- redlining.sf %>%
  dplyr::select(-c(holc_id, state, city, neighborhood_id, area_description_data, name)) %>%
  drop_na() 

# Join each house with historic redlining grade 

sales2023 <- st_join(sales2023, redlining.sf, join = st_within) 

sales2023 <- sales2023 %>% 
  dplyr::mutate(holc_grade = ifelse(is.na(holc_grade) == TRUE, "NA", holc_grade))

# Plotting Redlining Maps 

redlining_colors <- c("NA" = "gray85", "A" = "#197991", "B" = "#4B7740", "C" = "#E3D26F", "D" = "#D52F4D")
redlining_palette <- c("#197991", "#4B7740", "#E3D26F", "#D52F4D", "#A9233B")

redlining.sf.phl <- st_intersection(redlining.sf, blocks21.sf) #clip redlining to PHL boundaries
  
Redlining_plot <- ggplot() +
  geom_sf(data = blocks21.sf, color = "gray70") +
  geom_sf(data = redlining.sf.phl, aes(fill = holc_grade, text = holc_grade), alpha = 0.8) +
  scale_fill_manual(values = redlining_colors, name = "HOLC Grade") +
  labs(title = "1937 HOLC Redlining", caption = "by 2020 Block Group") +
  theme(legend.position = "right") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  mapTheme() 

Income_plot <- ggplot() +
  geom_sf(data = blocks21.sf, color = "grey70") +
  geom_sf(data = sales2023, aes(colour = q5(MedHHInc)), 
          show.legend = "point", size = .5, alpha = 0.5) +
  scale_colour_manual(values = rev(redlining_palette),
                   labels=qBr(sales2023,"MedHHInc"),
                   name="Income\nBrackets") +
  labs(title="2023 Median Incomes", caption = "by 2020 Block Group") +
  theme(legend.position = "right") +
  guides(color = guide_legend(override.aes = list(size = 3))) +
  mapTheme()

plot_grid(Redlining_plot, Income_plot, ncol = 2, align = "v")

NCRC states that, “There is significantly greater economic inequality in cities where more of the HOLC graded high-risk or “Hazardous” areas are currently minority neighborhoods.” (NCRC) The impact of HOLC is generalizable. One exception to note is Center City where it has many high income earners despite its high-risk grade likely due to post-war downtown revitalization.

When considering the patterns of redlining and neighborhoods, it is likely that neighborhoods such as Center City will account for divergences in the redlining data, and neighborhoods such as Kensington will account for consistencies in the redlining data when we develop our definition of local intelligence.

## K NEAREST NEIGHBOR: PRICE

sales2023 <- sales2023 %>% 
  mutate(nb = st_knn(geometry, k = 5),
         wt = st_weights(nb),
         price_lag = st_lag(sale_price, nb, wt))

Price Lag : To systemically investigate price clustering beyond redlining and in today’s data, a function called “k-nearest neighbor” calculates a weighted average of the 5 nearest properties’ sales price. This price lag variable accounts Philadelphia’s ‘block-by-block’ dynamics borne from its density and long history where many uses from mansions to power plants have converged within mere feet of one another - but can drastically impact sale price potentially.

Environmental Conditions

Proximity to amenities and distance from disamenities certainly has a colloquial effect on housing prices. Broker descriptions love to tout, “2 Blocks from Park!” or “Tucked away, quiet street!” To test the effect of environmental data, the study inclues tree density and litter conditions.

## TREES by BLOCK GROUP

trees.sf <- st_read("https://opendata.arcgis.com/api/v3/datasets/30ef36e9e880468fa74e2d5b18da4cfb_0/downloads/data?format=geojson&spatialRefId=4326") %>%
  st_transform('EPSG:2272') 

trees.sf <- st_join(trees.sf, blocks21.sf, join = st_within)

trees.sf <- trees.sf %>%
  select(c(OBJECTID, TREE_NAME, YEAR, GEOID))

trees.sf <- trees.sf %>%
  group_by(GEOID) %>%  # census block group
  summarize(trees = n()) 

blocks21.sf.trees <- st_join(blocks21.sf, trees.sf, by = "GEOID")

blocks21.sf.trees <- mutate(blocks21.sf.trees, trees = ifelse(is.na(trees), 0, trees)) %>% 
  select(trees, GEOID.y) 

sales2023 <- sales2023 %>%
  st_join(blocks21.sf.trees, join = st_intersects)

trees_plot <- ggplot() +
  geom_sf(data = blocks21.sf, fill = "gray20", color = "gray40") +
  stat_density2d(data = data.frame(st_coordinates(trees.sf)),
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 30, geom = 'polygon') +
  scale_fill_gradient(
    high = "#4B7740", 
    low = "#ABCDA2", 
    name = "",
   breaks = c(0.0000000001, 0.000000001),  # Set the breaks for low and high
    labels = c("low", "high")) +  # Set the corresponding labels
  scale_alpha(range = c(0, .7), guide = "none") +
  labs(title = "Tree Density", color = "") +
  theme(legend.position = "bottom") + 
  guides(color = guide_legend(override.aes = list(size = 3))) +
  labs(caption = "Tree Density is measured by \n number of trees in the block group") +
  mapTheme()

# NEAREST LITTER TO HOUSE

litterindex <- st_read("https://opendata.arcgis.com/datasets/04fa63e09b284dbfbde1983eab367319_0.geojson") %>%
  st_transform('EPSG:2272') %>% 
  rename(litterscore = HUNDRED_BLOCK_SCORE) %>% 
  rename(littergrade = SCORE_COLOR)

nearest_litter <- st_nearest_feature(sales2023, litterindex)
sales2023$litterscore <- litterindex$litterscore[nearest_litter]
sales2023$littergrade <- litterindex$littergrade[nearest_litter]

sales2023 <- sales2023 %>%
  mutate(litterscore = round(litterscore, 2))

litter_plot <- ggplot() +
  geom_sf(data = blocks21.sf, fill = "gray20", color = "gray40") +
  stat_density2d(data = data.frame(st_coordinates(litterindex)),
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 30, geom = 'polygon') +
  scale_fill_gradient(low = "#E3D26F", high = "#36151E", name = "",
                      breaks = c(0.0000000002, 0.0000000006),
                      labels = c("low", "high")) +
  scale_alpha(range = c(0, .7), guide = "none") +
  labs(title = "Litter Score", color = "") +
  theme(legend.position = "bottom") + 
  guides(color = guide_legend(override.aes = list(size = 3))) +
  labs(caption = "Litter Score is based on the Litter Index \ndeveloped by the City of Philadelphia") +
  mapTheme()

# plot two maps
plot_grid(trees_plot, litter_plot, ncol = 2, align = "v")

Interestingly, these two environmental data points may send mixed signals. Where there are more trees, there is more litter. Since tree density and litter conditions send strong visual cues to potential buyers of the level of care of the neighborhood, the impact of these variables on the model may strengthen or dilute the results.

Build Prediction Models

“All models are wrong, but some are useful.” - George Box

Zestimate seeks to predict housing prices. Predicting city trends is fraught. Oftentimes unpredictable political decisions such as post-war state-sponsored suburbanization (aka white flight) buck trend predictions. There is no model than account for every unforeseeable scenario. However as previously discussed, useful models are generalizable. Building a useful prediction model requires feature selection and feature engineering. These features are then used to train a model to recognize patterns associated with sale price. The predictive model will use those patterns to produce a Zestimate. While the nearest prices of nearby sales are likely a fine starting point to build Zestimate, including local intelligence will certainly strengthen the product.

Categorize Variables

This study uses the hedonic model (reference) theoretical framework that deconstructs housing prices into three categories:

  1. Internal Characteristics
  2. Public Services and Disamenities
  3. Spatial Process of Prices
# SELECTING VARIABLES TO INCLUDE IN MODEL

## PREPPING SALES DATA INTO MODEL V CHALLENGE

sales2023.mod <- sales2023 %>%
  filter(toPredict == "MODELLING")

sales2023.chal <- sales2023 %>%
  filter(toPredict == "CHALLENGE")

all_var <- c("objectid", "census_tract","GEOID", "toPredict","sale_price", "Neighborhood", "fireplaces", "garage_spaces", "number_of_bathrooms", "number_of_bedrooms", "number_stories", "total_area", "total_livable_area", "year_built","historic", "holc_grade", "price_lag", "trees", "litterscore", "TotalPop", "MedHHInc", "pctdetached", "pctmultifam_small", "hwy_grade")

sales2023.select <- sales2023.mod %>% 
  dplyr::select(all_of(all_var)) %>%
  na.omit()

sales2023.summary <- sales2023.select %>%
  st_drop_geometry() %>% 
  select(-c(objectid, census_tract, GEOID, toPredict))


# table of summary statistics with variable descriptions. Sort these variables by their category (internal characteristics, amenities/public services or spatial structure). Check out the `stargazer` package for this.

categories_to_include <- c("Amenities / Public Services", "Internal Characteristic", "Spatial Process")

#subsetting categories

#internal characteristics
InternalCharacteristics <- sales2023 %>%
  dplyr::select("fireplaces", "garage_spaces", 
          "number_of_bathrooms", "number_of_bedrooms", "number_stories", 
           "total_area", "total_livable_area", "year_built", "historic", "sale_price")

stargazer(as.data.frame(InternalCharacteristics), type="text", digits=1, title = "Internal Characteristics", out = "Training_PHLInternal.txt", omit.summary.stat = "N")

Internal Characteristics
=====================================================
Statistic             Mean    St. Dev.  Min    Max   
-----------------------------------------------------
fireplaces            0.05       0.3     0      5    
garage_spaces          0.4       0.7     0     72    
number_of_bathrooms    1.1       0.7     0      8    
number_of_bedrooms     2.6       1.3     0     31    
number_stories         2.0       0.6     1      5    
total_area           1,828.9   3,818.9   0   226,512 
total_livable_area   1,333.6    548.8    0   15,246  
year_built           1,936.7    51.1     0    2,024  
historic               1.9       0.4     1      3    
sale_price          271,846.8 240,083.4  0  5,990,000
-----------------------------------------------------
#amenities/public services

Amenities_Public_Services <- sales2023 %>%
  dplyr::select("Neighborhood", "trees", "TotalPop", "MedHHInc", "pctdetached", "pctmultifam_small", "litterscore")

stargazer(as.data.frame(Amenities_Public_Services), type="text", digits=1, title = "Public Services and Disamenities", out = "Training_PHLInternal.txt", omit.summary.stat = "N")

Public Services and Disamenities
===============================================
Statistic           Mean   St. Dev. Min   Max  
-----------------------------------------------
trees              116.7    112.9    2   2,136 
TotalPop          1,445.2   708.6    0   4,518 
MedHHInc          56,282.7 39,156.1  0  250,001
pctdetached         9.0      12.7   0.0  100.0 
pctmultifam_small   12.4     12.2   0.0  75.0  
litterscore         1.8      0.6    0.0   4.0  
-----------------------------------------------
#spatial process

Spatial <- sales2023 %>%
  dplyr::select("price_lag")

stargazer(as.data.frame(Spatial), type="text", digits=1, title = "Spatial Process of Prices", out = "Training_PHLInternal.txt", omit.summary.stat = "N")

Spatial Process of Prices
==================================================
Statistic   Mean    St. Dev.    Min        Max    
--------------------------------------------------
price_lag 267,985.6 194,111.3 20,000.0 3,292,800.0
--------------------------------------------------

As we feature select and engineer, here are some important takeaways from the above summary tables:

  • “price_lag” is the weighted average of nearby sales. These values are closely mirrored by actual sale price. Nearby prices are likely very influential so it is important to select features that can tell another side of the story.
  • Sales price ranges considerably in Philadelphia, and identifying other data that contains the same variety may be influential to help inform spatial price variation.
  • The variation of trees is quite high, with a minimum of only 2 in some areas, and a maximum of nearly 2150. This may be error or it may be important disparity to account for.
  • Total Population is based on census block groups which are engineered to normalize to 600 to 3,000 people.

Test for Correlation

The goal is to select variables that most significantly correlate to sale price to include in the predictive model. Correlation is a type of association test. Are sale prices more closely associated to square footage or to number of bedrooms? These are the types of subtle but important distinction we want to seek out.

# CORRELATION MATRIX OF VARIABLES

numericVars <- sales2023.select %>%
  st_drop_geometry() %>%
  select(-toPredict, -GEOID, -objectid, -census_tract) %>%
  select(-holc_grade, -hwy_grade, -historic, -Neighborhood) #remove categorical variables for corrtest

price.corr <- corr.test(numericVars)

if (sum(is.na(numericVars)) > 0) {
  # Handle missing values (e.g., impute or remove)
  numericVars <- na.omit(numericVars)  # Remove rows with missing values
}

correlation_matrix <- round(cor(numericVars), 1)
p_values <- cor_pmat(numericVars)

ggcorrplot(
  correlation_matrix,
  p.mat = p_values,
  colors = c("#36151E", "white", "#197991"),
  type = "lower",
  insig = "blank"
) +
labs(title = "Degrees of Correlation to Sale Price") +
  annotate(
  geom = "rect",
  xmin = .5, xmax = 16, ymin = .5, ymax = 1.5,
  fill = "transparent", color = "red", alpha = 0.5
)

The correlation matrix revealed strong positive and negative correlations. These are the primary variables to note for further engineering:

  • Fireplaces
  • Number of bathrooms
  • Total livable area (sf)
  • Tree density
  • Median Household Income
  • Litter conditions
  • Price lag

It is important to note where variables correlate with each other as well. Price lag and median income had very strong positive correlation with one another, as did number of bathrooms and number of bedrooms. The number of bedrooms and the number of bathrooms are highly correlated to one another. Therefore to avoid multicollinearity, or two variables telling the same story, we will focus on number of bathrooms since it is also more correlated to sale price.

# str(sales2023.select)

# Create 4 home price correlation scatterplots that you think are of interest (using open data variables)

#MedHHInc, trees, price_lag, litterscore

st_drop_geometry(sales2023.select) %>% 
  dplyr::select(sale_price, MedHHInc, trees, price_lag, litterscore) %>%
  gather(Variable, Value, -sale_price) %>% 
   ggplot(aes(Value, sale_price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#D52F4D") +
  facet_wrap(~Variable, ncol = 2, scales = "free", labeller = labeller(Variable = c(price_lag = "Price Lag", MedHHInc = "Median Income", trees = "Tree Density", litterscore = "Litter Conditions"))) +
     labs(title = "Price as a function of select variables") +
     plotTheme()

The relationship between sale price and median income and tree density are similarly positive, where as litter is negative. The correlation between price lag and sale price is nearly 1, which indicates a very tight fit (as supported by the values in the categorizing summary table.)

Examine Spatially

# Create 3 maps of the 3 most interesting variables.

plot_MedHHInc <- ggplot() +
  geom_sf(data = blocks21.sf, color = "gray70") +
  geom_sf(data = sales2023, aes(colour = q5(MedHHInc)), 
          show.legend = "point", size = .5) +
  scale_colour_manual(values = rev(palette),
                   labels=qBr(sales2023,"MedHHInc"),
                   name="Median\nIncome") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  mapTheme()

plot_price_lag <- ggplot() +
  geom_sf(data = blocks21.sf, color = "gray70") +
  geom_sf(data = sales2023, aes(colour = q5(price_lag)), 
          show.legend = "point", size = .5) +
  scale_colour_manual(values = rev(palette),
                   labels=qBr(sales2023,"price_lag"),
                   name="Nearby\nPrice") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  mapTheme()

trees_plot <- ggplot() +
  geom_sf(data = blocks21.sf, color = "gray70") +
  stat_density2d(data = data.frame(st_coordinates(trees.sf)),
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 50, geom = 'polygon') +
  scale_fill_gradient(
    low = "#ABCDA2",
    high = "#4B7740", 
    name = "Tree Density") +
  scale_alpha(range = c(0, .5), guide = "none") +
mapTheme()

plot_grid(plot_MedHHInc, plot_price_lag, trees_plot, align = "v", ncol = 2)

Spatially, different patterns emerge. For instance, Center City has high tree density, high median income, and high average nearby prices despite being historically marginalized by redlining in the 1930s. North Philadelphia experiences the inverse of these trends, as do many parts of West Philadelphia further from University City. Spatial exploration reinforces the breadth of variable selection.

Filter Data

#Filter out outliers discovered in summary statistics.
sales2023.select <- sales2023.select %>%
   filter(sale_price <= 3500000, garage_spaces < 4, total_livable_area < 6500,
          trees <= 1000)

After identifying correlated variables and looking at summary statistics, we also plotted the distribution of some key characteristics to identify any outliers possibly skewing the data. See Appendix A for scatterplots that visualize the outliers that were filtered out.

Data Filters:

  • Sale price <= $3,500,000
  • Trees < 1000
  • Garage Spaces < 4
  • Total Livable Area < 6,500 sf

Train and Test Prediction Models

The way the prediction model works is that we will train it to understand patterns about the relationship between sale price and our selected variables. Then we will test a portion of the data to see how well it performs. We use a 60:40 data partition to training and test, respectively.

#take note: all of these are categorical. Specifying categorical variables, fixed effects so the model is trained

inTrain <- createDataPartition(
              y = paste(sales2023.select$holc_grade, sales2023.select$historic, sales2023.select$Neighborhood),
              p = .60, list = FALSE)
philly.training <- sales2023.select[inTrain,] 
philly.test <- sales2023.select[-inTrain,]  

We established the variables of the model. Now, we need to formulate the best combination of those variables to predict sale price. To begin, we start with a simple model only based on price lag, and then we add and subtract variables and analyze if they strengthen or weaken the model.

Price Lag Only Model:

Considering the strong association between sale price and price lag, we developed a price lag only regression. The price lag only regression is statistically significant (p-value < 0.05) however its R-squared value indicates that approximately 57.34% of the variation in the dependent variable, sale_price, is explained by price lag.

Local Intelligence Model:

Let’s see how including local intelligence variables impacts the regression. A regression that includes a ‘kitchen sink’ of variables from our data sources including but not limited to fireplaces, number of bathrooms, redlining grade, trees, and litter performs better with an R-squared value that indicated approximately 73.44% of the variation in sale price is explained by local intelligence + price lag.

#Price Lag only model
reg.int <- lm(sale_price ~ price_lag, data = as.data.frame(philly.training))

#Local Intelligence Regression 
var_all <- c("sale_price", "Neighborhood", "fireplaces", "garage_spaces", "number_of_bathrooms", "number_of_bedrooms", "number_stories", "total_area", "total_livable_area", "year_built","historic", "holc_grade", "price_lag", "trees", "litterscore", "TotalPop", "MedHHInc", "pctdetached", "pctmultifam_small", "hwy_grade")

reg_all <-
  lm(sale_price ~ ., data = as.data.frame(philly.training) %>% 
                             dplyr::select(var_all))

To ensure that the Local Intelligence Regression does not have multicollinearity, or multiple values telling the same story about sale price, we use a series of tests. The first test, VIF, indicates that Neighborhood and Redlining Grade have very high VIF scores relative to other variables and are potentially conflicting. To further investigate these categorical (non-numeric) variables, we performed a chi-square test which reveals that they do have high collinearity.

#test for multicollinearity of reg_all
library(car)
vif(reg_all)
#neighborhood (10221.05) and holc_grade (153.67) have very high vif scores compared to other variables.

#checking chisquare on categorical for collinearity 
chi_test <- table(philly.training$Neighborhood, philly.training$holc_grade)
chisq.test(chi_test)

#remove holc_grade bc of high collinearity with Neighborhood

reg_slim <- lm(sale_price ~ Neighborhood + MedHHInc + price_lag + trees + number_of_bathrooms + total_livable_area + litterscore + number_stories + year_built, data = as.data.frame(philly.training))

Local Regression Slim Model:

The next regression iteration removed Redlining Grade as well as other variables that presented less association with sale price. This slimmed down regression performed marginally less well than the ‘kitchen sink’ with an R-Squared value of 72.97% variance explained.

R-Squared is not the only measure of fit. The second and third regressions have a lower AIC compared to price lag only is good sign of the models fit and leanness and meanness of including local intelligence generally. We also performed a likelihood ratio test which indicates a high chi-square for the ‘kitchen sink’ so while both are stat significant, the second regression seems to fit the sales data best.

AIC(reg.int, reg_all, reg_slim) #lower AIC compared to intercept is good sign of the models fit and leanness and meanness.

lrtest(reg.int, reg_all, reg_slim) #likelihood ratio test show a high chisq for reg_all so while both are stat significant, reg_all seems to fit the data best.

aic_table <- data.frame(
  Regression = c("Price Lag Only", "Local Intel", "Local Intel Slim"),
  AIC = c(373654.6, 367160.2, 367393.4))

chi_table <- data.frame(
  Regression = c("Local Intel", "Local Intel Slim"),
  Chi_Square = c(6818.45, 259.24),
  Significance = c("<0.05 ***", "<0.05 ***")
)
AIC Value per Regressions
Regression AIC
Price Lag Only 373654.6
Local Intel 367160.2
Local Intel Slim 367393.4
Chi-Square per Regression
Regression Chi_Square Significance
Local Intel 6818.45 <0.05 ***
Local Intel Slim 259.24 <0.05 ***

With all this in mind, the predictive model will be trained with second regression that features a breadth of local intelligence features.

Here is an in-depth comparative table between regressions:

#compare kitchen sink and slim regressions

coef_all <- summary(reg_all)$coef
coef_slim <- summary(reg_slim)$coef

coef_df_all <- data.frame(
  Model = c("reg_all"),
  Coefficients = rownames(coef_all),
  Estimate_all = coef_all[, 1],
  Std.Error_all = coef_all[, 2],
  p_value_all = coef_all[, 4])


coef_df_slim <- data.frame(
  Model = c("reg_slim"),
  Coefficients = rownames(coef_slim),
  Estimate_slim = coef_slim[, 1],
  Std.Error_slim = coef_slim[, 2],
  p_value_slim = coef_slim[, 4]
)

left_merged_kable <- merge(coef_df_all, coef_df_slim, by = "Coefficients", all.x = TRUE) %>%
  rename("Regression All" = Model.x, "Regression Slim" = Model.y)

kable(left_merged_kable, format = "html", caption = "Note: NAs are where variables were not included for both Regressions") %>% 
  kable_styling("striped", full_width = F) %>%
  scroll_box(width = "800px", height = "500px")
Note: NAs are where variables were not included for both Regressions
Coefficients Regression All Estimate_all Std.Error_all p_value_all Regression Slim Estimate_slim Std.Error_slim p_value_slim
(Intercept) reg_all -279476.7997428 49921.0374428 0.0000000 reg_slim -438474.3739475 46047.1181628 0.0000000
fireplaces reg_all 20055.8986415 3997.7262837 0.0000005 NA NA NA NA
garage_spaces reg_all 19949.7716871 2494.9113408 0.0000000 NA NA NA NA
historic reg_all 23109.8780154 2930.8283940 0.0000000 NA NA NA NA
holc_gradeB reg_all -10835.7003932 9343.1614304 0.2461719 NA NA NA NA
holc_gradeC reg_all -13681.0963243 10103.7213906 0.1757377 NA NA NA NA
holc_gradeD reg_all -722.0059079 11025.4149394 0.9477883 NA NA NA NA
holc_gradeNA reg_all -12411.2797850 11321.6474356 0.2729914 NA NA NA NA
hwy_grade reg_all 11971.1364543 6119.6278170 0.0504632 NA NA NA NA
litterscore reg_all -5267.4098876 1854.3596425 0.0045101 reg_slim -6582.2861356 1864.9903469 0.0004179
MedHHInc reg_all 0.2242095 0.0365627 0.0000000 reg_slim 0.2265662 0.0363103 0.0000000
NeighborhoodALLEGHENY WEST reg_all -65699.0082409 18729.5667850 0.0004533 reg_slim -78617.7264519 17418.9565627 0.0000064
NeighborhoodANDORRA reg_all -53626.6958289 26254.8610315 0.0411165 reg_slim -44715.5140164 25635.7775496 0.0811355
NeighborhoodASHTON reg_all -28074.8646915 27055.1471125 0.2994323 reg_slim -21237.8145788 27036.5642300 0.4321610
NeighborhoodBELLA VISTA reg_all 109341.7415589 20444.3351888 0.0000001 reg_slim 109732.1594553 18900.9530092 0.0000000
NeighborhoodBELLS CORNER reg_all -11101.2503662 20319.8413197 0.5848509 reg_slim 472.3591262 19253.3601216 0.9804271
NeighborhoodBELMONT reg_all -104943.0995189 23728.9404496 0.0000098 reg_slim -99477.3256346 22529.7147042 0.0000102
NeighborhoodBLUE BELL HILL reg_all -19560.9212990 43516.3565427 0.6530727 reg_slim -29843.8683334 42495.9151292 0.4825187
NeighborhoodBREWERYTOWN reg_all -37155.0836387 19201.3073030 0.0530071 reg_slim -40248.2425404 17662.9162070 0.0227008
NeighborhoodBRIDESBURG reg_all -31590.5782913 20265.8118929 0.1190638 reg_slim -37506.9338406 19105.0689525 0.0496437
NeighborhoodBROOKHAVEN reg_all -25643.5512275 37037.4650217 0.4887179 reg_slim -36185.3979561 37106.0231729 0.3294833
NeighborhoodBURHOLME reg_all -24016.7373695 19790.1842027 0.2249333 reg_slim -15227.1161084 18762.0188575 0.4170393
NeighborhoodBUSTLETON reg_all -9641.0202861 15771.9220198 0.5410259 reg_slim -5478.2779982 15888.7496221 0.7302577
NeighborhoodBYBERRY reg_all -112753.9122773 42597.9178231 0.0081317 reg_slim -90519.4614556 42731.0380078 0.0341628
NeighborhoodCABOT reg_all -66390.5445712 32361.7174698 0.0402356 reg_slim -61102.1363083 31623.8571900 0.0533609
NeighborhoodCALLOWHILL/CHINATOWN NORTH reg_all 64796.7274978 52473.9379988 0.2169123 reg_slim 42291.2166366 52562.5805279 0.4210713
NeighborhoodCARROLL PARK reg_all -86801.7567426 19289.6556555 0.0000069 reg_slim -93463.2521295 18069.1847659 0.0000002
NeighborhoodCASTOR GARDENS reg_all -48306.9491716 17752.2230931 0.0065131 reg_slim -42782.4758428 16436.3163278 0.0092531
NeighborhoodCATHEDRAL PARK reg_all -111863.9583836 32949.0487008 0.0006881 reg_slim -109757.2304376 32522.5898011 0.0007407
NeighborhoodCECIL B MOORE reg_all -125200.1873270 37616.8907321 0.0008761 reg_slim -128530.2754965 37044.9827495 0.0005229
NeighborhoodCEDAR PARK reg_all 4384.1862665 22747.0747485 0.8471683 reg_slim -7722.5142192 21661.3460154 0.7214631
NeighborhoodCEDARBROOK reg_all -65929.5791199 18345.7486857 0.0003271 reg_slim -57027.3432684 17052.2769512 0.0008272
NeighborhoodCENTRAL ROXBOROUGH reg_all -24678.6207347 17467.1426137 0.1577194 reg_slim -29885.6148045 16460.6246258 0.0694561
NeighborhoodCHESTNUT HILL reg_all 44483.1561038 21482.4802113 0.0384080 reg_slim 54474.7203282 20436.5000158 0.0076949
NeighborhoodCHINATOWN reg_all 54473.7351716 113146.6730846 0.6302089 reg_slim 58420.8113420 113987.5697621 0.6082959
NeighborhoodCOBBS CREEK reg_all -63981.5444107 17103.1741195 0.0001841 reg_slim -73888.9848888 15646.3033466 0.0000024
NeighborhoodCRESTMONT FARMS reg_all 26671.9898982 80921.8723036 0.7417059 reg_slim 65732.5487908 81301.3620753 0.4188138
NeighborhoodDUNLAP reg_all -131376.8407742 29913.8524480 0.0000113 reg_slim -127396.6883558 29029.1342570 0.0000115
NeighborhoodEAST FALLS reg_all -10064.7122781 19046.6508455 0.5972134 reg_slim -4640.7162482 18010.9143413 0.7966721
NeighborhoodEAST GERMANTOWN reg_all -72501.7393975 18674.9157549 0.0001039 reg_slim -82276.2877788 17485.3042865 0.0000026
NeighborhoodEAST KENSINGTON reg_all 3842.8712503 19795.2958595 0.8460765 reg_slim -4047.2779155 18179.3298099 0.8238262
NeighborhoodEAST MT. AIRY reg_all -84649.9436481 18006.7362165 0.0000026 reg_slim -76847.5754248 16628.0146949 0.0000038
NeighborhoodEAST OAK LANE reg_all -135938.3268446 23219.4068366 0.0000000 reg_slim -123609.7530494 21645.8535522 0.0000000
NeighborhoodEAST PARKSIDE reg_all -100322.0251927 26892.2036152 0.0001918 reg_slim -92772.1299044 26099.1750812 0.0003798
NeighborhoodEAST POPLAR reg_all -396.3533220 25894.7415980 0.9877880 reg_slim 5349.4991033 24787.5065327 0.8291356
NeighborhoodEAST TIOGA reg_all -145239.0794312 24910.4635948 0.0000000 reg_slim -157026.6561701 24153.3908417 0.0000000
NeighborhoodEAST TORRESDALE reg_all -9238.6538341 19994.9159892 0.6440526 reg_slim -9220.9858850 18982.4285427 0.6271420
NeighborhoodEASTWICK reg_all -69330.5450210 19278.0545896 0.0003238 reg_slim -57477.6195245 17776.9217236 0.0012267
NeighborhoodFAIRHILL reg_all -96214.4568443 20670.3893806 0.0000033 reg_slim -90795.1948537 19240.7577481 0.0000024
NeighborhoodFAIRMOUNT reg_all 79327.9472266 18413.3432530 0.0000166 reg_slim 83890.7617514 16454.2592408 0.0000003
NeighborhoodFELTONVILLE reg_all -81145.2928779 18933.2400007 0.0000183 reg_slim -81710.8299610 17715.1715893 0.0000040
NeighborhoodFERN ROCK reg_all -114447.5449107 25274.8155160 0.0000060 reg_slim -111728.1119313 24495.6166454 0.0000051
NeighborhoodFISHTOWN reg_all 48414.4583697 18283.4661780 0.0081063 reg_slim 47262.5499366 16328.9406343 0.0038047
NeighborhoodFORGOTTEN BLOCKS reg_all -61459.2515594 30873.7728943 0.0465376 reg_slim -66502.6183704 30329.3555583 0.0283469
NeighborhoodFOX CHASE reg_all -15096.7930131 17800.1156192 0.3963808 reg_slim -1286.2195882 17086.1237506 0.9399941
NeighborhoodFRANCISVILLE reg_all 9491.7041660 22289.3592951 0.6702309 reg_slim 22082.6082104 20669.0857723 0.2853640
NeighborhoodFRANKFORD reg_all -84273.7627814 16879.8316751 0.0000006 reg_slim -85196.4547140 15663.0238513 0.0000001
NeighborhoodFRANKFORD VALLEY reg_all -61743.8908957 21658.7989057 0.0043680 reg_slim -61348.2413654 20534.9453323 0.0028176
NeighborhoodGARDEN COURT reg_all 13303.9180466 28824.4160347 0.6444111 reg_slim 5972.4879777 28002.9007256 0.8311109
NeighborhoodGERMANTOWN reg_all -84269.4532083 17533.2526101 0.0000016 reg_slim -98075.7117969 16307.5649915 0.0000000
NeighborhoodGIRARD ESTATE reg_all 27630.9647696 30103.8360773 0.3587106 reg_slim 25135.7686960 29004.6716397 0.3861701
NeighborhoodGRAYS FERRY reg_all -46500.9043247 18386.4996740 0.0114472 reg_slim -41052.4016138 16602.8275385 0.0134247
NeighborhoodGREEN HILL FARMS reg_all -62868.8578340 53669.2962803 0.2414533 reg_slim -39580.3141921 53011.6575023 0.4552971
NeighborhoodHADDINGTON reg_all -91683.4876282 18590.7261149 0.0000008 reg_slim -88555.3769391 16899.6676403 0.0000002
NeighborhoodHARROWGATE reg_all -89954.8379758 18701.3055660 0.0000015 reg_slim -101149.6355978 17427.3601149 0.0000000
NeighborhoodHAWTHORNE reg_all 133391.3859209 27877.4896995 0.0000017 reg_slim 125888.7463502 26813.6237218 0.0000027
NeighborhoodHOLMESBURG reg_all -33030.8356079 18593.0781973 0.0756704 reg_slim -26684.3124741 17557.2450653 0.1285724
NeighborhoodHUNTING PARK reg_all -81830.7502200 18391.2804390 0.0000087 reg_slim -92245.6934911 17198.9716102 0.0000001
NeighborhoodHUNTING PARK INDUSTRIAL AREA reg_all -73860.0496874 43012.0188192 0.0859658 reg_slim -66750.3697964 42542.8981498 0.1166676
NeighborhoodJUNIATA PARK reg_all -58390.1274655 17809.2090079 0.0010456 reg_slim -53327.0794471 16662.7614704 0.0013756
NeighborhoodKENSINGTON reg_all -73936.6742166 17348.4892080 0.0000204 reg_slim -73332.2092400 15592.5003391 0.0000026
NeighborhoodKENSINGTON SOUTH reg_all 10660.3204395 22851.3446951 0.6408597 reg_slim 18252.7169842 21411.5511002 0.3939677
NeighborhoodKINGSESSING reg_all -71078.0125816 17397.6572345 0.0000442 reg_slim -79416.5903308 16002.6230213 0.0000007
NeighborhoodLAWNCREST reg_all -66141.4939686 18187.8036255 0.0002773 reg_slim -61558.5398941 17183.5961824 0.0003416
NeighborhoodLAWNDALE reg_all -60464.6773431 20705.3245826 0.0035032 reg_slim -49620.1552953 19593.2855456 0.0113359
NeighborhoodLEXINGTON reg_all 31966.1172131 25441.7068502 0.2089757 reg_slim 47245.4347744 24736.5221664 0.0561607
NeighborhoodLOGAN reg_all -116238.0022375 18566.9238177 0.0000000 reg_slim -119130.5563289 17242.2603564 0.0000000
NeighborhoodLOGAN SQUARE reg_all 62250.8192691 19952.7344864 0.0018127 reg_slim 69017.0749668 19439.3639828 0.0003860
NeighborhoodLUDLOW reg_all -52808.3703905 34230.8263943 0.1229230 reg_slim -49129.3421680 33592.6793599 0.1436265
NeighborhoodMANAYUNK reg_all -32849.5864851 17971.4593132 0.0675897 reg_slim -38507.7436252 16243.9220598 0.0177731
NeighborhoodMANTUA reg_all -62417.2976179 23852.8097917 0.0088863 reg_slim -54662.6443479 22720.9492087 0.0161489
NeighborhoodMAYFAIR reg_all -45574.5756867 16531.8817941 0.0058452 reg_slim -42219.6065777 15042.8142378 0.0050132
NeighborhoodMELROSE PARK GARDENS reg_all -74822.7659669 26573.9915337 0.0048749 reg_slim -62746.8330345 25612.5451040 0.0143039
NeighborhoodMILL CREEK reg_all -86286.9060187 21098.8747222 0.0000434 reg_slim -80718.0017347 19782.3318687 0.0000452
NeighborhoodMILLBROOK reg_all -18986.0464090 18148.4839895 0.2955098 reg_slim -17517.1150141 18052.9296840 0.3319040
NeighborhoodMORRELL PARK reg_all -23855.7611183 17240.1643011 0.1664629 reg_slim -25368.5216646 17078.4265040 0.1374571
NeighborhoodMORRIS PARK reg_all -74035.2260614 26778.6391821 0.0057048 reg_slim -71006.2224841 26038.1396081 0.0063992
NeighborhoodNICETOWN reg_all -95514.7522678 21504.8131378 0.0000090 reg_slim -97862.6975851 20342.2917389 0.0000015
NeighborhoodNORMANDY reg_all -504.9652983 38519.3590820 0.9895407 reg_slim -15862.4640491 38426.4109184 0.6797587
NeighborhoodNORRIS SQUARE reg_all -32424.5379200 24924.5633427 0.1933119 reg_slim -29401.7336867 23813.2494526 0.2169707
NeighborhoodNORTH CENTRAL reg_all -80050.1273523 18480.3989338 0.0000149 reg_slim -91827.5689139 16912.6671736 0.0000001
NeighborhoodNORTH DELAWARE reg_all -125249.9113691 66795.9599215 0.0607983 reg_slim -122275.9732774 66957.9079468 0.0678474
NeighborhoodNORTH PHILA. reg_all -98768.9359883 20987.1208531 0.0000025 reg_slim -101628.8866876 19625.7242436 0.0000002
NeighborhoodNORTHERN LIBERTIES reg_all 19683.2302778 21162.6740486 0.3523397 reg_slim 31407.2161736 19552.8861153 0.1082373
NeighborhoodNORTHWOOD reg_all -109857.7042768 27591.4371209 0.0000688 reg_slim -113248.4337823 26795.2691100 0.0000239
NeighborhoodOGONTZ reg_all -73306.8467902 18448.2118389 0.0000711 reg_slim -76392.5281295 17397.5640026 0.0000114
NeighborhoodOLD CITY reg_all -56147.6960712 113483.4492850 0.6207735 reg_slim -37773.1214624 114171.5101767 0.7407662
NeighborhoodOLDE KENSINGTON reg_all -15334.7403707 26307.0304497 0.5599606 reg_slim -14731.6502724 25124.7817587 0.5576570
NeighborhoodOLDE RICHMOND reg_all 15869.4741928 19363.6574097 0.4124871 reg_slim 10452.8543907 17672.2644171 0.5542061
NeighborhoodOLNEY reg_all -81051.0993539 17944.4988330 0.0000063 reg_slim -78219.1692285 15853.6904599 0.0000008
NeighborhoodOVERBROOK reg_all -95547.5871636 18894.8980732 0.0000004 reg_slim -95919.6370296 17585.7097417 0.0000000
NeighborhoodOVERBROOK FARMS reg_all -172684.0617529 32469.4522634 0.0000001 reg_slim -161275.5734957 30826.6156069 0.0000002
NeighborhoodOVERBROOK PARK reg_all -56451.7955031 22055.6680773 0.0104924 reg_slim -41076.3705219 20973.4106451 0.0501917
NeighborhoodOXFORD CIRCLE reg_all -66624.4141888 18468.3637994 0.0003103 reg_slim -57970.9662711 17066.7606727 0.0006839
NeighborhoodPACKER PARK reg_all -14381.8578155 24796.3898288 0.5619253 reg_slim -11369.4008805 24363.2785674 0.6407495
NeighborhoodPARADISE reg_all -53663.1677050 34960.8200870 0.1248185 reg_slim -56603.4633900 34411.6245983 0.1000144
NeighborhoodPARKWOOD reg_all -19199.1484625 17129.8742246 0.2623928 reg_slim -23270.8506505 16913.4072254 0.1688804
NeighborhoodPENNSPORT reg_all -6582.6538182 20130.6767112 0.7436757 reg_slim -4228.3511949 18204.7033291 0.8163341
NeighborhoodPENNYPACK WOODS reg_all 41960.6435048 113129.5540446 0.7107118 reg_slim 65227.5971361 114039.8416658 0.5673502
NeighborhoodPOINT BREEZE reg_all -8088.4612524 17309.9662729 0.6403127 reg_slim -4245.1828359 15356.1035523 0.7822072
NeighborhoodPORT RICHMOND reg_all -29512.1404750 17679.4934967 0.0950832 reg_slim -24532.4248997 15779.1562431 0.1200317
NeighborhoodPOWELTON VILLAGE reg_all -34605.9683188 52740.6332845 0.5117360 reg_slim -24015.8057978 52593.9763556 0.6479466
NeighborhoodQUEEN VILLAGE reg_all 93695.8143232 19964.4964507 0.0000027 reg_slim 98978.3461354 18273.0512294 0.0000001
NeighborhoodRHAWNHURST reg_all -3373.9229909 18579.8775991 0.8559070 reg_slim 9991.8375656 17428.9718079 0.5664583
NeighborhoodRITTENHOUSE SQ. reg_all 293773.0507151 18553.8054547 0.0000000 reg_slim 280872.1289137 17205.4170120 0.0000000
NeighborhoodSAUNDERS PARK reg_all -39282.6002949 48618.1316448 0.4191145 reg_slim -26126.8786552 48337.2458873 0.5888524
NeighborhoodSHARSWOOD reg_all -54367.5947877 24525.1100208 0.0266520 reg_slim -53499.8752214 23426.1196254 0.0224001
NeighborhoodSHAWMONT VALLEY reg_all -58495.1417862 40407.3922643 0.1477418 reg_slim -74499.8180405 40408.4212184 0.0652524
NeighborhoodSOCIETY HILL reg_all 223310.1849393 20144.0701817 0.0000000 reg_slim 218679.0227241 18190.9236294 0.0000000
NeighborhoodSOMERTON reg_all -2127.4648236 15935.0539383 0.8937932 reg_slim -3223.6079820 16035.2828123 0.8406763
NeighborhoodSOUTH PHILADELPHIA reg_all 8559.6425368 16568.4172324 0.6054269 reg_slim 9403.7046014 14640.7064377 0.5206900
NeighborhoodSOUTHWEST reg_all -83920.8752636 17066.3882595 0.0000009 reg_slim -82397.6901457 15812.5347453 0.0000002
NeighborhoodSOUTHWEST CENTER CITY reg_all 106732.5225133 18097.3078312 0.0000000 reg_slim 108779.3328829 16342.7869898 0.0000000
NeighborhoodSPRING GARDEN reg_all 77603.7536995 20260.6419528 0.0001286 reg_slim 87608.5166849 18407.0289858 0.0000020
NeighborhoodSPRUCE HILL reg_all 55453.9418361 26775.6943263 0.0383721 reg_slim 32932.5384605 25868.4027772 0.2030111
NeighborhoodST. HUGH reg_all -84376.6831640 20806.9308011 0.0000504 reg_slim -95639.2045120 19780.8709592 0.0000013
NeighborhoodSTRAWBERRY MANSION reg_all -80492.5823381 18714.0649261 0.0000171 reg_slim -96841.8354347 17115.6261400 0.0000000
NeighborhoodSUMMERDALE reg_all -70043.7232086 20226.3818041 0.0005358 reg_slim -61924.7207289 19163.2225759 0.0012346
NeighborhoodTACONY reg_all -60778.0340523 17277.0842742 0.0004365 reg_slim -60567.8269538 16140.5218212 0.0001758
NeighborhoodTIOGA reg_all -136940.3919822 19499.4280564 0.0000000 reg_slim -139746.6052022 18276.5480249 0.0000000
NeighborhoodUPPER HOLMESBURG reg_all -49117.2082574 22504.3607007 0.0290846 reg_slim -43100.4886552 21441.6702430 0.0444370
NeighborhoodUPPER NORTHWOOD reg_all -61942.4415593 20597.1073621 0.0026402 reg_slim -55845.8370408 19470.8696908 0.0041347
NeighborhoodUPPER ROXBOROUGH reg_all -39151.3185303 18183.7650762 0.0313286 reg_slim -29856.4679494 17064.3364054 0.0802023
NeighborhoodWALNUT HILL reg_all -39096.6410894 26657.3504243 0.1424976 reg_slim -43046.8258342 25972.3015942 0.0974586
NeighborhoodWALTON PARK reg_all -44212.4614591 30178.1304628 0.1429300 reg_slim -44753.7876904 30156.7736352 0.1378216
NeighborhoodWASHINGTON SQUARE WEST reg_all 168590.0820375 21203.4461655 0.0000000 reg_slim 169462.6339847 19569.5146383 0.0000000
NeighborhoodWEST FAIRHILL reg_all -127159.4012378 22247.9919633 0.0000000 reg_slim -124355.6183092 21180.6315858 0.0000000
NeighborhoodWEST KENSINGTON reg_all -47831.8556038 25278.1137276 0.0584819 reg_slim -49498.3790600 24372.1903475 0.0422804
NeighborhoodWEST MT. AIRY reg_all -63258.4744964 19759.4608285 0.0013706 reg_slim -54813.9137262 18667.2941880 0.0033263
NeighborhoodWEST OAK LANE reg_all -80596.4719677 17264.1880540 0.0000031 reg_slim -74889.6295475 15868.7867259 0.0000024
NeighborhoodWEST PARKSIDE reg_all -109853.3050213 36268.3923754 0.0024590 reg_slim -104608.8946361 35595.3540711 0.0032999
NeighborhoodWEST POPLAR reg_all -33032.1518695 36219.7385445 0.3617882 reg_slim -29878.3463061 35614.7995362 0.4015227
NeighborhoodWEST POWELTON reg_all -2738.1033258 33851.4863790 0.9355340 reg_slim -7886.4000773 33410.0426846 0.8133983
NeighborhoodWEST SHORE reg_all -23791.9480650 45311.9041412 0.5995425 reg_slim -23661.0629655 45071.5019339 0.5996143
NeighborhoodWHITAKER reg_all -79767.1412993 30923.3177517 0.0099041 reg_slim -81264.4461404 30921.6132196 0.0085963
NeighborhoodWHITMAN reg_all -9235.2623425 19762.5668074 0.6402848 reg_slim -8444.9175050 18249.7158755 0.6435563
NeighborhoodWINCHESTER reg_all -8141.1311035 19358.8864929 0.6740996 reg_slim 429.7532674 19126.7541313 0.9820744
NeighborhoodWINCHESTER PARK reg_all -16752.4470116 28317.5345494 0.5541330 reg_slim -12265.6333895 28463.6971471 0.6665318
NeighborhoodWISSAHICKON reg_all -7434.7842408 19752.4061702 0.7066261 reg_slim -18025.2254291 18681.4020772 0.3346240
NeighborhoodWISSINOMING reg_all -54146.6679531 17587.0174024 0.0020825 reg_slim -53416.9769139 16467.4435297 0.0011823
NeighborhoodWYNNEFIELD reg_all -124078.7498991 19788.7494761 0.0000000 reg_slim -116139.9060326 18971.3825342 0.0000000
NeighborhoodWYNNEFIELD HEIGHTS reg_all -25152.6556959 24626.0660526 0.3070906 reg_slim -25399.4422527 24507.1267002 0.3000287
NeighborhoodYORKTOWN reg_all -51085.7142768 36172.9040971 0.1578945 reg_slim -36278.3861451 35620.8263948 0.3084772
number_of_bathrooms reg_all 42042.8129161 2030.9653938 0.0000000 reg_slim 38010.2868205 1604.1250442 0.0000000
number_of_bedrooms reg_all -5425.9847155 1070.0491463 0.0000004 NA NA NA NA
number_stories reg_all 14716.6215316 2141.6615697 0.0000000 reg_slim 9985.4837017 2121.1790827 0.0000025
pctdetached reg_all -240.1178379 102.6117756 0.0192947 NA NA NA NA
pctmultifam_small reg_all 126.9913104 93.6901793 0.1752998 NA NA NA NA
price_lag reg_all 0.3724456 0.0095447 0.0000000 reg_slim 0.3864421 0.0094745 0.0000000
total_area reg_all 2.4438817 0.3780751 0.0000000 NA NA NA NA
total_livable_area reg_all 154.1511182 2.6404506 0.0000000 reg_slim 166.3417828 2.4510081 0.0000000
TotalPop reg_all -2.4162573 1.6397411 0.1406227 NA NA NA NA
trees reg_all 24.8456495 12.0714046 0.0395873 reg_slim 30.4536953 11.4708127 0.0079426
year_built reg_all 84.7980919 24.6327387 0.0005781 reg_slim 180.5030307 22.2569316 0.0000000

Understand Errors

Analyzing the mean absolute error of the regression reveals the regression’s accuracy. At $67,083, the error is not trivial considering the mean sale price is $229,000. The Mean Absolute Percent Error confirms that the model errs by 38%.

philly.test <- philly.test %>%
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(reg_all, philly.test), 
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)%>%
  filter(sale_price < 3500000)

#MAE
mae <- round(mean(philly.test$sale_price.AbsError, na.rm = TRUE), 2)

#MAPE
mape <- round(mean(philly.test$sale_price.APE, na.rm = TRUE) * 100, 2)  

#Summary table of MAPE vs MAE 
results_df <- data.frame(
  Metric = c("Mean Absolute Error (MAE)", "Mean Absolute Percentage Error (MAPE)"),
  Value = c(mae, paste0(mape, "%"))  # Append "%" to MAPE value for percentage
)

kable(results_df) %>%
  kable_material_dark(full_width = F) %>%
  column_spec(column = 2, width = "20%", extra_css = "text-align:center;")
Metric Value
Mean Absolute Error (MAE) 67312.68
Mean Absolute Percentage Error (MAPE) 30.78%

Visualizing the data can also be instrumental in diagnosing the goodness of a model’s fit. Given how tightly aligned the observed and predicted prices are we performed dozens of variable combinations to rule out over fitting. We are confident that our variables are generalizable and do not over-fit.

# Predicted prices as a function of observed prices

ggplot(philly.test, aes(x = sale_price, y = sale_price.Predict)) +
  geom_point(size = 2, alpha = 0.7, color = "#A5C4D4") +
stat_smooth(aes(sale_price, sale_price), 
             method = "lm", se = FALSE, size = 1, colour="#A5C4D4") + 
  stat_smooth(aes(sale_price.Predict, sale_price), 
              method = "lm", se = FALSE, size = 1, colour="#D52F4D") +  
  labs(
    title = "Predicted vs Observed Prices",
    subtitle = "Blue line is predicted. Red line is observed.",
    x = "Observed Prices",
    y = "Predicted Prices"
  ) +
  plotTheme()

Cross Validation

Cross validation is a way to test out the model on a subset of the data to see if the results hold. How this test works, is we create 100 random subsets to measure the goodness of fit. If the mean absolute error of each subset was approximately the same, we could deduce that the model is likely to generalize to new data. Unfortunately, we do not see clustered MAE. This spread out distribution of MAE indicates that the subsets produced different results. So while the model is potentially not over-fitting, it is potentially not powerful enough (which is confirmed by the high 38% MAE).

## Cross Validation

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(sale_price ~ ., data = st_drop_geometry(sales2023.select) %>% 
                                dplyr::select(var_all), 
     method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv

reg.cv$resample[1:10,]

ggplot(as.data.frame(reg.cv$resample), aes(x = MAE)) +
  geom_histogram(binwidth = 2500, fill = "#A5C4D4") +
  geom_vline(aes(xintercept = mean(reg.cv$resample$MAE)), colour = "red", size = 1) +
  labs(
    title = "Distribution of MAE",
    subtitle = "Observed MAE (red line)",
    x = "Mean Absolute Error (MAE)",
    y = "Count"
  ) +
  plotTheme()

Spatial Lag Errors

To understand the relationship between errors spatially, we mapped the error across Philadelphia. The map does not immediately indicate any clustering, however, a Moran’s I test takes a closer look at spatial clustering.

#Spatial Lags Error

# What is the relationship between errors? Are they clustered? Is the error of a single observation correlated with the error of nearby observations?

coords.test <-  st_coordinates(philly.test) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")
 
philly.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)) 

philly.test <- philly.test[is.finite(philly.test$sale_price.Error), ]
philly.test$sale_price.Error <- as.numeric(philly.test$sale_price.Error)

test_data <- philly.test %>%
  group_by(Neighborhood) %>%
  mutate(MeanPrice = mean(sale_price))

test_data <- test_data %>%
  group_by(Neighborhood) %>%
  mutate(MAPE = mean(sale_price.APE)) %>%
  filter(MAPE > 0) %>% 
  filter(MAPE < 5)

ggplot() +
  geom_sf(data = blocks21.sf, color = "grey70") +
  geom_sf(data = test_data, aes(colour = q5(sale_price.Error)), 
          show.legend = "point", size = .25) +
  scale_colour_manual(values = palette,
                   labels=qBr(test_data,"sale_price.Error"),
                   name="Error\nRanges") +
  labs(title="Map of Price Absolute Errors") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  mapTheme()

## Do Errors Cluster? Using Moran's I

moranTest <- moran.mc(philly.test$sale_price.Error, 
                      spatialWeights.test, nsim = 999)
moranTest

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.005, fill = "#A5C4D4") +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#D52F4D",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I of 0.073929 in red",
       x="Moran's I",
       y="Count") +
  plotTheme()

The Moran’s I value of 0.08356 suggests that there is some spatial autocorrelation in the residuals, but it’s not particularly strong. In other words, while similar values (residuals) do tend to be somewhat spatially clustered, the effect is not very pronounced. The alternative hypothesis “greater” indicates that the observed spatial autocorrelation is stronger than what we’d expect by random chance. However, the fact that the Moran’s I value is close to zero means that nearby locations have only slightly more similar prediction errors than would be expected under random spatial distribution.

So if the error is not explained spatially, perhaps it can be explained by difference in housing quality or other local intelligence features.

Additionally, when we examine whether the mean absolute percentage error is clustered around any neighborhoods, we see that only two neighborhoods in North Philadelphia have slightly higher MAPEs than the other neighborhoods, and only at 8% compared to less than 4%. This indicates that the model is generalizable to new test information.

#provide a map of mean absolute percentage error (MAPE) by neighborhood.

st_drop_geometry(philly.test) %>%
  group_by(Neighborhood) %>%
  summarize(mean.MAPE = mean(sale_price.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(hoods, by = c("Neighborhood" = "Neighborhood")) %>%
    st_sf() %>%
 ggplot() + 
      geom_sf(data = blocks21.sf, color = "gray70") +    
    geom_sf(aes(fill = mean.MAPE)) +
      geom_sf(data = philly.test, colour = "white", size = .5, alpha = 0.2) +
      scale_fill_gradient(low = palette[1], high = palette[5],
                          name = "MAPE") +
      labs(title = "Mean test set MAPE by neighborhood",
           caption = "White dots represent each sale in 2023.") +
      mapTheme()

#Provide a scatterplot of MAPE by neighborhood as a function of mean price by neighborhood.

ggplot(test_data) +
  geom_point(aes(MeanPrice, MAPE)) +
  geom_smooth(method = "lm", aes(MeanPrice, MAPE), se = FALSE, colour = "#D52F4D") +
  labs(title = "MAPE by Neighborhood as a function of mean price by Neighborhood",
       x = "Mean Home Price", y = "MAPE") +
  plotTheme()

Predict Prices

With the exception of Center City, which more recently flipped from a “D” grade (re: Redlining) residential neighborhood to an expensive one, housing prices are sticky. Areas like Chestnut Hill or Fairmount were historically highly valued, are today, and will continue to be in the future according to this model.

# Provide a map of your predicted values for where ‘toPredict’ is both “MODELLING” and “CHALLENGE”.

inTrain.both <- createDataPartition(
              y = paste(sales2023$holc_grade, sales2023$historic, sales2023$Neighborhood),
              p = .60, list = FALSE)

philly.training.both <- sales2023[inTrain.both,] 
philly.test.both <- sales2023[-inTrain.both,]  

reg_all.both <- 
  lm(sale_price ~ ., data = as.data.frame(philly.training.both) %>% 
                             dplyr::select(var_all))

philly.test.both <- philly.test.both %>%
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(reg_all.both, philly.test.both), 
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)%>%
  filter(sale_price < 5000000)

ggplot() +
  geom_sf(data = blocks21.sf, color = "gray70") +
  geom_sf(data = philly.test.both, aes(colour = q5(sale_price.Predict)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette,
                   labels=qBr(test_data,"sale_price.Predict"),
                   name="Prices") +
  labs(title="Predicted Sale Prices") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  mapTheme()

Contextualize

The income context of Philadelphia is important to consider when evaluating our model. When dividing Philadelphia into low or high income groups (bsed on the median income), there is clear clustering of high incomes in the center, northwest and northeast outskirts of the city, and lower incomes in dividing the outer and inner areas. Price absolute errors are clustered by income more so than by neighborhood boundary, which suggests income of nearby residents is highly influential in test set.

blocks21.sf <- blocks21.sf %>% 
  mutate(incomeContext = ifelse(MedHHInc > median(blocks21.sf$MedHHInc), "High Income", "Low Income"))

ggplot() + geom_sf(data = na.omit(blocks21.sf, color = "gray70"), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#4B7740", "#A5C4D4"), name="Income") +
    labs(title = "Median Income by Block Group") +
    mapTheme() + 
  theme(legend.position="bottom")

st_join(philly.test, blocks21.sf) %>% 
  filter(!is.na(incomeContext)) %>%
  group_by(Regression, incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(sale_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood income context") %>%
  kable_material_dark() %>%
  column_spec(column = 2, width = "20%", extra_css = "text-align:center;"
               )
Test set MAPE by neighborhood income context
Regression High Income Low Income
Baseline Regression 24% 41%

Conclusion

Discussion

The study strongly suggests that local intelligence is an effective variable to predict housing prices. While nearby housing prices are a strong predictor of housing prices, combining nearby prices with local intelligence is an even stronger predictor.

The variables in the study with the strongest association to sale price, such as livable area, neighborhood, median household income, tree density, and litter conditions drive the the regression’s goodness of fit. Other variables affect the model at the margins.

In Philadelphia and likely in many major metro areas across the United States, downtown revitalization in the past 30 years disproportionately pushed housing prices. It would be interesting to further develop this variable beyond redlining and today’s pricing to better capture how value is dispersed across modern U.S. cities. If we could weigh factors, such as FAR or % of permits denied, that ultimately encourage or constrict development, it could make neighborhood and historical pricing variables even stronger.

The current model predicts 74% of the variation in price. Spatially, we do see that areas like Center City and Chestnut Hill are going to continue to achieve high sales prices. The model predicts these trends particulary well. However, where the model could be impoved is how the model may be discounting the rapid popularity and value in areas like Fishtown where development is new and abundant. Spatially, this model does not account for where new development is in the works. Understanding how real estate development investment can drive housing prices, but an oversupply could potentially push pricing down is important to factor into our predictions, too. Perhaps the model could predict more variation in price with this data split into sales and rentals.

Also important to note is the impact of internal characteristics on predicted price. While the model does predict on liveable area and features like fireplaces and number of bathrooms relatively well, it does not account for quality. Our MAE of around $70k could be explained by the interior finishes between two otherwise identical properties. Variables that can correct for quailty could be an effective addition to the model.

Recommendation

We recommend that Zillow incorporate local intelligence into its Zestimate algorithm. Our model is a solid starting point that can certianly be improved upon with additional feature selection and engineering. While our model predicts fairly well, it could be improved in how well accurate predictions are dispersed across the city. Reducing and clustering absolute errors will be central to future iterations.

While Amenities, dis-amenities and neighborhood spatial characteristics are often far more predictive of home prices than internal characteristics, these may be different in Philadelphia than in other cities. The key is that disparity in these spatial characteristics may highlight where there are disparities, and in turn inform where price may change significantly. In a city where trees, litter, or income are uniformly distributed, these characteristics will not impact price. Price is in many ways, therefore, a byproduct of disparities, creating tension between neighborhoods that have what is ultimately not available throughout the rest of a city.

We recommend future iterations of the model include variables that offer nuance between properties that may be otherwise identical in terms of form and location. Furthermore, variables that account for urban growth dynamics such as where building permits are issued relative to current density. This kind of variable could help the model account for pricing trends that are not present today but may be in the future. Another example of accounting for the impact of future development in Philadelphia is distance to transit since the entire trolley system is undergoing an extensive modernization. Lastly, variables that capture job centers since people often choose their home based on access or proximity to jobs or potential jobs could explain the pricing disparity between neighborhoods, especially those in downtown. We are confident with this model, Zillow is off to a great start to evolve its Zestimate score.

Appendices

Appendix A: Filtering Data

We want to note where outliers were identified and subsequently filtered to ensure the results were not skewed. For sales prices, we remove any sales above 3,500,000 since this is significantly higher than even the most expensive homes. We filtered sales with more than 4 garages because the values larger were over 70, likely representing an entire building rather than an individual unit. Additionally, parks with high concentrations of trees likely created outlier tree density values, when in reality the homes near them were not nearly as tree covered. Therefore, we only included values where trees were less than 1,000 in the block group, an already very high number. Finally, we only included homes where the total livable area was less than 6,500 to remove extreme outlier homes that were also likely for entire buildings, rather than one unit.

plot(sales2023$sale_price, ylim = c(0,6000000), main="Sales Prices") +
abline(h = 3500000, col = "#D52F4D") 

plot(sales2023$garage_spaces, ylim = c(0, 80), main="Garage Spaces") + 
abline(h = 4, col = "#D52F4D") 

plot(sales2023$total_livable_area, ylim = c(0, 150000), main="Total Livable Area") +
  abline(h = 6500, col = "#D52F4D") 

plot(sales2023$trees, ylim = c(0, 10000), main="Trees") + 
  abline(h = 1000, col = "#D52F4D") 

Appendix B: VIF Scores

VIF_table <- data.frame(
  Variable = c("Neighborhood", "fireplaces", "garage_spaces", "number_of_bathrooms", "number_of_bedrooms", "number_stories", "total_area", "total_livable_area", "year_built", "historic", "holc_grade", "price_lag", "trees", "litterscore", "TotalPop", "MedHHInc", "pctdetached", "pctmultifam_small", "hwy_grade"),
  GVIF = c(9643.267543, 1.338493, 1.760822, 1.943907, 2.028490, 1.515672, 1.469092, 1.868998, 1.453886, 1.690424, 150.815224, 3.716039, 2.010839, 1.491015, 1.506094, 2.286827, 1.926168, 1.415946, 1.277872)
)

kable(VIF_table, caption = "VIF Values", format = "html") %>%
  kable_styling()
VIF Values
Variable GVIF
Neighborhood 9643.267543
fireplaces 1.338493
garage_spaces 1.760822
number_of_bathrooms 1.943907
number_of_bedrooms 2.028490
number_stories 1.515672
total_area 1.469092
total_livable_area 1.868998
year_built 1.453886
historic 1.690424
holc_grade 150.815224
price_lag 3.716039
trees 2.010839
litterscore 1.491015
TotalPop 1.506094
MedHHInc 2.286827
pctdetached 1.926168
pctmultifam_small 1.415946
hwy_grade 1.277872