Overview

Local intelligence is a strong predictor of housing prices in Philadelphia that can optimize Zillow’s popular and consequential Zestimate tool. Real estate professionals as well as home buyers and sellers alike trust Zestimate to be an accurate prediction of local real estate values. Investing in the tool’s reliability stands to not only benefit Zillow’s customers, but the company’s dominant market position, too.

The following study explores a series of local intelligence variables that range from tree-lined street to local median income. A series of statistical processes reveal a handful of key variables that are then processed by prediction models. The best prediction model is then tested on 2023 sale price data for accuracy. The study is mindful of creating models that are generalizable across Philadelphia, and hopefully generalizable across major metro areas throughout the country.

Select and Load Data

Firstly, sales data is a tremendous resource in terms of the internal characteristics of each property which can certainly influence price. This study leverages data such as year built, total square footage, number of bathrooms and bedrooms to holistically understand what influences price.

However, the property itself does not tell the full story. Where it is located and the characteristics of that area are key, too. There are myriad local intelligence variables to chose from between a number of key open data sources (2021-2023), including: U.S. Census American Community Survey, Pennsylvania Spatial Data Access (PASDA), Philadelphia Open Data Portal, or from the Delaware Valley Regional Planning Commission (DVRPC). This study priorizited complete datasets that affect the city at large. For example, litter is a systemic issue whereas airports are very localized. The following table breaks down viable sources and from which we selected local intelligence variables that we deemed to be generalizable for the purposes of this study:

# Create a data frame with your data
data_frame <- data.frame(
  "Dataset" = c(
    "Floodplain Data",
    "Sanitation Centers",
    "Litter Index",
    "Demographics and Socioeconomic Data",
    "Housing Typology",
    "Trees",
    "Neighborhood",
    "Proximity to Highways",
    "School Catchment Performance",
    "Walkscore", 
    "1937 Redlining Score"
  ),
  "Generalizability" = c(
    "",   # Floodplain Data 
    "",   # Sanitation Centers 
    "✔",  # Litter Index 
    "✔",  # Demographics and Socioeconomic Data 
    "",   # Housing Typology 
    "✔",  # Trees 
    "✔",  # Neighborhood
    "✔",  # Proximity to Highways 
    "",   # School Catchment Performance 
    "",   # Walkscore 
    "✔"  # 1937 Redlining Score
  )
)


dt_url <- c("https://data-phl.opendata.arcgis.com/datasets/phl::fema-100-flood-plain/explore",
            "https://opendataphilly.org/datasets/sanitation-areas/",
            "https://opendataphilly.org/datasets/litter-index/",
            "https://www.census.gov/",
            "https://www.census.gov/",
            "https://opendataphilly.org/datasets/phillytreemap/",
           "https://www.pasda.psu.edu/uci/DataSummary.aspx?dataset=7047",
           "https://dvrpc-dvrpcgis.opendata.arcgis.com/datasets/dvrpcgis::freight-highway-network/explore",
           "https://www.philasd.org/performance/programsservices/open-data/school-information/",
           "http://walkscore.com/",
           "https://ncrc.org/holc/")

data_frame %>%
  mutate(Dataset = cell_spec((Dataset), "html", link = dt_url)) %>%
  kable(format = "html", escape = FALSE) %>%
  kable_styling(bootstrap_options = c("hover", "condensced"), full_width = FALSE, 
                position = "left") %>%
  column_spec(column = 2, width = "20%", extra_css = "text-align:center;"
               )
Dataset Generalizability
Floodplain Data
Sanitation Centers
Litter Index
Demographics and Socioeconomic Data
Housing Typology
Trees
Neighborhood
Proximity to Highways
School Catchment Performance
Walkscore
1937 Redlining Score

Boundaries

From highways to rivers, man-made and natural boundaries alike delineate neighborhoods where certain people and types of businesses congregate. Neighborhood boundaries are arguably more organic than census block groups (which are optimized for consistent populations) or highway. Additionally, because residential sales are generalized by residential neighborhoods not by census block groups, neighborhoods (as defined by the Philly Planning Dept) are likely significant on local intelligence.You can see where areas of high income and low income cluster across Philadelphia, and how these boundaries are more recognizable at a neighbrhood level than block group level.

# plot sales data using median cut off  

sales_plot_mod <- sales2023 %>%
  filter(toPredict == "MODELLING")

median_sale_price <- median(sales_plot_mod$sale_price)
custom_colors <- ifelse(sales_plot_mod$sale_price <= median_sale_price, "#A5C4D4", "#4B7740")

sales_plot <- ggplot() +
  geom_sf(data = blocks21.sf, color = "gray70") +
  geom_sf(data = sales_plot_mod, aes(color = custom_colors), size = 0.5, alpha = 0.5) +
  scale_color_manual(
    values = c("#A5C4D4" = "#A5C4D4", "#4B7740" = "#4B7740"),
    labels = c("#4B7740" = "Above Median", "#A5C4D4" = "Below Median")) +
  labs(title = "Block Groups", color = "") +
  theme(legend.position = "none") + 
  guides(color = guide_legend(override.aes = list(size = 3))) +
  labs(subtitle = "2021 Philadelphia Median \nSale Price is $229,900") +
  mapTheme()

#Bring in neighborhoods

hoods <- st_read("/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Midterm/PPA_Midterm_Git/data/PhillyPlanning_Neighborhoods/PhillyPlanning_Neighborhoods.shp") %>%
  select(NAME) %>%
  rename(Neighborhood = NAME)

sales2023 <- st_join(sales2023, hoods, join = st_within)

hood_plot <- ggplot() +
  geom_sf(data = blocks21.sf, fill = "gray90", color = "gray90") +
  geom_sf(data = hoods, color = "gray60") +
  geom_sf(data = sales_plot_mod, aes(color = custom_colors), size = 0.5, alpha = 0.5) +
  scale_color_manual(
    values = c("#A5C4D4" = "#A5C4D4", "#4B7740" = "#4B7740"),
    labels = c("#4B7740" = "Above Median", "#A5C4D4" = "Below Median")) +
  labs(title = "Neighborhoods") +
  theme(legend.position = "none") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  labs(subtitle = "Neighborhoods as defined by Philadelphia \nPlanning Dept") +
  mapTheme()

#Bring in highways and apply buffer of 1/8 mile (660')

highways <- st_read("/Users/LauraFrances_1/Library/CloudStorage/GoogleDrive-lauramuellersoppart@gmail.com/My Drive/Penn/Courses/MUSA508_PPA/PPA_Midterm/PPA_Midterm_Git/data/Freight_Highway_Network.geojson")

highways <- highways %>%
  filter(state == "PA") %>%
  st_transform('EPSG:2272')

highways_phl <-  st_intersection(highways, blocks21.sf) %>%
  select(geometry)

highway_buffer <- st_buffer(highways_phl, 660)
highway_buffer$hwy_grade <- 1

# Make a dummy variable where 1 is a house within highway buffer and 0 is outside of buffer. 

saleshighways <- st_intersection(sales2023, highway_buffer)

homes_near_hwys <- saleshighways$objectid  #creates a list using unique ids from saleshighways

sales2023 <- sales2023 %>%
  mutate(hwy_grade = ifelse(objectid %in% homes_near_hwys, 1, 0))

hwy_plot <- ggplot() + 
  geom_sf(data = blocks21.sf, color = "gray70") +
   geom_sf(data = sales_plot_mod, aes(color = custom_colors), size = 0.5, alpha = 0.5) +
  scale_color_manual(
    values = c("#A5C4D4" = "#A5C4D4", "#4B7740" = "#4B7740"),
    labels = c("#4B7740" = "Above Median", "#A5C4D4" = "Below Median")) +
  geom_sf(data = highway_buffer, aes(fill = "#36151E", color = "#36151E")) +
  scale_fill_identity(name = "Highway", labels = "#36151E") +
  scale_color_identity(name = "Highway", labels = "#36151E") +
  guides(color = guide_legend(override.aes = list(size = 5))) +
  labs(title = "Highway Proximity") +
  labs(subtitle = "The highway buffer is 660 feet, \nabout 2 Philadelphia blocks.") +
  mapTheme()

#plot all three together

plot_grid(sales_plot, hood_plot, hwy_plot, ncol = 3, align = "v")