King-County-House-Price-Prediction

PROJECT OVERVIEW

As a data scientist at FlyHomes, I am tasked with analyzing house sales data in the King County area to build predictive models for sale prices and identify the significant factors influencing these prices.

Business Question:

Location: Which areas in King County have the highest average house prices?

Inner Factors: What specific house features (e.g., square footage, number of bedrooms) have the most substantial impact on the sale price?

External Factors: How do external elements such as crime rates, school ratings, and demographic data affect house prices in King County?

Goal:

My goal is to support potential investors and homebuyers in making well-informed decisions by offering them data and insights gathered through the comprehensive analysis of the King County real estate market. Moreover, we aim to provide a more confident home-buying experience, assisting them in navigating the market with certainty and success.

DATA USED FOR THIS PROJECT

The main dataset utilized is the 'House Sales in King County, USA' dataset from Kaggle, selected to analyze house sales data. This dataset encompasses sales from May 2014 to May 2015 and includes 21 variables, with over 21,000 rows of data, each representing a single house sale. I downloaded it in CSV format and removed unnecessary columns, retaining only essential columns such as price, bedrooms, bathrooms, square footage of living space, square footage of the lot, floors, view, condition, grade, year built, and zip code.

To enhance our analysis, I incorporated location-based data by sourcing data from federal and state databases. This additional data brought in crucial factors such as property tax rate, the number of crimes, school ratings, unemployment rate, average commute time, median household income, total population, and median age. I then merged five different datasets based on zip codes to construct a comprehensive dataset for predicting house prices. To review the process of merging the datasets, you can check my GitHub under the 'data merging' folder.

DATA PROCESSING

The primary tool I utilized for data cleaning and modeling is Python. During the cleaning process, I identified and addressed issues such as missing values and duplicates, and fixed incorrect data types and values. A significant step in this process was applying a log transformation to the 'price', 'sqft_living', and 'sqft_lot' variables. This transformation was chosen because it helped in normalizing the distribution of these variables, making them more symmetric and thereby facilitating more accurate predictive modeling compared to the removal of outliers. Essentially, log transformation mitigates the effects of extreme values, rendering the data more homoscedastic and ensuring the residuals are normally distributed, which is a desirable property in regression analysis.

For the modeling phase, I have selected the following variables: 'price_log', 'condition', 'grade', 'floors', 'bedrooms', 'bathrooms', 'sqft_living_log', 'sqft_lot_log', 'house_age', 'school_rate', 'unemployment_rate', 'travel_time_to_work', 'total_population', 'typical_levy_rate', 'median_age', 'median_household_income', and 'area_crime'. To delve into the specifics of the data cleaning process, you can check my GitHub under the 'data cleaning' folder.

METHODOLOGIES

To provide potential investors and homebuyers with actionable insights into the King County real estate market, I employed a combination of exploratory and predictive analytics. Initially, I utilized Exploratory Data Analysis (EDA) to clean, enrich, and visualize the data, thereby establishing a solid foundation for the subsequent predictive modeling.

Following this, I built predictive models using two methods: Multiple Linear Regression (MLR) and Extreme Gradient Boosting (XGBoost). These methods were chosen for their complementary strengths; MLR, a classical statistical method, is favored for its straightforward interpretability, allowing for easy understanding and explanation of the factors influencing house prices. On the other hand, XGBoost, a machine-learning technique, is renowned for its high predictive accuracy, making it a powerful tool for predicting house prices with a higher degree of precision. Despite employing different techniques, both methods share a common objective: to create reliable predictive models for house pricing in King County, leveraging the unique advantages of each approach to offer a more rounded analysis.

Upon the completion of both models, I proceeded to the comparison and validation phase. In this stage, I used key performance metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared value to assess their predictive power and reliability.

EXPLORATION

City vs. Price

Insights:

The top five cities with the highest average house prices are Medina, Mercer Island, Bellevue, Sammamish, and Redmond.
A significant portion of the houses are located in Seattle.

Bedrooms vs. Price

Insights:

The majority of houses in the dataset have three bedrooms.
Generally, more bedrooms result in a higher house price. However, houses with 0, 11, and 33 bedrooms are anomalies. During deep exploration, I identified that these houses have impractical layouts, which affects the dataset's accuracy.
To maintain data integrity, I removed rows with 0, 11, or 33 bedrooms.

Bathrooms vs. Price

Insights:

The majority of houses in the dataset have 2.5 bathrooms.
Similar to the bedrooms, a higher number of bathrooms tends to increase the house price. However, houses with 0 and 7.5 bathrooms are anomalies. During deep exploration, I identified that these houses have impractical layouts, which affects the dataset's accuracy.
To maintain data integrity, I removed rows with 0 or 7.5 bathrooms.

Floors vs. Price

Insights:

Most houses are designed with one or two floors.
Houses boasting 2.5 floors have the highest average prices in the market.

Condition vs. Price

Insights:

A large number of houses have received a condition score of 3.
The average house prices don't differentiate much for homes with condition scores between 3 and 5.

Grade vs. Price

Insights:

The majority of houses receive a grade score of 7.
A direct correlation exists between the grade score and the house price, with a higher grade bringing a higher price.

MODEL COMPARISON

Insights:

Inner Factors:
Both models emphasized that the variables sqft_living_log, sqft_lot_log, grade, and bathrooms have a significant impact on house prices. However, they displayed low R-squared values, indicating a potentially limited relationship between the inner factors and the predicted outcomes. In addition, before deploying the MLR model, I noticed that each variable had a high Variance Inflation Factor (VIF), which signals potential inaccuracies in the MLR results due to multicollinearity. In this case, where variables are highly correlated, this can skew the results and underrepresent the importance of some variables. Despite my attempts to remove the high VIF variables one at a time, I was left with no variables to analyze. Conversely, an article on Medium highlighted the XGBoost model's adeptness at managing multicollinearity issues effectively, suggesting it might provide more reliable predictions compared to the MLR model in this context.
External Factors:
Both models showed that the unemployment rate, travel time to work, and median age have a substantial impact on house prices. However, they encountered the same issue as in the inner factors analysis, presenting lower R-squared values, each variable having a high Variance Inflation Factor (VIF), and no variables were left for modeling after removing the variables with high VIF one by one.
Overall Factors:
Both models exhibited high explanatory power when considering the overall factors, with XGBoost slightly outperforming MLR. The sqft_living_log and grade variables were significant in both models, highlighting their importance in the prediction.

Findings

Prime Locations for Investment
Based on my analysis of the various cities in King County, I identified that premium areas such as Medina, Mercer Island, Bellevue, Sammamish, and Redmond exhibit the highest average house prices. Therefore, investments in these regions are most likely to yield substantial returns due to the premium value attached to properties in these localities.
Actionable Insight: Investors looking for high-value investments should prioritize these areas, while homebuyers aspiring to reside in upscale neighborhoods should focus their search in these cities.
Desirable Floor Plans
My analysis highlighted that homes with approximately 2.5 floors bring the highest average prices in the market.
Actionable Insight: Sellers should consider renovating older homes to meet this standard, potentially increasing their market value. Conversely, buyers should be prepared to pay a premium for homes with this feature, viewing them as valuable assets in terms of both living experience and future resale value.
The Significance of House Features
There are two factors that consistently influence a home’s market price: grade and square footage of living space. Houses that are graded higher and have more living space are priced higher on the market.
Actionable Insight: For buyers, it’s a smart move to look for homes with larger living areas and higher grades, as these homes not only offer a better living experience but are also likely to appreciate more in value over time. Sellers, on the other hand, can increase the market value of their homes by making improvements that enhance the grade of the home, helping to secure a better offer when it comes time to sell.
Influential External Factors
The unemployment rate, commute time, and median age play pivotal roles in significantly influencing housing prices
Actionable Insight: Buyers might consider targeting areas with low unemployment rates and manageable commute times to work, aiming to secure a property that not only ensures a high quality of life but also holds the potential for value appreciation over time.

Recommendation

Data-Driven Marketing:
Utilize the insights derived from the analysis to create data-driven marketing strategies. For example, highlighting the optimal house features (like the number of floors and grade score) in marketing.
Advisory Services:
Offer advisory services to clients, helping them make informed decisions based on the significant factors influencing house prices in King County.
Leveraging XGBoost:
Consider using XGBoost for future predictive analyses due to its slightly better performance and its enhanced ability to handle multicollinearity issues compared to the MLR model.

Further Work

More sales data:
The dataset spans from May 2014 to May 2015, a period that concluded several years ago. To obtain more accurate and timely insights, it is vital to acquire more recent data that also covers a broader time frame.
Temporal Analysis:
Conduct a temporal analysis to understand how house prices have evolved over time and identify any seasonal trends or patterns.
Interactive Dashboard:
Develop an interactive dashboard that allows users to explore the data and insights visually and to generate custom reports based on their preferences.

Contact

Thank you for reading. If you have any questions regarding this project, you can email me at candice.wu.555@gmail.com, or connect me on

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
1. raw datasets		1. raw datasets
2. the data that we need		2. the data that we need
3. data merging		3. data merging
4. data cleaning		4. data cleaning
5. modeling		5. modeling
6. report.pdf		6. report.pdf
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

King-County-House-Price-Prediction

PROJECT OVERVIEW

DATA USED FOR THIS PROJECT

DATA PROCESSING

METHODOLOGIES