loader
 

Prediction Analysis House Sales in King County, USA using Python Programming

 

Ruler County will start accepting a huge number of dollars this biennium to help analysis financial statements moderate lodging from a source that was made by the Washington state Legislature during the 2019 session.

The assets come from the state’s HB 1406, which went in April. It permits urban areas and regions to get a state deals and use charge credit, which can be put into reasonable lodging. The King County Council affirmed a couple of laws on Aug. 20 that will permit the area to start getting reserves this biennium.

According to the state law, urban areas and provinces had until this coming January to embrace a goals of plan, and until July to pass a mandate authoritatively permitting the state cash to start coming in. In any case, King County endorsed a law first, which permits them to gather at the greatest pace of .0146 percent. Statewide, HB 1406 is making some $500 million accessible throughout the following 20 years. The cash will be utilized to serve individuals making under 60 percent of the region middle salary.

 

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

 

id :a notation for a house

date: Date house was sold

price: Price is prediction target

bedrooms: Number of Bedrooms/House

bathrooms: Number of bathrooms/bedrooms

sqft_living: square footage of the home

sqft_lot: square footage of the lot

floors :Total floors (levels) in house

waterfront :House which has a view to a waterfront

view: Has been viewed

condition :How good the condition is Overall

grade: overall grade given to the housing unit, based on King County grading system

sqft_above :square footage of house apart from basement

sqft_basement: square footage of the basement

yr_built :Built Year

yr_renovated :Year when house was renovated

zipcode:zip code

lat: Latitude coordinate

long: Longitude coordinate

sqft_living15 :Living room area in 2015(implies– some renovations) This might or might not have affected the lotsize area

sqft_lot15 :lotSize area in 2015(implies– some renovations)

 

We will require the following libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
%matplotlib inline
 

Importing the DataSet

 

Loading the csv files:

In [2]:
file_name='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv'
df=pd.read_csv(file_name)
 

We will use the method head to display the first 5 columns of the dataframe.

In [3]:
df.head()
 Unnamed: 0iddatepricebedroomsbathroomssqft_livingsqft_lotfloorswaterfrontgradesqft_abovesqft_basementyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15
00712930052020141013T000000221900.03.01.00118056501.00711800195509817847.5112-122.25713405650
11641410019220141209T000000538000.03.02.25257072422.0072170400195119919812547.7210-122.31916907639
22563150040020150225T000000180000.02.01.00770100001.0067700193309802847.7379-122.23327208062
33248720087520141209T000000604000.04.03.00196050001.0071050910196509813647.5208-122.39313605000
44195440051020150218T000000510000.03.02.00168080801.00816800198709807447.6168-122.04518007503
 

Displaying the data types of each column using the attribute dtype.

In [4]:
print(df.dtypes)
 
Unnamed: 0         int64
id                 int64
date              object
price            float64
bedrooms         float64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object
 

We will use the method describe to obtain a statistical summary of the dataframe.

In [5]:
df.describe()
Out[5]:
 Unnamed: 0idpricebedroomsbathroomssqft_livingsqft_lotfloorswaterfrontviewgradesqft_abovesqft_basementyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15
count21613.000002.161300e+042.161300e+0421600.00000021603.00000021613.0000002.161300e+0421613.00000021613.00000021613.00000021613.00000021613.00000021613.00000021613.00000021613.00000021613.00000021613.00000021613.00000021613.00000021613.000000
mean10806.000004.580302e+095.400881e+053.3728702.1157362079.8997361.510697e+041.4943090.0075420.2343037.6568731788.390691291.5090451971.00513684.40225898077.93980547.560053-122.2138961986.55249212768.455652
std6239.280022.876566e+093.671272e+050.9266570.768996918.4408974.142051e+040.5399890.0865170.7663181.175459828.090978442.57504329.373411401.67924053.5050260.1385640.140828685.39130427304.179631
min0.000001.000102e+067.500000e+041.0000000.500000290.0000005.200000e+021.0000000.0000000.0000001.000000290.0000000.0000001900.0000000.00000098001.00000047.155900-122.519000399.000000651.000000
25%5403.000002.123049e+093.219500e+053.0000001.7500001427.0000005.040000e+031.0000000.0000000.0000007.0000001190.0000000.0000001951.0000000.00000098033.00000047.471000-122.3280001490.0000005100.000000
50%10806.000003.904930e+094.500000e+053.0000002.2500001910.0000007.618000e+031.5000000.0000000.0000007.0000001560.0000000.0000001975.0000000.00000098065.00000047.571800-122.2300001840.0000007620.000000
75%16209.000007.308900e+096.450000e+054.0000002.5000002550.0000001.068800e+042.0000000.0000000.0000008.0000002210.000000560.0000001997.0000000.00000098118.00000047.678000-122.1250002360.00000010083.000000
max21612.000009.900000e+097.700000e+0633.0000008.00000013540.0000001.651359e+063.5000001.0000004.00000013.0000009410.0000004820.0000002015.0000002015.00000098199.00000047.777600-121.3150006210.000000871200.000000

8 rows × 21 columns

 

Data Wrangling with Python Programming

 

Droping the columns "id" and "Unnamed: 0" from axis 1 using the method drop(), then using the method describe() to obtain a statistical summary of the data.

In [6]:
df.drop("id", axis = 1, inplace = True)
df.drop("Unnamed: 0", axis = 1, inplace = True)
df.describe()
Out[6]:
 pricebedroomsbathroomssqft_livingsqft_lotfloorswaterfrontviewconditiongradesqft_abovesqft_basementyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15
count2.161300e+0421600.00000021603.00000021613.0000002.161300e+0421613.00000021613.00000021613.00000021613.00000021613.00000021613.00000021613.00000021613.00000021613.00000021613.00000021613.00000021613.00000021613.00000021613.000000
mean5.400881e+053.3728702.1157362079.8997361.510697e+041.4943090.0075420.2343033.4094307.6568731788.390691291.5090451971.00513684.40225898077.93980547.560053-122.2138961986.55249212768.455652
std3.671272e+050.9266570.768996918.4408974.142051e+040.5399890.0865170.7663180.6507431.175459828.090978442.57504329.373411401.67924053.5050260.1385640.140828685.39130427304.179631
min7.500000e+041.0000000.500000290.0000005.200000e+021.0000000.0000000.0000001.0000001.000000290.0000000.0000001900.0000000.00000098001.00000047.155900-122.519000399.000000651.000000
25%3.219500e+053.0000001.7500001427.0000005.040000e+031.0000000.0000000.0000003.0000007.0000001190.0000000.0000001951.0000000.00000098033.00000047.471000-122.3280001490.0000005100.000000
50%4.500000e+053.0000002.2500001910.0000007.618000e+031.5000000.0000000.0000003.0000007.0000001560.0000000.0000001975.0000000.00000098065.00000047.571800-122.2300001840.0000007620.000000
75%6.450000e+054.0000002.5000002550.0000001.068800e+042.0000000.0000000.0000004.0000008.0000002210.000000560.0000001997.0000000.00000098118.00000047.678000-122.1250002360.00000010083.000000
max7.700000e+0633.0000008.00000013540.0000001.651359e+063.5000001.0000004.0000005.00000013.0000009410.0000004820.0000002015.0000002015.00000098199.00000047.777600-121.3150006210.000000871200.000000
 

We can see we have missing values for the columns bedrooms and bathrooms

In [7]:
print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())
 
number of NaN values for the column bedrooms : 13
number of NaN values for the column bathrooms : 10
 

We can replace the missing values of the column 'bedrooms' with the mean of the column 'bedrooms' using the method replace. Don’t forget to set the inplace parameter top True

In [8]:
mean=df['bedrooms'].mean()
df['bedrooms'].replace(np.nan,mean, inplace=True)
 

We will also replace the missing values of the column 'bathrooms' with the mean of the column 'bedrooms' ; using the method replace.

In [9]:
mean=df['bathrooms'].mean()
df['bathrooms'].replace(np.nan,mean, inplace=True)
In [10]:
print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())
 
number of NaN values for the column bedrooms : 0
number of NaN values for the column bathrooms : 0
 

Exploratory data analysis using Python

 

Using the method value_counts to count the number of houses with unique floor values, use the method .to_frame() to convert it to a dataframe.

In [11]:
df['floors'].value_counts().to_frame()
Out[11]:
 floors
1.010680
2.08241
1.51910
3.0613
2.5161
3.58
 

Using the function boxplot in the seaborn library to determine whether houses with a waterfront view or without a waterfront view have more price outliers .

In [12]:
sns.boxplot(x="waterfront", y="price", data=df)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1c54cd7d68>
 
 

Using the function regplot in the seaborn library to determine if the feature sqft_above is negatively or positively correlated with price.

In [13]:
sns.regplot(x="sqft_above", y="price", data=df, ci = None)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1c54bd3ba8>
 
 

We can use the Pandas method corr() to find the feature other than price that is most correlated with price.

In [14]:
df.corr()['price'].sort_values()
Out[14]:
zipcode         -0.053203
long             0.021626
condition        0.036362
yr_built         0.054012
sqft_lot15       0.082447
sqft_lot         0.089661
yr_renovated     0.126434
floors           0.256794
waterfront       0.266369
lat              0.307003
bedrooms         0.308797
sqft_basement    0.323816
view             0.397293
bathrooms        0.525738
sqft_living15    0.585379
sqft_above       0.605567
grade            0.667434
sqft_living      0.702035
price            1.000000
Name: price, dtype: float64
 

Model Development with Python Functions

 

Importing libraries

In [15]:
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
 

We can Fit a prediction algorithm as linear regression model using the longitude feature 'long' and caculate the R^2.

In [16]:
X = df[['long']]
Y = df['price']
lm = LinearRegression()
lm
lm.fit(X,Y)
lm.score(X, Y)
Out[16]:
0.00046769430149007363
 

Fit a linear regression model to predict the 'price' using the feature ‘sqft_living’.

In [17]:
X = df[['long']]
Y = df['price']
lm = LinearRegression()
lm
lm.fit(X,Y)
lm.score(X, Y)
Out[17]:
0.00046769430149007363
 

Fit a linear regression model to predict the ‘price’ using the list of features:

In [18]:
features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]
 
 
In [19]:
X2 = df[features]
Y2 = df['price']
lm.fit(X2,Y2)
lm.score(X2,Y2)
Out[19]:
0.6576527411217378
 

this will help with Question 8

Create a list of tuples, the first element in the tuple contains the name of the estimator:

'scale'

'polynomial'

'model'

The second element in the tuple contains the model constructor

StandardScaler()

PolynomialFeatures(include_bias=False)

LinearRegression()

In [20]:
Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(include_bias=False)),('model',LinearRegression())]
 

Using the list to create a pipeline object, predict the ‘price’, fit the object using the features in the list features , then fit the model and calculate the R^2

In [21]:
pipe=Pipeline(Input)
pipe
Out[21]:
Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomial', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('model', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False))])
In [22]:
pipe.fit(X,Y)
Out[22]:
Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomial', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('model', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False))])
In [23]:
pipe.score(X,Y)
Out[23]:
0.0033607985166381744
 

MODEL EVALUATION AND REFINEMENT (Data Training and Cross-Validating)

 
 
In [24]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
print("done")
 
done
 

We will split the data into training and testing set

In [25]:
features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]
X = df[features ]
Y = df['price']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=1)
print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])
 
number of test samples : 3242
number of training samples: 18371
 

Creating and fit a Ridge regression object using the training data, setting the regularization parameter to 0.1 and calculating the R^2 using the test data.

In [26]:
from sklearn.linear_model import Ridge
In [27]:
RigeModel = Ridge(alpha=0.1)
RigeModel.fit(x_train, y_train)
RigeModel.score(x_test, y_test)
Out[27]:
0.6478759163939115
 

Performing a second order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data, setting the regularisation parameter to 0.1. Calculate the R^2 utilising the test data provided.

In [28]:
pr=PolynomialFeatures(degree=2)
x_train_pr=pr.fit_transform(x_train[features])
x_test_pr=pr.fit_transform(x_test[features])
RigeModel = Ridge(alpha=0.1)
RigeModel.fit(x_train_pr, y_train)
RigeModel.score(x_test_pr, y_test)
Out[28]:
0.700274427924385

Leave a Reply