Boston Analysis

Predicting Boston Housing Prices

This project was worked through the Udemy: Data Analysis Course

In this project, we will evaluate the performance and predictive power of a model that has been trained and tested on data collected from homes in suburbs of Boston, Massachusetts. A model will be trained to predict the prices of the homes in Boston.

The Boston housing data was collected in 1978 and each of the 506 entries represent aggregated data about 14 features for homes from various suburbs in Boston, Massachusetts.

# IMPORTS

# Staple Inputs
import numpy as np
import pandas as pd
from pandas import DataFrame

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

# Dataset
from sklearn.datasets import load_boston

# Loading the Boston dataset
data_boston = load_boston()

print(data_boston.DESCR)

Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

boston_df = DataFrame(data_boston.data)
boston_df.columns = data_boston.feature_names

# Complete df
boston_df['PRICE'] = data_boston.target

# Separating the df into their respective df
X_fts = boston_df.drop('PRICE',axis=1)
y_price =  boston_df['PRICE']

boston_df.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

Data Exploration

boston_df.describe()

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.593761	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063	22.532806
std	8.596783	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062	9.197104
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000	5.000000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000	17.025000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000	21.200000
75%	3.647423	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000	25.000000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000	50.000000

boston_df.dtypes

CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD        float64
TAX        float64
PTRATIO    float64
B          float64
LSTAT      float64
PRICE      float64
dtype: object

# Using the apply to function to get the total amount of unique observations
boston_df.apply(lambda x: len(x.unique()))

CRIM       504
ZN          26
INDUS       76
CHAS         2
NOX         81
RM         446
AGE        356
DIS        412
RAD          9
TAX         66
PTRATIO     46
B          357
LSTAT      455
PRICE      229
dtype: int64

** Contains a lot of continous variables**

# Using the apply function, for every column, we find the total amount of NULL/NA values
boston_df.apply(lambda x: sum(x.isnull()))

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
PRICE      0
dtype: int64

That’s a great thing, there is no NaN values

## Indepth look at price

min_price = np.min(y_price)
max_price = np.max(y_price)
mean_price = np.mean(y_price)
median_price = np.median(y_price)
std_price = np.std(y_price)

print("Descriptive Statistics\n")
print("The maximum price for the PRICE is {0:.2f}".format(max_price*1000))
print("The minimum price for the PRICE is {0:.2f}".format(min_price*1000))
print("The mean price for the PRICE is {0:.2f}".format(mean_price*1000))
print("The median price for the PRICE is {0:.2f}".format(median_price*1000))
print("The std fo the price for the PRICE is {0:.2f}".format(std_price*1000))

Descriptive Statistics

The maximum price for the PRICE is 50000.00
The minimum price for the PRICE is 5000.00
The mean price for the PRICE is 22532.81
The median price for the PRICE is 21200.00
The std fo the price for the PRICE is 9188.01

Data Visualization

all_cols = boston_df.columns

# Histogram of prices (this is the target of our dataset)
plt.hist(y_price,bins=60)

plt.xlabel('Price in $1000s')
plt.ylabel('Number of houses')
plt.show()

png

# creating a visual of the correlation of all the dependent variables
for i, cols in enumerate(all_cols):    
    sns.lmplot(cols, "PRICE",data = boston_df)
    plt.show()

png

# Compute the correlation matrix
corr = boston_df.corr()


# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, annot=True, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

<matplotlib.axes._subplots.AxesSubplot at 0x120ed1320>

png

corr_matrix = boston_df.corr()
corr_matrix['PRICE'].sort_values(ascending=False)

PRICE      1.000000
RM         0.695360
ZN         0.360445
B          0.333461
DIS        0.249929
CHAS       0.175260
AGE       -0.376955
RAD       -0.381626
CRIM      -0.385832
NOX       -0.427321
TAX       -0.468536
INDUS     -0.483725
PTRATIO   -0.507787
LSTAT     -0.737663
Name: PRICE, dtype: float64

Feature Selection

import statsmodels.api as sm
from scipy import stats
from collections import defaultdict

global dict_adjus_R
dict_adjus_R = defaultdict(list)

def HighestPvalue(model, threshold):
    highest_pvalue = 0
    
    for index, current_pvalue in model.pvalues.items():
        if current_pvalue > highest_pvalue:
            highest_pvalue = current_pvalue
            highest_index = index
            
    if highest_pvalue > threshold: return highest_index
    else: return True

def CreateLinearReg(x, y):
    X2 = sm.add_constant(x)
    est = sm.OLS(y, X2)
    est2 = est.fit()
    return est2

def BackwardElimination(Xs, y, stats_signf):
    model_info = CreateLinearReg(Xs, y)
    p_results = HighestPvalue(model_info, stats_signf)

    dict_adjus_R[len(Xs.columns)].append([Xs.columns, model_info.rsquared_adj])
    
    if p_results is True: return model_info

    else:
        Xs.drop(p_results, axis=1, inplace=True)
        BackwardElimination(Xs, y, stats_signf)

# Statistical sigficance we would like to uses
stats_signf = 0.05
final_model = BackwardElimination(X_fts, y_price, stats_signf)
final_model.summary()

OLS Regression Results
Dep. Variable:	PRICE	R-squared:	0.741
Model:	OLS	Adj. R-squared:	0.735
Method:	Least Squares	F-statistic:	128.2
Date:	Wed, 10 Jan 2018	Prob (F-statistic):	5.74e-137
Time:	13:58:52	Log-Likelihood:	-1498.9
No. Observations:	506	AIC:	3022.
Df Residuals:	494	BIC:	3073.
Df Model:	11
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	36.3694	5.069	7.176	0.000	26.411	46.328
CRIM	-0.1076	0.033	-3.296	0.001	-0.172	-0.043
ZN	0.0458	0.014	3.387	0.001	0.019	0.072
CHAS	2.7212	0.854	3.185	0.002	1.043	4.400
NOX	-17.3956	3.536	-4.920	0.000	-24.343	-10.448
RM	3.7966	0.406	9.343	0.000	2.998	4.595
DIS	-1.4934	0.186	-8.039	0.000	-1.858	-1.128
RAD	0.2991	0.063	4.719	0.000	0.175	0.424
TAX	-0.0118	0.003	-3.488	0.001	-0.018	-0.005
PTRATIO	-0.9471	0.129	-7.337	0.000	-1.201	-0.693
B	0.0094	0.003	3.508	0.000	0.004	0.015
LSTAT	-0.5232	0.047	-11.037	0.000	-0.616	-0.430

Omnibus:	178.444	Durbin-Watson:	1.078
Prob(Omnibus):	0.000	Jarque-Bera (JB):	786.944
Skew:	1.524	Prob(JB):	1.31e-171
Kurtosis:	8.295	Cond. No.	1.47e+04

# Dictionary with some of their best models. 
# This was done to look at a group of variable and there adjusted R_Squared
dict_adjus_R

defaultdict(list,
            {11: [[Index(['CRIM', 'ZN', 'CHAS', 'NOX', 'RM', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B',
                      'LSTAT'],
                     dtype='object'),
               0.73476802182854828],
              [Index(['CRIM', 'ZN', 'CHAS', 'NOX', 'RM', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B',
                      'LSTAT'],
                     dtype='object'), 0.73476802182854828],
              [Index(['CRIM', 'ZN', 'CHAS', 'NOX', 'RM', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B',
                      'LSTAT'],
                     dtype='object'), 0.73476802182854828]],
             12: [[Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'DIS', 'RAD', 'TAX',
                      'PTRATIO', 'B', 'LSTAT'],
                     dtype='object'), 0.73429218983604261]],
             13: [[Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
                      'PTRATIO', 'B', 'LSTAT'],
                     dtype='object'), 0.7337538824121872]]})

# Based on the analysis using backward elimination, I will look into using these variables
X_fts_1 = X_fts[['CRIM', 'ZN', 'CHAS', 'NOX', 'RM', 'DIS', 
                        'RAD', 'TAX', 'PTRATIO', 'B','LSTAT']]

X_fts_1.head()

	CRIM	ZN	NOX	RM	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	0.538	6.575	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	0.469	6.421	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	0.469	7.185	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	0.458	6.998	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	0.458	7.147	6.0622	3.0	222.0	18.7	396.90	5.33

Data Modeling

Comparison

Similarities: Both MAE and RMSE express average model prediction error in units of the variable of interest. Both metrics can range from 0 to ∞ and are indifferent to the direction of errors. They are negatively-oriented scores, which means lower values are better.

Differences: Taking the square root of the average squared errors has some interesting implications for RMSE. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful when large errors are particularly undesirable.

Implementation: Define a Performance Metric

We will calculate the coefficient of determination, ${R}^2$, to quantify our model’s performance. The coefficient of determination for a model is a useful statistic in regression analysis, as it often describes how “good” that model is at making predictions.

The values for ${R}^2$ range from 0 to 1, which captures the percentage of squared correlation between the predicted and actual values of the target variable.

${R}^2$ of 0 is no better than a model that always predicts the mean of the target variable
${R}^2$ of 1 perfectly predicts the target variable
Any value between 0 and 1 indicates what percentage of the target variable, using this model, can be explained by the features.

from sklearn.metrics import r2_score

def Score_R2(y_true, y_predict):
    """ Calculates and returns the performance score between 
        true and predicted values based on the metric chosen. """

    # TODO: Calculate the performance score between 'y_true' and 'y_predict'
    score = r2_score(y_true, y_predict)

    # Return the score
    return score

Training/Testing the Dataset

In a dataset a training set is implemented to build up a model, while a validation set is used to validate the model built. Data points in the training set are excluded from the validation set. The correct way to pick out samples from your dataset to be part either the training or validation (also called test) set is randomly.

from sklearn.linear_model import LinearRegression

linreg = LinearRegression()

from sklearn.model_selection import train_test_split

# Splitting the dataset to understand how the model perform on a simple split
X_train, X_test, y_train, y_test = train_test_split(X_fts_1, y_price, test_size=0.2)
model = linreg.fit(X_train, y_train)
predictions = linreg.predict(X_test)

# Visualizing the performance of the dataset

plt.scatter(y_test, predictions)
plt.xlabel("True Values")
plt.ylabel("Predictions")
plt.show()

png

model_score = model.score(X_test, y_test)
print ("Model R_Square Performance:, {0:.2f}".format(model_score))

Model R_Square Performance:, 0.72

Cross Validation

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

seed = 7
splits = 10
kfold = KFold(n_splits=splits, random_state=seed)

scoring = 'neg_mean_squared_error'
results = cross_val_score(linreg, X_fts_1, y_price, cv=kfold, scoring=scoring)

all_RMSE = [np.sqrt(np.abs(result)) for result in results]
RMSE = np.sqrt(np.abs(results.mean()))
RMSE_std = np.sqrt(results.std())

print("For the RMSE, the mean is {0:.3f} and the std. deviation is {1:.3f}\n".format(RMSE, RMSE_std))
print("The RMSE for all the CV are {0}".format('\n'.join(str(r) for r in all_RMSE)))

For the RMSE, the mean is 5.723 and the std. deviation is 6.451

The RMSE for all the CV are 3.0325086999
7300523343
49240202316
93185975761
45459716069
40422039266
14890618193
4212881393
77112150767
22201336351

scoring = 'r2'

results = cross_val_score(linreg, X_fts_1, y_price, cv=kfold, scoring=scoring)
print("For the R Squared, the mean is {0:.3f} and the std. deviation is {1:.3f}\n".format(results.mean(), results.std()))
print("The R Squared for all the CV are \n{0}".format('\n'.join(str(r) for r in results)))

For the R Squared, the mean is 0.247 and the std. deviation is 0.543

The R Squared for all the CV are 
0.736364963982
0.481934238314
-0.738769807462
0.641343334412
0.577913070194
0.742233036329
0.380262591077
-0.0347512291036
-0.767164243387
0.449640823275

Decision Tree Regressors

import visuals as vs

# Produce learning curves for varying training set sizes and maximum depths
vs.ModelLearning(X_fts_1, y_price)

png

** RESULTS**

Depth of 1:

This is a high bias sceanario because the score is pretty low. Hence, the model is underfitting.
The testing score (green line) increases with the number of observations.
The testing score only increases to approximately 0.4, a low score.
The training score decreases to a very low score of approximately 0.4.
This indicates how the model does not seem to fit the data well.
Consequently, having more training points would not benefit the model as the model is underfitting the dataset. Instead, one should increase the model complexity to better fit the dataset.

Depth of 3:

Ideal sceanrio.
The testing score increased, but has hit a plateau at a good score (~0.7)
The model does a good job in generalizing the data
The training score decreases to a very low score of approximately (~0.8)
There seems to be no high bias or high variance problem.
Having more training points might benefit the model as the model.

Depth of 6:

Slight high variance problem
The training score seems to be a bit good, thus indicating a high variance problem where the model is picking up a lot of the noise
It is overfitting
The testing score increased, but has hit a plateau at a good score (~0.8)
The training score decreases to a very low score of approximately (~1.0)
Having more training points might benefit the model as the model.

Depth of 10:

High variance problem
The training score seems to be overfitting, thus indicating a high variance problem where the model is picking up a lot of the noise
It is overfitting
The testing score increased, but has hit a plateau at a good score (~0.8)
The training score decreases to a very low score of approximately (~0.9)
Try smaller sets of features (bc you are overfitting).

vs.ModelComplexity(X_train, y_train)

png

RESULTS

The ideal depth should be 3 or 4. Of course, if its 4, it would require more computational power but it can indicaite a better model. A maximum depth of 3 also looks great to use!

Grid Search

Grid searches specific parameters, and the possible values of those parameters. The grid search then returns the best parameter values for our model, after fitting the supplied data. This takes out the guess-work involved in seeking out the opitimal paramter values for a classifier.

Although we will be using GridSearchCV, it may be computationally expensive for a bigger dataset. There are other techniques that could be used for hyperparameter optimization in order to save time like RandomizedSearchCV, in this case instead of exploring the whole parameter space just a fixed number of parameter settings is sampled from the specified distributions.

from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor

# Using the variables from the grid search
def GridSearch_DecReg(X, y):
    """ Performs grid search over the 'max_depth' parameter for a 
        decision tree regressor trained on the input data [X, y]. """
    
    # Create cross-validation sets from the training data
    cv_sets = ShuffleSplit(X.shape[0], test_size = 0.20, random_state = 0)

    # TODO: Create a decision tree regressor object
    regressor = DecisionTreeRegressor()

    # TODO: Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
    params = {'max_depth': range(1,11)}

    # TODO: Transform 'performance_metric' into a scoring function using 'make_scorer' 
    scoring_fnc = make_scorer(performance_metric)

    # TODO: Create the grid search object
    grid = GridSearchCV(regressor,params,scoring_fnc,cv=cv_sets)

    # Fit the grid search object to the data to compute the optimal model
    grid = grid.fit(X, y)

    # Return the optimal model after fitting the data
    return grid

from sklearn.externals.six import StringIO
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus


def VisualizeDecisionTree(parameters, x, y):
    
    params = parameters.best_estimator_.get_params()

        
    dt_model = DecisionTreeRegressor(**params)
    dt_fit = dt_model.fit(x, y)
    ####
    
    dot_data = StringIO()
    export_graphviz(dt_fit, out_file=dot_data, special_characters=True, 
                     filled=True, rounded=True, feature_names=x.columns)

    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
    return Image(graph.create_png())
 

grid_p = GridSearch_DecReg(X_fts_1, y_price)

# Produce the value for 'max_depth'
print("Parameter 'max_depth' is {0} for the optimal model.".format(grid_p.best_estimator_.get_params()['max_depth']))

Parameter 'max_depth' is 5 for the optimal model.

# Checking the scores
# print(grid_p.cv_results_)

#Creating the decision tree for the specific 
X_train, X_test, y_train, y_test = train_test_split(X_fts_1, y_price, test_size=0.2)

params = grid_p.best_estimator_.get_params()

dt_model = DecisionTreeRegressor(**params)
dt_fit = dt_model.fit(X_train, y_train)

dt_scores = cross_val_score(dt_fit, X_train, y_train, cv = 10)

r2_sqr_ytest = r2_score(y_test, grid_p.best_estimator_.predict(X_test))
score_ytest = dt_fit.score(X_test, y_test)

print("""R Squared using the predicted model using the gridCV parameters
      for the y test is {0:.2f}""".format(r2_sqr_ytest))

print("The score fitting for the testing set is {0:.2f}".format(score_ytest))
print("Mean cross validation score: {0:.2f}".format(np.mean(dt_scores)))

R Squared using the predicted model using the gridCV parameters
      for the y test is 0.96
The score fitting for the testing set is 0.74
Mean cross validation score: 0.64

VisualizeDecisionTree(grid_p, X_train, y_train)

png

** Quite intersting. Variables used were LSTAT, RM, DIS, CRIM, DIS, NOX, PTRATIO, TAX**

# New set of variables
X_fts_2 = X_fts_1[['LSTAT','RM','DIS','CRIM','DIS']]

grid_p_2 = GridSearch_DecReg(X_fts_2, y_price)

# Produce the value for 'max_depth'
print("Parameter 'max_depth' is {0} for the optimal model.".format(grid_p_2.best_estimator_.get_params()['max_depth']))

Parameter 'max_depth' is 5 for the optimal model.

#Creating the decision tree for the specific 
X_train, X_test, y_train, y_test = train_test_split(X_fts_2, y_price, test_size=0.2)

params = grid_p_2.best_estimator_.get_params()

dt_model = DecisionTreeRegressor(**params)
dt_fit = dt_model.fit(X_train, y_train)

dt_scores = cross_val_score(dt_fit, X_train, y_train, cv = 10)

r2_sqr_ytest = r2_score(y_test, grid_p_2.best_estimator_.predict(X_test))
score_ytest = dt_fit.score(X_test, y_test)

print("""R Squared using the predicted model using the gridCV parameters
      for the y test is {0:.2f}""".format(r2_sqr_ytest))

print("The score fitting for the testing set is {0:.2f}".format(score_ytest))
print("Mean cross validation score: {0:.2f}".format(np.mean(dt_scores)))

R Squared using the predicted model using the gridCV parameters
      for the y test is 0.90
The score fitting for the testing set is 0.86
Mean cross validation score: 0.65

# Will redo the process with the top 5 variables!
vs.ModelLearning(X_fts_2, y_price)

png

VisualizeDecisionTree(grid_p, X_train, y_train)

png

Will use the first model for predictions

from collections import defaultdict

# Let perform for all of them with differernt ranges
ranges = [0.025, 0.05, 0.1]
var_d = defaultdict(list)

from collections import defaultdict

client_data = {}

for val in boston_df.columns:
    min_val = np.min(boston_df[val])
    max_val = np.max(boston_df[val])
    sampl = np.random.uniform(low=min_val, high=max_val, size=(10,))
    client_data["{0}".format(val)] = sampl

from pandas import DataFrame
client_df = DataFrame.from_dict(client_data)
client_df = client_df[['CRIM', 'ZN', 'CHAS', 'NOX', 'RM', 'DIS', 
           'RAD', 'TAX', 'PTRATIO', 'B','LSTAT']]

for i, price in enumerate(grid_p.best_estimator_.predict(client_df)):
    print ("Predicted selling price for Client {0}'s home: ${1:,.2f}".format(i+1, price*1000))

Predicted selling price for Client 1's home: $20,967.76
Predicted selling price for Client 2's home: $14,410.00
Predicted selling price for Client 3's home: $21,900.00
Predicted selling price for Client 4's home: $9,810.87
Predicted selling price for Client 5's home: $14,410.00
Predicted selling price for Client 6's home: $9,810.87
Predicted selling price for Client 7's home: $15,539.29
Predicted selling price for Client 8's home: $26,168.42
Predicted selling price for Client 9's home: $15,539.29
Predicted selling price for Client 10's home: $15,539.29

Results

Min Price: $5,000.00
Max Price: $50,000.00
Std Deviation: $9188.01
Median Price: $21,200.00
Mean Price: $22532.81

import matplotlib.pyplot as plt
plt.hist(y_price, bins = 20)
for price in grid_p.best_estimator_.predict(client_df):
    plt.axvline(price, lw = 5, c = 'r')

png

** MOST of the data does fit within the distribution**

# Using the variables from the grid search
def Temp_GridSearch(X, y):
    cv_sets = ShuffleSplit(X.shape[0], test_size = 0.20, random_state = 0)
    regressor = DecisionTreeRegressor()
    params = {'max_depth': range(1,11)}
    scoring_fnc = make_scorer(performance_metric)
    grid = GridSearchCV(regressor,params,scoring_fnc,cv=cv_sets)
    grid = grid.fit(X, y)
    return grid.best_estimator_

vs.PredictTrials(X_fts_1, y_price, Temp_GridSearch, client_df.as_matrix())

Trial 1: $20.34
Trial 2: $20.50
Trial 3: $10.90
Trial 4: $19.93
Trial 5: $20.34
Trial 6: $20.67
Trial 7: $21.05
Trial 8: $20.58
Trial 9: $20.89
Trial 10: $21.77

Range in prices: $10.87