The frame above is an example of using a trained machine learning model to
make predictions of what a Google App star rating might be, given the inputs.
This model is for illustration purposes only and is an example of applying machine
learning to help a business. For example, a company that is designing and developing a Google App
might find such as trained model useful in helping to decide what to name their application or
which audience to target for a better chance at receiving a higher app rating before going to
market.
In addition, this further code example below is what a process might be in order
to create such a machine learning model. The coding steps include things such as data exploration
through examination and visualization for both numeric and text data (language processing),
data cleaning and processing (sometimes referred to as data wrangling), and more in-depth
processing of missing data (through imputation), the handling of outliers, and label encoding
to ensure that the data can be passed into the machine learning and AI algorithms. In addition,
the steps illustrate how to continue to create better prediction models through dimension reduction
and parameter tuning through a process called hyper parameter optimizations.
Machine Learning Example 機械学習例
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive
drive.mount('/content/drive')
#read dataset
df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ApartmentRent_Prediction/housing_train.csv')
df.head()
df.shape
#to display all columns and not get the ...
pd.options.display.max_columns = None
df.head()
#to check if any missing value is present or not
df.isnull().values.any()
#to display which columns have missing values
df.isnull().sum()
df.isnull().sum().sort_values(ascending=False)
## remove those rows wherever we have lat & long as missing value
df.dropna(axis='index',how='all', subset=['lat','long'],inplace=True)
# checking number of rows again to see that the above NA rows for lat long have been removed
df.shape
df.isnull().sum()
### now we have huge no. of missing values in laundry_options & parking_options
df.dtypes
1.Drop those rows wherever we have missing values, but thats not a professional approach
2.Fill missing values with Statistical Approaches like Mean , Median & Mode (In case of categorical),but if we have more missing values it will affect distribution of data
3.Professional way is -- fill missing values in such a way that it does not affect distribution of that feature, ie using some advanced approaches like Random Value Imputation
# first create copy of data as this is just an example
data=df.copy()
# there are over 54k rows of data without laundry_options data
data['laundry_options'].isnull().sum()
# we can see what the values of this laundry_options are and their relative amounts visually
data['laundry_options'].value_counts().plot(kind='bar')
# and we can take an actual count
data['laundry_options'].value_counts()
# find the mode
data['laundry_options'].mode()[0]
# insert the mode in rows with missing laundry_options values
data['laundry_options'].fillna('w/d in unit',inplace=True)
data['laundry_options'].value_counts()
data['laundry_options'].value_counts().plot(kind='bar')
Aim: Random sample imputation consists of taking random observation from the dataset and we use this observation to replace the nan values
When should it be used? It assumes that the data are missing completely at random(MCAR)
#To fetch a random sample
df['laundry_options'].dropna().sample()
df['laundry_options'].isnull().sum()
## considering sample of size 54127
random_sample=df['laundry_options'].dropna().sample(54127)
## random_sample=df['laundry_options'].dropna().sample(df['laundry_options'].isnull().sum())
random_sample
random_sample.index
df[df['laundry_options'].isnull()].index
random_sample.index=df[df['laundry_options'].isnull()].index
random_sample.index
random_sample
df.loc[df['laundry_options'].isnull(),'laundry_options']=random_sample
df['laundry_options'].value_counts().plot(kind='bar')
We see from the distribution above that this method is superior to the statistical replacement method
#### We can automate the steps above by creating a function
def impute_nan(df,variable):
##Create the random sample to fill in the na
random_sample=df[variable].dropna().sample(df[variable].isnull().sum())
##pandas need to have equal index in order to merge the dataset
random_sample.index=df[df[variable].isnull()].index
df.loc[df[variable].isnull(),variable]=random_sample
# Exploration steps
df['parking_options'].isnull().sum()
df['parking_options'].value_counts()/len(df)*100
## The initial ratio between off-street parking & Carport (the top two values present in the data is):
33/10
## imputing NaNs of parking_options
impute_nan(df,'parking_options')
df['parking_options'].value_counts()/len(df)*100
## After imputing NaNs, lets check the ratio between off-street parking & Carport
52/16
We see that the ratio between the top two values prior to imputation is post imputation is not very different (3.25 vs 3.33)
# And finally, we can check to make sure the NaNs have been taken care of
df.isnull().sum()
df.head()
df.columns
# importing all necessery modules
from wordcloud import WordCloud, STOPWORDS
total_description = ''
stopwords = set(STOPWORDS)
# iterate through the first 10k rows of dataframe..
for val in df['description'][0:10000]:
# typecaste each val to string
val = str(val)
# split the value
tokens = val.split()
# Converts each token into lowercase
for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
total_description =total_description + " ".join(tokens)+" "
Below is Craig's simplified code of the above function and runs much quicker.
# iterate through the first 10k rows of dataframe..
for val in df['description'][0:10000]:
if isinstance(val, str):
total_description += val.lower()
### generating the WordCloud table
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(total_description)
# plotting the WordCloud image
plt.figure(figsize = (8, 8))
plt.imshow(wordcloud)
plt.axis("off")
from nltk.corpus import RegexpTokenizer as regextoken
# First let's take care of the NaNs
df['description'].isnull().sum()
df['description'].fillna('no description',inplace=True)
df['description'].isnull().sum()
# Converting all the text to lowercase
df['description'] = df['description'].apply(lambda x: x.lower())
## Creating a regular expression tokenizer to have only alphabets , ie remove all the special characters
tokenizer = regextoken("[a-zA-Z]+")
tokenizer
df['description'][0]
print(tokenizer.tokenize(df['description'][0]))
sample=df.sample(10000)
sample.head()
# Applying the tokenizer to each row of the reviews
sample_tokens = sample['description'].apply(tokenizer.tokenize)
sample_tokens.index
sample_index = sample_tokens.index[0]
# Examining the tokens created for the first row / restaurant
print(sample_tokens[sample_index])
Viewing the first tokenized list of words, we see we have some stopwords like an, and, it, etc. So we need to remove them.
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
# These are common words defined by Python developers that typically don't add meaning to the text and can be removed
stop = stopwords.words("english")
print(stop)
### with respect to very first row, how to remove stopwords
rev=sample_tokens[sample_index]
print(rev)
print([token for token in rev if token not in stop])
len(sample_tokens)
### using function
def remove_stopwords(text):
updated_text=[token for token in text if token not in stop]
return updated_text
sample_tokens=sample_tokens.apply(remove_stopwords)
type(sample_tokens)
len(sample_tokens)
indices=[i for i in range(0,10000)]
rev=pd.Series(data=sample_tokens.values,index=indices)
rev
# Concatenating all the reviews as I have to count frequency of each word as I have to plot which word has highest count
all_reviews = sample_tokens.astype(str).str.cat()
type(all_reviews)
len(all_reviews)
## perform tokenization to convert your string(all_reviews) into list,so that we will count frequency of words
cleaned_reviews = tokenizer.tokenize(all_reviews)
len(cleaned_reviews)
type(cleaned_reviews)
# obtain the frequency of individual words in the reviews, for this u have to use FreqDist
from nltk import FreqDist, bigrams, trigrams
fd = FreqDist()
for word in cleaned_reviews:
fd[word]=fd[word]+ 1
# Examining the top 5 most frequent words
fd.most_common(5)
# Plotting the top 50 most frequent words
plt.figure(figsize = (15, 8))
fd.plot(20)
#Bi-grams
from nltk import bigrams
# Generating bigrams from the reviews
bigrams = bigrams(cleaned_reviews)
## takes
# Getting the bigram frequency distribution
fd_bigrams = FreqDist()
for bigram in bigrams:
fd_bigrams[bigram]=fd_bigrams[bigram] + 1
# Examining the top 5 most frequent bigrams
fd_bigrams.most_common(5)
# Plotting the top 50 most frequent bigrams
plt.figure(figsize = (15, 8))
fd_bigrams.plot(50)
from nltk import trigrams
# Generating trigrams from the reviews
trigrams = trigrams(cleaned_reviews)
fd_trigrams = FreqDist()
for trigram in trigrams:
fd_trigrams[trigram] += 1
fd_trigrams.most_common(5)
plt.figure(figsize = (10, 5))
fd_trigrams.plot(50)
!pip install --upgrade folium
import folium
from folium.plugins import HeatMap
# Create map with overall cases registered
m = folium.Map(zoom_start=2)
m
df.columns
HeatMap(data=df[['lat', 'long','price']], radius=15).add_to(m)
# Show the map
m
list=['beds',
'baths', 'cats_allowed', 'dogs_allowed', 'smoking_allowed',
'wheelchair_access', 'electric_vehicle_charge', 'comes_furnished']
def label_distribution(feature):
return sns.countplot(df[feature])
for i in list:
#in this case,we have to first mention figure and then draw distribution
plt.figure(figsize=(15,5))
label_distribution(i)
df.columns
plt.figure(figsize=(12,8))
sns.distplot(df[df['dogs_allowed']==0]['price'],hist=False,label="Price where pets are not allowed")
sns.distplot(df[df['dogs_allowed']==1]['price'],hist=False,label="Price where pets are allowed")
plt.legend()
plt.title("Income Distribution")
imp_features=['price',
'sqfeet',
'beds',
'baths']
sns.boxplot(df['price'])
sns.stripplot(df['price'])
sns.boxplot(df['price'])
for feature in imp_features:
plt.figure()
sns.stripplot(df[feature])
sns.boxplot(df[feature])
### using Q-Q plot, we will figure out whether we have outliers in our data or not
import statsmodels.api as sm
sm.qqplot(df['price'],line='45')
## Automating stuffs
import statsmodels.api as sm
def qq_plots(df,col):
plt.figure(figsize=(10, 4))
sm.qqplot(df[col],line='45')
plt.title("Normal QQPlot of {} ".format(col))
for feature in imp_features:
qq_plots(df,feature)
After detecting the outlier we should remove\treat the outlier
Outliers badly affect mean and standard deviation of the dataset.
It increases the error variance and reduces the power of statistical tests.
Most machine learning algorithms do not work well in the presence of outlier. So it is desirable to detect and remove outliers.
With all these reasons we must be careful about outlier and treat them before build a ML model.
There are some techniques used to deal with outliers.
1. Deleting observations but thts not a professional approach,as in this case there is a information loss in our data.
We delete outlier values if it is due to data entry error..
2. Transforming values.
Transforming variables can also eliminate outliers. These transformed values reduces the variation caused by extreme values.
1. Scaling
2. Log transformation
3. Cube Root Normalization
4. Box-Cox transformation
* These techniques convert higher values of data to smaller values.
* If the data has to many extreme values or skewed, this method helps to make your data normal.
* But These technique not always give you the best results.
* There is no lose of data from these methods.
* In all these method boxcox transformation gives the best result.
3. Imputation by using some statistical techniques to deal with outliers like Median , Z-Score , IQR , Robust Z-score
df.shape
data=df.copy()
data['price'].nlargest(400)
data['price'].median()
data['price'].mean()
### where-ever price is >7000 replace it with median bcz median doesnt gets affected with outliers
data['price']=np.where(data['price']>5000,data['price'].median(),data['price'])
### Automate stuffs using function
def deal_with_outliers(feature,threshold):
data[feature]=np.where(data[feature]>threshold,data[feature].median(),data[feature])
data['price'].mean()
data['price'].median()
##distrbution of price before Dealing with outliers
sns.distplot(df['price'])
## as it is almost Normally Distributed data, this data is suitable for your ML algo
sns.distplot(data['price'])
data['sqfeet'].nlargest(200)
deal_with_outliers('sqfeet',5000)
sns.distplot(df['sqfeet'])
sns.distplot(data['sqfeet'])
#now for beds
deal_with_outliers('beds',999)
sns.boxplot(df['beds'])
sns.boxplot(data['beds'])
data['baths'].nlargest(50)
## before dealing with outliers
sns.boxplot(df['baths'])
## imputing your outliers
deal_with_outliers('baths',10)
## after dealing with outliers
sns.boxplot(data['baths'])
#now getting distribution of Each features
for feature in imp_features:
plt.figure()#in this case,we have to first mention figure and then draw distribution
sns.distplot(data[feature])
plt.figure(figsize=(12,8))
sns.distplot(data[data['dogs_allowed']==0]['price'],hist=False,label="Price where pets are not allowed")
sns.distplot(data[data['dogs_allowed']==1]['price'],hist=False,label="Price where pets are allowed")
plt.legend()
plt.title("Income Distribution")
(data['dogs_allowed'] == 0)
def price_distribution(feature,label):
plt.figure(figsize=(12,8))
sns.distplot(data[data[feature]==0]['price'],hist=False,label="Price where {} are not allowed".format(label))
sns.distplot(data[data[feature]==1]['price'],hist=False,label="Price where {} are allowed".format(label))
plt.legend()
plt.title("Income Distribution")
price_distribution('cats_allowed','pets')
df.columns
df['electric_vehicle_charge'].unique()
price_distribution('comes_furnished','wheel chair')
price_distribution('electric_vehicle_charge','electric_vehicle_charge')
import plotly.express as px
fig=px.scatter(data, x="price", y="sqfeet")
fig.show()
#### There is a complex relationship between Price & sqfeet it means your Linear Regression algo doesnt perform better
### u have to use some ensemble algos to predict price that will definitely perform better !
data.corr()
## highlighting results
data.corr().style.background_gradient(cmap='Reds')
##Higher Co-relation group
## sqfeet-- beds
## sqfeet-- baths
## dogs_allowed-- cats_allowed
## it means we can drop beds, baths, cats_allowed
data.columns
dataframe=data.copy()
dataframe.drop(['id','url','region_url','beds','baths','cats_allowed','image_url','description','lat','long'],axis=1,inplace=True)
dataframe.dtypes
cat_features=[feature for feature in dataframe.columns if data[feature].dtype=='O']
cat_features
for feature in cat_features:
print('total diff features in {} are {}'.format(feature,len(df[feature].unique())))
dataframe.shape
region_count=dataframe['region'].value_counts()
region_count
pd.set_option('display.max_rows',298)
region_count=dataframe['region'].value_counts()
region_count
len(region_count[region_count>500])
important=region_count[region_count>500].index
important
def remove(x):
if x not in important:
return 'other'
else:
return x
dataframe['region'].tail(100)
dataframe['region']=dataframe['region'].apply(remove)
dataframe['region'].tail(50)
len(dataframe['region'].unique())
def get_stats(feature):
count=dataframe[feature].value_counts()
pd.set_option('display.max_rows',df[feature].nunique())
return count
get_stats('state')
dataframe.shape
def extract_imp_sub_categories(feature,threshold):
count=dataframe[feature].value_counts()
important=count[count>threshold].index
return important
sub_cat=extract_imp_sub_categories('state',2000)
sub_cat
dataframe['state']=dataframe['state'].apply(lambda x:'other' if x not in sub_cat else x)
dataframe['state'].nunique()
get_stats('type')
imp2=extract_imp_sub_categories('type',3000)
imp2
dataframe['type']=dataframe['type'].apply(lambda x:'other' if x not in imp2 else x)
for feature in cat_features:
print('total diff features in {} are {}'.format(feature,len(dataframe[feature].unique())))
dictionary=dict(dataframe['region'].value_counts())
dictionary
dataframe['region']=dataframe['region'].map(dictionary)
dataframe['region']
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
for feature in ['laundry_options','parking_options','type']:
dataframe[feature]=le.fit_transform(dataframe[feature])
!pip install category_encoders
from category_encoders import CountEncoder
dataframe['state'].head()
pd.value_counts(dataframe['state'])
pd.set_option('display.max_rows',33)
pd.value_counts(dataframe['state'])
ce=CountEncoder()
dataframe['state']=ce.fit_transform(dataframe['state'])
dataframe['state'].head()
dataframe.dtypes
dataframe.head()
y=dataframe['price']
x=dataframe.drop('price',axis=1)
from sklearn.decomposition import PCA
pca = PCA(n_components = 5)
pca_fit = pca.fit_transform(x[:1000])
#split dataset into train and test
from sklearn.model_selection import train_test_split
#x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.8,random_state=0)
x_train, x_test, y_train, y_test = train_test_split(pca_fit, y[:1000], test_size=0.2)
from sklearn.tree import DecisionTreeRegressor
dt=DecisionTreeRegressor(random_state=0)
dt.fit(x_train,y_train)
y_pred=dt.predict(x_test)
y_pred
#predict how our model is
from sklearn.metrics import r2_score
r2=r2_score(y_test,y_pred)
r2
#fit Regression models
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
### classifier models
models = []
models.append(('LinearRegression', LinearRegression()))
models.append(('RandomForest', RandomForestRegressor()))
models.append(('Decision Tree', DecisionTreeRegressor()))
models.append(('KNN', KNeighborsRegressor(n_neighbors = 5)))
models
for name, model in models:
print(name)
model.fit(x_train, y_train)
# Make predictions.
predictions = model.predict(x_test)
# Compute the error.
from sklearn.metrics import r2_score
print(r2_score(predictions, y_test))
print('\n')
## RF Performs best
dataframe.shape
final=dataframe[0:10000]
dep=final['price']
ind=final.drop('price',axis=1)
#split dataset into train and test
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(ind,dep,train_size=0.8,random_state=0)
## Hyperparameter optimization using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
#Randomized Search CV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 4)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 3)]
# Minimum number of samples required to split a node
min_samples_split = [ 5, 15, 100]
# Minimum number of samples required at each leaf node
#min_samples_leaf = [1, 5, 10]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split}
random_grid
4*2*3*3
reg_rf=RandomForestRegressor()
# Random search of parameters, using 5 fold cross validation,
# search across 100 different combinations
rf_random = RandomizedSearchCV(estimator = reg_rf, param_distributions = random_grid, cv = 3, verbose=2, n_jobs = -1)
rf_random.fit(x_train,y_train)
rf_random.best_params_
prediction = rf_random.predict(x_test)
sns.distplot(y_test-prediction)
r2_score(y_test,prediction)
# Save the trained model as a pickle string.
import pickle
saved_model = pickle.dump(rf_random, open('drive/MyDrive/Colab Notebooks/ApartmentRent_Prediction/Models/Apps.pickle','wb'))
# saved_pca = pickle.dump(pca, open('drive/MyDrive/Colab Notebooks/GoogleAppRating_Prediction/Models/AppsPCA.pickle','wb'))
dir(rf_random)
rf_random.best_estimator_.feature_importances_