Machine Learning Example 機械学習例

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

#read dataset
df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ApartmentRent_Prediction/housing_train.csv')
df.head()

df.shape

(265190, 22)

#to display all columns and not get the ...

pd.options.display.max_columns = None

df.head()

#to check if any missing value is present or not

df.isnull().values.any()

True

#to display which columns have missing values

df.isnull().sum()

id                             0
url                            0
region                         0
region_url                     0
price                          0
type                           0
sqfeet                         0
beds                           0
baths                          0
cats_allowed                   0
dogs_allowed                   0
smoking_allowed                0
wheelchair_access              0
electric_vehicle_charge        0
comes_furnished                0
laundry_options            54311
parking_options            95135
image_url                      0
description                    2
lat                         1419
long                        1419
state                          1
dtype: int64

df.isnull().sum().sort_values(ascending=False)

parking_options            95135
laundry_options            54311
long                        1419
lat                         1419
description                    2
state                          1
url                            0
image_url                      0
comes_furnished                0
electric_vehicle_charge        0
wheelchair_access              0
id                             0
dogs_allowed                   0
cats_allowed                   0
baths                          0
beds                           0
sqfeet                         0
type                           0
price                          0
region_url                     0
region                         0
smoking_allowed                0
dtype: int64

## remove those rows wherever we have lat & long as missing value

df.dropna(axis='index',how='all', subset=['lat','long'],inplace=True)

# checking number of rows again to see that the above NA rows for lat long have been removed
df.shape

(263771, 22)

df.isnull().sum()

id                             0
url                            0
region                         0
region_url                     0
price                          0
type                           0
sqfeet                         0
beds                           0
baths                          0
cats_allowed                   0
dogs_allowed                   0
smoking_allowed                0
wheelchair_access              0
electric_vehicle_charge        0
comes_furnished                0
laundry_options            54127
parking_options            94444
image_url                      0
description                    2
lat                            0
long                           0
state                          0
dtype: int64

### now we have huge no. of missing values in laundry_options & parking_options

df.dtypes

id                           int64
url                         object
region                      object
region_url                  object
price                        int64
type                        object
sqfeet                       int64
beds                         int64
baths                      float64
cats_allowed                 int64
dogs_allowed                 int64
smoking_allowed              int64
wheelchair_access            int64
electric_vehicle_charge      int64
comes_furnished              int64
laundry_options             object
parking_options             object
image_url                   object
description                 object
lat                        float64
long                       float64
state                       object
dtype: object

below are some options on how to deal with missing values of laundry_options & parking_options¶

1.Drop those rows wherever we have missing values, but thats not a professional approach
2.Fill missing values with Statistical Approaches like Mean , Median & Mode (In case of categorical),but if we have more missing values it will affect distribution of data
3.Professional way is -- fill missing values in such a way that it does not affect distribution of that feature, ie using some advanced approaches like Random Value Imputation

start with example of dealing with missing values using Statistical Approaches (2nd approach on list)¶

# first create copy of data as this is just an example

data=df.copy()

# there are over 54k rows of data without laundry_options data

data['laundry_options'].isnull().sum()

54127

# we can see what the values of this laundry_options are and their relative amounts visually

data['laundry_options'].value_counts().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7f293219fc50>

# and we can take an actual count

data['laundry_options'].value_counts()

w/d in unit           90393
w/d hookups           49880
laundry on site       39109
laundry in bldg       27722
no laundry on site     2540
Name: laundry_options, dtype: int64

# find the mode

data['laundry_options'].mode()[0]

'w/d in unit'

# insert the mode in rows with missing laundry_options values

data['laundry_options'].fillna('w/d in unit',inplace=True)

data['laundry_options'].value_counts()

w/d in unit           144520
w/d hookups            49880
laundry on site        39109
laundry in bldg        27722
no laundry on site      2540
Name: laundry_options, dtype: int64

data['laundry_options'].value_counts().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7f293218b090>

We see from the 2 visuals above that this approach will impact distribution of data, which we don't want. So we look at using some smarter approaches¶

Option 3 (smarter option): fill NA value using Random Value Imputation¶

Random Sample Imputation¶

Aim: Random sample imputation consists of taking random observation from the dataset and we use this observation to replace the nan values

When should it be used? It assumes that the data are missing completely at random(MCAR)

#To fetch a random sample

df['laundry_options'].dropna().sample()

53150    w/d in unit
Name: laundry_options, dtype: object

df['laundry_options'].isnull().sum()

54127

## considering sample of size 54127
random_sample=df['laundry_options'].dropna().sample(54127)
## random_sample=df['laundry_options'].dropna().sample(df['laundry_options'].isnull().sum())
random_sample

205727        w/d hookups
181793        w/d in unit
207763    laundry on site
233182        w/d in unit
174424        w/d in unit
               ...       
222356        w/d in unit
250051        w/d in unit
161542        w/d in unit
262527        w/d hookups
69034     laundry in bldg
Name: laundry_options, Length: 54127, dtype: object

random_sample.index

Int64Index([205727, 181793, 207763, 233182, 174424, 109131,  10451, 220024,
            192515, 182744,
            ...
            100455, 125195, 183503, 199173, 245745, 222356, 250051, 161542,
            262527,  69034],
           dtype='int64', length=54127)

df[df['laundry_options'].isnull()].index

Int64Index([    10,     49,     62,     64,     65,     73,     77,     78,
                79,    101,
            ...
            265112, 265121, 265124, 265130, 265136, 265157, 265161, 265163,
            265174, 265187],
           dtype='int64', length=54127)

random_sample.index=df[df['laundry_options'].isnull()].index

random_sample.index

Int64Index([    10,     49,     62,     64,     65,     73,     77,     78,
                79,    101,
            ...
            265112, 265121, 265124, 265130, 265136, 265157, 265161, 265163,
            265174, 265187],
           dtype='int64', length=54127)

random_sample

10            w/d hookups
49            w/d in unit
62        laundry on site
64            w/d in unit
65            w/d in unit
               ...       
265157        w/d in unit
265161        w/d in unit
265163        w/d in unit
265174        w/d hookups
265187    laundry in bldg
Name: laundry_options, Length: 54127, dtype: object

df.loc[df['laundry_options'].isnull(),'laundry_options']=random_sample

df['laundry_options'].value_counts().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7f293217d8d0>

We see from the distribution above that this method is superior to the statistical replacement method

#### We can automate the steps above by creating a function

def impute_nan(df,variable):
    ##Create the random sample to fill in the na

    random_sample=df[variable].dropna().sample(df[variable].isnull().sum())
    
    ##pandas need to have equal index in order to merge the dataset
    random_sample.index=df[df[variable].isnull()].index
    df.loc[df[variable].isnull(),variable]=random_sample

Steps to impute NA Values of parking_options by using the function above¶

# Exploration steps

df['parking_options'].isnull().sum()

94444

df['parking_options'].value_counts()/len(df)*100

off-street parking    33.399805
carport               10.848046
attached garage       10.316145
detached garage        4.838667
street parking         4.000061
no parking             0.745723
valet parking          0.046252
Name: parking_options, dtype: float64

## The initial ratio between off-street parking & Carport (the top two values present in the data is):
33/10

3.3

## imputing NaNs of parking_options
impute_nan(df,'parking_options')

df['parking_options'].value_counts()/len(df)*100

off-street parking    52.020882
carport               16.880931
attached garage       16.092368
detached garage        7.549731
street parking         6.219031
no parking             1.170712
valet parking          0.066345
Name: parking_options, dtype: float64

## After imputing NaNs, lets check the ratio between off-street parking & Carport
52/16

3.25

We see that the ratio between the top two values prior to imputation is post imputation is not very different (3.25 vs 3.33)

# And finally, we can check to make sure the NaNs have been taken care of

df.isnull().sum()

id                         0
url                        0
region                     0
region_url                 0
price                      0
type                       0
sqfeet                     0
beds                       0
baths                      0
cats_allowed               0
dogs_allowed               0
smoking_allowed            0
wheelchair_access          0
electric_vehicle_charge    0
comes_furnished            0
laundry_options            0
parking_options            0
image_url                  0
description                2
lat                        0
long                       0
state                      0
dtype: int64

The missing values in our data are gone¶

df.head()

df.columns

Index(['id', 'url', 'region', 'region_url', 'price', 'type', 'sqfeet', 'beds',
       'baths', 'cats_allowed', 'dogs_allowed', 'smoking_allowed',
       'wheelchair_access', 'electric_vehicle_charge', 'comes_furnished',
       'laundry_options', 'parking_options', 'image_url', 'description', 'lat',
       'long', 'state'],
      dtype='object')

Perform Text Analysis on the Description feature¶

# importing all necessery modules 
from wordcloud import WordCloud, STOPWORDS

total_description = ''

stopwords = set(STOPWORDS)

# iterate through the first 10k rows of dataframe..
for val in df['description'][0:10000]:
    # typecaste each val to string 
    val = str(val) 
    
    # split the value 
    tokens = val.split() 
    
    # Converts each token into lowercase 
    for i in range(len(tokens)): 
        tokens[i] = tokens[i].lower() 
    total_description =total_description + " ".join(tokens)+" "

Below is Craig's simplified code of the above function and runs much quicker.

# iterate through the first 10k rows of dataframe..

for val in df['description'][0:10000]:
    if isinstance(val, str):
      total_description += val.lower()

### generating the WordCloud table

wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                stopwords = stopwords, 
                min_font_size = 10).generate(total_description)

# plotting the WordCloud image 
                       
plt.figure(figsize = (8, 8)) 
plt.imshow(wordcloud) 
plt.axis("off")

(-0.5, 799.5, 799.5, -0.5)

From the wordcloud above we can see that descriptions like to emphasize the words fitness center, washer dryer, walk in closet, swimming pool, and pet friendly.¶

But it is still hard to gauge by this inspection by how much are these emphasized. Instead, look at plots/charts¶

Word count plot and charts¶

from nltk.corpus import RegexpTokenizer as regextoken

# First let's take care of the NaNs

df['description'].isnull().sum()

2

df['description'].fillna('no description',inplace=True)

df['description'].isnull().sum()

0

# Converting all the text to lowercase

df['description'] = df['description'].apply(lambda x: x.lower())

##  Creating a regular expression tokenizer to have only alphabets , ie remove all the special characters

tokenizer = regextoken("[a-zA-Z]+")

tokenizer

RegexpTokenizer(pattern='[a-zA-Z]+', gaps=False, discard_empty=True, flags=<RegexFlag.UNICODE|DOTALL|MULTILINE: 56>)

df['description'][0]

'apartments in birmingham al welcome to 100 inverness apartment homes, a luxury apartment community tucked away into one of the most coveted locations in the city for birmingham apartments. find stylish one-, two-, and three-bedroom apartment floor plans tailored to your wants and needs, complete with the amenities that are important to you. our layouts come with just enough variety to enable you to make it your own haven unlike any other. get inspired by our photos and the serenity of the nature surrounding you to let your aesthetic come together with our stylish, modern designs for a home you wonâ\x80\x99t be able to wait to show off to your guests. even beyond style, our pet-friendly apartment community is also all about functionality! some of our homes come with washer and dryer hookups, wood burning fireplaces, hardwood floors, and a private patio or balcony. all of our residents can make use of oversized walk-in closets, air conditioning, garages, a business center, on-site management and maintenance, a laundry facility, and more. play a game or two of tennis at our tennis courts, take a dip in the pool at one of our pools, or try out the new grilling and picnic area! make apartment life easier and take advantage of our online payment options, short term lease options, and valet trash services. use our package receiving services for those out-of-town (or simply busy) days and know that we are here to support you and make your time in our pet-friendly apartment community a breeze. with charming homes and an alluring location near transportation options and shopping, dining, and entertainment hotspots, 100 inverness apartment homes is an ideal place for anyone looking for the very best in a community that values each resident. contact us or stop by to schedule a tour so you can learn more about our apartment community. we look forward to meeting you!  -night patrol -110 acre stocked lake -2 sparkling swimming pools -washer/ dryer hookup (select units) -cable ready -range -individual climate control -wood burning fireplaces in select units  call:  show contact info   pet friendly, sports bars close by for the sport fanatics, homedepot, grandview hospital, modern laundry facilities, golds gym, shopping and dining less than 5 minutes away, planet fitness less than 5 minutes away, jeff state college less than 10 minutes away, oak mountain school district that include (inverness elementary and middle), bark park, spain park, 110 acre stocked fishing lake, lake and wooded views, urgent cares nearby'

print(tokenizer.tokenize(df['description'][0]))

['apartments', 'in', 'birmingham', 'al', 'welcome', 'to', 'inverness', 'apartment', 'homes', 'a', 'luxury', 'apartment', 'community', 'tucked', 'away', 'into', 'one', 'of', 'the', 'most', 'coveted', 'locations', 'in', 'the', 'city', 'for', 'birmingham', 'apartments', 'find', 'stylish', 'one', 'two', 'and', 'three', 'bedroom', 'apartment', 'floor', 'plans', 'tailored', 'to', 'your', 'wants', 'and', 'needs', 'complete', 'with', 'the', 'amenities', 'that', 'are', 'important', 'to', 'you', 'our', 'layouts', 'come', 'with', 'just', 'enough', 'variety', 'to', 'enable', 'you', 'to', 'make', 'it', 'your', 'own', 'haven', 'unlike', 'any', 'other', 'get', 'inspired', 'by', 'our', 'photos', 'and', 'the', 'serenity', 'of', 'the', 'nature', 'surrounding', 'you', 'to', 'let', 'your', 'aesthetic', 'come', 'together', 'with', 'our', 'stylish', 'modern', 'designs', 'for', 'a', 'home', 'you', 'won', 't', 'be', 'able', 'to', 'wait', 'to', 'show', 'off', 'to', 'your', 'guests', 'even', 'beyond', 'style', 'our', 'pet', 'friendly', 'apartment', 'community', 'is', 'also', 'all', 'about', 'functionality', 'some', 'of', 'our', 'homes', 'come', 'with', 'washer', 'and', 'dryer', 'hookups', 'wood', 'burning', 'fireplaces', 'hardwood', 'floors', 'and', 'a', 'private', 'patio', 'or', 'balcony', 'all', 'of', 'our', 'residents', 'can', 'make', 'use', 'of', 'oversized', 'walk', 'in', 'closets', 'air', 'conditioning', 'garages', 'a', 'business', 'center', 'on', 'site', 'management', 'and', 'maintenance', 'a', 'laundry', 'facility', 'and', 'more', 'play', 'a', 'game', 'or', 'two', 'of', 'tennis', 'at', 'our', 'tennis', 'courts', 'take', 'a', 'dip', 'in', 'the', 'pool', 'at', 'one', 'of', 'our', 'pools', 'or', 'try', 'out', 'the', 'new', 'grilling', 'and', 'picnic', 'area', 'make', 'apartment', 'life', 'easier', 'and', 'take', 'advantage', 'of', 'our', 'online', 'payment', 'options', 'short', 'term', 'lease', 'options', 'and', 'valet', 'trash', 'services', 'use', 'our', 'package', 'receiving', 'services', 'for', 'those', 'out', 'of', 'town', 'or', 'simply', 'busy', 'days', 'and', 'know', 'that', 'we', 'are', 'here', 'to', 'support', 'you', 'and', 'make', 'your', 'time', 'in', 'our', 'pet', 'friendly', 'apartment', 'community', 'a', 'breeze', 'with', 'charming', 'homes', 'and', 'an', 'alluring', 'location', 'near', 'transportation', 'options', 'and', 'shopping', 'dining', 'and', 'entertainment', 'hotspots', 'inverness', 'apartment', 'homes', 'is', 'an', 'ideal', 'place', 'for', 'anyone', 'looking', 'for', 'the', 'very', 'best', 'in', 'a', 'community', 'that', 'values', 'each', 'resident', 'contact', 'us', 'or', 'stop', 'by', 'to', 'schedule', 'a', 'tour', 'so', 'you', 'can', 'learn', 'more', 'about', 'our', 'apartment', 'community', 'we', 'look', 'forward', 'to', 'meeting', 'you', 'night', 'patrol', 'acre', 'stocked', 'lake', 'sparkling', 'swimming', 'pools', 'washer', 'dryer', 'hookup', 'select', 'units', 'cable', 'ready', 'range', 'individual', 'climate', 'control', 'wood', 'burning', 'fireplaces', 'in', 'select', 'units', 'call', 'show', 'contact', 'info', 'pet', 'friendly', 'sports', 'bars', 'close', 'by', 'for', 'the', 'sport', 'fanatics', 'homedepot', 'grandview', 'hospital', 'modern', 'laundry', 'facilities', 'golds', 'gym', 'shopping', 'and', 'dining', 'less', 'than', 'minutes', 'away', 'planet', 'fitness', 'less', 'than', 'minutes', 'away', 'jeff', 'state', 'college', 'less', 'than', 'minutes', 'away', 'oak', 'mountain', 'school', 'district', 'that', 'include', 'inverness', 'elementary', 'and', 'middle', 'bark', 'park', 'spain', 'park', 'acre', 'stocked', 'fishing', 'lake', 'lake', 'and', 'wooded', 'views', 'urgent', 'cares', 'nearby']

sample=df.sample(10000)

sample.head()

# Applying the tokenizer to each row of the reviews

sample_tokens = sample['description'].apply(tokenizer.tokenize)

sample_tokens.index
sample_index = sample_tokens.index[0]

# Examining the tokens created for the first row / restaurant
print(sample_tokens[sample_index])

['there', 'are', 'nicely', 'sized', 'bedrooms', 'and', 'large', 'bath', 'great', 'bed', 'bath', 'with', 'a', 'large', 'bonus', 'large', 'great', 'family', 'room', 'w', 'gas', 'fireplace', 'spacious', 'kitchen', 'with', 'breakfast', 'room', 'wood', 'look', 'flooring', 'throughout', 'new', 'roof', 'new', 'garage', 'opener', 'new', 'gutter', 'guards', 'new', 'water', 'heater', 'new', 'patio', 'large', 'private', 'fenced', 'backyard', 'very', 'nice', 'neighborhood', 'amenity', 'center', 'with', 'pool', 'clubhouse', 'parks', 'basketball', 'court', 'refrigerator', 'washer', 'dryer', 'convey', 'move', 'in', 'ready', 'we', 'are', 'offering', 'rent', 'to', 'own', 'later', 'at', 'a', 'fair', 'market', 'rental', 'price', 'takes', 'over', 'payments', 'please', 'call', 'for', 'more', 'details', 'listing', 'id']

Viewing the first tokenized list of words, we see we have some stopwords like an, and, it, etc. So we need to remove them.

Example of removing stopwords¶

import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

from nltk.corpus import stopwords

# These are common words defined by Python developers that typically don't add meaning to the text and can be removed
stop = stopwords.words("english")
print(stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

### with respect to very first row, how to remove stopwords
rev=sample_tokens[sample_index]
print(rev)

['there', 'are', 'nicely', 'sized', 'bedrooms', 'and', 'large', 'bath', 'great', 'bed', 'bath', 'with', 'a', 'large', 'bonus', 'large', 'great', 'family', 'room', 'w', 'gas', 'fireplace', 'spacious', 'kitchen', 'with', 'breakfast', 'room', 'wood', 'look', 'flooring', 'throughout', 'new', 'roof', 'new', 'garage', 'opener', 'new', 'gutter', 'guards', 'new', 'water', 'heater', 'new', 'patio', 'large', 'private', 'fenced', 'backyard', 'very', 'nice', 'neighborhood', 'amenity', 'center', 'with', 'pool', 'clubhouse', 'parks', 'basketball', 'court', 'refrigerator', 'washer', 'dryer', 'convey', 'move', 'in', 'ready', 'we', 'are', 'offering', 'rent', 'to', 'own', 'later', 'at', 'a', 'fair', 'market', 'rental', 'price', 'takes', 'over', 'payments', 'please', 'call', 'for', 'more', 'details', 'listing', 'id']

print([token for token in rev if token not in stop])

['nicely', 'sized', 'bedrooms', 'large', 'bath', 'great', 'bed', 'bath', 'large', 'bonus', 'large', 'great', 'family', 'room', 'w', 'gas', 'fireplace', 'spacious', 'kitchen', 'breakfast', 'room', 'wood', 'look', 'flooring', 'throughout', 'new', 'roof', 'new', 'garage', 'opener', 'new', 'gutter', 'guards', 'new', 'water', 'heater', 'new', 'patio', 'large', 'private', 'fenced', 'backyard', 'nice', 'neighborhood', 'amenity', 'center', 'pool', 'clubhouse', 'parks', 'basketball', 'court', 'refrigerator', 'washer', 'dryer', 'convey', 'move', 'ready', 'offering', 'rent', 'later', 'fair', 'market', 'rental', 'price', 'takes', 'payments', 'please', 'call', 'details', 'listing', 'id']

len(sample_tokens)

10000

### using function

def remove_stopwords(text):
    updated_text=[token for token in text if token not in stop]
    return updated_text

sample_tokens=sample_tokens.apply(remove_stopwords)

type(sample_tokens)

pandas.core.series.Series

len(sample_tokens)

10000

indices=[i for i in range(0,10000)]

rev=pd.Series(data=sample_tokens.values,index=indices)
rev

0       [nicely, sized, bedrooms, large, bath, great, ...
1       [grand, reserve, park, isle, centrally, locate...
2       [designed, upgrade, lifestyle, brannigan, vill...
3       [schedule, tour, book, tour, appointments, onl...
4       [riverton, high, desert, pricing, month, price...
                              ...                        
9995    [waving, app, reservation, fee, open, sat, sun...
9996    [schedule, tour, book, tour, appointments, onl...
9997    [public, tech, bus, transit, downtown, private...
9998    [community, amenitiesbeautiful, swimming, pool...
9999    [fresno, another, city, california, surrounded...
Length: 10000, dtype: object

# Concatenating all the reviews as I have to count frequency of each word as I have to plot which word has highest count
all_reviews = sample_tokens.astype(str).str.cat()

type(all_reviews)

str

len(all_reviews)

12290803

## perform tokenization to convert your string(all_reviews) into list,so that we will count frequency of words
cleaned_reviews = tokenizer.tokenize(all_reviews)

len(cleaned_reviews)

1202135

type(cleaned_reviews)

list

# obtain the frequency of individual words in the reviews, for this u have to use FreqDist

from nltk import FreqDist, bigrams, trigrams

fd = FreqDist()

for word in cleaned_reviews:
    fd[word]=fd[word]+ 1

# Examining the top 5 most frequent words
fd.most_common(5)

[('contact', 11014),
 ('community', 11000),
 ('apartment', 10490),
 ('apartments', 9603),
 ('home', 9343)]

# Plotting the top 50 most frequent words
plt.figure(figsize = (15, 8))
fd.plot(20)

<matplotlib.axes._subplots.AxesSubplot at 0x7f293191ff50>

#Bi-grams

from nltk import bigrams

# Generating bigrams from the reviews
bigrams = bigrams(cleaned_reviews)

## takes 
# Getting the bigram frequency distribution
fd_bigrams = FreqDist()
for bigram in bigrams:
    fd_bigrams[bigram]=fd_bigrams[bigram] + 1
    
# Examining the top 5 most frequent bigrams
fd_bigrams.most_common(5)

[(('contact', 'info'), 8814),
 (('show', 'contact'), 8628),
 (('washer', 'dryer'), 4585),
 (('fitness', 'center'), 3944),
 (('call', 'show'), 3106)]

# Plotting the top 50 most frequent bigrams
plt.figure(figsize = (15, 8))
fd_bigrams.plot(50)

<matplotlib.axes._subplots.AxesSubplot at 0x7f293204a510>

from nltk import trigrams

# Generating trigrams from the reviews
trigrams = trigrams(cleaned_reviews)

fd_trigrams = FreqDist()
for trigram in trigrams:
    fd_trigrams[trigram] += 1

fd_trigrams.most_common(5)

[(('show', 'contact', 'info'), 8626),
 (('call', 'show', 'contact'), 3103),
 (('equal', 'housing', 'opportunity'), 2086),
 (('stainless', 'steel', 'appliances'), 1493),
 (('availability', 'subject', 'change'), 1162)]

plt.figure(figsize = (10, 5))
fd_trigrams.plot(50)

<matplotlib.axes._subplots.AxesSubplot at 0x7f293998b750>

perform Spatial Analysis to get a clear cut of where exactly higher priced houses are situated¶

!pip install --upgrade folium

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: folium in /usr/local/lib/python3.7/dist-packages (0.8.3)
Collecting folium
  Downloading folium-0.12.1.post1-py2.py3-none-any.whl (95 kB)
     |████████████████████████████████| 95 kB 2.1 MB/s 
Requirement already satisfied: branca>=0.3.0 in /usr/local/lib/python3.7/dist-packages (from folium) (0.5.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from folium) (1.21.6)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from folium) (2.23.0)
Requirement already satisfied: jinja2>=2.9 in /usr/local/lib/python3.7/dist-packages (from folium) (2.11.3)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2>=2.9->folium) (2.0.1)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->folium) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->folium) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->folium) (2022.6.15)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->folium) (3.0.4)
Installing collected packages: folium
  Attempting uninstall: folium
    Found existing installation: folium 0.8.3
    Uninstalling folium-0.8.3:
      Successfully uninstalled folium-0.8.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.12.1.post1 which is incompatible.
Successfully installed folium-0.12.1.post1

import folium

from folium.plugins import HeatMap

# Create map with overall cases registered
m = folium.Map(zoom_start=2)
m

df.columns

Index(['id', 'url', 'region', 'region_url', 'price', 'type', 'sqfeet', 'beds',
       'baths', 'cats_allowed', 'dogs_allowed', 'smoking_allowed',
       'wheelchair_access', 'electric_vehicle_charge', 'comes_furnished',
       'laundry_options', 'parking_options', 'image_url', 'description', 'lat',
       'long', 'state'],
      dtype='object')

HeatMap(data=df[['lat', 'long','price']], radius=15).add_to(m)
# Show the map

<folium.plugins.heat_map.HeatMap at 0x7f292ac5dbd0>

m

Analyse Label distribution of data¶

list=['beds',
       'baths', 'cats_allowed', 'dogs_allowed', 'smoking_allowed',
       'wheelchair_access', 'electric_vehicle_charge', 'comes_furnished']

def label_distribution(feature):
    return sns.countplot(df[feature])

for i in list:
    #in this case,we have to first mention figure and then draw distribution
    plt.figure(figsize=(15,5))
    label_distribution(i)

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

df.columns

Index(['id', 'url', 'region', 'region_url', 'price', 'type', 'sqfeet', 'beds',
       'baths', 'cats_allowed', 'dogs_allowed', 'smoking_allowed',
       'wheelchair_access', 'electric_vehicle_charge', 'comes_furnished',
       'laundry_options', 'parking_options', 'image_url', 'description', 'lat',
       'long', 'state'],
      dtype='object')

plt.figure(figsize=(12,8))
sns.distplot(df[df['dogs_allowed']==0]['price'],hist=False,label="Price where pets are not allowed")
sns.distplot(df[df['dogs_allowed']==1]['price'],hist=False,label="Price where  pets are allowed")
plt.legend()
plt.title("Income Distribution")

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)

Text(0.5, 1.0, 'Income Distribution')

dealing with outliers from data¶

imp_features=['price',
 'sqfeet',
 'beds',
 'baths']

Detect outliers using BoxPlot Approach¶

sns.boxplot(df['price'])

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

<matplotlib.axes._subplots.AxesSubplot at 0x7f2925ceaf50>

sns.stripplot(df['price'])
sns.boxplot(df['price'])

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

<matplotlib.axes._subplots.AxesSubplot at 0x7f2925c2a150>

for feature in imp_features:
    plt.figure()
    sns.stripplot(df[feature])
    sns.boxplot(df[feature])

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

Detect Outliers using some statistical approaches¶

### using Q-Q plot, we will figure out whether we have outliers in our data or not

import statsmodels.api as sm

/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

sm.qqplot(df['price'],line='45')

## Automating stuffs
import statsmodels.api as sm 
def qq_plots(df,col):
    plt.figure(figsize=(10, 4))
    sm.qqplot(df[col],line='45')
    plt.title("Normal QQPlot of {} ".format(col))

for feature in imp_features:    
    qq_plots(df,feature)

<Figure size 720x288 with 0 Axes>

<Figure size 720x288 with 0 Axes>

<Figure size 720x288 with 0 Axes>

<Figure size 720x288 with 0 Axes>

WHAT NEXT??¶

After detecting the outlier we should remove\treat the outlier 
Outliers badly affect mean and standard deviation of the dataset. 
It increases the error variance and reduces the power of statistical tests.
Most machine learning algorithms do not work well in the presence of outlier. So it is desirable to detect and remove outliers.
With all these reasons we must be careful about outlier and treat them before build a ML model. 
There are some techniques used to deal with outliers.
 1. Deleting observations but thts not a professional approach,as in this case there is a information loss in our data.
    We delete outlier values if it is due to data entry error..

 2. Transforming values.
    Transforming variables can also eliminate outliers. These transformed values reduces the variation caused by extreme values.
    1. Scaling
    2. Log transformation
    3. Cube Root Normalization
    4. Box-Cox transformation

    * These techniques convert higher values of data to smaller values.
    * If the data has to many extreme values or skewed, this method helps to make your data normal.
    * But These technique not always give you the best results.
    * There is no lose of data from these methods.
    * In all these method boxcox transformation gives the best result.   


 3. Imputation by using some statistical techniques to deal with outliers like Median , Z-Score , IQR , Robust Z-score

Imputing Outliers using Statistical techmiques¶

df.shape

(263771, 22)

data=df.copy()

data['price'].nlargest(400)

110953    2768307249
23893       21701907
247250      18502000
247251      18502000
73730       12000000
             ...    
176631          6500
248957          6500
249264          6500
249377          6500
26659           6499
Name: price, Length: 400, dtype: int64

data['price'].median()

1060.0

data['price'].mean()

12330.078647766433

### where-ever price is >7000 replace it with median bcz median doesnt gets affected with outliers
data['price']=np.where(data['price']>5000,data['price'].median(),data['price'])

### Automate stuffs using function 
def deal_with_outliers(feature,threshold):
    data[feature]=np.where(data[feature]>threshold,data[feature].median(),data[feature])

data['price'].mean()

1207.3528515265136

data['price'].median()

1060.0

##distrbution of price before Dealing with outliers
sns.distplot(df['price'])

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x7f29204336d0>

## as it is almost Normally Distributed data, this data is suitable for your ML algo
sns.distplot(data['price'])

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x7f291fd47f10>

Now for sqfeet¶

data['sqfeet'].nlargest(200)

195751    8388607
7862      8000000
42751     1019856
42752     1019856
63022      999999
           ...   
22519        5000
30498        5000
31092        5000
31294        5000
35225        5000
Name: sqfeet, Length: 200, dtype: int64

deal_with_outliers('sqfeet',5000)

sns.distplot(df['sqfeet'])

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x7f291fbc3a50>

sns.distplot(data['sqfeet'])

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<matplotlib.axes._subplots.AxesSubplot at 0x7f291f8e0050>

#now for beds

deal_with_outliers('beds',999)

sns.boxplot(df['beds'])

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

<matplotlib.axes._subplots.AxesSubplot at 0x7f291f82f850>

sns.boxplot(data['beds'])

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

<matplotlib.axes._subplots.AxesSubplot at 0x7f291f77e990>

data['baths'].nlargest(50)

125920    75.0
126923    75.0
73839     35.0
73730      8.5
198055     8.0
147730     7.5
51821      7.0
105365     7.0
147539     7.0
263801     7.0
26354      6.5
35177      6.5
62928      6.5
117774     6.5
29290      6.0
34231      6.0
34698      6.0
39679      6.0
48285      6.0
62703      6.0
74580      6.0
118178     6.0
121787     6.0
138789     6.0
213772     6.0
223274     6.0
223525     6.0
223527     6.0
225173     6.0
225174     6.0
225176     6.0
43347      5.5
51891      5.5
88835      5.5
126218     5.5
137778     5.5
137781     5.5
137787     5.5
137789     5.5
137795     5.5
137812     5.5
137824     5.5
137850     5.5
137885     5.5
137900     5.5
137902     5.5
137956     5.5
138002     5.5
138030     5.5
190626     5.5
Name: baths, dtype: float64

## before dealing with outliers
sns.boxplot(df['baths'])

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

<matplotlib.axes._subplots.AxesSubplot at 0x7f291f27a650>

## imputing your outliers
deal_with_outliers('baths',10)

## after dealing with outliers
sns.boxplot(data['baths'])

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

<matplotlib.axes._subplots.AxesSubplot at 0x7f291ef69e50>

#now getting distribution of Each features
for feature in imp_features:
    plt.figure()#in this case,we have to first mention figure and then draw distribution
    sns.distplot(data[feature])

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Analyse Distribution of price where pets are allowed & pets are not allowed¶

plt.figure(figsize=(12,8))
sns.distplot(data[data['dogs_allowed']==0]['price'],hist=False,label="Price where pets are not allowed")
sns.distplot(data[data['dogs_allowed']==1]['price'],hist=False,label="Price where  pets are allowed")
plt.legend()
plt.title("Income Distribution")

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)

Text(0.5, 1.0, 'Income Distribution')

(data['dogs_allowed'] == 0)

0         False
1         False
2         False
3         False
4         False
          ...  
265184    False
265185    False
265186    False
265187    False
265188    False
Name: dogs_allowed, Length: 263771, dtype: bool

Automate above stuffs¶

def price_distribution(feature,label):
    plt.figure(figsize=(12,8))
    sns.distplot(data[data[feature]==0]['price'],hist=False,label="Price where {} are not allowed".format(label))
    sns.distplot(data[data[feature]==1]['price'],hist=False,label="Price where {} are allowed".format(label))
    plt.legend()
    plt.title("Income Distribution")

price_distribution('cats_allowed','pets')

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)

From the above plot we could say that price of house is not varying in case of having pets or not !¶

df.columns

Index(['id', 'url', 'region', 'region_url', 'price', 'type', 'sqfeet', 'beds',
       'baths', 'cats_allowed', 'dogs_allowed', 'smoking_allowed',
       'wheelchair_access', 'electric_vehicle_charge', 'comes_furnished',
       'laundry_options', 'parking_options', 'image_url', 'description', 'lat',
       'long', 'state'],
      dtype='object')

df['electric_vehicle_charge'].unique()

array([0, 1])

price_distribution('comes_furnished','wheel chair')

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)

From the above plot we could say that price of house is not varying in case of furnished & not furnished houses!¶

price_distribution('electric_vehicle_charge','electric_vehicle_charge')

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)

From the above plot we could say that price of house is higher if it has a facility of charging electric_vehicle¶

Relationship between area & Price¶

import plotly.express as px
fig=px.scatter(data, x="price", y="sqfeet")
fig.show()

#### There is a complex relationship between Price & sqfeet it means your Linear Regression algo doesnt perform better
### u have to use some ensemble algos to predict price that will definitely perform better !

data.corr()

## highlighting results

data.corr().style.background_gradient(cmap='Reds')

##Higher Co-relation group

## sqfeet-- beds
## sqfeet-- baths
## dogs_allowed-- cats_allowed 

## it means we can drop beds, baths, cats_allowed

data.columns

Index(['id', 'url', 'region', 'region_url', 'price', 'type', 'sqfeet', 'beds',
       'baths', 'cats_allowed', 'dogs_allowed', 'smoking_allowed',
       'wheelchair_access', 'electric_vehicle_charge', 'comes_furnished',
       'laundry_options', 'parking_options', 'image_url', 'description', 'lat',
       'long', 'state'],
      dtype='object')

dataframe=data.copy()

dataframe.drop(['id','url','region_url','beds','baths','cats_allowed','image_url','description','lat','long'],axis=1,inplace=True)

dataframe.dtypes

region                      object
price                      float64
type                        object
sqfeet                     float64
dogs_allowed                 int64
smoking_allowed              int64
wheelchair_access            int64
electric_vehicle_charge      int64
comes_furnished              int64
laundry_options             object
parking_options             object
state                       object
dtype: object

get all the categorical features¶

cat_features=[feature for feature in dataframe.columns if data[feature].dtype=='O']
cat_features

['region', 'type', 'laundry_options', 'parking_options', 'state']

check all the sub-categories in categorical features to check wht encoding technique we can apply¶

for feature in cat_features:
    print('total diff features in {} are {}'.format(feature,len(df[feature].unique())))

total diff features in region are 298
total diff features in type are 12
total diff features in laundry_options are 5
total diff features in parking_options are 7
total diff features in state are 38

dataframe.shape

(263771, 12)

region_count=dataframe['region'].value_counts()
region_count

jacksonville              4242
rochester                 3673
fayetteville              3636
omaha / council bluffs    2719
boulder                   2606
                          ... 
corvallis/albany             3
kansas city                  3
northwest OK                 2
east oregon                  1
texoma                       1
Name: region, Length: 298, dtype: int64

pd.set_option('display.max_rows',298)

region_count=dataframe['region'].value_counts()
region_count

jacksonville                  4242
rochester                     3673
fayetteville                  3636
omaha / council bluffs        2719
boulder                       2606
denver                        2582
ventura county                2578
savannah / hinesville         2575
stockton                      2569
fort collins / north CO       2566
inland empire                 2559
orlando                       2555
lincoln                       2547
reno / tahoe                  2535
augusta                       2522
jackson                       2519
colorado springs              2518
space coast                   2508
washington, DC                2478
tucson                        2463
minneapolis / st paul         2444
charlotte                     2435
daytona beach                 2387
orange county                 2386
st louis, MO                  2385
sacramento                    2382
raleigh / durham / CH         2365
lakeland                      2355
san diego                     2351
grand rapids                  2339
SF bay area                   2319
sarasota-bradenton            2302
ann arbor                     2281
atlanta                       2273
des moines                    2265
lexington                     2262
topeka                        2244
los angeles                   2244
spokane / coeur d'alene       2207
las vegas                     2203
south jersey                  2201
cleveland                     2186
lansing                       2184
fresno / madera               2172
huntsville / decatur          2160
frederick                     2147
tampa bay area                2140
albuquerque                   2139
indianapolis                  2124
winston-salem                 2119
cincinnati                    2110
ft myers / SW florida         2086
albany                        2077
champaign urbana              2064
phoenix                       2063
columbus                      2050
delaware                      2046
annapolis                     2045
new orleans                   2037
baltimore                     2018
detroit metro                 2008
greensboro                    1995
asheville                     1994
tallahassee                   1976
wichita                       1972
gainesville                   1962
central NJ                    1935
st cloud                      1895
wilmington                    1871
louisville                    1852
baton rouge                   1851
hawaii                        1830
kansas city, MO               1779
treasure coast                1755
new hampshire                 1754
little rock                   1732
mobile                        1728
anchorage / mat-su            1695
bakersfield                   1644
pensacola                     1617
panama city                   1598
macon / warner robins         1593
boston                        1575
lafayette                     1543
akron / canton                1524
gulfport / biloxi             1503
chicago                       1496
hartford                      1490
boise                         1472
south florida                 1440
worcester / central MA        1429
new haven                     1416
springfield                   1409
palm springs                  1401
buffalo                       1377
quad cities, IA/IL            1337
north dakota                  1324
evansville                    1311
grand forks                   1309
modesto                       1262
athens                        1258
eastern NC                    1180
monterey bay                  1145
manhattan                     1083
western massachusetts         1027
bowling green                 1016
okaloosa / walton              941
flagstaff / sedona             935
shreveport                     935
flint                          916
syracuse                       914
north jersey                   842
hanford-corcoran               815
kalamazoo                      815
bismarck                       795
fort wayne                     787
new york city                  763
cedar rapids                   753
montgomery                     751
tuscaloosa                     742
south coast                    734
hudson valley                  707
south bend / michiana          694
saginaw-midland-baycity        688
eastern CT                     675
bloomington                    644
ithaca                         640
battle creek                   637
north mississippi              624
santa barbara                  624
jersey shore                   623
fargo / moorhead               613
columbia / jeff city           593
long island                    568
ocala                          564
birmingham                     558
southern maryland              544
billings                       535
lawrence                       531
monroe                         509
valdosta                       507
mohave county                  480
northwest GA                   469
bloomington-normal             460
peoria                         454
florida keys                   450
visalia-tulare                 449
san luis obispo                446
holland                        445
maine                          420
watertown                      410
sioux city                     405
southwest michigan             401
rockford                       386
iowa city                      380
western slope                  372
lake charles                   369
ames                           359
east idaho                     351
duluth / superior              350
sierra vista                   330
southern illinois              330
hattiesburg                    327
muskegon                       323
las cruces                     320
waterloo / cedar falls         308
decatur                        298
lafayette / west lafayette     292
fairbanks                      287
binghamton                     284
muncie / anderson              282
santa fe / taos                282
cumberland valley              278
merced                         267
chico                          263
port huron                     263
high rockies                   258
northern michigan              256
gold country                   256
mankato                        250
pueblo                         246
heartland florida              246
bozeman                        239
redding                        237
central michigan               232
meridian                       225
pullman / moscow               223
western maryland               221
houma                          221
yuma                           215
glens falls                    214
st augustine                   199
brunswick                      199
missoula                       196
potsdam-canton-massena         191
eastern shore                  184
northwest CT                   172
finger lakes                   172
jonesboro                      159
utica-rome-oneida              154
prescott                       153
elmira-corning                 153
yuba-sutter                    151
fort smith                     147
humboldt county                145
joplin                         144
hickory / lenoir               142
santa maria                    138
st joseph                      135
cape cod / islands             129
kalispell                      126
huntington-ashland             126
southeast KS                   123
dothan                         119
fort dodge                     111
show low                       103
central louisiana              101
terre haute                     96
richmond                        94
eastern montana                 93
imperial county                 91
texarkana                       89
twin falls                      88
elko                            87
kenai peninsula                 84
upper peninsula                 83
dubuque                         82
clovis / portales               82
southeast alaska                82
northwest KS                    77
oneonta                         77
kokomo                          77
lake of the ozarks              71
outer banks                     68
western KY                      68
gadsden-anniston                67
scottsbluff / panhandle         66
florence / muscle shoals        65
plattsburgh-adirondacks         64
southwest MN                    62
butte                           57
mendocino county                57
southeast IA                    57
eastern kentucky                56
chillicothe                     55
eastern CO                      52
southeast missouri              52
catskills                       51
helena                          50
bemidji                         49
statesboro                      48
farmington                      45
the thumb                       45
southwest KS                    43
siskiyou county                 42
chautauqua                      42
mattoon-charleston              42
north platte                    41
ashtabula                       39
roswell / carlsbad              39
boone                           38
salina                          37
grand island                    37
brainerd                        36
great falls                     35
la salle co                     35
north central FL                29
western IL                      27
mason city                      25
lewiston / clarkston            24
kirksville                      22
oklahoma city                   19
twin tiers NY/PA                18
northern panhandle              17
owensboro                       17
susanville                      15
dayton / springfield            14
portland                        14
lawton                          14
southwest MS                    12
bend                            11
st louis                         9
tulsa                            7
eugene                           7
parkersburg-marietta             7
stillwater                       6
toledo                           6
medford-ashland                  4
oregon coast                     4
lima / findlay                   4
mansfield                        3
zanesville / cambridge           3
tuscarawas co                    3
corvallis/albany                 3
kansas city                      3
northwest OK                     2
east oregon                      1
texoma                           1
Name: region, dtype: int64

len(region_count[region_count>500])

141

important=region_count[region_count>500].index
important

Index(['jacksonville', 'rochester', 'fayetteville', 'omaha / council bluffs',
       'boulder', 'denver', 'ventura county', 'savannah / hinesville',
       'stockton', 'fort collins / north CO',
       ...
       'fargo / moorhead', 'columbia / jeff city', 'long island', 'ocala',
       'birmingham', 'southern maryland', 'billings', 'lawrence', 'monroe',
       'valdosta'],
      dtype='object', length=141)

def remove(x):
    if x not in important:
        return 'other'
    else:
        return x

dataframe['region'].tail(100)

265089      columbus
265090      columbus
265091      columbus
265092      columbus
265093      columbus
265094      columbus
265095      columbus
265096      columbus
265097      columbus
265098      columbus
265099      columbus
265100      columbus
265101      columbus
265102      columbus
265103      columbus
265104      columbus
265105      columbus
265106      columbus
265107      columbus
265108      columbus
265109      columbus
265110      columbus
265111      columbus
265112      columbus
265113      columbus
265114      columbus
265115      columbus
265116      columbus
265117      columbus
265118      columbus
265119      columbus
265120      columbus
265121      columbus
265122      columbus
265123      columbus
265124      columbus
265125      columbus
265126      columbus
265127      columbus
265128      columbus
265129      columbus
265130      columbus
265131      columbus
265132      columbus
265133      columbus
265134      columbus
265135      columbus
265136      columbus
265137      columbus
265138      columbus
265139      columbus
265140      columbus
265141      columbus
265142      columbus
265143      columbus
265144      columbus
265145      columbus
265146      columbus
265147      columbus
265148      columbus
265149      columbus
265150      columbus
265151      columbus
265152      columbus
265153      columbus
265154      columbus
265155      columbus
265156      columbus
265157      columbus
265158      columbus
265159      columbus
265160      columbus
265161      columbus
265162      columbus
265163      columbus
265164      columbus
265165      columbus
265166      columbus
265167      columbus
265168      columbus
265169      columbus
265170    stillwater
265171      columbus
265172      columbus
265173      columbus
265174      columbus
265175      columbus
265176      columbus
265177      columbus
265178      columbus
265179      columbus
265180      columbus
265181      columbus
265182      columbus
265183      columbus
265184      columbus
265185      columbus
265186      columbus
265187      columbus
265188      columbus
Name: region, dtype: object

dataframe['region']=dataframe['region'].apply(remove)

dataframe['region'].tail(50)

265139    columbus
265140    columbus
265141    columbus
265142    columbus
265143    columbus
265144    columbus
265145    columbus
265146    columbus
265147    columbus
265148    columbus
265149    columbus
265150    columbus
265151    columbus
265152    columbus
265153    columbus
265154    columbus
265155    columbus
265156    columbus
265157    columbus
265158    columbus
265159    columbus
265160    columbus
265161    columbus
265162    columbus
265163    columbus
265164    columbus
265165    columbus
265166    columbus
265167    columbus
265168    columbus
265169    columbus
265170       other
265171    columbus
265172    columbus
265173    columbus
265174    columbus
265175    columbus
265176    columbus
265177    columbus
265178    columbus
265179    columbus
265180    columbus
265181    columbus
265182    columbus
265183    columbus
265184    columbus
265185    columbus
265186    columbus
265187    columbus
265188    columbus
Name: region, dtype: object

len(dataframe['region'].unique())

142

lets Automate above stuffs¶

def get_stats(feature):
    count=dataframe[feature].value_counts()
    pd.set_option('display.max_rows',df[feature].nunique())
    return count

get_stats('state')

ca    32997
fl    31687
nc    18490
mi    14439
ga    13773
co    11200
ny     9958
il     9615
ks     7889
ia     7462
mn     7445
md     7437
la     7291
az     6742
oh     6537
in     6400
al     6190
nj     5601
ky     5397
ms     4963
ma     4894
id     4365
ct     3753
nd     3428
ar     3145
nm     2907
nv     2836
ne     2693
dc     2478
ak     2148
mo     2137
de     2046
hi     1830
nh     1754
mt     1331
me      420
ok       49
or       44
Name: state, dtype: int64

dataframe.shape

(263771, 12)

def extract_imp_sub_categories(feature,threshold):
    count=dataframe[feature].value_counts()
    important=count[count>threshold].index
    return important

sub_cat=extract_imp_sub_categories('state',2000)

sub_cat

Index(['ca', 'fl', 'nc', 'mi', 'ga', 'co', 'ny', 'il', 'ks', 'ia', 'mn', 'md',
       'la', 'az', 'oh', 'in', 'al', 'nj', 'ky', 'ms', 'ma', 'id', 'ct', 'nd',
       'ar', 'nm', 'nv', 'ne', 'dc', 'ak', 'mo', 'de'],
      dtype='object')

dataframe['state']=dataframe['state'].apply(lambda x:'other' if x not in sub_cat else x)

dataframe['state'].nunique()

33

get_stats('type')

apartment          217090
house               23400
townhouse           10295
condo                4841
duplex               3436
manufactured         3004
cottage/cabin         697
loft                  510
flat                  349
in-law                144
land                    4
assisted living         1
Name: type, dtype: int64

imp2=extract_imp_sub_categories('type',3000)
imp2

Index(['apartment', 'house', 'townhouse', 'condo', 'duplex', 'manufactured'], dtype='object')

dataframe['type']=dataframe['type'].apply(lambda x:'other' if x not in imp2 else x)

for feature in cat_features:
    print('total diff features in {} are {}'.format(feature,len(dataframe[feature].unique())))

total diff features in region are 142
total diff features in type are 7
total diff features in laundry_options are 5
total diff features in parking_options are 7
total diff features in state are 33

apply Frequency or count encoding on Region col as still have many sub-categories in this col¶

dictionary=dict(dataframe['region'].value_counts())
dictionary

{'SF bay area': 2319,
 'akron / canton': 1524,
 'albany': 2077,
 'albuquerque': 2139,
 'anchorage / mat-su': 1695,
 'ann arbor': 2281,
 'annapolis': 2045,
 'asheville': 1994,
 'athens': 1258,
 'atlanta': 2273,
 'augusta': 2522,
 'bakersfield': 1644,
 'baltimore': 2018,
 'baton rouge': 1851,
 'battle creek': 637,
 'billings': 535,
 'birmingham': 558,
 'bismarck': 795,
 'bloomington': 644,
 'boise': 1472,
 'boston': 1575,
 'boulder': 2606,
 'bowling green': 1016,
 'buffalo': 1377,
 'cedar rapids': 753,
 'central NJ': 1935,
 'champaign urbana': 2064,
 'charlotte': 2435,
 'chicago': 1496,
 'cincinnati': 2110,
 'cleveland': 2186,
 'colorado springs': 2518,
 'columbia / jeff city': 593,
 'columbus': 2050,
 'daytona beach': 2387,
 'delaware': 2046,
 'denver': 2582,
 'des moines': 2265,
 'detroit metro': 2008,
 'eastern CT': 675,
 'eastern NC': 1180,
 'evansville': 1311,
 'fargo / moorhead': 613,
 'fayetteville': 3636,
 'flagstaff / sedona': 935,
 'flint': 916,
 'fort collins / north CO': 2566,
 'fort wayne': 787,
 'frederick': 2147,
 'fresno / madera': 2172,
 'ft myers / SW florida': 2086,
 'gainesville': 1962,
 'grand forks': 1309,
 'grand rapids': 2339,
 'greensboro': 1995,
 'gulfport / biloxi': 1503,
 'hanford-corcoran': 815,
 'hartford': 1490,
 'hawaii': 1830,
 'hudson valley': 707,
 'huntsville / decatur': 2160,
 'indianapolis': 2124,
 'inland empire': 2559,
 'ithaca': 640,
 'jackson': 2519,
 'jacksonville': 4242,
 'jersey shore': 623,
 'kalamazoo': 815,
 'kansas city, MO': 1779,
 'lafayette': 1543,
 'lakeland': 2355,
 'lansing': 2184,
 'las vegas': 2203,
 'lawrence': 531,
 'lexington': 2262,
 'lincoln': 2547,
 'little rock': 1732,
 'long island': 568,
 'los angeles': 2244,
 'louisville': 1852,
 'macon / warner robins': 1593,
 'manhattan': 1083,
 'minneapolis / st paul': 2444,
 'mobile': 1728,
 'modesto': 1262,
 'monroe': 509,
 'monterey bay': 1145,
 'montgomery': 751,
 'new hampshire': 1754,
 'new haven': 1416,
 'new orleans': 2037,
 'new york city': 763,
 'north dakota': 1324,
 'north jersey': 842,
 'north mississippi': 624,
 'ocala': 564,
 'okaloosa / walton': 941,
 'omaha / council bluffs': 2719,
 'orange county': 2386,
 'orlando': 2555,
 'other': 23345,
 'palm springs': 1401,
 'panama city': 1598,
 'pensacola': 1617,
 'phoenix': 2063,
 'quad cities, IA/IL': 1337,
 'raleigh / durham / CH': 2365,
 'reno / tahoe': 2535,
 'rochester': 3673,
 'sacramento': 2382,
 'saginaw-midland-baycity': 688,
 'san diego': 2351,
 'santa barbara': 624,
 'sarasota-bradenton': 2302,
 'savannah / hinesville': 2575,
 'shreveport': 935,
 'south bend / michiana': 694,
 'south coast': 734,
 'south florida': 1440,
 'south jersey': 2201,
 'southern maryland': 544,
 'space coast': 2508,
 "spokane / coeur d'alene": 2207,
 'springfield': 1409,
 'st cloud': 1895,
 'st louis, MO': 2385,
 'stockton': 2569,
 'syracuse': 914,
 'tallahassee': 1976,
 'tampa bay area': 2140,
 'topeka': 2244,
 'treasure coast': 1755,
 'tucson': 2463,
 'tuscaloosa': 742,
 'valdosta': 507,
 'ventura county': 2578,
 'washington, DC': 2478,
 'western massachusetts': 1027,
 'wichita': 1972,
 'wilmington': 1871,
 'winston-salem': 2119,
 'worcester / central MA': 1429}

dataframe['region']=dataframe['region'].map(dictionary)

dataframe['region']

0          558
1          558
2          558
3          558
4          558
          ... 
265184    2050
265185    2050
265186    2050
265187    2050
265188    2050
Name: region, Length: 263771, dtype: int64

perform Label encoding on 'laundry_options','parking_options','type'¶

from sklearn.preprocessing import LabelEncoder

le=LabelEncoder()

for feature in ['laundry_options','parking_options','type']:
    dataframe[feature]=le.fit_transform(dataframe[feature])

perform Frequncy encoding on state feature as it has multiple sub-categories using CountEncoder class¶

!pip install category_encoders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting category_encoders
  Downloading category_encoders-2.5.0-py2.py3-none-any.whl (69 kB)
     |████████████████████████████████| 69 kB 3.2 MB/s 
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (1.4.1)
Requirement already satisfied: pandas>=1.0.5 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (1.3.5)
Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (1.0.2)
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (1.21.6)
Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (0.10.2)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.7/dist-packages (from category_encoders) (0.5.2)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.0.5->category_encoders) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.0.5->category_encoders) (2022.1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from patsy>=0.5.1->category_encoders) (1.15.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.20.0->category_encoders) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.20.0->category_encoders) (1.1.0)
Installing collected packages: category-encoders
Successfully installed category-encoders-2.5.0

from category_encoders import CountEncoder

dataframe['state'].head()

0    al
1    al
2    al
3    al
4    al
Name: state, dtype: object

pd.value_counts(dataframe['state'])

ca    32997
fl    31687
nc    18490
mi    14439
ga    13773
      ...  
ne     2693
dc     2478
ak     2148
mo     2137
de     2046
Name: state, Length: 33, dtype: int64

pd.set_option('display.max_rows',33)

pd.value_counts(dataframe['state'])

ca       32997
fl       31687
nc       18490
mi       14439
ga       13773
co       11200
ny        9958
il        9615
ks        7889
ia        7462
mn        7445
md        7437
la        7291
az        6742
oh        6537
in        6400
al        6190
nj        5601
other     5428
ky        5397
ms        4963
ma        4894
id        4365
ct        3753
nd        3428
ar        3145
nm        2907
nv        2836
ne        2693
dc        2478
ak        2148
mo        2137
de        2046
Name: state, dtype: int64

ce=CountEncoder()

dataframe['state']=ce.fit_transform(dataframe['state'])

dataframe['state'].head()

0    6190
1    6190
2    6190
3    6190
4    6190
Name: state, dtype: int64

dataframe.dtypes

region                       int64
price                      float64
type                         int64
sqfeet                     float64
dogs_allowed                 int64
smoking_allowed              int64
wheelchair_access            int64
electric_vehicle_charge      int64
comes_furnished              int64
laundry_options              int64
parking_options              int64
state                        int64
dtype: object

dataframe.head()

y=dataframe['price']
x=dataframe.drop('price',axis=1)

from sklearn.decomposition import PCA

pca = PCA(n_components = 5)
pca_fit = pca.fit_transform(x[:1000])

#split dataset into train and test
from sklearn.model_selection import train_test_split
#x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.8,random_state=0)

x_train, x_test, y_train, y_test = train_test_split(pca_fit, y[:1000], test_size=0.2)

from sklearn.tree import DecisionTreeRegressor
dt=DecisionTreeRegressor(random_state=0)
dt.fit(x_train,y_train)

DecisionTreeRegressor(random_state=0)

y_pred=dt.predict(x_test)

y_pred

array([1039.4       ,  985.        ,  500.        ,  700.        ,
        975.        ,  799.        , 1244.        ,  570.        ,
       1120.        , 1400.        ,  525.        , 1004.        ,
        931.66666667, 1161.09090909,  760.        ,  930.        ,
        750.        , 1039.4       ,  650.        ,  899.        ,
        585.        , 1120.        ,  909.        ,  710.        ,
       1400.        ,  975.        ,  595.        ,  585.        ,
        835.        ,  850.        ,  837.59259259,  825.        ,
        904.        ,  990.        ,  804.16666667,  850.        ,
        925.        ,  855.        ,  630.        , 1500.        ,
       1344.        , 1400.        ,  659.        ,  700.        ,
        800.        ,  835.        ,  495.        ,  787.66666667,
       1181.        ,  830.3125    , 1495.        , 1161.09090909,
        679.        ,  826.66666667, 1004.        ,  702.        ,
        990.        ,  779.        , 1130.        ,  890.        ,
        899.        ,  761.15384615,  835.        , 1050.        ,
        985.        ,  870.        ,  925.        ,  761.15384615,
        495.        ,  630.        , 1039.4       , 1050.        ,
        822.33333333,  837.59259259,  870.        ,  699.        ,
        899.        ,  650.        , 1000.        ,  799.        ,
        960.        , 1181.        , 1039.4       ,  570.        ,
       1161.09090909,  931.66666667, 1500.        ,  837.59259259,
       1039.4       ,  899.        , 1238.        ,  899.        ,
        689.        ,  699.        , 1039.4       ,  685.        ,
        900.        ,  835.        , 1039.4       ,  702.        ,
        566.        ,  925.        , 2200.        ,  787.66666667,
       1262.        ,  925.        , 1039.4       ,  874.        ,
        760.        ,  719.        ,  500.        , 1039.4       ,
        835.        ,  890.        ,  987.        ,  580.        ,
        837.59259259,  837.59259259,  650.        ,  495.        ,
        822.33333333,  950.        , 1039.4       , 1120.        ,
        800.        , 1130.        , 1200.        ,  787.66666667,
        975.        ,  760.        , 1046.        , 1343.        ,
       1039.4       , 1181.        , 1240.        , 1120.        ,
       1250.        ,  987.        ,  500.        ,  710.        ,
        702.        ,  700.        ,  749.        ,  500.        ,
        830.3125    , 1355.        ,  835.        , 1039.4       ,
        640.        ,  990.        ,  761.15384615,  735.        ,
        839.        ,  750.        ,  659.        , 1132.        ,
        835.        ,  793.33333333, 1120.        ,  985.        ,
       1120.        , 1402.        , 1189.        ,  995.        ,
        899.        ,  850.        ,  899.        ,  960.        ,
       1120.        ,  650.        ,  837.59259259, 1039.4       ,
       1041.66666667,  987.        ,  650.        ,  899.        ,
        650.        , 1134.        ,  931.66666667,  850.        ,
        975.        , 1039.4       ,  931.66666667,  830.3125    ,
        580.        ,  990.        ,  825.        ,  787.66666667,
       1039.4       ,  659.        ,  837.59259259,  835.        ,
        585.        ,  950.        ,  793.33333333,  874.        ,
        735.        , 1189.        ,  835.        ,  728.33333333])

#predict how our model is
from sklearn.metrics import r2_score
r2=r2_score(y_test,y_pred)
r2

0.33927439098732526

#fit Regression models
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

### classifier models
models = []
models.append(('LinearRegression', LinearRegression()))
models.append(('RandomForest', RandomForestRegressor()))
models.append(('Decision Tree', DecisionTreeRegressor()))
models.append(('KNN', KNeighborsRegressor(n_neighbors = 5)))

models

[('LinearRegression', LinearRegression()),
 ('RandomForest', RandomForestRegressor()),
 ('Decision Tree', DecisionTreeRegressor()),
 ('KNN', KNeighborsRegressor())]

for name, model in models:
    print(name)
    model.fit(x_train, y_train)
    
    # Make predictions.
    predictions = model.predict(x_test)

    # Compute the error.
    from sklearn.metrics import r2_score
    print(r2_score(predictions, y_test))

    print('\n')

LinearRegression
-0.43728561824850387


RandomForest
0.49136313424371125


Decision Tree
0.46323944598918176


KNN
0.31351357332016705

## RF Performs best

dataframe.shape

(263771, 12)

This amount of data will take too long for hyper parameter testing and optimization so will limit to first 10k rows¶

final=dataframe[0:10000]

dep=final['price']
ind=final.drop('price',axis=1)

#split dataset into train and test
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(ind,dep,train_size=0.8,random_state=0)

## Hyperparameter optimization using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

#Randomized Search CV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 4)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 3)]
# Minimum number of samples required to split a node
min_samples_split = [ 5, 15, 100]
# Minimum number of samples required at each leaf node
#min_samples_leaf = [1, 5, 10]

# Create the random grid

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split}

random_grid

{'max_depth': [5, 17, 30],
 'max_features': ['auto', 'sqrt'],
 'min_samples_split': [5, 15, 100],
 'n_estimators': [100, 466, 833, 1200]}

4*2*3*3

72

reg_rf=RandomForestRegressor()

# Random search of parameters, using 5 fold cross validation, 
# search across 100 different combinations
rf_random = RandomizedSearchCV(estimator = reg_rf, param_distributions = random_grid, cv = 3, verbose=2, n_jobs = -1)

rf_random.fit(x_train,y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits

RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(), n_jobs=-1,
                   param_distributions={'max_depth': [5, 17, 30],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_split': [5, 15, 100],
                                        'n_estimators': [100, 466, 833, 1200]},
                   verbose=2)

rf_random.best_params_

{'max_depth': 30,
 'max_features': 'sqrt',
 'min_samples_split': 5,
 'n_estimators': 466}

prediction = rf_random.predict(x_test)

sns.distplot(y_test-prediction)

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning:

`distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).

<matplotlib.axes._subplots.AxesSubplot at 0x7f291ae7e4d0>

r2_score(y_test,prediction)

0.773940051314308

# Save the trained model as a pickle string.
import pickle 

saved_model = pickle.dump(rf_random, open('drive/MyDrive/Colab Notebooks/ApartmentRent_Prediction/Models/Apps.pickle','wb'))

# saved_pca = pickle.dump(pca, open('drive/MyDrive/Colab Notebooks/GoogleAppRating_Prediction/Models/AppsPCA.pickle','wb'))

dir(rf_random)

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_check_feature_names',
 '_check_n_features',
 '_check_refit_for_multimetric',
 '_estimator_type',
 '_format_results',
 '_get_param_names',
 '_get_tags',
 '_more_tags',
 '_pairwise',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_required_parameters',
 '_run_search',
 '_select_best_index',
 '_validate_data',
 'best_estimator_',
 'best_index_',
 'best_params_',
 'best_score_',
 'classes_',
 'cv',
 'cv_results_',
 'decision_function',
 'error_score',
 'estimator',
 'feature_names_in_',
 'fit',
 'get_params',
 'inverse_transform',
 'multimetric_',
 'n_features_in_',
 'n_iter',
 'n_jobs',
 'n_splits_',
 'param_distributions',
 'pre_dispatch',
 'predict',
 'predict_log_proba',
 'predict_proba',
 'random_state',
 'refit',
 'refit_time_',
 'return_train_score',
 'score',
 'score_samples',
 'scorer_',
 'scoring',
 'set_params',
 'transform',
 'verbose']

rf_random.best_estimator_.feature_importances_

array([0.17673113, 0.05623722, 0.33643127, 0.0213649 , 0.0325462 ,
       0.01847403, 0.00279842, 0.02229807, 0.10194604, 0.06679752,
       0.1643752 ])

	id	price	sqfeet	beds	baths	cats_allowed	dogs_allowed	smoking_allowed	wheelchair_access	electric_vehicle_charge	comes_furnished	lat	long
id	1.000000	0.134772	-0.038769	-0.064331	-0.029581	0.034818	0.024426	0.011674	0.012306	0.019057	-0.022436	-0.006378	-0.102851
price	0.134772	1.000000	0.326490	0.178142	0.252186	-0.042037	-0.033900	-0.165164	0.085760	0.137642	0.023183	-0.034644	-0.218584
sqfeet	-0.038769	0.326490	1.000000	0.743962	0.642934	-0.078665	-0.036908	-0.051534	-0.000421	0.007912	0.017238	-0.000621	0.037274
beds	-0.064331	0.178142	0.743962	1.000000	0.641341	-0.071272	-0.035495	-0.014372	-0.027641	-0.018272	0.017712	0.017202	0.020761
baths	-0.029581	0.252186	0.642934	0.641341	1.000000	-0.021490	0.018232	-0.020555	0.030413	0.009954	0.043990	-0.104677	-0.008859
cats_allowed	0.034818	-0.042037	-0.078665	-0.071272	-0.021490	1.000000	0.887186	0.036541	0.118325	0.053525	-0.076369	-0.013148	0.054597
dogs_allowed	0.024426	-0.033900	-0.036908	-0.035495	0.018232	0.887186	1.000000	0.031869	0.128308	0.053380	-0.055738	-0.052023	0.039935
smoking_allowed	0.011674	-0.165164	-0.051534	-0.014372	-0.020555	0.036541	0.031869	1.000000	-0.204326	-0.100159	-0.150293	-0.129610	0.118966
wheelchair_access	0.012306	0.085760	-0.000421	-0.027641	0.030413	0.118325	0.128308	-0.204326	1.000000	0.206526	0.152672	-0.019009	-0.006892
electric_vehicle_charge	0.019057	0.137642	0.007912	-0.018272	0.009954	0.053525	0.053380	-0.100159	0.206526	1.000000	0.069265	0.010245	-0.061936
comes_furnished	-0.022436	0.023183	0.017238	0.017712	0.043990	-0.076369	-0.055738	-0.150293	0.152672	0.069265	1.000000	0.000507	-0.023418
lat	-0.006378	-0.034644	-0.000621	0.017202	-0.104677	-0.013148	-0.052023	-0.129610	-0.019009	0.010245	0.000507	1.000000	-0.070620
long	-0.102851	-0.218584	0.037274	0.020761	-0.008859	0.054597	0.039935	0.118966	-0.006892	-0.061936	-0.023418	-0.070620	1.000000

	id	price	sqfeet	beds	baths	cats_allowed	dogs_allowed	smoking_allowed	wheelchair_access	electric_vehicle_charge	comes_furnished	lat	long
id	1.000000	0.134772	-0.038769	-0.064331	-0.029581	0.034818	0.024426	0.011674	0.012306	0.019057	-0.022436	-0.006378	-0.102851
price	0.134772	1.000000	0.326490	0.178142	0.252186	-0.042037	-0.033900	-0.165164	0.085760	0.137642	0.023183	-0.034644	-0.218584
sqfeet	-0.038769	0.326490	1.000000	0.743962	0.642934	-0.078665	-0.036908	-0.051534	-0.000421	0.007912	0.017238	-0.000621	0.037274
beds	-0.064331	0.178142	0.743962	1.000000	0.641341	-0.071272	-0.035495	-0.014372	-0.027641	-0.018272	0.017712	0.017202	0.020761
baths	-0.029581	0.252186	0.642934	0.641341	1.000000	-0.021490	0.018232	-0.020555	0.030413	0.009954	0.043990	-0.104677	-0.008859
cats_allowed	0.034818	-0.042037	-0.078665	-0.071272	-0.021490	1.000000	0.887186	0.036541	0.118325	0.053525	-0.076369	-0.013148	0.054597
dogs_allowed	0.024426	-0.033900	-0.036908	-0.035495	0.018232	0.887186	1.000000	0.031869	0.128308	0.053380	-0.055738	-0.052023	0.039935
smoking_allowed	0.011674	-0.165164	-0.051534	-0.014372	-0.020555	0.036541	0.031869	1.000000	-0.204326	-0.100159	-0.150293	-0.129610	0.118966
wheelchair_access	0.012306	0.085760	-0.000421	-0.027641	0.030413	0.118325	0.128308	-0.204326	1.000000	0.206526	0.152672	-0.019009	-0.006892
electric_vehicle_charge	0.019057	0.137642	0.007912	-0.018272	0.009954	0.053525	0.053380	-0.100159	0.206526	1.000000	0.069265	0.010245	-0.061936
comes_furnished	-0.022436	0.023183	0.017238	0.017712	0.043990	-0.076369	-0.055738	-0.150293	0.152672	0.069265	1.000000	0.000507	-0.023418
lat	-0.006378	-0.034644	-0.000621	0.017202	-0.104677	-0.013148	-0.052023	-0.129610	-0.019009	0.010245	0.000507	1.000000	-0.070620
long	-0.102851	-0.218584	0.037274	0.020761	-0.008859	0.054597	0.039935	0.118966	-0.006892	-0.061936	-0.023418	-0.070620	1.000000

Machine Learning Example

below are some options on how to deal with missing values of laundry_options & parking_options¶

start with example of dealing with missing values using Statistical Approaches (2nd approach on list)¶

We see from the 2 visuals above that this approach will impact distribution of data, which we don't want. So we look at using some smarter approaches¶

Option 3 (smarter option): fill NA value using Random Value Imputation¶

Random Sample Imputation¶

Steps to impute NA Values of parking_options by using the function above¶

The missing values in our data are gone¶

Perform Text Analysis on the Description feature¶

From the wordcloud above we can see that descriptions like to emphasize the words fitness center, washer dryer, walk in closet, swimming pool, and pet friendly.¶

But it is still hard to gauge by this inspection by how much are these emphasized. Instead, look at plots/charts¶

Word count plot and charts¶

Example of removing stopwords¶

perform Spatial Analysis to get a clear cut of where exactly higher priced houses are situated¶

Analyse Label distribution of data¶

dealing with outliers from data¶

Detect outliers using BoxPlot Approach¶

Detect Outliers using some statistical approaches¶

WHAT NEXT??¶

Imputing Outliers using Statistical techmiques¶

Now for sqfeet¶

Analyse Distribution of price where pets are allowed & pets are not allowed¶

Automate above stuffs¶

From the above plot we could say that price of house is not varying in case of having pets or not !¶

From the above plot we could say that price of house is not varying in case of furnished & not furnished houses!¶

From the above plot we could say that price of house is higher if it has a facility of charging electric_vehicle¶

Relationship between area & Price¶

get all the categorical features¶

check all the sub-categories in categorical features to check wht encoding technique we can apply¶

lets Automate above stuffs¶

apply Frequency or count encoding on Region col as still have many sub-categories in this col¶

perform Label encoding on 'laundry_options','parking_options','type'¶

perform Frequncy encoding on state feature as it has multiple sub-categories using CountEncoder class¶

This amount of data will take too long for hyper parameter testing and optimization so will limit to first 10k rows¶

Office

Pages

	id	url	region	region_url	price	type	sqfeet	beds	baths	cats_allowed	dogs_allowed	smoking_allowed	laundry_options	parking_options	image_url	description	lat	long	state
0	7039061606	https://bham.craigslist.org/apa/d/birmingham-h...	birmingham	https://bham.craigslist.org	1195	apartment	1908	3	2.0	1	1	1	laundry on site	street parking	https://images.craigslist.org/00L0L_80pNkyDeG0...	Apartments In Birmingham AL Welcome to 100 Inv...	33.4226	-86.7065	al
1	7041970863	https://bham.craigslist.org/apa/d/birmingham-w...	birmingham	https://bham.craigslist.org	1120	apartment	1319	3	2.0	1	1	1	laundry on site	off-street parking	https://images.craigslist.org/00707_uRrY9CsNMC...	Find Your Way to Haven Apartment Homes Come ho...	33.3755	-86.8045	al
2	7041966914	https://bham.craigslist.org/apa/d/birmingham-g...	birmingham	https://bham.craigslist.org	825	apartment	1133	1	1.5	1	1	1	laundry on site	street parking	https://images.craigslist.org/00h0h_b7Bdj1NLBi...	Apartments In Birmingham AL Welcome to 100 Inv...	33.4226	-86.7065	al
3	7041966936	https://bham.craigslist.org/apa/d/birmingham-f...	birmingham	https://bham.craigslist.org	800	apartment	927	1	1.0	1	1	1	laundry on site	street parking	https://images.craigslist.org/00808_6ghZ8tSRQs...	Apartments In Birmingham AL Welcome to 100 Inv...	33.4226	-86.7065	al
4	7041966888	https://bham.craigslist.org/apa/d/birmingham-2...	birmingham	https://bham.craigslist.org	785	apartment	1047	2	1.0	1	1	1	laundry on site	street parking	https://images.craigslist.org/00y0y_21c0FOvUXm...	Apartments In Birmingham AL Welcome to 100 Inv...	33.4226	-86.7065	al

	id	url	region	region_url	price	type	sqfeet	beds	baths	cats_allowed	dogs_allowed	smoking_allowed	laundry_options	parking_options	image_url	description	lat	long	state
217272	7044469402	https://fayetteville.craigslist.org/apa/d/faye...	fayetteville	https://fayetteville.craigslist.org	500	house	1460	3	2.0	1	1	1	w/d hookups	attached garage	https://images.craigslist.org/00m0m_4TavabEaXL...	\|\|\|/-/there are 3 nicely sized bedrooms and la...	35.0727	-79.0499	nc
101127	7048915993	https://tampa.craigslist.org/pnl/apa/d/clearwa...	tampa bay area	https://tampa.craigslist.org	1275	apartment	766	1	1.0	1	1	1	w/d in unit	off-street parking	https://images.craigslist.org/00C0C_bU96hyp3Sc...	the grand reserve at park isle is centrally lo...	27.9676	-82.7280	fl
228391	7040122131	https://winstonsalem.craigslist.org/apa/d/wins...	winston-salem	https://winstonsalem.craigslist.org	730	apartment	684	1	1.0	0	0	1	laundry on site	attached garage	https://images.craigslist.org/01111_8ykjx2xTTj...	designed to upgrade your lifestyle, brannigan ...	36.0425	-80.2609	nc
140600	7050799182	https://desmoines.craigslist.org/apa/d/urbanda...	des moines	https://desmoines.craigslist.org	820	apartment	742	1	1.0	1	1	1	laundry in bldg	off-street parking	https://images.craigslist.org/00303_bGbx5x3eXg...	to schedule a tour we now book our tour appoin...	41.5899	-93.7575	ia
26066	7048725836	https://inlandempire.craigslist.org/apa/d/vict...	inland empire	https://inlandempire.craigslist.org	1401	apartment	926	2	2.0	1	1	0	w/d in unit	carport	https://images.craigslist.org/00o0o_5UL9efGfw1...	riverton of the high desert pricing: $1360 - ...	34.5119	-117.3340	ca

	region	price	sqfeet	dogs_allowed	smoking_allowed	laundry_options	parking_options	state
0	558	1195.0	1908.0	1	1	1	5	6190
1	558	1120.0	1319.0	1	1	1	4	6190
2	558	825.0	1133.0	1	1	1	5	6190
3	558	800.0	927.0	1	1	1	5	6190
4	558	785.0	1047.0	1	1	1	5	6190

Machine Learning Example

below are some options on how to deal with missing values of laundry_options & parking_options¶

start with example of dealing with missing values using Statistical Approaches (2nd approach on list)¶

We see from the 2 visuals above that this approach will impact distribution of data, which we don't want. So we look at using some smarter approaches¶

Option 3 (smarter option): fill NA value using Random Value Imputation¶

Random Sample Imputation¶

Steps to impute NA Values of parking_options by using the function above¶

The missing values in our data are gone¶

Perform Text Analysis on the Description feature¶

From the wordcloud above we can see that descriptions like to emphasize the words fitness center, washer dryer, walk in closet, swimming pool, and pet friendly.¶

But it is still hard to gauge by this inspection by how much are these emphasized. Instead, look at plots/charts¶

Word count plot and charts¶

Example of removing stopwords¶

perform Spatial Analysis to get a clear cut of where exactly higher priced houses are situated¶

Analyse Label distribution of data¶

dealing with outliers from data¶

Detect outliers using BoxPlot Approach¶

Detect Outliers using some statistical approaches¶

WHAT NEXT??¶

Imputing Outliers using Statistical techmiques¶

Now for sqfeet¶

Analyse Distribution of price where pets are allowed & pets are not allowed¶

Automate above stuffs¶

From the above plot we could say that price of house is not varying in case of having pets or not !¶

From the above plot we could say that price of house is not varying in case of furnished & not furnished houses!¶

From the above plot we could say that price of house is higher if it has a facility of charging electric_vehicle¶

Relationship between area & Price¶

get all the categorical features¶

check all the sub-categories in categorical features to check wht encoding technique we can apply¶

lets Automate above stuffs¶

apply Frequency or count encoding on Region col as still have many sub-categories in this col¶

perform Label encoding on 'laundry_options','parking_options','type'¶

perform Frequncy encoding on state feature as it has multiple sub-categories using CountEncoder class¶

This amount of data will take too long for hyper parameter testing and optimization so will limit to first 10k rows¶

Office

Pages

Subscribe