Data Analytics Program:

Build a simple linear regression model for Salary Prediction from years of experience.(Download Salary dataset from kaggle). Find the accuracy of the model.

In [ ]:

In [61]:

​
import pandas as pd
​

In [62]:

​
data = pd.read_csv("Salary.csv")
data
​

Out[62]:

	experience	salary
0	1	100
1	2	200
2	3	300
3	4	405
4	5	499
5	7	700
6	8	800
7	9	900

In [63]:

​
data.isnull()
​

Out[63]:

	experience	salary
0	False	False
1	False	False
2	False	False
3	False	False
4	False	False
5	False	False
6	False	False
7	False	False

In [64]:

​
data.describe()
​

Out[64]:

	experience	salary
count	8.000000	8.00000
mean	4.875000	488.00000
std	2.900123	289.79648
min	1.000000	100.00000
25%	2.750000	275.00000
50%	4.500000	452.00000
75%	7.250000	725.00000
max	9.000000	900.00000

In [65]:

​
data.head()
​

Out[65]:

	experience	salary
0	1	100
1	2	200
2	3	300
3	4	405
4	5	499

In [66]:

​
data.tail()
​

Out[66]:

	experience	salary
3	4	405
4	5	499
5	7	700
6	8	800
7	9	900

In [67]:

​
data.sample()
​

Out[67]:

	experience	salary
3	4	405

In [68]:

​
data.dtypes
​

Out[68]:

experience    int64
salary        int64
dtype: object

In [69]:

​
corr = data.corr()
corr
corr.style.background_gradient(cmap='rainbow')

Out[69]:

	experience	salary
experience	1.000000	0.999980
salary	0.999980	1.000000

In [70]:

y = data['salary']
x = data.drop(['salary'],axis = 1)
y

Out[70]:

0    100
1    200
2    300
3    405
4    499
5    700
6    800
7    900
Name: salary, dtype: int64

In [71]:

​
x
​

Out[71]:

	experience
0	1
1	2
2	3
3	4
4	5
5	7
6	8
7	9

In [72]:

​
import matplotlib.pyplot as plt
​

In [73]:

plt.scatter(x,y)
plt.title("Salary vs Experience (Training Dataset)") 
plt.xlabel(" Experience in Year") 
plt.ylabel("Salary(In Rupees)") 
plt.show()
​

In [74]:

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25)

In [75]:

​
print(x_train)
​

   experience
4           5
7           9
0           1
3           4
5           7
1           2

In [76]:

​
print(y_train)
​

4    499
7    900
0    100
3    405
5    700
1    200
Name: salary, dtype: int64

In [77]:

​
print(x_test.shape)
​

(2, 1)

In [78]:

​
from sklearn.linear_model import LinearRegression
slr=LinearRegression()
​

In [79]:

​
slr.fit(x_train,y_train)
​

Out[79]:

LinearRegression()

In [80]:

​
y_pred=slr.predict(x_test)
y_pred
​

Out[80]:

array([300.80147059, 800.39705882])

In [81]:

df=pd.DataFrame({'Actual':y_test,'predicted':y_pred})
df

Out[81]:

	Actual	predicted
2	300	300.801471
6	800	800.397059

In [82]:

​
print(slr.predict([[10]]))
​

[1000.23529412]

In [83]:

plt.scatter(x_test, y_test, color="green") 
plt.plot(x_test, y_pred, color="red") 
plt.title("Salary vs Experience (Testing Dataset)") 
plt.xlabel("Years of Experience") 
plt.ylabel("Salary(In Rupees)") 
plt.show() 

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

Q2. Use the house price prediction dataset to build a multiple linear regression model for predicting purchases. Identify independent and target variable. Split the variables into training and testing sets and print them. Find the accuracy of the model.

In [1]:

​
import pandas as pd
​

In [88]:

​
data = pd.read_csv("HousePrice.csv")
data
​

Out[88]:

	area	bedrooms	age	price
0	1000	3.0	20	550000
1	1004	4.0	21	56000
2	1200	5.0	22	60000
3	1300	NaN	23	70000
4	2000	6.0	30	80000
5	3000	8.0	34	90000
6	4000	9.0	40	99000

In [89]:

​
data.describe
​

Out[89]:

<bound method NDFrame.describe of    area  bedrooms  age   price
0  1000       3.0   20  550000
1  1004       4.0   21   56000
2  1200       5.0   22   60000
3  1300       NaN   23   70000
4  2000       6.0   30   80000
5  3000       8.0   34   90000
6  4000       9.0   40   99000>

In [90]:

data.dtypes

Out[90]:

area          int64
bedrooms    float64
age           int64
price         int64
dtype: object

In [91]:

​
data.head()
​

Out[91]:

	area	bedrooms	age	price
0	1000	3.0	20	550000
1	1004	4.0	21	56000
2	1200	5.0	22	60000
3	1300	NaN	23	70000
4	2000	6.0	30	80000

In [92]:

​
data.isna()
​

Out[92]:

	area	bedrooms	age	price
0	False	False	False	False
1	False	False	False	False
2	False	False	False	False
3	False	True	False	False
4	False	False	False	False
5	False	False	False	False
6	False	False	False	False

In [93]:

​
data.isna().sum()
​

Out[93]:

area        0
bedrooms    1
age         0
price       0
dtype: int64

In [94]:

​
data['bedrooms'] = data['bedrooms'].fillna(data['bedrooms'].mean())
data
​

Out[94]:

	area	bedrooms	age	price
0	1000	3.000000	20	550000
1	1004	4.000000	21	56000
2	1200	5.000000	22	60000
3	1300	5.833333	23	70000
4	2000	6.000000	30	80000
5	3000	8.000000	34	90000
6	4000	9.000000	40	99000

In [95]:

​
from sklearn.linear_model import LinearRegression
​

In [96]:

#  Set dependent and independent variables
y = data['price']
x = data.drop(['price'], axis=1)
x

Out[96]:

	area	bedrooms	age
0	1000	3.000000	20
1	1004	4.000000	21
2	1200	5.000000	22
3	1300	5.833333	23
4	2000	6.000000	30
5	3000	8.000000	34
6	4000	9.000000	40

In [97]:

​
y
​

Out[97]:

0    550000
1     56000
2     60000
3     70000
4     80000
5     90000
6     99000
Name: price, dtype: int64

In [98]:

​
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25)

In [99]:

​
x_train
​

Out[99]:

	area	bedrooms	age
1	1004	4.000000	21
4	2000	6.000000	30
2	1200	5.000000	22
0	1000	3.000000	20
3	1300	5.833333	23

In [100]:

​
y_train
​

Out[100]:

1     56000
4     80000
2     60000
0    550000
3     70000
Name: price, dtype: int64

In [101]:

​
print(x_test.shape)
​

(2, 3)

In [102]:

​
from sklearn.linear_model import LinearRegression
slr=LinearRegression()
​

In [103]:

​
slr.fit(x_train,y_train)
​

Out[103]:

LinearRegression()

In [104]:

​
y_pred=slr.predict(x_test)
y_pred

Out[104]:

array([2578443.38316721, 1507365.03322259])

In [105]:

​
df=pd.DataFrame({'Actual':y_test,'predicted':y_pred})
df
​

Out[105]:

	Actual	predicted
6	99000	2.578443e+06
5	90000	1.507365e+06

In [ ]:

​
slr.predict([[10]])
​

In [ ]:

Create ‘User’ Data set having 5 columns namely: User ID, Gender, Age, EstimatedSalary and Purchased. Build a logistic regression model that can predict whether on the given parameter a person will buy a car or not. Display confusion matrix.

In [20]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

In [21]:

df = pd.read_csv("car.csv")
df

Out[21]:

	Gender	Age	Salary	Purchased
0	male	44	72000	No
1	male	27	48000	Yes
2	female	30	54000	No
3	male	38	61000	No
4	female	40	30000	Yes
5	female	35	58000	Yes
6	male	37	52000	No
7	male	48	79000	Yes
8	male	50	83000	No
9	female	37	67000	Yes

In [22]:

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['Gender']=le.fit_transform(df['Gender'])
df

Out[22]:

	Gender	Age	Salary	Purchased
0	1	44	72000	No
1	1	27	48000	Yes
2	0	30	54000	No
3	1	38	61000	No
4	0	40	30000	Yes
5	0	35	58000	Yes
6	1	37	52000	No
7	1	48	79000	Yes
8	1	50	83000	No
9	0	37	67000	Yes

In [23]:

df['Purchased']=le.fit_transform(df['Purchased'])
df

Out[23]:

	Gender	Age	Salary	Purchased
0	1	44	72000	0
1	1	27	48000	1
2	0	30	54000	0
3	1	38	61000	0
4	0	40	30000	1
5	0	35	58000	1
6	1	37	52000	0
7	1	48	79000	1
8	1	50	83000	0
9	0	37	67000	1

In [24]:

​
df.notnull()
​

Out[24]:

	Gender	Age	Salary	Purchased
0	True	True	True	True
1	True	True	True	True
2	True	True	True	True
3	True	True	True	True
4	True	True	True	True
5	True	True	True	True
6	True	True	True	True
7	True	True	True	True
8	True	True	True	True
9	True	True	True	True

In [25]:

x = df.iloc[:, [0,1,2]].values
y = df.iloc[:, 3].values
​

In [26]:

​
x
​

Out[26]:

array([[    1,    44, 72000],
       [    1,    27, 48000],
       [    0,    30, 54000],
       [    1,    38, 61000],
       [    0,    40, 30000],
       [    0,    35, 58000],
       [    1,    37, 52000],
       [    1,    48, 79000],
       [    1,    50, 83000],
       [    0,    37, 67000]], dtype=int64)

In [27]:

​
y
​

Out[27]:

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

In [28]:

​
# Splitting the dataset into training and test set. 
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.50, random_state=
0) 

In [29]:

#feature Scaling 
from sklearn.preprocessing import StandardScaler 
st_x= StandardScaler() 
x_train= st_x.fit_transform(x_train) 
x_test= st_x.fit_transform(x_test) 
x_test
x_train
​

Out[29]:

array([[ 0.5       , -0.70226353, -1.26927717],
       [ 0.5       ,  1.56976553,  1.49447151],
       [ 0.5       , -0.49571543, -0.34802761],
       [ 0.5       ,  0.74357315,  0.77794407],
       [-2.        , -1.11535972, -0.6551108 ]])

In [30]:

#Fitting Logistic Regression to the training set 
from sklearn.linear_model import LogisticRegression 
classifier= LogisticRegression(random_state=10) 
classifier.fit(x_train, y_train)
​

Out[30]:

LogisticRegression(random_state=10)

In [31]:

#Predicting the test set result 
y_pred= classifier.predict(x_test) 
y_pred
​
​

Out[31]:

array([0, 0, 0, 1, 0])

In [39]:

#Creating the Confusion matrix 
from sklearn.metrics import confusion_matrix 
import seaborn as sn
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)
​

Out[39]:

<AxesSubplot:xlabel='Predicted', ylabel='Actual'>

In [51]:

confusion_matrix
from sklearn.metrics import confusion_matrix 
​

In [ ]:

​
print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))
​

In [ ]:

​
Accuracy: 0.6
    

In [ ]:

Frequent itemset and Association rule mining

#dataset creation in python transactions = [['bread','milk'],['bread', 'diaper''milk','eggs'],['milk', 'diaper','beer','coke'], ['bread','milk','diaper','beer'], ['bread','milk','diaper','coke']]

In [4]:

​
transactions
​

Out[4]:

[['bread', 'milk'],
 ['bread', 'diapermilk', 'eggs'],
 ['milk', 'diaper', 'beer', 'coke'],
 ['bread', 'milk', 'diaper', 'beer'],
 ['bread', 'milk', 'diaper', 'coke']]

In [ ]:

​
pip install mlxtend
​

In [ ]:

​
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
​

In [ ]:

from mlxtend.preprocessing import TransactionEncoder
te=TransactionEncoder()
te_array=te.fit(transactions).transform(transactions)
df=pd.DataFrame(te_array, columns=te.columns_)
df
​

In [ ]:

beer bread coke diaper diapermilk eggs milk
False True False False False False True
False True False False True True False
True False True True False False True
True True False True False False True
False True True True False False True

In [ ]:

​
from sklearn.preprocessing import LabelEncoder
df.apply(LabelEncoder().fit_transform)
​

In [ ]:

beer bread coke diaper diapermilk eggs milk
0 1 0 0 0 0 1
0 1 0 0 1 1 0
1 0 1 1 0 0 1
1 1 0 1 0 0 1
0 1 1 1 0 0 1
​

In [ ]:

​
Find the frequent itemsets Generate frequent itemsets that have a support value of at least 50%. By default,
apriori returns the column indices of the items, For better readability, we can set use_colnames=True to
convert these integer values into the respective item names:
    

In [ ]:

​
freq_items = apriori(df, min_support = 0.5, use_colnames = True)
print(freq_items)
​

In [ ]:

​
support itemsets
0 0.8 (bread)
1 0.6 (diaper)
2 0.8 (milk)
3 0.6 (bread, milk)
4 0.6 (diaper, milk)
​

In [ ]:

​
Generate the association rules Generate association rules that have a support value of at least 5%
​

In [ ]:

​
rules = association_rules(freq_items, metric ='support', min_threshold=0.05)
#rules = rules.sort_values(['support', 'confidence'], ascending =[False,False])
print(rules)
In [ ]:
antecedents consequents antecedent support consequent support support
\
0 (bread) (milk) 0.8 0.8 0.6
1 (milk) (bread) 0.8 0.8 0.6
2 (diaper) (milk) 0.6 0.8 0.6
3 (milk) (diaper) 0.8 0.6 0.6
 confidence lift leverage conviction 
0 0.75 0.9375 -0.04 0.8 
1 0.75 0.9375 -0.04 0.8 
2 1.00 1.2500 0.12 inf 
3 0.75 1.2500 0.12 1.6

In [ ]:

Search This Blog

Musicworld

Data Analytics Program

Q2. Use the house price prediction dataset to build a multiple linear regression model for predicting purchases. Identify independent and target variable. Split the variables into training and testing sets and print them. Find the accuracy of the model.

Create ‘User’ Data set having 5 columns namely: User ID, Gender, Age, EstimatedSalary and Purchased. Build a logistic regression model that can predict whether on the given parameter a person will buy a car or not. Display confusion matrix.

Frequent itemset and Association rule mining

Comments

Post a Comment

Popular posts from this blog

Full Stack Developement Practical Slips Programs

Practical slips programs : Machine Learning

Some slips for full stacks