Data Analytics Program
- Get link
- X
- Other Apps
Data Analytics Program:
Build a simple linear regression model for Salary Prediction from years of experience.(Download Salary dataset from kaggle). Find the accuracy of the model.
In [ ]:
1
In [61]:
1
2
import pandas as pd3
In [62]:
1
2
data = pd.read_csv("Salary.csv")3
data4
Out[62]:
| experience | salary | |
|---|---|---|
| 0 | 1 | 100 |
| 1 | 2 | 200 |
| 2 | 3 | 300 |
| 3 | 4 | 405 |
| 4 | 5 | 499 |
| 5 | 7 | 700 |
| 6 | 8 | 800 |
| 7 | 9 | 900 |
In [63]:
1
2
data.isnull()3
Out[63]:
| experience | salary | |
|---|---|---|
| 0 | False | False |
| 1 | False | False |
| 2 | False | False |
| 3 | False | False |
| 4 | False | False |
| 5 | False | False |
| 6 | False | False |
| 7 | False | False |
In [64]:
1
2
data.describe()3
Out[64]:
| experience | salary | |
|---|---|---|
| count | 8.000000 | 8.00000 |
| mean | 4.875000 | 488.00000 |
| std | 2.900123 | 289.79648 |
| min | 1.000000 | 100.00000 |
| 25% | 2.750000 | 275.00000 |
| 50% | 4.500000 | 452.00000 |
| 75% | 7.250000 | 725.00000 |
| max | 9.000000 | 900.00000 |
In [65]:
1
2
data.head()3
Out[65]:
| experience | salary | |
|---|---|---|
| 0 | 1 | 100 |
| 1 | 2 | 200 |
| 2 | 3 | 300 |
| 3 | 4 | 405 |
| 4 | 5 | 499 |
In [66]:
1
2
data.tail()3
Out[66]:
| experience | salary | |
|---|---|---|
| 3 | 4 | 405 |
| 4 | 5 | 499 |
| 5 | 7 | 700 |
| 6 | 8 | 800 |
| 7 | 9 | 900 |
In [67]:
1
2
data.sample()3
Out[67]:
| experience | salary | |
|---|---|---|
| 3 | 4 | 405 |
In [68]:
1
2
data.dtypes3
Out[68]:
experience int64 salary int64 dtype: object
In [69]:
1
2
corr = data.corr()3
corr4
corr.style.background_gradient(cmap='rainbow')Out[69]:
| experience | salary | |
|---|---|---|
| experience | 1.000000 | 0.999980 |
| salary | 0.999980 | 1.000000 |
In [70]:
1
y = data['salary']2
x = data.drop(['salary'],axis = 1)3
yOut[70]:
0 100 1 200 2 300 3 405 4 499 5 700 6 800 7 900 Name: salary, dtype: int64
In [71]:
1
2
x3
Out[71]:
| experience | |
|---|---|
| 0 | 1 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
| 5 | 7 |
| 6 | 8 |
| 7 | 9 |
In [72]:
1
2
import matplotlib.pyplot as plt3
In [73]:
1
plt.scatter(x,y)2
plt.title("Salary vs Experience (Training Dataset)") 3
plt.xlabel(" Experience in Year") 4
plt.ylabel("Salary(In Rupees)") 5
plt.show()6
In [74]:
1
from sklearn.model_selection import train_test_split2
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25)In [75]:
1
2
print(x_train)3
experience 4 5 7 9 0 1 3 4 5 7 1 2
In [76]:
1
2
print(y_train)3
4 499 7 900 0 100 3 405 5 700 1 200 Name: salary, dtype: int64
In [77]:
1
2
print(x_test.shape)3
(2, 1)
In [78]:
1
2
from sklearn.linear_model import LinearRegression3
slr=LinearRegression()4
In [79]:
1
2
slr.fit(x_train,y_train)3
Out[79]:
LinearRegression()
In [80]:
1
2
y_pred=slr.predict(x_test)3
y_pred4
Out[80]:
array([300.80147059, 800.39705882])
In [81]:
1
df=pd.DataFrame({'Actual':y_test,'predicted':y_pred})2
dfOut[81]:
| Actual | predicted | |
|---|---|---|
| 2 | 300 | 300.801471 |
| 6 | 800 | 800.397059 |
In [82]:
1
2
print(slr.predict([[10]]))3
[1000.23529412]
In [83]:
1
plt.scatter(x_test, y_test, color="green") 2
plt.plot(x_test, y_pred, color="red") 3
plt.title("Salary vs Experience (Testing Dataset)") 4
plt.xlabel("Years of Experience") 5
plt.ylabel("Salary(In Rupees)") 6
plt.show() In [ ]:
1
In [ ]:
1
In [ ]:
1
In [ ]:
1
In [ ]:
1
Q2. Use the house price prediction dataset to build a multiple linear regression model for predicting purchases. Identify independent and target variable. Split the variables into training and testing sets and print them. Find the accuracy of the model.
In [1]:12import pandas as pd3In [88]:12data = pd.read_csv("HousePrice.csv")3data4Out[88]:area bedrooms age price 0 1000 3.0 20 550000 1 1004 4.0 21 56000 2 1200 5.0 22 60000 3 1300 NaN 23 70000 4 2000 6.0 30 80000 5 3000 8.0 34 90000 6 4000 9.0 40 99000
In [89]:12data.describe3Out[89]:<bound method NDFrame.describe of area bedrooms age price
0 1000 3.0 20 550000
1 1004 4.0 21 56000
2 1200 5.0 22 60000
3 1300 NaN 23 70000
4 2000 6.0 30 80000
5 3000 8.0 34 90000
6 4000 9.0 40 99000>
In [90]:1data.dtypesOut[90]:area int64
bedrooms float64
age int64
price int64
dtype: object
In [91]:12data.head()3Out[91]:area bedrooms age price 0 1000 3.0 20 550000 1 1004 4.0 21 56000 2 1200 5.0 22 60000 3 1300 NaN 23 70000 4 2000 6.0 30 80000
In [92]:12data.isna()3Out[92]:area bedrooms age price 0 False False False False 1 False False False False 2 False False False False 3 False True False False 4 False False False False 5 False False False False 6 False False False False
In [93]:12data.isna().sum()3Out[93]:area 0
bedrooms 1
age 0
price 0
dtype: int64
In [94]:12data['bedrooms'] = data['bedrooms'].fillna(data['bedrooms'].mean())3data4Out[94]:area bedrooms age price 0 1000 3.000000 20 550000 1 1004 4.000000 21 56000 2 1200 5.000000 22 60000 3 1300 5.833333 23 70000 4 2000 6.000000 30 80000 5 3000 8.000000 34 90000 6 4000 9.000000 40 99000
In [95]:12from sklearn.linear_model import LinearRegression3In [96]:1# Set dependent and independent variables2y = data['price']3x = data.drop(['price'], axis=1)4xOut[96]:area bedrooms age 0 1000 3.000000 20 1 1004 4.000000 21 2 1200 5.000000 22 3 1300 5.833333 23 4 2000 6.000000 30 5 3000 8.000000 34 6 4000 9.000000 40
In [97]:12y3Out[97]:0 550000
1 56000
2 60000
3 70000
4 80000
5 90000
6 99000
Name: price, dtype: int64
In [98]:12from sklearn.model_selection import train_test_split3x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25)In [99]:12x_train3Out[99]:area bedrooms age 1 1004 4.000000 21 4 2000 6.000000 30 2 1200 5.000000 22 0 1000 3.000000 20 3 1300 5.833333 23
In [100]:12y_train3Out[100]:1 56000
4 80000
2 60000
0 550000
3 70000
Name: price, dtype: int64
In [101]:12print(x_test.shape)3(2, 3)
In [102]:12from sklearn.linear_model import LinearRegression3slr=LinearRegression()4In [103]:12slr.fit(x_train,y_train)3Out[103]:LinearRegression()
In [104]:12y_pred=slr.predict(x_test)3y_predOut[104]:array([2578443.38316721, 1507365.03322259])
In [105]:12df=pd.DataFrame({'Actual':y_test,'predicted':y_pred})3df4Out[105]:Actual predicted 6 99000 2.578443e+06 5 90000 1.507365e+06
In [ ]:12slr.predict([[10]])3In [ ]:
1
In [1]:
1
2
import pandas as pd3
In [88]:
1
2
data = pd.read_csv("HousePrice.csv")3
data4
Out[88]:
| area | bedrooms | age | price | |
|---|---|---|---|---|
| 0 | 1000 | 3.0 | 20 | 550000 |
| 1 | 1004 | 4.0 | 21 | 56000 |
| 2 | 1200 | 5.0 | 22 | 60000 |
| 3 | 1300 | NaN | 23 | 70000 |
| 4 | 2000 | 6.0 | 30 | 80000 |
| 5 | 3000 | 8.0 | 34 | 90000 |
| 6 | 4000 | 9.0 | 40 | 99000 |
In [89]:
1
2
data.describe3
Out[89]:
<bound method NDFrame.describe of area bedrooms age price 0 1000 3.0 20 550000 1 1004 4.0 21 56000 2 1200 5.0 22 60000 3 1300 NaN 23 70000 4 2000 6.0 30 80000 5 3000 8.0 34 90000 6 4000 9.0 40 99000>
In [90]:
1
data.dtypesOut[90]:
area int64 bedrooms float64 age int64 price int64 dtype: object
In [91]:
1
2
data.head()3
Out[91]:
| area | bedrooms | age | price | |
|---|---|---|---|---|
| 0 | 1000 | 3.0 | 20 | 550000 |
| 1 | 1004 | 4.0 | 21 | 56000 |
| 2 | 1200 | 5.0 | 22 | 60000 |
| 3 | 1300 | NaN | 23 | 70000 |
| 4 | 2000 | 6.0 | 30 | 80000 |
In [92]:
1
2
data.isna()3
Out[92]:
| area | bedrooms | age | price | |
|---|---|---|---|---|
| 0 | False | False | False | False |
| 1 | False | False | False | False |
| 2 | False | False | False | False |
| 3 | False | True | False | False |
| 4 | False | False | False | False |
| 5 | False | False | False | False |
| 6 | False | False | False | False |
In [93]:
1
2
data.isna().sum()3
Out[93]:
area 0 bedrooms 1 age 0 price 0 dtype: int64
In [94]:
1
2
data['bedrooms'] = data['bedrooms'].fillna(data['bedrooms'].mean())3
data4
Out[94]:
| area | bedrooms | age | price | |
|---|---|---|---|---|
| 0 | 1000 | 3.000000 | 20 | 550000 |
| 1 | 1004 | 4.000000 | 21 | 56000 |
| 2 | 1200 | 5.000000 | 22 | 60000 |
| 3 | 1300 | 5.833333 | 23 | 70000 |
| 4 | 2000 | 6.000000 | 30 | 80000 |
| 5 | 3000 | 8.000000 | 34 | 90000 |
| 6 | 4000 | 9.000000 | 40 | 99000 |
In [95]:
1
2
from sklearn.linear_model import LinearRegression3
In [96]:
1
# Set dependent and independent variables2
y = data['price']3
x = data.drop(['price'], axis=1)4
xOut[96]:
| area | bedrooms | age | |
|---|---|---|---|
| 0 | 1000 | 3.000000 | 20 |
| 1 | 1004 | 4.000000 | 21 |
| 2 | 1200 | 5.000000 | 22 |
| 3 | 1300 | 5.833333 | 23 |
| 4 | 2000 | 6.000000 | 30 |
| 5 | 3000 | 8.000000 | 34 |
| 6 | 4000 | 9.000000 | 40 |
In [97]:
1
2
y3
Out[97]:
0 550000 1 56000 2 60000 3 70000 4 80000 5 90000 6 99000 Name: price, dtype: int64
In [98]:
1
2
from sklearn.model_selection import train_test_split3
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25)In [99]:
1
2
x_train3
Out[99]:
| area | bedrooms | age | |
|---|---|---|---|
| 1 | 1004 | 4.000000 | 21 |
| 4 | 2000 | 6.000000 | 30 |
| 2 | 1200 | 5.000000 | 22 |
| 0 | 1000 | 3.000000 | 20 |
| 3 | 1300 | 5.833333 | 23 |
In [100]:
1
2
y_train3
Out[100]:
1 56000 4 80000 2 60000 0 550000 3 70000 Name: price, dtype: int64
In [101]:
1
2
print(x_test.shape)3
(2, 3)
In [102]:
1
2
from sklearn.linear_model import LinearRegression3
slr=LinearRegression()4
In [103]:
1
2
slr.fit(x_train,y_train)3
Out[103]:
LinearRegression()
In [104]:
1
2
y_pred=slr.predict(x_test)3
y_predOut[104]:
array([2578443.38316721, 1507365.03322259])
In [105]:
1
2
df=pd.DataFrame({'Actual':y_test,'predicted':y_pred})3
df4
Out[105]:
| Actual | predicted | |
|---|---|---|
| 6 | 99000 | 2.578443e+06 |
| 5 | 90000 | 1.507365e+06 |
In [ ]:
1
2
slr.predict([[10]])3
In [ ]:
1
Create ‘User’ Data set having 5 columns namely: User ID, Gender, Age, EstimatedSalary and Purchased. Build a logistic regression model that can predict whether on the given parameter a person will buy a car or not. Display confusion matrix.
In [20]:
1import pandas as pd2import numpy as np3from sklearn.model_selection import train_test_split4from sklearn.linear_model import LogisticRegression5import matplotlib.pyplot as pltIn [21]:
1df = pd.read_csv("car.csv")2dfOut[21]:Gender Age Salary Purchased 0 male 44 72000 No 1 male 27 48000 Yes 2 female 30 54000 No 3 male 38 61000 No 4 female 40 30000 Yes 5 female 35 58000 Yes 6 male 37 52000 No 7 male 48 79000 Yes 8 male 50 83000 No 9 female 37 67000 Yes
In [22]:
1from sklearn.preprocessing import LabelEncoder2le=LabelEncoder()3df['Gender']=le.fit_transform(df['Gender'])4dfOut[22]:Gender Age Salary Purchased 0 1 44 72000 No 1 1 27 48000 Yes 2 0 30 54000 No 3 1 38 61000 No 4 0 40 30000 Yes 5 0 35 58000 Yes 6 1 37 52000 No 7 1 48 79000 Yes 8 1 50 83000 No 9 0 37 67000 Yes
In [23]:
1df['Purchased']=le.fit_transform(df['Purchased'])2dfOut[23]:Gender Age Salary Purchased 0 1 44 72000 0 1 1 27 48000 1 2 0 30 54000 0 3 1 38 61000 0 4 0 40 30000 1 5 0 35 58000 1 6 1 37 52000 0 7 1 48 79000 1 8 1 50 83000 0 9 0 37 67000 1
In [24]:12df.notnull()3Out[24]:Gender Age Salary Purchased 0 True True True True 1 True True True True 2 True True True True 3 True True True True 4 True True True True 5 True True True True 6 True True True True 7 True True True True 8 True True True True 9 True True True True
In [25]:1x = df.iloc[:, [0,1,2]].values2y = df.iloc[:, 3].values3In [26]:12x3Out[26]:array([[ 1, 44, 72000],
[ 1, 27, 48000],
[ 0, 30, 54000],
[ 1, 38, 61000],
[ 0, 40, 30000],
[ 0, 35, 58000],
[ 1, 37, 52000],
[ 1, 48, 79000],
[ 1, 50, 83000],
[ 0, 37, 67000]], dtype=int64)In [27]:12y3Out[27]:array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
In [28]:12# Splitting the dataset into training and test set. 3from sklearn.model_selection import train_test_split 4x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.50, random_state=50) In [29]:1#feature Scaling 2from sklearn.preprocessing import StandardScaler 3st_x= StandardScaler() 4x_train= st_x.fit_transform(x_train) 5x_test= st_x.fit_transform(x_test) 6x_test7x_train8Out[29]:array([[ 0.5 , -0.70226353, -1.26927717],
[ 0.5 , 1.56976553, 1.49447151],
[ 0.5 , -0.49571543, -0.34802761],
[ 0.5 , 0.74357315, 0.77794407],
[-2. , -1.11535972, -0.6551108 ]])In [30]:1#Fitting Logistic Regression to the training set 2from sklearn.linear_model import LogisticRegression 3classifier= LogisticRegression(random_state=10) 4classifier.fit(x_train, y_train)5Out[30]:LogisticRegression(random_state=10)
In [31]:
1#Predicting the test set result 2y_pred= classifier.predict(x_test) 3y_pred45Out[31]:array([0, 0, 0, 1, 0])
In [39]:
1#Creating the Confusion matrix 2from sklearn.metrics import confusion_matrix 3import seaborn as sn4confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])5sn.heatmap(confusion_matrix, annot=True)6Out[39]:<AxesSubplot:xlabel='Predicted', ylabel='Actual'>
In [51]:
1confusion_matrix2from sklearn.metrics import confusion_matrix 3In [ ]:12print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))3In [ ]:12Accuracy: 0.63 In [ ]:
1
In [20]:
1
import pandas as pd2
import numpy as np3
from sklearn.model_selection import train_test_split4
from sklearn.linear_model import LogisticRegression5
import matplotlib.pyplot as pltIn [21]:
1
df = pd.read_csv("car.csv")2
dfOut[21]:
| Gender | Age | Salary | Purchased | |
|---|---|---|---|---|
| 0 | male | 44 | 72000 | No |
| 1 | male | 27 | 48000 | Yes |
| 2 | female | 30 | 54000 | No |
| 3 | male | 38 | 61000 | No |
| 4 | female | 40 | 30000 | Yes |
| 5 | female | 35 | 58000 | Yes |
| 6 | male | 37 | 52000 | No |
| 7 | male | 48 | 79000 | Yes |
| 8 | male | 50 | 83000 | No |
| 9 | female | 37 | 67000 | Yes |
In [22]:
1
from sklearn.preprocessing import LabelEncoder2
le=LabelEncoder()3
df['Gender']=le.fit_transform(df['Gender'])4
dfOut[22]:
| Gender | Age | Salary | Purchased | |
|---|---|---|---|---|
| 0 | 1 | 44 | 72000 | No |
| 1 | 1 | 27 | 48000 | Yes |
| 2 | 0 | 30 | 54000 | No |
| 3 | 1 | 38 | 61000 | No |
| 4 | 0 | 40 | 30000 | Yes |
| 5 | 0 | 35 | 58000 | Yes |
| 6 | 1 | 37 | 52000 | No |
| 7 | 1 | 48 | 79000 | Yes |
| 8 | 1 | 50 | 83000 | No |
| 9 | 0 | 37 | 67000 | Yes |
In [23]:
1
df['Purchased']=le.fit_transform(df['Purchased'])2
dfOut[23]:
| Gender | Age | Salary | Purchased | |
|---|---|---|---|---|
| 0 | 1 | 44 | 72000 | 0 |
| 1 | 1 | 27 | 48000 | 1 |
| 2 | 0 | 30 | 54000 | 0 |
| 3 | 1 | 38 | 61000 | 0 |
| 4 | 0 | 40 | 30000 | 1 |
| 5 | 0 | 35 | 58000 | 1 |
| 6 | 1 | 37 | 52000 | 0 |
| 7 | 1 | 48 | 79000 | 1 |
| 8 | 1 | 50 | 83000 | 0 |
| 9 | 0 | 37 | 67000 | 1 |
In [24]:
1
2
df.notnull()3
Out[24]:
| Gender | Age | Salary | Purchased | |
|---|---|---|---|---|
| 0 | True | True | True | True |
| 1 | True | True | True | True |
| 2 | True | True | True | True |
| 3 | True | True | True | True |
| 4 | True | True | True | True |
| 5 | True | True | True | True |
| 6 | True | True | True | True |
| 7 | True | True | True | True |
| 8 | True | True | True | True |
| 9 | True | True | True | True |
In [25]:
1
x = df.iloc[:, [0,1,2]].values2
y = df.iloc[:, 3].values3
In [26]:
1
2
x3
Out[26]:
array([[ 1, 44, 72000],
[ 1, 27, 48000],
[ 0, 30, 54000],
[ 1, 38, 61000],
[ 0, 40, 30000],
[ 0, 35, 58000],
[ 1, 37, 52000],
[ 1, 48, 79000],
[ 1, 50, 83000],
[ 0, 37, 67000]], dtype=int64)In [27]:
1
2
y3
Out[27]:
array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
In [28]:
1
2
# Splitting the dataset into training and test set. 3
from sklearn.model_selection import train_test_split 4
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.50, random_state=5
0) In [29]:
1
#feature Scaling 2
from sklearn.preprocessing import StandardScaler 3
st_x= StandardScaler() 4
x_train= st_x.fit_transform(x_train) 5
x_test= st_x.fit_transform(x_test) 6
x_test7
x_train8
Out[29]:
array([[ 0.5 , -0.70226353, -1.26927717],
[ 0.5 , 1.56976553, 1.49447151],
[ 0.5 , -0.49571543, -0.34802761],
[ 0.5 , 0.74357315, 0.77794407],
[-2. , -1.11535972, -0.6551108 ]])In [30]:
1
#Fitting Logistic Regression to the training set 2
from sklearn.linear_model import LogisticRegression 3
classifier= LogisticRegression(random_state=10) 4
classifier.fit(x_train, y_train)5
Out[30]:
LogisticRegression(random_state=10)
In [31]:
1
#Predicting the test set result 2
y_pred= classifier.predict(x_test) 3
y_pred4
5
Out[31]:
array([0, 0, 0, 1, 0])
In [39]:
1
#Creating the Confusion matrix 2
from sklearn.metrics import confusion_matrix 3
import seaborn as sn4
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])5
sn.heatmap(confusion_matrix, annot=True)6
Out[39]:
<AxesSubplot:xlabel='Predicted', ylabel='Actual'>
In [51]:
1
confusion_matrix2
from sklearn.metrics import confusion_matrix 3
In [ ]:
1
2
print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))3
In [ ]:
1
2
Accuracy: 0.63
In [ ]:
1
Frequent itemset and Association rule mining
#dataset creation in python transactions = [['bread','milk'],['bread', 'diaper''milk','eggs'],['milk', 'diaper','beer','coke'], ['bread','milk','diaper','beer'], ['bread','milk','diaper','coke']]
In [4]:12transactions3Out[4]:[['bread', 'milk'],
['bread', 'diapermilk', 'eggs'],
['milk', 'diaper', 'beer', 'coke'],
['bread', 'milk', 'diaper', 'beer'],
['bread', 'milk', 'diaper', 'coke']]
In [ ]:12pip install mlxtend3In [ ]:12import pandas as pd3from mlxtend.frequent_patterns import apriori, association_rules4In [ ]:1from mlxtend.preprocessing import TransactionEncoder2te=TransactionEncoder()3te_array=te.fit(transactions).transform(transactions)4df=pd.DataFrame(te_array, columns=te.columns_)5df6In [ ]:
1beer bread coke diaper diapermilk eggs milk20 False True False False False False True31 False True False False True True False42 True False True True False False True53 True True False True False False True64 False True True True False False TrueIn [ ]:12from sklearn.preprocessing import LabelEncoder3df.apply(LabelEncoder().fit_transform)4In [ ]:1beer bread coke diaper diapermilk eggs milk20 0 1 0 0 0 0 131 0 1 0 0 1 1 042 1 0 1 1 0 0 153 1 1 0 1 0 0 164 0 1 1 1 0 0 17In [ ]:12Find the frequent itemsets Generate frequent itemsets that have a support value of at least 50%. By default,3apriori returns the column indices of the items, For better readability, we can set use_colnames=True to4convert these integer values into the respective item names:5 In [ ]:12freq_items = apriori(df, min_support = 0.5, use_colnames = True)3print(freq_items)4In [ ]:12support itemsets30 0.8 (bread)41 0.6 (diaper)52 0.8 (milk)63 0.6 (bread, milk)74 0.6 (diaper, milk)8In [ ]:12Generate the association rules Generate association rules that have a support value of at least 5%3In [ ]:12rules = association_rules(freq_items, metric ='support', min_threshold=0.05)3#rules = rules.sort_values(['support', 'confidence'], ascending =[False,False])4print(rules)5In [ ]:6antecedents consequents antecedent support consequent support support7\80 (bread) (milk) 0.8 0.8 0.691 (milk) (bread) 0.8 0.8 0.6102 (diaper) (milk) 0.6 0.8 0.6113 (milk) (diaper) 0.8 0.6 0.612 confidence lift leverage conviction 130 0.75 0.9375 -0.04 0.8 141 0.75 0.9375 -0.04 0.8 152 1.00 1.2500 0.12 inf 163 0.75 1.2500 0.12 1.6In [ ]:
1
#dataset creation in python transactions = [['bread','milk'],['bread', 'diaper''milk','eggs'],['milk', 'diaper','beer','coke'], ['bread','milk','diaper','beer'], ['bread','milk','diaper','coke']]
In [4]:
1
2
transactions3
Out[4]:
[['bread', 'milk'], ['bread', 'diapermilk', 'eggs'], ['milk', 'diaper', 'beer', 'coke'], ['bread', 'milk', 'diaper', 'beer'], ['bread', 'milk', 'diaper', 'coke']]
In [ ]:
1
2
pip install mlxtend3
In [ ]:
1
2
import pandas as pd3
from mlxtend.frequent_patterns import apriori, association_rules4
In [ ]:
1
from mlxtend.preprocessing import TransactionEncoder2
te=TransactionEncoder()3
te_array=te.fit(transactions).transform(transactions)4
df=pd.DataFrame(te_array, columns=te.columns_)5
df6
In [ ]:
1
beer bread coke diaper diapermilk eggs milk2
0 False True False False False False True3
1 False True False False True True False4
2 True False True True False False True5
3 True True False True False False True6
4 False True True True False False TrueIn [ ]:
1
2
from sklearn.preprocessing import LabelEncoder3
df.apply(LabelEncoder().fit_transform)4
In [ ]:
1
beer bread coke diaper diapermilk eggs milk2
0 0 1 0 0 0 0 13
1 0 1 0 0 1 1 04
2 1 0 1 1 0 0 15
3 1 1 0 1 0 0 16
4 0 1 1 1 0 0 17
In [ ]:
1
2
Find the frequent itemsets Generate frequent itemsets that have a support value of at least 50%. By default,3
apriori returns the column indices of the items, For better readability, we can set use_colnames=True to4
convert these integer values into the respective item names:5
In [ ]:
1
2
freq_items = apriori(df, min_support = 0.5, use_colnames = True)3
print(freq_items)4
In [ ]:
1
2
support itemsets3
0 0.8 (bread)4
1 0.6 (diaper)5
2 0.8 (milk)6
3 0.6 (bread, milk)7
4 0.6 (diaper, milk)8
In [ ]:
1
2
Generate the association rules Generate association rules that have a support value of at least 5%3
In [ ]:
1
2
rules = association_rules(freq_items, metric ='support', min_threshold=0.05)3
#rules = rules.sort_values(['support', 'confidence'], ascending =[False,False])4
print(rules)5
In [ ]:6
antecedents consequents antecedent support consequent support support7
\8
0 (bread) (milk) 0.8 0.8 0.69
1 (milk) (bread) 0.8 0.8 0.610
2 (diaper) (milk) 0.6 0.8 0.611
3 (milk) (diaper) 0.8 0.6 0.612
confidence lift leverage conviction 13
0 0.75 0.9375 -0.04 0.8 14
1 0.75 0.9375 -0.04 0.8 15
2 1.00 1.2500 0.12 inf 16
3 0.75 1.2500 0.12 1.6In [ ]:
1
z
z
- Get link
- X
- Other Apps
Comments
Post a Comment
hey