Evaluating Multivariate Data with Data Mining Methods¶
An Example of Using Numpy
and Pandas
to load and represent Data from Web and Sklearn
for Prediction
This Notebook is under construction:
Goals are: * collect Data Mining methods for data prediction * Make understandable examples with some data * emphasizing interesting code with Markdown * Use dynamic code so that different data can be used instead of examples- Thus making it easier to save the code as Classes and methods in a Pythod Module in the future
Cells where user can take different decisions are marked RED
1. Compressive Strength of Concrete¶
Get Data from Webpage¶
In [2]:
from sympy import init_printing
init_printing()
In [3]:
import csv
import urllib
import numpy as np
get_data = urllib.urlopen('http://www.bliasoft.com/Documents/DataConcrete.txt')
Convert Decimal seperator from comma to point
How Many Columns does your Data have?
In [4]:
No_Col=9
Does the Data have a "comma" for decimal seperation
If the Answer is "Yes":
In [5]:
conv = lambda valstr: float(valstr.replace(',','.'))
In [6]:
c={}
for i in range(0,No_Col,1):
c[i] = conv
If the Answer is "No": Just Skip the code above this
Set the Delimiter of the Data
In [7]:
Data=np.genfromtxt(get_data,dtype=float64 , delimiter='\t', skip_header=0, names=True, converters=c)
np.random.shuffle(Data)# randomizes data rowwise
In [8]:
Data
Out[8]:
Number of Data Sets
In [9]:
Data.shape[0]
Out[9]:
In [10]:
Data=np.reshape(Data, (Data.shape[0],-1))
In [11]:
type(Data[0])
Out[11]:
Access Columns by name is really intuitiv
In [12]:
Data[:]['Compressive_strength']
Out[12]:
We Split Dateset in 2 Groups
- 70% of Data for Training
- 30% for Evaluating/Test
In [13]:
test_ind=int(Data.shape[0]*0.3)
print("test_ind=%s" % test_ind)
train_ind=Data.shape[0]-test_ind
print('train_ind={0}'.format(train_ind))
instances=np.arange(Data.shape[0])
instances
Out[13]:
In [13]:
Determine which Independent Variables should explain the Dependent Variable
The First 2 columns will be used for the plots against the Dependent Variable
In [52]:
#SET THE INDEXES OF INDEPENDENTS you do not need to keep the order 1-inf...
#but the first and second in this order will be plotted !
select_feat=[3,7,4]
select_out=[-1]#SET THE COl INDEX OF DEPENDANT
feat_col=[]
for elem in select_feat:
feat_col.append(Data.dtype.names[elem])
Let us look at the Independent Variables we will use to analyze the Dependent
In [53]:
feat_col
Out[53]:
Now we define the Arrays for our model
In [54]:
[int(x) for x in select_feat]
Out[54]:
In [55]:
[feat_col[x] for x in range(len(select_feat))]
Out[55]:
In [56]:
Data[0:2]
Out[56]:
if you choose one element it is ok
Data.dtype.names[0]
but if you choose more you have to convert the tuple to a list()
In [57]:
Data.dtype.names[0]
Out[57]:
In [58]:
list(Data.dtype.names[:])
Out[58]:
In [59]:
X_data=Data[feat_col]
Y_true=Data[Data.dtype.names[select_out[0]]] #Holds the dependent
In [60]:
X_data
Out[60]:
In [61]:
X_data[:][X_data.dtype.names[0]] #this hides our column names now, but we need another type of array to make our calculations
Out[61]:
In [62]:
#X_data[:][list(X_data.dtype.names[:])]
#this would not work to hide columns...
\[\hat{Comp_{Strength}}=f(Concrete, Water...)\]
Extract the Explaining Variables without column names for further usability
Following concatenates all the columns you choosed as input for the model
In [63]:
build=X_data[feat_col[0]]
for i in range(0,len(feat_col)-1,1):
build=np.concatenate((build,X_data[feat_col[i+1]]),axis=1)
X_data=build
In [64]:
X_data #Columns are gone!
Out[64]:
Data Mining Methods
Support Vector Regression
In [65]:
from sklearn.svm import SVR
svr_rbf = SVR(kernel='rbf', C=1e5, gamma=0.1)# free parameters can be optimized ** to do **
SVR_function=svr_rbf.fit(X_data[:train_ind], Y_true[:train_ind].flatten())
y_train = SVR_function.predict(X_data[:train_ind])
#flatten because sklearn needs 1D for the Y
Training Performance
In [66]:
import matplotlib.pyplot as plt
import sympy as sy
tex_ylabel=sy.latex(Data.dtype.names[-1].replace('_',' ').replace('$',''))
plt.scatter(instances[:train_ind],y_train,label='Performance of SVR on Trained Data')
plt.scatter(instances[:train_ind],Y_true[:train_ind].flatten(),c='r',label='True Data')
plt.xlabel(r'$Instance$')
plt.ylabel(r'$%s$' % tex_ylabel)
#plt.scatter()
plt.legend(loc='best')
plt.show()
In [67]:
from sklearn.metrics import mean_squared_error,r2_score
R2_train=r2_score(Y_true[:train_ind].flatten(),y_train)
#tex=sy.latex('R**2').replace('$','')
print('$R^{2}$'+ '=%s' % R2_train)
Test Performance
In [68]:
import matplotlib.pyplot as plt
import sympy as sy
tex_ylabel=sy.latex(Data.dtype.names[-1].replace('_',' ').replace('$',''))
y_test = SVR_function.predict(X_data[train_ind:])
plt.scatter(instances[train_ind:],y_test,label='Performance of SVR on \"New\" Data')
plt.scatter(instances[train_ind:],Y_true[train_ind:].flatten(),c='r',label='True Data')
plt.xlabel(r'$Instance$')
plt.ylabel(r'$%s$' % tex_ylabel)
plt.legend(loc='best')
plt.show()
In [69]:
from sklearn.metrics import mean_squared_error,r2_score
R2_test=r2_score(Y_true[train_ind:].flatten(),y_test)
#tex=sy.latex('R**2').replace('$','')
print('$R^{2}$'+ '=%s' % R2_test)
2 Independent Variables are not enough for a good prediction
The SVR we designed (only 2 variables) is a pretty bad estimator as you can see from Coefficient of Determination \(R^{2}\)
Long Expression: Takes the index of max in column
In [70]:
float(Data[Data[feat_col[0]].argmax()][feat_col[0]])#Data[ind]['Col']
Out[70]:
In [71]:
X_data_mean=np.mean(X_data,axis=0)
In [72]:
X_data_mean
Out[72]:
Determine Min and Max in Data for the Plotting Ranges (This Means no Extrapolaton in Plot)
In [74]:
#If you want to plot other than the x0=feat_col[0] and x1=feat_col[1] chanf sequence of feat_col at the top
max_x0=float(Data[Data[feat_col[0]].argmax()][feat_col[0]])
max_x1=float(Data[Data[feat_col[1]].argmax()][feat_col[1]])
min_x0=float(Data[Data[feat_col[0]].argmin()][feat_col[0]])
min_x1=float(Data[Data[feat_col[1]].argmin()][feat_col[1]])
#Generate Plotting Data
num_arr=200
x0_3D=np.linspace(min_x0,max_x0,num=num_arr)#remember argmay returns the index of max
x1_3D=np.linspace(min_x1,max_x1,num=num_arr) #water nearly no effect in this range??
#reshape for columns view
x0_3D=np.reshape(x0_3D,(x0_3D.shape[0],-1))
x1_3D=np.reshape(x1_3D,(x1_3D.shape[0],-1))
#Concatenated to send them to Estimation function
X_3D_to_pred=np.concatenate((x0_3D, x1_3D),axis=1)#concatenate mean() if more than two independents
The Following Cell will Fill the Columns of those Independents not seen in the 3D Plot with their corresponding "ar.means"
In [75]:
if len(select_feat)>=3:
fill_vals=X_data_mean[2:].tolist()
for i in range(0,len(fill_vals),1):
filler=[fill_vals[i] for x in range(num_arr)]
filler_arr=np.array(filler)
filler_arr=np.reshape(filler_arr, (filler_arr.shape[0],-1))#makes thos vertical arrays like matlab does
X_3D_to_pred=np.concatenate((X_3D_to_pred,filler_arr),axis=1)# concatenates from right side
Z_pred=SVR_function.predict(X_3D_to_pred)#now the prediction function gets the right array at "fitting time"
X0_3D,X1_3D=np.meshgrid(x0_3D,x1_3D)
2D Plot with first Independent as chosen before
In [76]:
plt.plot(x0_3D,Z_pred)
plt.xlabel(feat_col[0])# you can access the columns title also dynamically!
plt.ylabel(Data.dtype.names[select_out[0]])
plt.show()
3D Plot with first 2 Independents as chosen before
In [130]:
from mpl_toolkits.mplot3d.axes3d import Axes3D
import pylab as p
fig = plt.figure()
ax = fig.gca(projection='3d')
wire = ax.plot_wireframe(X0_3D , X1_3D, Z_pred, cmap='autumn', cstride=20, rstride=20)
plt.show()
Method 2: Linear Regression
Another and prettier way of Data Representation is "Pandas"
Series
or DataFrame
Data Frame
In [117]:
import pandas as pd
X_data_DF=pd.DataFrame(X_data,columns=feat_col)
X_data_DF.head()# .head() shows the beginning of data
Out[117]:
We train our Linear Model now with the same input variables as before
In [125]:
from sklearn import linear_model
Lin_regr = linear_model.LinearRegression()
Lin_regr.fit(X_data[:train_ind], Y_true[:train_ind].flatten())
# Explained variance score: 1 is perfect prediction
# and 0 means that there is no linear relationship
# between X and Y.
R2_lin_reg=Lin_regr.score(X_data[train_ind:], Y_true[train_ind:].flatten()) ;
coeff=pd.Series(Lin_regr.coef_ ,index=feat_col)
plt.figure();coeff.plot(kind='bar',label='Coefficient on '+ Data.dtype.names[-1]);plt.legend(loc='best')
print(s)
print("R2_lin_reg=%s" % R2_lin_reg)
In [142]:
X_data_DF[X_data_DF.Water>=150][:5]#how to look up some data
Out[142]:
Keine Kommentare:
Kommentar verΓΆffentlichen