Evaluating Multivariate Data with Data Mining Methods¶

An Example of Using `Numpy` and `Pandas` to load and represent Data from Web and `Sklearn` for Prediction

This Notebook is under construction:

Goals are: * collect Data Mining methods for data prediction * Make understandable examples with some data * emphasizing interesting code with Markdown * Use dynamic code so that different data can be used instead of examples

Thus making it easier to save the code as Classes and methods in a Pythod Module in the future

Cells where user can take different decisions are marked RED

1. Compressive Strength of Concrete¶

Get Data from Webpage¶

from sympy import init_printing
init_printing()

import csv 
import urllib
import numpy as np

get_data  = urllib.urlopen('http://www.bliasoft.com/Documents/DataConcrete.txt')

Convert Decimal seperator from comma to point

How Many Columns does your Data have?

No_Col=9

Does the Data have a "comma" for decimal seperation

If the Answer is "Yes":

conv = lambda valstr: float(valstr.replace(',','.'))

c={}
for i in range(0,No_Col,1):
    c[i] = conv

If the Answer is "No": Just Skip the code above this

Set the Delimiter of the Data

Data=np.genfromtxt(get_data,dtype=float64 , delimiter='\t', skip_header=0, names=True, converters=c)

np.random.shuffle(Data)# randomizes data rowwise

Data

array([(237.0, 92.0, 71.0, 247.0, 6.0, 853.0, 695.0, 28.0, 28.63),
       (305.0, 0.0, 100.0, 196.0, 10.0, 959.0, 705.0, 28.0, 30.12),
       (259.9, 100.6, 78.4, 170.6, 10.4, 935.7, 762.9, 28.0, 49.77), ...,
       (298.1, 0.0, 107.0, 186.4, 6.1, 879.0, 815.2, 28.0, 42.64),
       (380.0, 95.0, 0.0, 228.0, 0.0, 932.0, 594.0, 90.0, 40.56),
       (200.0, 133.0, 0.0, 192.0, 0.0, 965.4, 806.2, 3.0, 11.41)], 
      dtype=[('Cement', '<f8'), ('Blast_furnace_slags', '<f8'), ('Fly_ashes', '<f8'), ('Water', '<f8'), ('Superplasticizers', '<f8'), ('Coarse_aggregates', '<f8'), ('Fine_aggregates', '<f8'), ('Age', '<f8'), ('Compressive_strength', '<f8')])

Number of Data Sets

Data.shape[0]

Data=np.reshape(Data, (Data.shape[0],-1))

type(Data[0])

numpy.ndarray

Access Columns by name is really intuitiv

Data[:]['Compressive_strength']

array([[ 28.63],
       [ 30.12],
       [ 49.77],
       ..., 
       [ 42.64],
       [ 40.56],
       [ 11.41]])

We Split Dateset in 2 Groups

70% of Data for Training
30% for Evaluating/Test

test_ind=int(Data.shape[0]*0.3)
print("test_ind=%s" % test_ind) 
train_ind=Data.shape[0]-test_ind
print('train_ind={0}'.format(train_ind))

instances=np.arange(Data.shape[0])
instances

test_ind=309
train_ind=721

array([   0,    1,    2, ..., 1027, 1028, 1029])

Determine which Independent Variables should explain the Dependent Variable

The First 2 columns will be used for the plots against the Dependent Variable

#SET THE INDEXES OF INDEPENDENTS you do not need to keep the order 1-inf...
#but the first and second in this order will be plotted !
select_feat=[3,7,4]

select_out=[-1]#SET THE COl INDEX OF DEPENDANT



feat_col=[]
for elem in select_feat:
    feat_col.append(Data.dtype.names[elem])

Let us look at the Independent Variables we will use to analyze the Dependent

feat_col

['Water', 'Age', 'Superplasticizers']

Now we define the Arrays for our model

[int(x) for x in select_feat]

/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/IPython/core/formatters.py:239: FormatterWarning: Exception in image/png formatter: 
\begin{bmatrix}3, & 7, & 4\end{bmatrix}
^
Unknown symbol: \begin (at char 0), (line:1, col:1)
  FormatterWarning,

[feat_col[x] for x in range(len(select_feat))]

['Water', 'Age', 'Superplasticizers']

Data[0:2]

array([[(237.0, 92.0, 71.0, 247.0, 6.0, 853.0, 695.0, 28.0, 28.63)],
       [(305.0, 0.0, 100.0, 196.0, 10.0, 959.0, 705.0, 28.0, 30.12)]], 
      dtype=[('Cement', '<f8'), ('Blast_furnace_slags', '<f8'), ('Fly_ashes', '<f8'), ('Water', '<f8'), ('Superplasticizers', '<f8'), ('Coarse_aggregates', '<f8'), ('Fine_aggregates', '<f8'), ('Age', '<f8'), ('Compressive_strength', '<f8')])

if you choose one element it is ok Data.dtype.names[0] but if you choose more you have to convert the tuple to a list()

Data.dtype.names[0]

'Cement'

list(Data.dtype.names[:])

['Cement',
 'Blast_furnace_slags',
 'Fly_ashes',
 'Water',
 'Superplasticizers',
 'Coarse_aggregates',
 'Fine_aggregates',
 'Age',
 'Compressive_strength']

X_data=Data[feat_col]

Y_true=Data[Data.dtype.names[select_out[0]]] #Holds the dependent

X_data

array([[(247.0, 28.0, 6.0)],
       [(196.0, 28.0, 10.0)],
       [(170.6, 28.0, 10.4)],
       ..., 
       [(186.4, 28.0, 6.1)],
       [(228.0, 90.0, 0.0)],
       [(192.0, 3.0, 0.0)]], 
      dtype=[('Water', '<f8'), ('Age', '<f8'), ('Superplasticizers', '<f8')])

X_data[:][X_data.dtype.names[0]] #this hides our column names now, but we need another type of array to make our calculations

array([[ 247. ],
       [ 196. ],
       [ 170.6],
       ..., 
       [ 186.4],
       [ 228. ],
       [ 192. ]])

#X_data[:][list(X_data.dtype.names[:])]
#this would not work to hide columns...

\[\hat{Comp_{Strength}}=f(Concrete, Water...)\]

Extract the Explaining Variables without column names for further usability

Following concatenates all the columns you choosed as input for the model

build=X_data[feat_col[0]]
for i in range(0,len(feat_col)-1,1):
    build=np.concatenate((build,X_data[feat_col[i+1]]),axis=1)
X_data=build

X_data #Columns are gone!

array([[ 247. ,   28. ,    6. ],
       [ 196. ,   28. ,   10. ],
       [ 170.6,   28. ,   10.4],
       ..., 
       [ 186.4,   28. ,    6.1],
       [ 228. ,   90. ,    0. ],
       [ 192. ,    3. ,    0. ]])

Data Mining Methods

Support Vector Regression

from sklearn.svm import SVR

svr_rbf = SVR(kernel='rbf', C=1e5, gamma=0.1)# free parameters can be optimized ** to do **

SVR_function=svr_rbf.fit(X_data[:train_ind], Y_true[:train_ind].flatten())

y_train = SVR_function.predict(X_data[:train_ind])
#flatten because sklearn needs 1D for the Y

Training Performance

import matplotlib.pyplot as plt
import sympy as sy
tex_ylabel=sy.latex(Data.dtype.names[-1].replace('_',' ').replace('$',''))

plt.scatter(instances[:train_ind],y_train,label='Performance of SVR on Trained Data')
plt.scatter(instances[:train_ind],Y_true[:train_ind].flatten(),c='r',label='True Data')
plt.xlabel(r'$Instance$')
plt.ylabel(r'$%s$' % tex_ylabel)
#plt.scatter()
plt.legend(loc='best')
plt.show()

from sklearn.metrics import mean_squared_error,r2_score
R2_train=r2_score(Y_true[:train_ind].flatten(),y_train)
#tex=sy.latex('R**2').replace('$','')
print('$R^{2}$'+ '=%s' % R2_train)

$R^{2}$=0.931387303246

Test Performance

import matplotlib.pyplot as plt
import sympy as sy
tex_ylabel=sy.latex(Data.dtype.names[-1].replace('_',' ').replace('$',''))
y_test = SVR_function.predict(X_data[train_ind:])
plt.scatter(instances[train_ind:],y_test,label='Performance of SVR on \"New\" Data')
plt.scatter(instances[train_ind:],Y_true[train_ind:].flatten(),c='r',label='True Data')
plt.xlabel(r'$Instance$')
plt.ylabel(r'$%s$' % tex_ylabel)
plt.legend(loc='best')
plt.show()

from sklearn.metrics import mean_squared_error,r2_score
R2_test=r2_score(Y_true[train_ind:].flatten(),y_test)
#tex=sy.latex('R**2').replace('$','')
print('$R^{2}$'+ '=%s' % R2_test)

$R^{2}$=-2.65098593552

2 Independent Variables are not enough for a good prediction

The SVR we designed (only 2 variables) is a pretty bad estimator as you can see from Coefficient of Determination \(R^{2}\)

Long Expression: Takes the index of max in column

float(Data[Data[feat_col[0]].argmax()][feat_col[0]])#Data[ind]['Col']

X_data_mean=np.mean(X_data,axis=0)

X_data_mean

array([ 181.56728155,   45.66213592,    6.20466019])

Determine Min and Max in Data for the Plotting Ranges (This Means no Extrapolaton in Plot)

#If you want to plot other than the x0=feat_col[0] and x1=feat_col[1] chanf sequence of feat_col at the top

max_x0=float(Data[Data[feat_col[0]].argmax()][feat_col[0]])
max_x1=float(Data[Data[feat_col[1]].argmax()][feat_col[1]])

min_x0=float(Data[Data[feat_col[0]].argmin()][feat_col[0]])
min_x1=float(Data[Data[feat_col[1]].argmin()][feat_col[1]])

#Generate Plotting Data
num_arr=200
x0_3D=np.linspace(min_x0,max_x0,num=num_arr)#remember argmay returns the index of max
x1_3D=np.linspace(min_x1,max_x1,num=num_arr) #water nearly no effect in this range??

#reshape for columns view
x0_3D=np.reshape(x0_3D,(x0_3D.shape[0],-1))
x1_3D=np.reshape(x1_3D,(x1_3D.shape[0],-1))

#Concatenated to send them to Estimation function
X_3D_to_pred=np.concatenate((x0_3D, x1_3D),axis=1)#concatenate mean() if more than two independents

The Following Cell will Fill the Columns of those Independents not seen in the 3D Plot with their corresponding "ar.means"

if len(select_feat)>=3:
    fill_vals=X_data_mean[2:].tolist()
    for i in range(0,len(fill_vals),1):
        filler=[fill_vals[i] for x in range(num_arr)]
        filler_arr=np.array(filler)
        filler_arr=np.reshape(filler_arr, (filler_arr.shape[0],-1))#makes thos vertical arrays like matlab does
        X_3D_to_pred=np.concatenate((X_3D_to_pred,filler_arr),axis=1)# concatenates from right side 
    

Z_pred=SVR_function.predict(X_3D_to_pred)#now the prediction function gets the right array at "fitting time"

X0_3D,X1_3D=np.meshgrid(x0_3D,x1_3D)

2D Plot with first Independent as chosen before

plt.plot(x0_3D,Z_pred)
plt.xlabel(feat_col[0])# you can access the columns title also dynamically!
plt.ylabel(Data.dtype.names[select_out[0]])
plt.show()

3D Plot with first 2 Independents as chosen before

from mpl_toolkits.mplot3d.axes3d import Axes3D
import pylab as p
fig = plt.figure()
ax = fig.gca(projection='3d')
wire = ax.plot_wireframe(X0_3D , X1_3D, Z_pred, cmap='autumn', cstride=20, rstride=20)

plt.show()

Method 2: Linear Regression

Another and prettier way of Data Representation is "Pandas" Series or DataFrame

Data Frame

import pandas as pd
X_data_DF=pd.DataFrame(X_data,columns=feat_col)
X_data_DF.head()# .head() shows the beginning of data

We train our Linear Model now with the same input variables as before

from sklearn import linear_model
Lin_regr = linear_model.LinearRegression()
Lin_regr.fit(X_data[:train_ind], Y_true[:train_ind].flatten())

# Explained variance score: 1 is perfect prediction
# and 0 means that there is no linear relationship
# between X and Y.
R2_lin_reg=Lin_regr.score(X_data[train_ind:], Y_true[train_ind:].flatten()) ;
coeff=pd.Series(Lin_regr.coef_ ,index=feat_col)
plt.figure();coeff.plot(kind='bar',label='Coefficient on '+ Data.dtype.names[-1]);plt.legend(loc='best')
print(s)
print("R2_lin_reg=%s" % R2_lin_reg)

Water               -0.169086
Age                  0.117719
Superplasticizers    0.865676
dtype: float64
R2_lin_reg=0.31371105198

X_data_DF[X_data_DF.Water>=150][:5]#how to look up some data

The Python Coder

Samstag, 21. Juni 2014