Samstag, 21. Juni 2014

Evaluating Multivariate Data with Data Mining Methods: An Example of Using Numpy and Pandas to load and represent Data from Web and Sklearn for Prediction

Read_CSV

Evaluating Multivariate Data with Data Mining Methods

An Example of Using Numpy and Pandas to load and represent Data from Web and Sklearn for Prediction

This Notebook is under construction:

Goals are: * collect Data Mining methods for data prediction * Make understandable examples with some data * emphasizing interesting code with Markdown * Use dynamic code so that different data can be used instead of examples
  • Thus making it easier to save the code as Classes and methods in a Pythod Module in the future

Cells where user can take different decisions are marked RED

1. Compressive Strength of Concrete

Get Data from Webpage

In [2]:
from sympy import init_printing
init_printing() 
In [3]:
import csv 
import urllib
import numpy as np

get_data  = urllib.urlopen('http://www.bliasoft.com/Documents/DataConcrete.txt')
Convert Decimal seperator from comma to point
How Many Columns does your Data have?
In [4]:
No_Col=9
Does the Data have a "comma" for decimal seperation
If the Answer is "Yes":
In [5]:
conv = lambda valstr: float(valstr.replace(',','.'))
In [6]:
c={}
for i in range(0,No_Col,1):
    c[i] = conv
If the Answer is "No": Just Skip the code above this
Set the Delimiter of the Data
In [7]:
Data=np.genfromtxt(get_data,dtype=float64 , delimiter='\t', skip_header=0, names=True, converters=c)

np.random.shuffle(Data)# randomizes data rowwise
In [8]:
Data
Out[8]:
array([(237.0, 92.0, 71.0, 247.0, 6.0, 853.0, 695.0, 28.0, 28.63),
       (305.0, 0.0, 100.0, 196.0, 10.0, 959.0, 705.0, 28.0, 30.12),
       (259.9, 100.6, 78.4, 170.6, 10.4, 935.7, 762.9, 28.0, 49.77), ...,
       (298.1, 0.0, 107.0, 186.4, 6.1, 879.0, 815.2, 28.0, 42.64),
       (380.0, 95.0, 0.0, 228.0, 0.0, 932.0, 594.0, 90.0, 40.56),
       (200.0, 133.0, 0.0, 192.0, 0.0, 965.4, 806.2, 3.0, 11.41)], 
      dtype=[('Cement', '<f8'), ('Blast_furnace_slags', '<f8'), ('Fly_ashes', '<f8'), ('Water', '<f8'), ('Superplasticizers', '<f8'), ('Coarse_aggregates', '<f8'), ('Fine_aggregates', '<f8'), ('Age', '<f8'), ('Compressive_strength', '<f8')])
Number of Data Sets
In [9]:
Data.shape[0]
Out[9]:
$$1030$$
In [10]:
Data=np.reshape(Data, (Data.shape[0],-1))
In [11]:
type(Data[0])
Out[11]:
numpy.ndarray
Access Columns by name is really intuitiv
In [12]:
Data[:]['Compressive_strength']
Out[12]:
array([[ 28.63],
       [ 30.12],
       [ 49.77],
       ..., 
       [ 42.64],
       [ 40.56],
       [ 11.41]])

We Split Dateset in 2 Groups

  • 70% of Data for Training
  • 30% for Evaluating/Test
In [13]:
test_ind=int(Data.shape[0]*0.3)
print("test_ind=%s" % test_ind) 
train_ind=Data.shape[0]-test_ind
print('train_ind={0}'.format(train_ind))

instances=np.arange(Data.shape[0])
instances
test_ind=309
train_ind=721

Out[13]:
array([   0,    1,    2, ..., 1027, 1028, 1029])
In [13]:
 

Determine which Independent Variables should explain the Dependent Variable

The First 2 columns will be used for the plots against the Dependent Variable
In [52]:
#SET THE INDEXES OF INDEPENDENTS you do not need to keep the order 1-inf...
#but the first and second in this order will be plotted !
select_feat=[3,7,4]

select_out=[-1]#SET THE COl INDEX OF DEPENDANT



feat_col=[]
for elem in select_feat:
    feat_col.append(Data.dtype.names[elem])

    

Let us look at the Independent Variables we will use to analyze the Dependent

In [53]:
feat_col
Out[53]:
['Water', 'Age', 'Superplasticizers']

Now we define the Arrays for our model

In [54]:
[int(x) for x in select_feat]
/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/IPython/core/formatters.py:239: FormatterWarning: Exception in image/png formatter: 
\begin{bmatrix}3, & 7, & 4\end{bmatrix}
^
Unknown symbol: \begin (at char 0), (line:1, col:1)
  FormatterWarning,

Out[54]:
$$\begin{bmatrix}3, & 7, & 4\end{bmatrix}$$
In [55]:
[feat_col[x] for x in range(len(select_feat))]
Out[55]:
['Water', 'Age', 'Superplasticizers']
In [56]:
Data[0:2]
Out[56]:
array([[(237.0, 92.0, 71.0, 247.0, 6.0, 853.0, 695.0, 28.0, 28.63)],
       [(305.0, 0.0, 100.0, 196.0, 10.0, 959.0, 705.0, 28.0, 30.12)]], 
      dtype=[('Cement', '<f8'), ('Blast_furnace_slags', '<f8'), ('Fly_ashes', '<f8'), ('Water', '<f8'), ('Superplasticizers', '<f8'), ('Coarse_aggregates', '<f8'), ('Fine_aggregates', '<f8'), ('Age', '<f8'), ('Compressive_strength', '<f8')])
if you choose one element it is ok Data.dtype.names[0] but if you choose more you have to convert the tuple to a list()
In [57]:
Data.dtype.names[0]
Out[57]:
'Cement'
In [58]:
list(Data.dtype.names[:])
Out[58]:
['Cement',
 'Blast_furnace_slags',
 'Fly_ashes',
 'Water',
 'Superplasticizers',
 'Coarse_aggregates',
 'Fine_aggregates',
 'Age',
 'Compressive_strength']
In [59]:
X_data=Data[feat_col]

Y_true=Data[Data.dtype.names[select_out[0]]] #Holds the dependent
In [60]:
X_data
Out[60]:
array([[(247.0, 28.0, 6.0)],
       [(196.0, 28.0, 10.0)],
       [(170.6, 28.0, 10.4)],
       ..., 
       [(186.4, 28.0, 6.1)],
       [(228.0, 90.0, 0.0)],
       [(192.0, 3.0, 0.0)]], 
      dtype=[('Water', '<f8'), ('Age', '<f8'), ('Superplasticizers', '<f8')])
In [61]:
X_data[:][X_data.dtype.names[0]] #this hides our column names now, but we need another type of array to make our calculations 
Out[61]:
array([[ 247. ],
       [ 196. ],
       [ 170.6],
       ..., 
       [ 186.4],
       [ 228. ],
       [ 192. ]])
In [62]:
#X_data[:][list(X_data.dtype.names[:])]
#this would not work to hide columns...
\[\hat{Comp_{Strength}}=f(Concrete, Water...)\]

Extract the Explaining Variables without column names for further usability

Following concatenates all the columns you choosed as input for the model
In [63]:
build=X_data[feat_col[0]]
for i in range(0,len(feat_col)-1,1):
    build=np.concatenate((build,X_data[feat_col[i+1]]),axis=1)
X_data=build
In [64]:
X_data #Columns are gone!
Out[64]:
array([[ 247. ,   28. ,    6. ],
       [ 196. ,   28. ,   10. ],
       [ 170.6,   28. ,   10.4],
       ..., 
       [ 186.4,   28. ,    6.1],
       [ 228. ,   90. ,    0. ],
       [ 192. ,    3. ,    0. ]])

Data Mining Methods

Support Vector Regression

In [65]:
from sklearn.svm import SVR

svr_rbf = SVR(kernel='rbf', C=1e5, gamma=0.1)# free parameters can be optimized ** to do **

SVR_function=svr_rbf.fit(X_data[:train_ind], Y_true[:train_ind].flatten())

y_train = SVR_function.predict(X_data[:train_ind])
#flatten because sklearn needs 1D for the Y

Training Performance

In [66]:
import matplotlib.pyplot as plt
import sympy as sy
tex_ylabel=sy.latex(Data.dtype.names[-1].replace('_',' ').replace('$',''))

plt.scatter(instances[:train_ind],y_train,label='Performance of SVR on Trained Data')
plt.scatter(instances[:train_ind],Y_true[:train_ind].flatten(),c='r',label='True Data')
plt.xlabel(r'$Instance$')
plt.ylabel(r'$%s$' % tex_ylabel)
#plt.scatter()
plt.legend(loc='best')
plt.show()
In [67]:
from sklearn.metrics import mean_squared_error,r2_score
R2_train=r2_score(Y_true[:train_ind].flatten(),y_train)
#tex=sy.latex('R**2').replace('$','')
print('$R^{2}$'+ '=%s' % R2_train)
$R^{2}$=0.931387303246

Test Performance

In [68]:
import matplotlib.pyplot as plt
import sympy as sy
tex_ylabel=sy.latex(Data.dtype.names[-1].replace('_',' ').replace('$',''))
y_test = SVR_function.predict(X_data[train_ind:])
plt.scatter(instances[train_ind:],y_test,label='Performance of SVR on \"New\" Data')
plt.scatter(instances[train_ind:],Y_true[train_ind:].flatten(),c='r',label='True Data')
plt.xlabel(r'$Instance$')
plt.ylabel(r'$%s$' % tex_ylabel)
plt.legend(loc='best')
plt.show()
In [69]:
from sklearn.metrics import mean_squared_error,r2_score
R2_test=r2_score(Y_true[train_ind:].flatten(),y_test)
#tex=sy.latex('R**2').replace('$','')
print('$R^{2}$'+ '=%s' % R2_test)
$R^{2}$=-2.65098593552

2 Independent Variables are not enough for a good prediction
The SVR we designed (only 2 variables) is a pretty bad estimator as you can see from Coefficient of Determination \(R^{2}\)
Long Expression: Takes the index of max in column
In [70]:
float(Data[Data[feat_col[0]].argmax()][feat_col[0]])#Data[ind]['Col']
Out[70]:
$$247.0$$
In [71]:
X_data_mean=np.mean(X_data,axis=0)
In [72]:
X_data_mean
Out[72]:
array([ 181.56728155,   45.66213592,    6.20466019])
Determine Min and Max in Data for the Plotting Ranges (This Means no Extrapolaton in Plot)
In [74]:
#If you want to plot other than the x0=feat_col[0] and x1=feat_col[1] chanf sequence of feat_col at the top

max_x0=float(Data[Data[feat_col[0]].argmax()][feat_col[0]])
max_x1=float(Data[Data[feat_col[1]].argmax()][feat_col[1]])

min_x0=float(Data[Data[feat_col[0]].argmin()][feat_col[0]])
min_x1=float(Data[Data[feat_col[1]].argmin()][feat_col[1]])

#Generate Plotting Data
num_arr=200
x0_3D=np.linspace(min_x0,max_x0,num=num_arr)#remember argmay returns the index of max
x1_3D=np.linspace(min_x1,max_x1,num=num_arr) #water nearly no effect in this range??

#reshape for columns view
x0_3D=np.reshape(x0_3D,(x0_3D.shape[0],-1))
x1_3D=np.reshape(x1_3D,(x1_3D.shape[0],-1))

#Concatenated to send them to Estimation function
X_3D_to_pred=np.concatenate((x0_3D, x1_3D),axis=1)#concatenate mean() if more than two independents

The Following Cell will Fill the Columns of those Independents not seen in the 3D Plot with their corresponding "ar.means"

In [75]:
if len(select_feat)>=3:
    fill_vals=X_data_mean[2:].tolist()
    for i in range(0,len(fill_vals),1):
        filler=[fill_vals[i] for x in range(num_arr)]
        filler_arr=np.array(filler)
        filler_arr=np.reshape(filler_arr, (filler_arr.shape[0],-1))#makes thos vertical arrays like matlab does
        X_3D_to_pred=np.concatenate((X_3D_to_pred,filler_arr),axis=1)# concatenates from right side 
    

Z_pred=SVR_function.predict(X_3D_to_pred)#now the prediction function gets the right array at "fitting time"

X0_3D,X1_3D=np.meshgrid(x0_3D,x1_3D)

2D Plot with first Independent as chosen before

In [76]:
plt.plot(x0_3D,Z_pred)
plt.xlabel(feat_col[0])# you can access the columns title also dynamically!
plt.ylabel(Data.dtype.names[select_out[0]])
plt.show()

3D Plot with first 2 Independents as chosen before

In [130]:
from mpl_toolkits.mplot3d.axes3d import Axes3D
import pylab as p
fig = plt.figure()
ax = fig.gca(projection='3d')
wire = ax.plot_wireframe(X0_3D , X1_3D, Z_pred, cmap='autumn', cstride=20, rstride=20)

plt.show()

Method 2: Linear Regression

Another and prettier way of Data Representation is "Pandas" Series or DataFrame
Data Frame
In [117]:
import pandas as pd
X_data_DF=pd.DataFrame(X_data,columns=feat_col)
X_data_DF.head()# .head() shows the beginning of data
Out[117]:
Water Age Superplasticizers
0 247.0 28 6.0
1 196.0 28 10.0
2 170.6 28 10.4
3 145.0 28 13.1
4 203.5 3 0.0
5 rows × 3 columns

We train our Linear Model now with the same input variables as before

In [125]:
from sklearn import linear_model
Lin_regr = linear_model.LinearRegression()
Lin_regr.fit(X_data[:train_ind], Y_true[:train_ind].flatten())

# Explained variance score: 1 is perfect prediction
# and 0 means that there is no linear relationship
# between X and Y.
R2_lin_reg=Lin_regr.score(X_data[train_ind:], Y_true[train_ind:].flatten()) ;
coeff=pd.Series(Lin_regr.coef_ ,index=feat_col)
plt.figure();coeff.plot(kind='bar',label='Coefficient on '+ Data.dtype.names[-1]);plt.legend(loc='best')
print(s)
print("R2_lin_reg=%s" % R2_lin_reg)
Water               -0.169086
Age                  0.117719
Superplasticizers    0.865676
dtype: float64
R2_lin_reg=0.31371105198

In [142]:
X_data_DF[X_data_DF.Water>=150][:5]#how to look up some data 
Out[142]:
Water Age Superplasticizers
0 247.0 28 6.0
1 196.0 28 10.0
2 170.6 28 10.4
4 203.5 3 0.0
5 228.0 28 0.0
5 rows × 3 columns

More to come...

Keine Kommentare:

Kommentar veröffentlichen