Categories

# Linear vs Logistic Regression, all in Numpy

The two entry level machine learning algorithms , linear and logistic regression are quite easy to understand and provide a good way to practice coding the general machine learning pipeline, in their vectorized form. Namely,

• Prepping the dataset, eg: removing outliers, adding features(polynomial multiples of existing features), normalization, feature scaling.
• Implementing the learning algorithm function.
• Calculating the loss through the chosen loss function and hypothesis.
• Optimization algorithm – Update you model’s parameters depending on the loss and ground truth.
• Maintaining a config file for the total training process.

An entry level toy dataset: pima indians diabetes dataset, has a target of one variable for each datapoint(a binary classification task of predicting whether a person has diabetes or not), will be used for the sake of this tutorial blog.

You can know all about the dataset on Kaggle. The dataset has around `768` datapoints and `8` features, which is quite a sweet spot for having a decent model without worrying about underfitting. The data is about female patients, specifically, their BMI, insulin level, age, skin thickness, glucose level etc. The target tells if the person has diabetes or not. `500` of all the datapoints are non-diabetic. `268` are diabetic. The data is not highly skewed so normal test accuracy should suffice(no need to find precision, recall and `F1` score).

## Helper Functions

#### For normalizing the dataset

This helps in normalizing the data and bringing them in a range of `0-1`.

``````def scalify_min_max(np_dataframe):
minimum_array=np.amin(np_dataframe,axis=0)
maximum_array=np.amax(np_dataframe,axis=0)
range_array = maximum_array-minimum_array

scaled = (np_dataframe-minimum_array)/range_array
return scaled``````

#### For calculating the accuracy

``````def accuracy_calculator(Y_out,Y):
accuracy=np.sum(np.logical_not(np.logical_xor(Y_out,Y)))/Y.shape
true_positives=np.sum(np.logical_and(Y_out,Y))
false_positives=np.sum(np.logical_and(Y_out,np.logical_not(Y)))
false_negatives=np.sum(np.logical_and(np.logical_not(Y_out),Y))
precision=true_positives/(true_positives+false_positives)
recall=true_positives/(true_positives+false_negatives)
print("Precision:",precision,".Recall:",recall)
F1_score=precision*recall/(precision+recall)
return [accuracy,precision,recall,F1_score]``````

#### For preparing the dataset – creating train/val/test splits

``````def pre_data_prep(filename,dest_fileloc):
with open(filename,'rb') as f:
gzip_fd=gzip.GzipFile(fileobj=f)
next(gzip_fd)#Skip first row
Y=diabetes_df[:,-1]
scaled_diabetes_df = scalify_min_max(diabetes_df[:,:-1])
concat_diabetes = np.concatenate((scaled_diabetes_df,np.array([Y]).T),axis=1)
savetxt(dest_fileloc,concat_diabetes,delimiter=',')

def dataprep(fileloc,split):
assert len(split) == 3
assert sum(split) == 1
Y=np.array([diabetes_data[:,-1]]).T
classes = np.unique(Y)
assert len(classes) == 2
X=diabetes_data[:,:-1]
data_size=X.shape
print(data_size,X.shape,Y.shape)

split_size=int(split*data_size)
val_split=int(split*data_size)
X_train=X[:split_size]
X_val=X[split_size:split_size+val_split]
X_test=X[split_size+val_split:]
Y_train=Y[:split_size]
Y_val=Y[split_size:split_size+val_split]
Y_test=Y[split_size+val_split:]
return X_train,X_val,X_test,Y_train,Y_val,Y_test``````

#### Evaluation function

For for finding accuracy of learned model on the test dataset.

``````def evaluate(theta_params,X,Y=None,thresh=0.5):
data_size=X.shape
X_extend=np.concatenate((np.ones((data_size,1)),X),axis=1)
pred = np.greater(np.matmul(X_extend,theta_params),thresh)*1
cost=np.sum(np.square(np.matmul(X_extend,theta_params)-Y))/(data_size*2)
return pred,cost``````

## Logistic Regression Function

``````def sigmoid_func(theta,X):
retval = 1/(1+np.exp(-1*np.matmul(theta.T,X)))
return retval

def logistic_regression(X,Y,learning_rate=0.001,num_iters=100,thresh=0.5,rand_seed=None):
if rand_seed!=None:#For reproducible results
np.random.seed(rand_seed)
data_size = X.shape
theta_params=np.array([np.random.randn(X.shape+1)]).T
X_extend = np.concatenate((np.ones((data_size,1)),X),axis=1).T
cost=[]#Keep track of cost after each iteration of learning
for i in tqdm(range(num_iters),desc="Training.."):
h_theta=sigmoid_func(theta_params,X_extend).T#mX1
cost.append(-1*np.sum(Y*np.log(h_theta)+(1-Y)*np.log(1-h_theta))/(data_size))
final_pred = np.greater(np.matmul(X_extend.T,theta_params),thresh)*1
accuracy=np.sum(np.logical_not(np.logical_xor(final_pred,Y)))/data_size
cost=np.array(cost)
return theta_params,accuracy,cost``````

## Linear Regression Function

``````def linear_regression(X,Y,learning_rate=0.001,num_iters=100,thresh=0.5,rand_seed=None):
if rand_seed!=None:
np.random.seed(rand_seed)
data_size = X.shape
#print(X.shape,Y.shape)
theta_params=np.array([np.random.randn(X.shape+1)]).T
X_extend = np.concatenate((np.ones((data_size,1)),X),axis=1)
cost=[]
for i in tqdm(range(num_iters),desc="Training.."):
theta_params=theta_params-learning_rate*np.matmul((np.matmul(theta_params.T,X_extend.T)-Y.T),X_extend).T/data_size
cost.append(np.sum(np.square(np.matmul(X_extend,theta_params)-Y))/(data_size*2))
final_pred = np.greater(np.matmul(X_extend,theta_params),thresh)*1
accuracy=np.sum(np.logical_not(np.logical_xor(final_pred,Y)))/data_size
cost=np.array(cost)
return theta_params,accuracy,cost``````

## Runner functions for Linear and Logistic Regressions

``````#######################--------Linear RUNNER---------###############################
def regression_runner(fileloc,data_split_ratios,seed_values):
X_train,X_val,X_test,Y_train,Y_val,Y_test = dataprep(fileloc,data_split_ratios)
all_models=[]
all_val_accuracies=[]
random_seeds=seed_values
num_iters=500
x_axis=np.arange(num_iters)
for i in range(len(random_seeds)):
model,train_accuracy,cost=linear_regression(X_train,Y_train,rand_seed=random_seeds[i],num_iters=num_iters)
print("Trial:",i,".Train Accuracy:",train_accuracy)
all_models.append(model)
plt.plot(x_axis,cost,label=str(random_seeds[i]))

val_prediction,val_cost=evaluate(model,X_val,Y_val)
accuracy_precision=accuracy_calculator(val_prediction,Y_val)
all_val_accuracies.append(accuracy_precision)
print("Validation Accuracy:",accuracy_precision)
print("Validation Cost:",val_cost)

#plt.legend()
plt.title("Linear Regression")
plt.xlabel('Number of iterations')
plt.ylabel('Cost')
plt.show()
max_accuracy_idx=np.where(all_val_accuracies==np.amax(all_val_accuracies))
best_model=all_models[max_accuracy_idx]
print(best_model.shape)
#print(X_test.shape,Y_test.shape)
test_pred,test_cost=evaluate(best_model,X_test,Y_test)
print(test_pred.shape,print(test_cost))
test_accuracy,test_precision,test_recall,test_f1=accuracy_calculator(test_pred,Y_test)
print("Test accuracy:",test_accuracy,".Test cost:",test_cost)

#####################-------------LOGISTIC RUNNER--------------##########################
def logistic_runner(fileloc,data_split_ratios,seed_values):
X_train,X_val,X_test,Y_train,Y_val,Y_test = dataprep(fileloc,data_split_ratios)
all_models=[]
all_val_accuracies=[]
random_seeds=seed_values
num_iters=1500
x_axis=np.arange(num_iters)
for i in range(10):
model,train_accuracy,cost=logistic_regression(X_train,Y_train,rand_seed=random_seeds[i],num_iters=num_iters)
print("Trial:",i,".Train Accuracy:",train_accuracy)
all_models.append(model)
plt.plot(x_axis,cost,label=str(random_seeds[i]))

val_prediction,val_cost=evaluate(model,X_val,Y_val)
accuracy_precision=accuracy_calculator(val_prediction,Y_val)
all_val_accuracies.append(accuracy_precision)
print("Validation Accuracy:",accuracy_precision)
print("Validation Cost:",val_cost)
#plt.legend()
plt.title("Logistic Regression")
plt.xlabel('Number of iterations')
plt.ylabel('Cost')
plt.show()
max_accuracy_idx=np.where(all_val_accuracies==np.amax(all_val_accuracies))
best_model=all_models[max_accuracy_idx]

test_pred,test_cost=evaluate(best_model,X_test,Y_test)
#print(test_pred.shape,print(test_cost))
test_accuracy,test_precision,test_recall,test_f1=accuracy_calculator(test_pred,Y_test)
print("Test accuracy:",test_accuracy,".Test cost:",test_cost)``````

## Training Curves

Note that each of the below two trainings was performed with `10` different values of initial theta. The initial value of theta effects the overall training performance. The best of the `10` was taken in consideration for the final evaluation on the test dataset.

## Test Accuracies

``````Linear Regression:
Test accuracy: 0.7068965517241379 .Test cost: 0.14745936729023856

Logistic Regression:
Test accuracy: 0.646551724137931 .Test cost: 0.2865915372479961``````

## Assimilated code 