Categories
Algorithms

Linear vs Logistic Regression, all in Numpy

The two entry level machine learning algorithms , linear and logistic regression are quite easy to understand and provide a good way to practice coding the general machine learning pipeline, in their vectorized form. Namely,

  • Prepping the dataset, eg: removing outliers, adding features(polynomial multiples of existing features), normalization, feature scaling.
  • Implementing the learning algorithm function.
  • Calculating the loss through the chosen loss function and hypothesis.
  • Optimization algorithm – Update you model’s parameters depending on the loss and ground truth.
  • Maintaining a config file for the total training process.

An entry level toy dataset: pima indians diabetes dataset, has a target of one variable for each datapoint(a binary classification task of predicting whether a person has diabetes or not), will be used for the sake of this tutorial blog.

About the dataset

You can know all about the dataset on Kaggle. The dataset has around 768 datapoints and 8 features, which is quite a sweet spot for having a decent model without worrying about underfitting. The data is about female patients, specifically, their BMI, insulin level, age, skin thickness, glucose level etc. The target tells if the person has diabetes or not. 500 of all the datapoints are non-diabetic. 268 are diabetic. The data is not highly skewed so normal test accuracy should suffice(no need to find precision, recall and F1 score).

Helper Functions

For normalizing the dataset

This helps in normalizing the data and bringing them in a range of 0-1.

def scalify_min_max(np_dataframe):
    minimum_array=np.amin(np_dataframe,axis=0)
    maximum_array=np.amax(np_dataframe,axis=0)
    range_array = maximum_array-minimum_array

    scaled = (np_dataframe-minimum_array)/range_array
    return scaled

For calculating the accuracy

def accuracy_calculator(Y_out,Y):
    accuracy=np.sum(np.logical_not(np.logical_xor(Y_out,Y)))/Y.shape[0]
    true_positives=np.sum(np.logical_and(Y_out,Y))
    false_positives=np.sum(np.logical_and(Y_out,np.logical_not(Y)))
    false_negatives=np.sum(np.logical_and(np.logical_not(Y_out),Y))
    precision=true_positives/(true_positives+false_positives)
    recall=true_positives/(true_positives+false_negatives)
    print("Precision:",precision,".Recall:",recall)
    F1_score=precision*recall/(precision+recall)
    return [accuracy,precision,recall,F1_score]

For preparing the dataset – creating train/val/test splits

def pre_data_prep(filename,dest_fileloc):
    with open(filename,'rb') as f:
        gzip_fd=gzip.GzipFile(fileobj=f)
        next(gzip_fd)#Skip first row
        diabetes_df = loadtxt(gzip_fd,delimiter=',',dtype=np.float32)
    Y=diabetes_df[:,-1]
    scaled_diabetes_df = scalify_min_max(diabetes_df[:,:-1])
    concat_diabetes = np.concatenate((scaled_diabetes_df,np.array([Y]).T),axis=1)
    savetxt(dest_fileloc,concat_diabetes,delimiter=',')

def dataprep(fileloc,split):
    assert len(split) == 3
    assert sum(split) == 1
    diabetes_data = loadtxt(fileloc,delimiter=',',dtype=np.float32)
    Y=np.array([diabetes_data[:,-1]]).T
    classes = np.unique(Y)
    assert len(classes) == 2
    X=diabetes_data[:,:-1]
    data_size=X.shape[0]
    print(data_size,X.shape,Y.shape)

    split_size=int(split[0]*data_size)
    val_split=int(split[0]*data_size)
    X_train=X[:split_size]
    X_val=X[split_size:split_size+val_split]
    X_test=X[split_size+val_split:]
    Y_train=Y[:split_size]
    Y_val=Y[split_size:split_size+val_split]
    Y_test=Y[split_size+val_split:]
    return X_train,X_val,X_test,Y_train,Y_val,Y_test

Evaluation function

For for finding accuracy of learned model on the test dataset.

def evaluate(theta_params,X,Y=None,thresh=0.5):
    data_size=X.shape[0]
    X_extend=np.concatenate((np.ones((data_size,1)),X),axis=1)
    pred = np.greater(np.matmul(X_extend,theta_params),thresh)*1
    cost=np.sum(np.square(np.matmul(X_extend,theta_params)-Y))/(data_size*2)
    return pred,cost

Logistic Regression Function

def sigmoid_func(theta,X):
    retval = 1/(1+np.exp(-1*np.matmul(theta.T,X)))
    return retval
    
def logistic_regression(X,Y,learning_rate=0.001,num_iters=100,thresh=0.5,rand_seed=None):
    if rand_seed!=None:#For reproducible results
        np.random.seed(rand_seed)
    data_size = X.shape[0]
    theta_params=np.array([np.random.randn(X.shape[1]+1)]).T
    #Add bias column to X
    X_extend = np.concatenate((np.ones((data_size,1)),X),axis=1).T
    cost=[]#Keep track of cost after each iteration of learning
    for i in tqdm(range(num_iters),desc="Training.."):
        h_theta=sigmoid_func(theta_params,X_extend).T#mX1
        grad=np.matmul(X_extend,(h_theta-Y))/data_size#nXm*mX1=nX1
        theta_params=theta_params-learning_rate*grad
        cost.append(-1*np.sum(Y*np.log(h_theta)+(1-Y)*np.log(1-h_theta))/(data_size))
    final_pred = np.greater(np.matmul(X_extend.T,theta_params),thresh)*1
    accuracy=np.sum(np.logical_not(np.logical_xor(final_pred,Y)))/data_size
    cost=np.array(cost)
    return theta_params,accuracy,cost

Linear Regression Function

def linear_regression(X,Y,learning_rate=0.001,num_iters=100,thresh=0.5,rand_seed=None):
    if rand_seed!=None:
        np.random.seed(rand_seed)
    data_size = X.shape[0]
    #print(X.shape,Y.shape)
    theta_params=np.array([np.random.randn(X.shape[1]+1)]).T
    X_extend = np.concatenate((np.ones((data_size,1)),X),axis=1)
    cost=[]
    for i in tqdm(range(num_iters),desc="Training.."):
        theta_params=theta_params-learning_rate*np.matmul((np.matmul(theta_params.T,X_extend.T)-Y.T),X_extend).T/data_size
        cost.append(np.sum(np.square(np.matmul(X_extend,theta_params)-Y)[0])/(data_size*2))
    final_pred = np.greater(np.matmul(X_extend,theta_params),thresh)*1
    accuracy=np.sum(np.logical_not(np.logical_xor(final_pred,Y)))/data_size
    cost=np.array(cost)
    return theta_params,accuracy,cost

Runner functions for Linear and Logistic Regressions

#######################--------Linear RUNNER---------###############################
def regression_runner(fileloc,data_split_ratios,seed_values):
    X_train,X_val,X_test,Y_train,Y_val,Y_test = dataprep(fileloc,data_split_ratios)
    all_models=[]
    all_val_accuracies=[]
    random_seeds=seed_values
    num_iters=500
    x_axis=np.arange(num_iters)
    for i in range(len(random_seeds)):
        model,train_accuracy,cost=linear_regression(X_train,Y_train,rand_seed=random_seeds[i],num_iters=num_iters)
        print("Trial:",i,".Train Accuracy:",train_accuracy)
        all_models.append(model)
        plt.plot(x_axis,cost,label=str(random_seeds[i]))
        
        val_prediction,val_cost=evaluate(model,X_val,Y_val)
        accuracy_precision=accuracy_calculator(val_prediction,Y_val)
        all_val_accuracies.append(accuracy_precision[0])
        print("Validation Accuracy:",accuracy_precision)
        print("Validation Cost:",val_cost)

    #plt.legend()
    plt.title("Linear Regression")
    plt.xlabel('Number of iterations')
    plt.ylabel('Cost')
    plt.show()
    max_accuracy_idx=np.where(all_val_accuracies==np.amax(all_val_accuracies))[0][0]
    best_model=all_models[max_accuracy_idx]
    print(best_model.shape)
    #print(X_test.shape,Y_test.shape)
    test_pred,test_cost=evaluate(best_model,X_test,Y_test)
    print(test_pred.shape,print(test_cost))
    test_accuracy,test_precision,test_recall,test_f1=accuracy_calculator(test_pred,Y_test)
    print("Test accuracy:",test_accuracy,".Test cost:",test_cost)

#####################-------------LOGISTIC RUNNER--------------##########################
def logistic_runner(fileloc,data_split_ratios,seed_values):
    X_train,X_val,X_test,Y_train,Y_val,Y_test = dataprep(fileloc,data_split_ratios)
    all_models=[]
    all_val_accuracies=[]
    random_seeds=seed_values
    num_iters=1500
    x_axis=np.arange(num_iters)
    for i in range(10):
        model,train_accuracy,cost=logistic_regression(X_train,Y_train,rand_seed=random_seeds[i],num_iters=num_iters)
        print("Trial:",i,".Train Accuracy:",train_accuracy)
        all_models.append(model)
        plt.plot(x_axis,cost,label=str(random_seeds[i]))

        val_prediction,val_cost=evaluate(model,X_val,Y_val)
        accuracy_precision=accuracy_calculator(val_prediction,Y_val)
        all_val_accuracies.append(accuracy_precision[0])
        print("Validation Accuracy:",accuracy_precision)
        print("Validation Cost:",val_cost)
    #plt.legend()
    plt.title("Logistic Regression")
    plt.xlabel('Number of iterations')
    plt.ylabel('Cost')
    plt.show()
    max_accuracy_idx=np.where(all_val_accuracies==np.amax(all_val_accuracies))[0][0]
    best_model=all_models[max_accuracy_idx]

    test_pred,test_cost=evaluate(best_model,X_test,Y_test)
    #print(test_pred.shape,print(test_cost))
    test_accuracy,test_precision,test_recall,test_f1=accuracy_calculator(test_pred,Y_test)
    print("Test accuracy:",test_accuracy,".Test cost:",test_cost)

Training Curves

Note that each of the below two trainings was performed with 10 different values of initial theta. The initial value of theta effects the overall training performance. The best of the 10 was taken in consideration for the final evaluation on the test dataset.

linear regression training
logistic regression training

Test Accuracies

Linear Regression:
Test accuracy: 0.7068965517241379 .Test cost: 0.14745936729023856

Logistic Regression:
Test accuracy: 0.646551724137931 .Test cost: 0.2865915372479961

Assimilated code

Gist Link

Additional Notes

  • The above code does not use regularization.
  • It may appear that for a few curves training was stopped prematurely, but infact the test results were more near optimal for the above training parameters.
  • Although linear regression appears to be performing better for the above case it might give poorer results for other datasets.
  • Note that logistic regression took 3X as many iterations as linear regression to converge.
  • Initial model parameters are chosen randomly by varying the seed values. Initial model parameters(theta) effects the overall training performance, hence 10 such values were taken.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s