Home Credit Default Risk Competition on Kaggle



Banks take a large risk when letting people borrow money. Personal loans are an increasing debt type which means the risk for defualting on this risk is increasing as well. Assessing the risk for default without a credit score is difficult, so other factors such as financial and socioeconomic history may be used to determine this. For this Kaggle compeition we took the steps of:

1) Exploring the data with an exploratory notebook in order to learn about the dataset.

2) Engineer useful features for our machine learning models by manipulating tabular data using pandas, numpy.

3) Creating three machine learning models: Logistic Regression, Random Forest, Multilayer Perceptron Neural Network

4) Using kaggle to score our models and measure progress.

Exploratory Analysis

Here we explore our dataset and find out what general ideas and ineferences we can about the dataset. It's extremely beneficial to gain knowldge about the data you're working with in order to gain somewhat of a domain knowledge. This will help us decide which variables to pick for our machine learning model to learn. Variables more related to the target will yield better learning/predictive results



Defining a function that will find missing values

# MISSING VALUE CHECK
  # Define function
  def check_missing_value(df):
      # Returns how many values are missing in each column
      missing_values = df.isnull().sum()
      # Percentage
      missing_values_percent = 100 * missing_values / len(df)
      # Table
      missing_values_table = pd.concat([missing_values, missing_values_percent], axis=1)
      # renaming columns
      missing_value_table_renamed_columns = missing_values_table.rename( columns={0: 'Missing values', 1: '% of total values'})
      # return summary info
      return missing_value_table_renamed_columns

  # Create a dataframe using the previously called function to explain the missing data
  expl_missing_values_df = check_missing_value(apptrain)

  # Creating a new dataframe called mvdf that shows the amount of missing values and the percentage and sorts it.
  mvdf = expl_missing_values_df.loc[~(expl_missing_values_df==0).all(axis=1)]
  mvdf.sort_values(by=['% of total values'], ascending=False).head(30)
  

Herein we define a function that that allows us to take a look at the missing values within the dataframe. This is beneficial as we are to make an informed decision regarding the missing data to decide if the model would stand to benefit from either foregoing the feature altogether or to impute the data.



Function Results

The results of the function show we are missing over half of our data in certain columns. The machine learning library we are using does not accept NaN values to be fed into the machine learning algorithm. Therefore we must choose what to do with the missing data at runtime. The approach taken to resolve the missing data hurdle was to impute data using a median strategy. This allows us to get a good sense for the feautures relevance to the target.



Target Balance

In our case it is helpful to know the ratio of those who default versus those who do not simply because it will allow us to look at our machine learning models tabular data and determine if it is an apropriate interpretation.



Correlation

correlations = apptrain.corr()['TARGET'].sort_values()
print('Top 15 positive correlations', correlations.tail(15))
print('Top 15 negative correlations', correlations.head(15))
                                

Using the above code yields the following results

The correlation serves as a starting point that informs us which features to focus on. The range, -1 to 1, indicates either a strongly positive or negative correlation, a strong correlation is good for a machine learning model as it is indicative of a possible relationship in the data, however this makes a very large assumption that the data is linearly correlaed which may not be the case. We can see from the low numbers, that we do not have a strong linear correlation. This is stil helpful in selecting features that have an extremely negligible correlation.



Age Dependent Analysis

Here we categorized different age groups and created a bar graph for visual analysis. Age as a variable can be closely related to the other features, we must be careful to avoid collinearity as that will add noise to our machine learning model.

Feature Engineering

Kaggle competitions and machine learning come down to two things: Feature Engineering, and Model tuning. Efforts are spent using domain knowledge to make features that relate to the target variable. A lot of effort is also spent discarding features that have low correlation to prevent overfitting.

# Creating train data copies
apptrain_domain = apptrain.copy()
apptest_domain = apptest.copy()

#Creating variables for train data
apptrain_domain['CREDIT_INCOME_PERCENT'] = apptrain_domain['AMT_CREDIT'] * 100 / apptrain_domain['AMT_INCOME_TOTAL']
apptrain_domain['ANNUITY_INCOME_PERCENT'] = apptrain_domain['AMT_ANNUITY'] / apptrain_domain['AMT_INCOME_TOTAL']
apptrain_domain['CREDIT_TERM'] = apptrain_domain['AMT_ANNUITY'] / apptrain_domain['AMT_CREDIT']
apptrain_domain['DAYS_EMPLOYED_PERCENT'] = apptrain_domain['DAYS_EMPLOYED'] / apptrain_domain['DAYS_BIRTH']

#Creating variables for test data
apptest_domain['CREDIT_INCOME_PERCENT'] = apptest_domain['AMT_CREDIT'] / apptest_domain['AMT_INCOME_TOTAL']
apptest_domain['ANNUITY_INCOME_PERCENT'] = apptest_domain['AMT_ANNUITY'] / apptest_domain['AMT_INCOME_TOTAL']
apptest_domain['CREDIT_TERM'] = apptest_domain['AMT_ANNUITY'] / apptest_domain['AMT_CREDIT']
apptest_domain['DAYS_EMPLOYED_PERCENT'] = apptest_domain['DAYS_EMPLOYED'] / apptest_domain['DAYS_BIRTH']
                    

Domain Knowledge

With our limited knowledge of the Banking/Financial industry, we created features that are obvious in their correlation. These domain features prove to score at the top of the correlation function given by pandas



External Sources and Their Correlation to Target



During exploratory data analysis, we found features that begin with EXT_SOURCE have high correlation to the target variable. We can attempt to amplify the relationship by squaring or cubing the values, as well as multiplying them with each other. In this instance we are increasing Features by using Polynomial methods.



Automated Feature Engineering

Automated feature engineering is an excellent way to gather many features very quickly. For this task, we employed the help of a python library called Feature Tools. With feature tools, we were thereotically able to make 1700+ features by relating tables in our dataset. However, since we had hardware limitations we weren't able to collect said features as our data was already pretty sizable. If we had access to appropriate hardware, we would be able to do feature selection for 3,000,000 rows by 1,700 features


Logistic Regression Model

Logistic regression is a machine learning model used to make predictions according to linear function of the input features. It can be used for both regression and classification, herein are utilizing it as a classifier.



Logistic regression is a machine learning model used to make predictions according to linear function of the input features. It can be used for both regression and classification, herein it is being utilized as a classifier.




Model Tuning

For our logistic regression model, in order to improve generalization performance, we opted for a strong regularization parameter. Since that parameter is stronger the lower the number (inverse) we use .0001 in order to prevent overfitting.

    from sklearn.linear_model import LogisticRegression

      # Develop model with parameters
    logistic_regression = LogisticRegression(C=0.0001, verbose=2, n_jobs=-1)

      #train the model by giving it the data
    logistic_regression.fit(training_data_scaled, train_labels)
    


The default parameters given to us were a good fit for our model to be tuned with. for regularization, Scikit learn defaults to L2 which does not assume only a few features are important. Since we have a large dataset, and we know more than a few features will be relevant to our target variable, the default parameter for regularization is sufficient. It is hard to visualize a logistic regression model that has more than 2 dimensions of features, but with our model, we are able to apply many features as such: y = w[0] * x[0] + w[n] * x[n] + b



Kaggle Score



We submitted two models using logistic regression. Of these two models, one was submitted untuned and with limited feature engineering. The other we submitted tuned and using features that were engineered manually and selected via Random Forest feature importance.
Random Forest

Random Forest is a great classifier to use in this particular instance. It is powerful and widely implemented and with default settings performs well due to the ability to have many trees in the forest. Having many trees in the forest prevents overfitting since some trees will have the same data.

Random forests are great classifiers. They are comprised of an ensemble of Decision Trees, hence the name "Random Forest". The Random part of the name comes from the fact that that it searches for the best subset of features, meaning only a random subset of the features is taken into a consideration. A great quality of Random Forest that we took advantage of for other models is feature importance. We took what the forest considered important features and fed them back into the algorithm, as well as the Logistic Regression and MLP Neural Net models.




Model Tuning

For tuning the random forest, n_estimators was our biggest friend, and enemy. Increasing n_estimators increases the number of trees in the forest. This is beneficial to prevent the overfitting that happens in deep decision trees, however, computation time greatly increases. for our i7-4790k it took about 90 minutes to calculate 10000 trees with n_jobs parameter set to -1
from sklearn.ensemble import RandomForestClassifier

# Declare the model, tune parameters, fit data

Random_Forest = RandomForestClassifier(n_estimators = 1000, verbose = 1, n_jobs = -1, max_features = 'auto')
Random_Forest.fit(training_data, train_labels)
                                        




Kaggle Score

Our Random Forest model scored a high default of .678 with just the base features and very few trees. We strongly believe that if better hardware was available we could increase this score with minimal feature engineering.


Multilayer Perceptron Neural Network

The Multilayer Perceptron Neural Network is a part of Scikit Learns neural network library



The multilayer perceptron neural network is efficient on large datasets, can build very complex models, and has the most parameters for tuning. Having so many parameters does mean there's more chance for a bad configuration. Using this model we had to take scale into account, and although we believe random forest is our strongest model, the Multilayer Perceptron Neural Network does boast a high base score.




Model Tuning

from sklearn.preprocessing import Imputer, MinMaxScaler
imputer = Imputer(strategy = 'median')

fe_training_data = imputer.fit_transform(fe_training_data)
fe_testing_data = imputer.transform(fe_testing_data)

scaler = MinMaxScaler(feature_range = (0,1))
scaler.fit(fe_training_data)
fe_training_data = scaler.transform(fe_training_data)
fe_testing_data = scaler.transform(fe_testing_data)

from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(activation='relu', learning_rate='adaptive', max_iter=1000, verbose=True)
mlp.fit(fe_training_data, train_labels)
    


To tune our Neural Network, we chose 1000 iterations, although we chose many iterations training loss did not improve significantly over 2 epochs and so training was cut short. we chose rectified linear units as our acivation function since we scaled our data from zero to one, this function made sense. Our learning rate was set to adaptive simply because it self monitors the training to be more or less aggresive based on training loss.



Kaggle Score



With feature engineering, the MLP Neural Network gave us our highest kaggle public score. This is not surprising due to the fact that we had so many feature and our observations were complex. However, we still believe that with even more feature engineering, and better hardware, Random Forest would edge out the Neural Network.