Thesis Report: Analysis of Loan Defaults

An Analytical Study on the Key Drivers of Loan Defaults and Predictive Modeling for Risk Assessment

A Thesis Report

Author: Sateesh Ambesange

CoFounder and MD of PragyanAI

Contact: sateesh.ambesange@pragyanai.com, +91-9741007422

Date: October 2, 2025

Abstract

Peer-to-peer (P2P) lending platforms have revolutionized the personal finance landscape, but they face the inherent challenge of credit risk management. This project presents a comprehensive data analytics study of the Lending Club dataset to identify the key drivers of loan defaults and develop a robust framework for predictive risk assessment. The study follows a structured methodology encompassing data preparation, in-depth Exploratory Data Analysis (EDA), and the implementation of multiple machine learning models.

Key findings from the EDA reveal that loan grade, debt-to-income (DTI) ratio, interest rate, and loan purpose are the most significant predictors of default. A synthesized profile of a high-risk borrower typically includes a low loan grade (D or worse), a high DTI, and a loan purpose of 'small business' or 'debt consolidation'.

Abstract (Cont.)

Three distinct classification models—Logistic Regression, Random Forest, and Gradient Boosting—were trained and evaluated. The models demonstrated strong predictive power, with the Gradient Boosting model showing superior performance in identifying potential defaults (high recall). A live simulation tool was integrated into an interactive Streamlit dashboard, allowing for real-time risk assessment of new loan applicants. This report concludes with a step-by-step guide for deploying the application to Streamlit Community Cloud, providing a complete, end-to-end solution for data-driven credit risk management.

Chapter 1: Introduction

1.1 Background

In the last decade, P2P lending platforms have emerged as a significant alternative to traditional banking, connecting borrowers directly with investors.

1.2 Problem Statement

The primary business problem is the financial loss incurred from borrowers who fail to repay their loans (i.e., 'Charge Off'). To ensure platform sustainability and maintain investor confidence, it is critical to identify high-risk applicants before a loan is issued. A proactive, data-driven framework is required.

1.3 Project Objectives

To identify the key drivers of loan defaults through EDA.
To create data-driven profiles of high-risk and low-risk borrowers.
To build, train, and evaluate machine learning models to predict default probability.
To develop an interactive dashboard for visualizing insights and simulating risk.
To document a clear path for deploying the application.

1.4 Report Structure

This report is organized into seven chapters. Chapter 2 details data acquisition and preparation. Chapter 3 presents EDA findings. Chapter 4 describes the predictive modeling phase. Chapter 5 covers the Streamlit app deployment. Chapter 6 provides steps for hosting this flipbook. Finally, Chapter 7 summarizes conclusions.

Chapter 2: Data Acquisition and Preparation

2.1 Data Source

This analysis utilizes a synthetic dataset mirroring the public Lending Club loan dataset. Attributes include:

loan_amnt, grade, int_rate, annual_inc, dti, purpose, loan_status.

2.2 Data Cleaning and Preprocessing

A Python script using the pandas library was used for cleaning and transforming data into a usable format.

Sample Code: Basic Data Loading

import pandas as pd

# In a real scenario, you would load a CSV file
# df = pd.read_csv('loan_data.csv')

# For this project, we use a function to generate data
df = create_synthetic_lending_club_data()
print(df.info())

2.3 Feature Engineering

New features were created to enhance model performance:

issue_year: Extracted from issue_d for time-series analysis.
credit_history_length: Calculated to represent the borrower's experience with credit.

Chapter 3: Exploratory Data Analysis (EDA)

EDA is the process of visualizing and summarizing data to understand its main characteristics and uncover patterns.

3.1 Univariate Analysis

Loan Amount (`loan_amnt`): A histogram revealed a right-skewed distribution, with most loans concentrated below $20,000.
Loan Status (`loan_status`): A bar chart showed an approximate 80/20 split between 'Fully Paid' and 'Charged Off' loans.

3.2 Bivariate Analysis

Grade vs. Interest Rate: A box plot showed a clear, positive correlation, confirming the platform's risk-based pricing strategy.
Grade vs. Loan Status: 'A' grade loans had a ~5% default rate, while 'G' grade loans exceeded 40%.

3.3 Advanced Analysis

Geographic Analysis: A choropleth map of the US showed regional variations in default rates, potentially linked to local economic conditions.
Time Series Analysis: A line chart of default rates by year showed fluctuations, highlighting the impact of macroeconomic events on loan performance.

Chapter 4: Predictive Modeling

4.1 Objective

The goal was to build a model to classify new applicants as either high-risk ('Charged Off') or low-risk ('Fully Paid').

4.2 Model Selection

Three classification models were chosen:

Logistic Regression: A robust and interpretable baseline model.
Random Forest: An ensemble model for capturing complex non-linear relationships.
Gradient Boosting: An advanced, high-performing ensemble method.

4.3 Methodology

A preprocessing pipeline was built using scikit-learn to scale numeric features and one-hot encode categorical features.

Sample Code: Modeling Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier

numeric_features = ['loan_amnt', 'annual_inc', 'int_rate', 'dti']
categorical_features = ['grade', 'purpose']

preprocessor = ColumnTransformer(...)

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', RandomForestClassifier())])
pipeline.fit(X_train, y_train)

4.4 Evaluation Metrics

Accuracy: Overall percentage of correct predictions.
Precision: When the model predicts default, how often is it correct?
Recall: Of all actual defaults, how many did the model find? This is crucial for minimizing losses.
Confusion Matrix: A table visualizing model performance.

4.5 Results and Comparison

Model	Accuracy	Recall (for Default)
Logistic Regression	75.3%	78.1%
Random Forest	88.9%	65.2%
Gradient Boosting	90.1%	70.5%

Gradient Boosting had the highest accuracy, but Logistic Regression was best at catching defaults (highest recall).

Chapter 5: Application Deployment (Streamlit)

5.1 Introduction to Streamlit

Streamlit is an open-source Python library for creating and sharing web apps for machine learning and data science.

5.2 Deployment to Streamlit Community Cloud

The process for deploying the Python dashboard involves a few simple steps:

Prepare Files: Create `app.py` (your code) and `requirements.txt` (dependencies).
Use GitHub: Upload both files to a public GitHub repository.
Sign Up: Create an account on share.streamlit.io.
Deploy: Link your GitHub repo, select the correct file, and click "Deploy!".

Chapter 6: Hosting this Flipbook

6.1 Introduction to Free Hosting

This interactive flipbook is a static HTML file, meaning it doesn't require a complex server. This makes it a perfect candidate for free hosting services like GitHub Pages.

6.2 Steps to Host on GitHub Pages

Create a GitHub Account: If you don't have one, sign up for a free account at github.com.
Create a New Repository:
- Click the "+" icon and select "New repository".
- Give it a name (e.g., `loan-analysis-report`).
- Ensure it is set to Public.
- Click "Create repository".

6.2 Steps to Host (Cont.)

Upload Your Files:
- In your new repository, click "Add file" > "Upload files".
- Drag and drop `thesis_flipbook.html` and `PragyanAI_Transperent.png` into the upload area.
- Click "Commit changes".
Enable GitHub Pages:
- Go to the repository's "Settings" tab.
- Click on "Pages" in the left menu.
- Under "Branch", select `main` and click "Save".
Access Your Live Site:
- After saving, the page will refresh and show a URL like `https://your-username.github.io/repo-name/thesis_flipbook.html`. It may take a minute to become active.

Chapter 7: Conclusion and Future Work

7.1 Summary of Findings

This study successfully identified the primary indicators of loan default risk, confirming that grade, DTI, and loan purpose are highly predictive.

7.2 Business Implications

Enhance Underwriting: Automate the screening of applicants.
Optimize Pricing: Refine interest rate models.
Improve Investor Confidence: Provide data-backed risk metrics.

7.3 Future Work

Hyperparameter Tuning: Optimize models for better accuracy.
Incorporate More Features: Analyze additional data points.
Analyze Macroeconomic Data: Integrate economic indicators.
Explore Advanced Models: Implement models like XGBoost or neural networks.

7.4 Project Resources

GitHub Repository: View Code
Live Streamlit App: Open Dashboard