Credit Card Fraud Detection

潘彥銘
7 min readMar 1, 2021

This article is about my practice on the Kaggle imbalanced dataset — Credit Card Fraud Detection.

Import packages

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import RobustScalerfrom sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegressionfrom imblearn.under_sampling import NearMiss
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTEfrom sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

Simple EDA

Import data

all_data = pd.read_csv(‘creditcard.csv’)
all_data.shape

Output

(284807, 31)

Inspect the columns

all_data.columns

Output

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10','V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20','V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount','Class'], dtype='object')

There are 31 features in total.

  • Time: number of seconds elapsed between this transaction and the first transaction in the dataset
  • V1~V28: the result of a PCA dimensionality reduction
  • Amount: transaction amount
  • Class: label of the transaction. 1 for fraudulent transactions, 0 otherwise

And there is no missing data in this dataset.

Inspect the data number of each label

sns.countplot(x=’Class’, data=all_data)

Output

There are 284315 non-fraudulent and 492 fraudulent transactions. Apparently, this dataset is really imbalanced.

non_fraud_num, fraud_num = all_data[‘Class’].value_counts()print(‘Non-fraud ratio: {} %’.format(round(non_fraud_num/all_data.shape[0]*100, 2)))
print(‘Fraud ratio: {} %’.format(round(fraud_num/all_data.shape[0]*100, 2)))

Output

Non-fraud ratio: 99.83 %
Fraud ratio: 0.17 %

Spilt the dataset (training & testing)

I split the data into training and testing set, whose label distribution are the same as original dataset. The size of training set is three times as large as the size of testing set.

X, y = all_data.iloc[:, 0:-1], all_data.iloc[:, -1]orig_x_train, orig_y_train, orig_x_test, orig_y_test = pd.DataFrame(), pd.DataFrame(), pd.DataFrame(), pd.DataFrame()
orig_train_idx, orig_test_idx = [], []
skf = StratifiedKFold(n_splits=4)
for train_idx, test_idx in skf.split(X, y):
orig_x_train, orig_y_train = X.iloc[train_idx], y.iloc[train_idx]
orig_x_test, orig_y_test = X.iloc[test_idx], y.iloc[test_idx]
orig_train_idx, orig_test_idx = train_idx, test_idx
break

train_non_fraud_num, train_fraud_num = orig_y_train.value_counts()
test_non_fraud_num, test_fraud_num = orig_y_test.value_counts()
print('Training set non-fraud ratio: {:.2f} %'.format(train_non_fraud_num/len(train_idx)*100))
print('Training set fraud ratio: {:.2f} %'.format(train_fraud_num/len(train_idx)*100))
print('Testing set non-fraud ratio: {:.2f} %'.format(test_non_fraud_num/len(test_idx)*100))
print('Testing set fraud ratio: {:.2f} %'.format(test_fraud_num/len(test_idx)*100))

Output

Training set non-fraud ratio: 99.83 %
Training set fraud ratio: 0.17 %
Testing set non-fraud ratio: 99.83 %
Testing set fraud ratio: 0.17 %

Feature scaling

Here I scale the Time and Account features. I first fit the scaler on training set and use it to transform testing set to prevent data leakage.

orig_x_train_deep = orig_x_train.copy(deep=True)
orig_x_test_deep = orig_x_test.copy(deep=True)
time_transformer = RobustScaler()
orig_x_train_deep = orig_x_train_deep.assign(Scaled_time=time_transformer.fit_transform(orig_x_train_deep[‘Time’].values.reshape(-1,1)))
orig_x_test_deep = orig_x_test_deep.assign(Scaled_time=time_transformer.transform(orig_x_test_deep[‘Time’].values.reshape(-1,1)))
amount_transformer = RobustScaler()
orig_x_train_deep = orig_x_train_deep.assign(Scaled_amount=amount_transformer.fit_transform(orig_x_train_deep[‘Amount’].values.reshape(-1,1)))
orig_x_test_deep = orig_x_test_deep.assign(Scaled_amount=amount_transformer.transform(orig_x_test_deep[‘Amount’].values.reshape(-1,1)))
orig_x_train_deep.drop(['Time', 'Amount'], axis=1, inplace=True)
orig_x_test_deep.drop(['Time', 'Amount'], axis=1, inplace=True)

Logistic regression

# Logistic Regression 
orig_log_reg = LogisticRegression(max_iter=1000)
orig_log_reg.fit(orig_x_train_deep.values, orig_y_train.values.ravel())
# Prediction
orig_y_pred = orig_log_reg.predict(orig_x_test_deep)
print('Classification report: \n\n', classification_report(orig_y_test, orig_y_pred))
# Calculate AUROC
print('AUROC: ', roc_auc_score(orig_y_test, orig_y_pred))

Output

Classification report: 

precision recall f1-score support

0 1.00 1.00 1.00 71079
1 0.74 0.85 0.79 123

accuracy 1.00 71202
macro avg 0.87 0.93 0.90 71202
weighted avg 1.00 1.00 1.00 71202

AUROC: 0.9265689945128041

Here I choose the recall of fraudulent class and AUROC to be indicators of good model.

  • Recall: The ability to detect the class. (TP/(TP+FN))
  • AUROC: the area under ROC curve, which it indicates how well the model distinguishes the class. The ideal value is 1.

In this kind of scenario, using accuracy as indicator of good model is not a good choice since there are lots of non-fraudulent cases and few fraudulent cases. Let’s say, if a model predicts all cases in dataset as non-fraudulent, it will get a high accuracy score but actually it learns nothing.

The logistic regression model has 0.85 recall and 0.92 AUROC, which is good. But we can do better.

Logistic regression with class weight

The default logistic regression use the same punishment weight for each class. However, for imbalanced dataset, the wrong classification for fraudulent class will have more cost than that for non-fraudulent class.

Thus, let’s adjust the punishment for both class.

# Logistic Regression 
w = {0:1, 1:99}
weighted_log_reg = LogisticRegression(max_iter=1000, class_weight=w)
weighted_log_reg.fit(orig_x_train_deep.values, orig_y_train.values.ravel())
# Prediction
weighted_y_pred = weighted_log_reg.predict(orig_x_test_deep)
print('Classification report: \n\n', classification_report(orig_y_test, weighted_y_pred))
# Calculate AUROC
print('AUROC: ', roc_auc_score(orig_y_test, weighted_y_pred))

Output

Classification report: 

precision recall f1-score support

0 1.00 0.99 0.99 71079
1 0.10 0.88 0.18 123

accuracy 0.99 71202
macro avg 0.55 0.93 0.58 71202
weighted avg 1.00 0.99 0.99 71202

AUROC: 0.9320251358931095

We get a better recall and AUROC, which is 0.88 and 0.932, respectively.

Not enough, the weight above is my guess. Let’s try other weight combinations.

w = [{0:25, 1:9975}, {0:2.5, 1:997.5}, {0:0.25, 1:99.75}, {0:0.025, 1:9.975}, {0:0.0025, 1:0.9975}, {0:0.00025, 1:0.09975},
{0:20, 1:9980}, {0:2.0, 1:999.0}, {0:0.2, 1:99.8}, {0:0.02, 1:9.98}, {0:0.002, 1:0.998}, {0:0.0002, 1:0.0998},
{0:17, 1:9983}, {0:1.7, 1:998.3}, {0:0.17, 1:99.83}, {0:0.017, 1:9.983}, {0:0.0017, 1:0.9983}, {0:0.00017, 1:0.09983},
{0:10, 1:9990}, {0:1, 1:999}, {0:0.1, 1:99.9}, {0:0.01, 1:9.99}, {0:0.001, 1:0.999}, {0:0.0001, 1:0.0999},
{0:100, 1:9900}, {0:10, 1:990}, {0:1, 1:99}, {0:0.1, 1:9.9}, {0:0.01, 1:0.99}]
log_reg_params = {‘class_weight’: w}# Logistic Regression with grid search
grid_weighted_log_reg = GridSearchCV(LogisticRegression(max_iter=1000), log_reg_params, scoring=’roc_auc’)
grid_weighted_log_reg.fit(orig_x_train_deep.values, orig_y_train.values.ravel())
# Best hyperparameters
best_hyperparams = grid_weighted_log_reg.best_params_
print(‘Best hyperparameters: ‘, best_hyperparams)
best_weighted_log_reg = grid_weighted_log_reg.best_estimator_
# Prediction
best_weighted_y_pred = best_weighted_log_reg.predict(orig_x_test_deep)
print(‘Classification report: \n\n’, classification_report(orig_y_test, best_weighted_y_pred))
# Calculate AUROC
print(‘AUROC: ‘, roc_auc_score(orig_y_test, best_weighted_y_pred))

Output

Best hyperparameters:  {'class_weight': {0: 0.00025, 1: 0.09975}}
Classification report:

precision recall f1-score support

0 1.00 0.98 0.99 71079
1 0.07 0.97 0.13 123

accuracy 0.98 71202
macro avg 0.54 0.97 0.56 71202
weighted avg 1.00 0.98 0.99 71202

AUROC: 0.9727590976580851

Now the recall and AUROC looks great. Let’s refer to the visualization of confusion matrix.

cfs_mat = confusion_matrix(orig_y_test, best_weighted_y_pred)
fig, ax = plt.subplots(figsize=(12,9))
sns.heatmap(pd.DataFrame(cfs_mat), cmap=”Blues”, annot=True, fmt=’d’, ax=ax, annot_kws={“fontsize”:16})
ax.set_xlabel(‘Predicted label’, fontsize = 16)
ax.set_ylabel(‘True label’, fontsize = 16)

Output

Our model identifies 119 fraudulent cases out of 125! That’s awesome.

Here I also tried some common data sampling method such as RandomUnderSampler, NearMiss and SMOTE on logistic regression.

RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
rus_X, rus_y = rus.fit_sample(orig_x_train, orig_y_train)
rus_y.value_counts()

Output

0    369
1 369
Name: Class, dtype: int64

To make dataset more balanced, the RandomUnderSampler randomly chooses non-fraudulent cases, whose number is the same as fraudulent cases.

# Feature scaling
time_transformer = RobustScaler()
rus_X = rus_X.assign(Scaled_time=time_transformer.fit_transform(rus_X[‘Time’].values.reshape(-1,1)))
rus_X_test = orig_x_test.assign(Scaled_time=time_transformer.transform(orig_x_test[‘Time’].values.reshape(-1,1)))
amount_transformer = RobustScaler()
rus_X = rus_X.assign(Scaled_amount=amount_transformer.fit_transform(rus_X[‘Amount’].values.reshape(-1,1)))
rus_X_test = rus_X_test.assign(Scaled_amount=amount_transformer.transform(rus_X_test[‘Amount’].values.reshape(-1,1)))
rus_X.drop([‘Time’, ‘Amount’], axis=1, inplace=True)
rus_X_test.drop([‘Time’, ‘Amount’], axis=1, inplace=True)
# Logistic Regression
rus_log_reg = LogisticRegression(max_iter=1000)
rus_log_reg.fit(rus_X.values, rus_y.values)
# Prediction
rus_y_pred = rus_log_reg.predict(rus_X_test)
print(‘Classification report: \n\n’, classification_report(orig_y_test, rus_y_pred))
# Calculate AUROC
print(‘AUROC: ‘,roc_auc_score(orig_y_test, rus_y_pred))

Output

Classification report: 

precision recall f1-score support

0 1.00 0.92 0.96 71079
1 0.02 0.90 0.04 123

accuracy 0.92 71202
macro avg 0.51 0.91 0.50 71202
weighted avg 1.00 0.92 0.96 71202

AUROC: 0.9107363877842551

NearMiss (undersampling)

nm_X, nm_y = NearMiss().fit_sample(orig_x_train, orig_y_train)nm_y.value_counts()

Output

0    369
1 369
Name: Class, dtype: int64

The size of both class is the same as RandomUnderSampler, but NearMiss uses different strategy to select non-fraudulent cases.

# Feature scaling
time_transformer = RobustScaler()
nm_X = nm_X.assign(Scaled_time=time_transformer.fit_transform(nm_X[‘Time’].values.reshape(-1,1)))
nm_x_test = orig_x_test.assign(Scaled_time=time_transformer.transform(orig_x_test[‘Time’].values.reshape(-1,1)))
amount_transformer = RobustScaler()
nm_X = nm_X.assign(Scaled_amount=amount_transformer.fit_transform(nm_X[‘Amount’].values.reshape(-1,1)))
nm_x_test = nm_x_test.assign(Scaled_amount=amount_transformer.transform(nm_x_test[‘Amount’].values.reshape(-1,1)))
nm_X.drop([‘Time’, ‘Amount’], axis=1, inplace=True)
nm_x_test.drop([‘Time’, ‘Amount’], axis=1, inplace=True)
# Logistic Regression
nm_log_reg = LogisticRegression(max_iter=1000)
nm_log_reg.fit(nm_X.values, nm_y.values)
# Prediction
nm_y_pred = nm_log_reg.predict(nm_x_test)
print(‘Classification report: \n\n’, classification_report(orig_y_test, nm_y_pred))
# Calculate AUROC
print(‘AUROC: ‘,roc_auc_score(orig_y_test, nm_y_pred))

Output

Classification report: 

precision recall f1-score support

0 1.00 0.49 0.66 71079
1 0.00 0.94 0.01 123

accuracy 0.49 71202
macro avg 0.50 0.72 0.33 71202
weighted avg 1.00 0.49 0.66 71202

AUROC: 0.7177637112124298

SMOTE (oversampling)

smt_X, smt_y = SMOTE().fit_sample(orig_x_train, orig_y_train)smt_y.value_counts()

Output

0    213236
1 213236
Name: Class, dtype: int64

Instead of selecting data from non-fraudulent cases, SMOTE creates new fraudulent cases by itself. Now the numbers of data of both classes are still the same. But the total data number becomes more than 400,000.

# Feature scaling
time_transformer = RobustScaler()
smt_X = smt_X.assign(Scaled_time=time_transformer.fit_transform(smt_X[‘Time’].values.reshape(-1,1)))
smt_x_test = orig_x_test.assign(Scaled_time=time_transformer.transform(orig_x_test[‘Time’].values.reshape(-1,1)))
amount_transformer = RobustScaler()
smt_X = smt_X.assign(Scaled_amount=amount_transformer.fit_transform(smt_X[‘Amount’].values.reshape(-1,1)))
smt_x_test = smt_x_test.assign(Scaled_amount=amount_transformer.transform(smt_x_test[‘Amount’].values.reshape(-1,1)))
smt_X.drop([‘Time’, ‘Amount’], axis=1, inplace=True)
smt_x_test.drop([‘Time’, ‘Amount’], axis=1, inplace=True)
# Logistic Regression
smt_log_reg = LogisticRegression(max_iter=1000)
smt_log_reg.fit(smt_X.values, smt_y.values)
# Prediction
smt_y_pred = smt_log_reg.predict(smt_x_test)
print(‘Classification report: \n\n’, classification_report(orig_y_test, smt_y_pred))
# Calculate AUROC
print(‘AUROC: ‘, roc_auc_score(orig_y_test, smt_y_pred))

Output

Classification report: 

precision recall f1-score support

0 1.00 0.97 0.98 71079
1 0.05 0.98 0.09 123

accuracy 0.97 71202
macro avg 0.52 0.97 0.54 71202
weighted avg 1.00 0.97 0.98 71202

AUROC: 0.9749380541541034

Confusion matrix visualization:

cfs_mat = confusion_matrix(orig_y_test, smt_y_pred)
fig, ax = plt.subplots(figsize=(8,6))
sns.heatmap(pd.DataFrame(cfs_mat), cmap=”Blues”, annot=True, fmt=’d’, ax=ax, annot_kws={“fontsize”:16})
ax.set_xlabel(‘Predicted label’, fontsize = 16)
ax.set_ylabel(‘True label’, fontsize = 16)

Output

The result of SMOTE is the best among three sampling methods, and it is slightly better than that of logistic regression with weight adjustment.

Conclusion

In this article, I used two methods to deal with imbalanced dataset. One is weight adjustment and the other is data under/oversampling. Both two methods work in this scenario.

Any feedback/discussion is welcome!

--

--