Data Preparation

Overview

This section examines summary statistics for the fraud_data.csv dataset. It then splits the dataset into training and test sets to train several models and evaluate their effectiveness in detecting fraud in credit card transactions. This project focuses on selecting the appropriate model evaluation metrics when classes are imbalanced.

A description of the dataset and code for importing the data are provided in the previous section.

Construction of the model and analysis are presented in the next section.

This project is based on assignments from Applied Machine Learning in Python by University of Michigan on Coursera.

The analysis for this project was performed in Python.

Data Exploration and Processing

The following code outputs summary statistics for each of the features:

# Read data and output summary statistics 
df = read_transactions_data()

print(round(df.describe().transpose(), 3))
print('\nThe number of missing values across all attributes and samples: ', df.isnull().sum().sum())

The summary statistics below show that there are 21,693 transactions in the data, of which 1.6% are fraudulent. The average transaction amount being substantially higher than the median suggests that there is a relatively small number of very large transactions that drive the mean upward. The dataset has no missing values.

          count    mean      std     min    25%     50%     75%       max
V1      21693.0  -0.032    2.107 -41.929 -0.929   0.008   1.316     2.452
V2      21693.0   0.048    1.691 -40.804 -0.593   0.075   0.820    21.467
V3      21693.0  -0.092    1.870 -31.104 -0.963   0.177   1.021     4.070
V4      21693.0   0.058    1.540  -4.849 -0.850  -0.013   0.772    12.115
V5      21693.0  -0.034    1.531 -32.092 -0.698  -0.064   0.615    29.162
V6      21693.0  -0.023    1.341 -20.368 -0.779  -0.282   0.384    21.393
V7      21693.0  -0.074    1.597 -41.507 -0.565   0.031   0.564    34.303
V8      21693.0   0.002    1.413 -38.987 -0.206   0.023   0.328    20.007
V9      21693.0  -0.044    1.159 -13.434 -0.670  -0.074   0.590     9.126
V10     21693.0  -0.091    1.355 -24.403 -0.555  -0.099   0.445    12.702
V11     21693.0   0.067    1.154  -3.996 -0.739   0.006   0.786    12.019
V12     21693.0  -0.094    1.365 -18.600 -0.439   0.127   0.614     3.970
V13     21693.0  -0.001    0.990  -3.845 -0.634  -0.019   0.652     4.099
V14     21693.0  -0.091    1.356 -19.214 -0.438   0.045   0.490     6.441
V15     21693.0  -0.004    0.917  -4.499 -0.582   0.049   0.642     5.720
V16     21693.0  -0.055    1.096 -14.130 -0.493   0.060   0.525     6.443
V17     21693.0  -0.098    1.425 -24.019 -0.499  -0.076   0.390     6.609
V18     21693.0  -0.033    0.937  -9.499 -0.513  -0.019   0.495     3.790
V19     21693.0   0.022    0.844  -4.400 -0.444   0.022   0.485     4.850
V20     21693.0  -0.002    0.728 -21.025 -0.210  -0.057   0.139    13.120
V21     21693.0   0.012    0.850 -21.454 -0.225  -0.024   0.193    27.203
V22     21693.0   0.004    0.741  -8.887 -0.538   0.007   0.530     8.362
V23     21693.0  -0.002    0.630 -21.304 -0.162  -0.012   0.147    15.626
V24     21693.0  -0.002    0.600  -2.767 -0.356   0.037   0.432     4.014
V25     21693.0  -0.000    0.521  -4.542 -0.317   0.012   0.354     5.542
V26     21693.0   0.002    0.478  -1.855 -0.326  -0.045   0.239     3.463
V27     21693.0   0.002    0.425  -7.764 -0.070   0.002   0.096     9.880
V28     21693.0   0.003    0.302  -6.520 -0.053   0.012   0.082     9.876
Amount  21693.0  86.776  235.644   0.000  5.370  21.950  76.480  7712.430
Class   21693.0   0.016    0.127   0.000  0.000   0.000   0.000     1.000

The number of missing values across all attributes and samples:  0

The following code splits the sample into training and test sets:

# Split the data into X_train, X_test, y_train, y_test
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

Features need to be scaled before we train the models described in the next section. The code below fits a scaler to the training data and transforms both the training and test data using the fitted scaler.

It is important to note that the scaler should be fitted to the training data only (rather than to the entire dataset) in order to prevent leakage of information from the test data.

# Scale the data
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Next step: Analysis