Back to Notes
Tutorial
Automated Data Augmentation for Tabular Data!
March 13, 2025
If you’re new to machine learning and wondering how to make your models smarter without collecting more data, you’re in the right place.
What is Data Augmentation?
Imagine you’re teaching a child to recognize cats. If you only show them one picture of a cat, they might struggle to identify cats in different poses or lighting. But if you show them many variations—cats sitting, standing, or in different colors—they’ll learn better. Data augmentation does the same for machine learning models. It creates modified copies of your existing data to give your model more examples to learn from, making it more robust and less likely to “memorize” the training data (a problem called overfitting). For tabular data—like a CSV file with customer transactions—this might mean tweaking numbers slightly or generating new rows based on patterns in your data. Let’s dive into how to do this, step by step.
Step 1: Understand Your Data Before you start changing anything, take a good look at your dataset.
Before you start changing anything, take a good look at your dataset. What’s in it? For example, imagine a file called customer_transactions.csv
with columns like:
customer_id (a unique number for each customer),
age (a number),
transaction_amount (a dollar value),
gender (categories like “M” or “F”),
purchase (0 for “no,” 1 for “yes”).
Ask yourself:
Are the features numerical (like age) or categorical (like gender)?
Are there missing values (empty cells)? Is the data balanced?
For instance, do you have way more “no” purchases than “yes” purchases?
Understanding these details helps you decide how to augment your data effectively.
Step 2: Handle Missing Values
Missing data is like a puzzle with pieces gone—it can confuse your model. You need to fill those gaps. Here are simple ways to do it:
For numbers: Use the average (mean) or middle value (median). If some age values are missing, you could fill them with the average age of all customers.
For categories: Use the most common value (mode). If gender is missing, fill it with whichever is more frequent, like “F” if most customers are female.
Here's how you'd do tis in python using the pandas library:
import pandas as pd
# Load your data
df = pd.read_csv('customer_transactions.csv')
# Fill missing ages with the average
df['age'].fillna(df['age'].mean(), inplace=True)
# Fill missing gender with the most common value
df['gender'].fillna(df['gender'].mode()[0], inplace=True)
This ensures your dataset is complete before you start augmenting
Step 3: Apply Transformations to Skewed Data
import numpy as np
# Apply log transformation to transaction_amount
df['transaction_amount'] = np.log1p(df['transaction_amount'])
The np.log1p function is handy because it works even if some values are zero (plain log doesn’t like zeros). After this, your data might look more “normal,” which many models prefer.
Step 4: Generate Synthetic Data with Noise
Now, let’s create new data by tweaking what you have. One easy way is to add random noise—small random changes—to numerical features. For example, if a customer’s age is 30, you might add or subtract a tiny bit (like 1 or 2 years) to create a new “synthetic” customer.
Here’s a Python example:
# Make a copy of the original data
df_aug = df.copy()
# Add random noise to age and transaction_amount
df_aug['age'] += np.random.uniform(-2, 2, size=len(df)) # Changes age by -2 to +2
df_aug['transaction_amount'] += np.random.uniform(-0.1, 0.1, size=len(df)) # Small changes on log scale
You can make several copies with different noise levels and combine them:
# Create 5 augmented versions
num_augmentations = 5
augmented_dfs = [df.copy() for _ in range(num_augmentations)]
for aug_df in augmented_dfs:
aug_df['age'] += np.random.uniform(-2, 2, size=len(df))
aug_df['transaction_amount'] += np.random.uniform(-0.1, 0.1, size=len(df))
# Combine original and augmented data
df_augmented = pd.concat([df] + augmented_dfs, ignore_index=True)
This gives you a bigger dataset with realistic variations.
Step 5: Balance the Dataset with SMOTE (If Needed)
What if your data is imbalanced? Say you have 70 “no” purchases and only 30 “yes” purchases. Your model might just guess “no” all the time because it’s more common. SMOTE (Synthetic Minority Over-sampling Technique) fixes this by creating new examples of the minority class (here, “yes” purchases) based on existing ones.
SMOTE needs all features to be numbers, so first, convert categories like gender to numbers (e.g., “M” = 0, “F” = 1) using one-hot encoding. Then apply SMOTE:
from imblearn.over_sampling import SMOTE
# One-hot encode categorical features
df_encoded = pd.get_dummies(df_augmented, columns=['gender'], drop_first=True)
# Separate features and target
X = df_encoded.drop('purchase', axis=1)
y = df_encoded['purchase']
# Apply SMOTE
smote = SMOTE()
X_res, y_res = smote.fit_resample(X, y)
# Combine back into a DataFrame
df_balanced = pd.DataFrame(X_res, columns=X.columns)
df_balanced['purchase'] = y_res
SMOTE picks a “yes” purchase, finds similar “yes” purchases, and creates new ones in between—like averaging their age and transaction_amount. Now your classes are balanced!
Step 6: Ensure Data Consistency
After all this augmenting, double-check your data:
Are age values still sensible (e.g., not negative)?
Do transaction_amount values make sense after noise or log transformations?
For categories, did encoding or augmentation mess anything up?
If something looks off (like an age of -5), adjust your noise range or add rules to cap values. Step 7: Save the Augmented Data Once you’re happy with your augmented dataset, save it for your machine learning project:
Step 7: Save the Augmented Data
Once You're happy with your augmented dataset, save it for your machine learning project:
df_balanced.to_csv('customer_transactions_augmented.csv', index=False)
Now you have a shiny new customer_transactions_augmented.csv
file, ready to train a better model!
A Simple Example
Let’s tie this together with an example. Suppose your original dataset has 100 rows with age,
transaction_amount
, and purchase
. The picture at the top shows a scatter plot: the left side is the original data (blue for “no” purchases, red for “yes”), and the right side shows it after adding noise (slightly shifted points).
Here’s what you’d do:
Load and clean: Load
customer_transactions.csv
and fill any missing age values with the mean.Transform: Apply np.log1p to transaction_amount if it’s skewed. Augment with noise: Add small random changes to age and transaction_amount, making 5 new versions and combining them (500 rows total).
Balance with SMOTE: If “yes” purchases are rare, use SMOTE to even things out.
Save: Export the result as
customer_transactions_augmented.csv.
The code snippets above show exactly how to do each part.
A Quick Note for Beginners
For categorical features like gender, augmentation is trickier. You could keep them the same for synthetic rows or sample them randomly, but advanced methods (like generative models) exist too. For now, focusing on numbers keeps it simple.
Also, in real machine learning, you’d only augment your training data, not your test data, to avoid cheating. Here, we augmented the whole dataset for simplicity, but keep this in mind for projects!
Conclusion
Automated data augmentation for tabular data is like giving your machine learning model a bigger, more diverse classroom to learn from. By understanding your data, fixing missing values, transforming skewed features, adding noise, balancing classes with SMOTE, checking consistency, and saving your work, you can boost your model’s performance without collecting more data. It’s a practical skill that’s easy to start with Python libraries like pandas
, numpy
, and imblearn
.
What is Data Augmentation?
Imagine you’re teaching a child to recognize cats. If you only show them one picture of a cat, they might struggle to identify cats in different poses or lighting. But if you show them many variations—cats sitting, standing, or in different colors—they’ll learn better. Data augmentation does the same for machine learning models. It creates modified copies of your existing data to give your model more examples to learn from, making it more robust and less likely to “memorize” the training data (a problem called overfitting). For tabular data—like a CSV file with customer transactions—this might mean tweaking numbers slightly or generating new rows based on patterns in your data. Let’s dive into how to do this, step by step.
Step 1: Understand Your Data Before you start changing anything, take a good look at your dataset.
Before you start changing anything, take a good look at your dataset. What’s in it? For example, imagine a file called customer_transactions.csv
with columns like:
customer_id (a unique number for each customer),
age (a number),
transaction_amount (a dollar value),
gender (categories like “M” or “F”),
purchase (0 for “no,” 1 for “yes”).
Ask yourself:
Are the features numerical (like age) or categorical (like gender)?
Are there missing values (empty cells)? Is the data balanced?
For instance, do you have way more “no” purchases than “yes” purchases?
Understanding these details helps you decide how to augment your data effectively.
Step 2: Handle Missing Values
Missing data is like a puzzle with pieces gone—it can confuse your model. You need to fill those gaps. Here are simple ways to do it:
For numbers: Use the average (mean) or middle value (median). If some age values are missing, you could fill them with the average age of all customers.
For categories: Use the most common value (mode). If gender is missing, fill it with whichever is more frequent, like “F” if most customers are female.
Here's how you'd do tis in python using the pandas library:
import pandas as pd
# Load your data
df = pd.read_csv('customer_transactions.csv')
# Fill missing ages with the average
df['age'].fillna(df['age'].mean(), inplace=True)
# Fill missing gender with the most common value
df['gender'].fillna(df['gender'].mode()[0], inplace=True)
This ensures your dataset is complete before you start augmenting
Step 3: Apply Transformations to Skewed Data
import numpy as np
# Apply log transformation to transaction_amount
df['transaction_amount'] = np.log1p(df['transaction_amount'])
The np.log1p function is handy because it works even if some values are zero (plain log doesn’t like zeros). After this, your data might look more “normal,” which many models prefer.
Step 4: Generate Synthetic Data with Noise
Now, let’s create new data by tweaking what you have. One easy way is to add random noise—small random changes—to numerical features. For example, if a customer’s age is 30, you might add or subtract a tiny bit (like 1 or 2 years) to create a new “synthetic” customer.
Here’s a Python example:
# Make a copy of the original data
df_aug = df.copy()
# Add random noise to age and transaction_amount
df_aug['age'] += np.random.uniform(-2, 2, size=len(df)) # Changes age by -2 to +2
df_aug['transaction_amount'] += np.random.uniform(-0.1, 0.1, size=len(df)) # Small changes on log scale
You can make several copies with different noise levels and combine them:
# Create 5 augmented versions
num_augmentations = 5
augmented_dfs = [df.copy() for _ in range(num_augmentations)]
for aug_df in augmented_dfs:
aug_df['age'] += np.random.uniform(-2, 2, size=len(df))
aug_df['transaction_amount'] += np.random.uniform(-0.1, 0.1, size=len(df))
# Combine original and augmented data
df_augmented = pd.concat([df] + augmented_dfs, ignore_index=True)
This gives you a bigger dataset with realistic variations.
Step 5: Balance the Dataset with SMOTE (If Needed)
What if your data is imbalanced? Say you have 70 “no” purchases and only 30 “yes” purchases. Your model might just guess “no” all the time because it’s more common. SMOTE (Synthetic Minority Over-sampling Technique) fixes this by creating new examples of the minority class (here, “yes” purchases) based on existing ones.
SMOTE needs all features to be numbers, so first, convert categories like gender to numbers (e.g., “M” = 0, “F” = 1) using one-hot encoding. Then apply SMOTE:
from imblearn.over_sampling import SMOTE
# One-hot encode categorical features
df_encoded = pd.get_dummies(df_augmented, columns=['gender'], drop_first=True)
# Separate features and target
X = df_encoded.drop('purchase', axis=1)
y = df_encoded['purchase']
# Apply SMOTE
smote = SMOTE()
X_res, y_res = smote.fit_resample(X, y)
# Combine back into a DataFrame
df_balanced = pd.DataFrame(X_res, columns=X.columns)
df_balanced['purchase'] = y_res
SMOTE picks a “yes” purchase, finds similar “yes” purchases, and creates new ones in between—like averaging their age and transaction_amount. Now your classes are balanced!
Step 6: Ensure Data Consistency
After all this augmenting, double-check your data:
Are age values still sensible (e.g., not negative)?
Do transaction_amount values make sense after noise or log transformations?
For categories, did encoding or augmentation mess anything up?
If something looks off (like an age of -5), adjust your noise range or add rules to cap values. Step 7: Save the Augmented Data Once you’re happy with your augmented dataset, save it for your machine learning project:
Step 7: Save the Augmented Data
Once You're happy with your augmented dataset, save it for your machine learning project:
df_balanced.to_csv('customer_transactions_augmented.csv', index=False)
Now you have a shiny new customer_transactions_augmented.csv
file, ready to train a better model!
A Simple Example
Let’s tie this together with an example. Suppose your original dataset has 100 rows with age,
transaction_amount
, and purchase
. The picture at the top shows a scatter plot: the left side is the original data (blue for “no” purchases, red for “yes”), and the right side shows it after adding noise (slightly shifted points).
Here’s what you’d do:
Load and clean: Load
customer_transactions.csv
and fill any missing age values with the mean.Transform: Apply np.log1p to transaction_amount if it’s skewed. Augment with noise: Add small random changes to age and transaction_amount, making 5 new versions and combining them (500 rows total).
Balance with SMOTE: If “yes” purchases are rare, use SMOTE to even things out.
Save: Export the result as
customer_transactions_augmented.csv.
The code snippets above show exactly how to do each part.
A Quick Note for Beginners
For categorical features like gender, augmentation is trickier. You could keep them the same for synthetic rows or sample them randomly, but advanced methods (like generative models) exist too. For now, focusing on numbers keeps it simple.
Also, in real machine learning, you’d only augment your training data, not your test data, to avoid cheating. Here, we augmented the whole dataset for simplicity, but keep this in mind for projects!
Conclusion
Automated data augmentation for tabular data is like giving your machine learning model a bigger, more diverse classroom to learn from. By understanding your data, fixing missing values, transforming skewed features, adding noise, balancing classes with SMOTE, checking consistency, and saving your work, you can boost your model’s performance without collecting more data. It’s a practical skill that’s easy to start with Python libraries like pandas
, numpy
, and imblearn
.
What is Data Augmentation?
Imagine you’re teaching a child to recognize cats. If you only show them one picture of a cat, they might struggle to identify cats in different poses or lighting. But if you show them many variations—cats sitting, standing, or in different colors—they’ll learn better. Data augmentation does the same for machine learning models. It creates modified copies of your existing data to give your model more examples to learn from, making it more robust and less likely to “memorize” the training data (a problem called overfitting). For tabular data—like a CSV file with customer transactions—this might mean tweaking numbers slightly or generating new rows based on patterns in your data. Let’s dive into how to do this, step by step.
Step 1: Understand Your Data Before you start changing anything, take a good look at your dataset.
Before you start changing anything, take a good look at your dataset. What’s in it? For example, imagine a file called customer_transactions.csv
with columns like:
customer_id (a unique number for each customer),
age (a number),
transaction_amount (a dollar value),
gender (categories like “M” or “F”),
purchase (0 for “no,” 1 for “yes”).
Ask yourself:
Are the features numerical (like age) or categorical (like gender)?
Are there missing values (empty cells)? Is the data balanced?
For instance, do you have way more “no” purchases than “yes” purchases?
Understanding these details helps you decide how to augment your data effectively.
Step 2: Handle Missing Values
Missing data is like a puzzle with pieces gone—it can confuse your model. You need to fill those gaps. Here are simple ways to do it:
For numbers: Use the average (mean) or middle value (median). If some age values are missing, you could fill them with the average age of all customers.
For categories: Use the most common value (mode). If gender is missing, fill it with whichever is more frequent, like “F” if most customers are female.
Here's how you'd do tis in python using the pandas library:
import pandas as pd
# Load your data
df = pd.read_csv('customer_transactions.csv')
# Fill missing ages with the average
df['age'].fillna(df['age'].mean(), inplace=True)
# Fill missing gender with the most common value
df['gender'].fillna(df['gender'].mode()[0], inplace=True)
This ensures your dataset is complete before you start augmenting
Step 3: Apply Transformations to Skewed Data
import numpy as np
# Apply log transformation to transaction_amount
df['transaction_amount'] = np.log1p(df['transaction_amount'])
The np.log1p function is handy because it works even if some values are zero (plain log doesn’t like zeros). After this, your data might look more “normal,” which many models prefer.
Step 4: Generate Synthetic Data with Noise
Now, let’s create new data by tweaking what you have. One easy way is to add random noise—small random changes—to numerical features. For example, if a customer’s age is 30, you might add or subtract a tiny bit (like 1 or 2 years) to create a new “synthetic” customer.
Here’s a Python example:
# Make a copy of the original data
df_aug = df.copy()
# Add random noise to age and transaction_amount
df_aug['age'] += np.random.uniform(-2, 2, size=len(df)) # Changes age by -2 to +2
df_aug['transaction_amount'] += np.random.uniform(-0.1, 0.1, size=len(df)) # Small changes on log scale
You can make several copies with different noise levels and combine them:
# Create 5 augmented versions
num_augmentations = 5
augmented_dfs = [df.copy() for _ in range(num_augmentations)]
for aug_df in augmented_dfs:
aug_df['age'] += np.random.uniform(-2, 2, size=len(df))
aug_df['transaction_amount'] += np.random.uniform(-0.1, 0.1, size=len(df))
# Combine original and augmented data
df_augmented = pd.concat([df] + augmented_dfs, ignore_index=True)
This gives you a bigger dataset with realistic variations.
Step 5: Balance the Dataset with SMOTE (If Needed)
What if your data is imbalanced? Say you have 70 “no” purchases and only 30 “yes” purchases. Your model might just guess “no” all the time because it’s more common. SMOTE (Synthetic Minority Over-sampling Technique) fixes this by creating new examples of the minority class (here, “yes” purchases) based on existing ones.
SMOTE needs all features to be numbers, so first, convert categories like gender to numbers (e.g., “M” = 0, “F” = 1) using one-hot encoding. Then apply SMOTE:
from imblearn.over_sampling import SMOTE
# One-hot encode categorical features
df_encoded = pd.get_dummies(df_augmented, columns=['gender'], drop_first=True)
# Separate features and target
X = df_encoded.drop('purchase', axis=1)
y = df_encoded['purchase']
# Apply SMOTE
smote = SMOTE()
X_res, y_res = smote.fit_resample(X, y)
# Combine back into a DataFrame
df_balanced = pd.DataFrame(X_res, columns=X.columns)
df_balanced['purchase'] = y_res
SMOTE picks a “yes” purchase, finds similar “yes” purchases, and creates new ones in between—like averaging their age and transaction_amount. Now your classes are balanced!
Step 6: Ensure Data Consistency
After all this augmenting, double-check your data:
Are age values still sensible (e.g., not negative)?
Do transaction_amount values make sense after noise or log transformations?
For categories, did encoding or augmentation mess anything up?
If something looks off (like an age of -5), adjust your noise range or add rules to cap values. Step 7: Save the Augmented Data Once you’re happy with your augmented dataset, save it for your machine learning project:
Step 7: Save the Augmented Data
Once You're happy with your augmented dataset, save it for your machine learning project:
df_balanced.to_csv('customer_transactions_augmented.csv', index=False)
Now you have a shiny new customer_transactions_augmented.csv
file, ready to train a better model!
A Simple Example
Let’s tie this together with an example. Suppose your original dataset has 100 rows with age,
transaction_amount
, and purchase
. The picture at the top shows a scatter plot: the left side is the original data (blue for “no” purchases, red for “yes”), and the right side shows it after adding noise (slightly shifted points).
Here’s what you’d do:
Load and clean: Load
customer_transactions.csv
and fill any missing age values with the mean.Transform: Apply np.log1p to transaction_amount if it’s skewed. Augment with noise: Add small random changes to age and transaction_amount, making 5 new versions and combining them (500 rows total).
Balance with SMOTE: If “yes” purchases are rare, use SMOTE to even things out.
Save: Export the result as
customer_transactions_augmented.csv.
The code snippets above show exactly how to do each part.
A Quick Note for Beginners
For categorical features like gender, augmentation is trickier. You could keep them the same for synthetic rows or sample them randomly, but advanced methods (like generative models) exist too. For now, focusing on numbers keeps it simple.
Also, in real machine learning, you’d only augment your training data, not your test data, to avoid cheating. Here, we augmented the whole dataset for simplicity, but keep this in mind for projects!
Conclusion
Automated data augmentation for tabular data is like giving your machine learning model a bigger, more diverse classroom to learn from. By understanding your data, fixing missing values, transforming skewed features, adding noise, balancing classes with SMOTE, checking consistency, and saving your work, you can boost your model’s performance without collecting more data. It’s a practical skill that’s easy to start with Python libraries like pandas
, numpy
, and imblearn
.
What is Data Augmentation?
Imagine you’re teaching a child to recognize cats. If you only show them one picture of a cat, they might struggle to identify cats in different poses or lighting. But if you show them many variations—cats sitting, standing, or in different colors—they’ll learn better. Data augmentation does the same for machine learning models. It creates modified copies of your existing data to give your model more examples to learn from, making it more robust and less likely to “memorize” the training data (a problem called overfitting). For tabular data—like a CSV file with customer transactions—this might mean tweaking numbers slightly or generating new rows based on patterns in your data. Let’s dive into how to do this, step by step.
Step 1: Understand Your Data Before you start changing anything, take a good look at your dataset.
Before you start changing anything, take a good look at your dataset. What’s in it? For example, imagine a file called customer_transactions.csv
with columns like:
customer_id (a unique number for each customer),
age (a number),
transaction_amount (a dollar value),
gender (categories like “M” or “F”),
purchase (0 for “no,” 1 for “yes”).
Ask yourself:
Are the features numerical (like age) or categorical (like gender)?
Are there missing values (empty cells)? Is the data balanced?
For instance, do you have way more “no” purchases than “yes” purchases?
Understanding these details helps you decide how to augment your data effectively.
Step 2: Handle Missing Values
Missing data is like a puzzle with pieces gone—it can confuse your model. You need to fill those gaps. Here are simple ways to do it:
For numbers: Use the average (mean) or middle value (median). If some age values are missing, you could fill them with the average age of all customers.
For categories: Use the most common value (mode). If gender is missing, fill it with whichever is more frequent, like “F” if most customers are female.
Here's how you'd do tis in python using the pandas library:
import pandas as pd
# Load your data
df = pd.read_csv('customer_transactions.csv')
# Fill missing ages with the average
df['age'].fillna(df['age'].mean(), inplace=True)
# Fill missing gender with the most common value
df['gender'].fillna(df['gender'].mode()[0], inplace=True)
This ensures your dataset is complete before you start augmenting
Step 3: Apply Transformations to Skewed Data
import numpy as np
# Apply log transformation to transaction_amount
df['transaction_amount'] = np.log1p(df['transaction_amount'])
The np.log1p function is handy because it works even if some values are zero (plain log doesn’t like zeros). After this, your data might look more “normal,” which many models prefer.
Step 4: Generate Synthetic Data with Noise
Now, let’s create new data by tweaking what you have. One easy way is to add random noise—small random changes—to numerical features. For example, if a customer’s age is 30, you might add or subtract a tiny bit (like 1 or 2 years) to create a new “synthetic” customer.
Here’s a Python example:
# Make a copy of the original data
df_aug = df.copy()
# Add random noise to age and transaction_amount
df_aug['age'] += np.random.uniform(-2, 2, size=len(df)) # Changes age by -2 to +2
df_aug['transaction_amount'] += np.random.uniform(-0.1, 0.1, size=len(df)) # Small changes on log scale
You can make several copies with different noise levels and combine them:
# Create 5 augmented versions
num_augmentations = 5
augmented_dfs = [df.copy() for _ in range(num_augmentations)]
for aug_df in augmented_dfs:
aug_df['age'] += np.random.uniform(-2, 2, size=len(df))
aug_df['transaction_amount'] += np.random.uniform(-0.1, 0.1, size=len(df))
# Combine original and augmented data
df_augmented = pd.concat([df] + augmented_dfs, ignore_index=True)
This gives you a bigger dataset with realistic variations.
Step 5: Balance the Dataset with SMOTE (If Needed)
What if your data is imbalanced? Say you have 70 “no” purchases and only 30 “yes” purchases. Your model might just guess “no” all the time because it’s more common. SMOTE (Synthetic Minority Over-sampling Technique) fixes this by creating new examples of the minority class (here, “yes” purchases) based on existing ones.
SMOTE needs all features to be numbers, so first, convert categories like gender to numbers (e.g., “M” = 0, “F” = 1) using one-hot encoding. Then apply SMOTE:
from imblearn.over_sampling import SMOTE
# One-hot encode categorical features
df_encoded = pd.get_dummies(df_augmented, columns=['gender'], drop_first=True)
# Separate features and target
X = df_encoded.drop('purchase', axis=1)
y = df_encoded['purchase']
# Apply SMOTE
smote = SMOTE()
X_res, y_res = smote.fit_resample(X, y)
# Combine back into a DataFrame
df_balanced = pd.DataFrame(X_res, columns=X.columns)
df_balanced['purchase'] = y_res
SMOTE picks a “yes” purchase, finds similar “yes” purchases, and creates new ones in between—like averaging their age and transaction_amount. Now your classes are balanced!
Step 6: Ensure Data Consistency
After all this augmenting, double-check your data:
Are age values still sensible (e.g., not negative)?
Do transaction_amount values make sense after noise or log transformations?
For categories, did encoding or augmentation mess anything up?
If something looks off (like an age of -5), adjust your noise range or add rules to cap values. Step 7: Save the Augmented Data Once you’re happy with your augmented dataset, save it for your machine learning project:
Step 7: Save the Augmented Data
Once You're happy with your augmented dataset, save it for your machine learning project:
df_balanced.to_csv('customer_transactions_augmented.csv', index=False)
Now you have a shiny new customer_transactions_augmented.csv
file, ready to train a better model!
A Simple Example
Let’s tie this together with an example. Suppose your original dataset has 100 rows with age,
transaction_amount
, and purchase
. The picture at the top shows a scatter plot: the left side is the original data (blue for “no” purchases, red for “yes”), and the right side shows it after adding noise (slightly shifted points).
Here’s what you’d do:
Load and clean: Load
customer_transactions.csv
and fill any missing age values with the mean.Transform: Apply np.log1p to transaction_amount if it’s skewed. Augment with noise: Add small random changes to age and transaction_amount, making 5 new versions and combining them (500 rows total).
Balance with SMOTE: If “yes” purchases are rare, use SMOTE to even things out.
Save: Export the result as
customer_transactions_augmented.csv.
The code snippets above show exactly how to do each part.
A Quick Note for Beginners
For categorical features like gender, augmentation is trickier. You could keep them the same for synthetic rows or sample them randomly, but advanced methods (like generative models) exist too. For now, focusing on numbers keeps it simple.
Also, in real machine learning, you’d only augment your training data, not your test data, to avoid cheating. Here, we augmented the whole dataset for simplicity, but keep this in mind for projects!
Conclusion
Automated data augmentation for tabular data is like giving your machine learning model a bigger, more diverse classroom to learn from. By understanding your data, fixing missing values, transforming skewed features, adding noise, balancing classes with SMOTE, checking consistency, and saving your work, you can boost your model’s performance without collecting more data. It’s a practical skill that’s easy to start with Python libraries like pandas
, numpy
, and imblearn
.