Overfitting and Underfitting in Machine Learning: What You Need to Know

Ever wondered why your machine learning model performs brilliantly during training but crashes spectacularly when faced with real-world data? Or why sometimes your model seems too “dumb” to even capture basic patterns? You’re not alone, and you’re definitely dealing with one of ML’s most frustrating challenges, which is overfitting and underfitting in machine learning.

Here’s the thing: building a machine learning model is a lot like cooking the perfect bowl of porridge. Too hot, too cold, or just right—remember Goldilocks? That’s exactly what overfitting and underfitting in machine learning represent. And trust me, getting this balance right can make or break your entire ML project, whether you’re a B.Tech student working on your final year project or a data scientist deploying models in production.

By the end of this piece, you’ll understand what causes these problems, how to identify them in your own models, and most importantly, how to fix them. Because let’s be honest, nobody wants to be that person who proudly demonstrates a model with 99% training accuracy only to watch it fail miserably on test data. Moreover, understanding these concepts will save you countless debugging hours and significantly improve your model’s real-world performance.

Understanding the Core Problem

Before we dive into overfitting in machine learning and underfitting in machine learning, let’s talk about what we’re really trying to achieve. The goal of any ML model is simple: learn patterns from training data and apply them successfully to new, unseen data. It’s like studying for an exam—you want to understand concepts, not just memorize answers.

Think about it this way: if you memorize exact questions and answers before an exam, you’ll ace those specific questions but fail when the teacher slightly modifies them. That’s overfitting. On the flip side, if you barely study at all, you’ll fail everything. That’s underfitting.

What is Underfitting in Machine Learning?

Let’s start with the simpler concept. Underfitting in machine learning occurs when your model is too simple to capture the underlying patterns in your data. It’s like trying to fit a straight line through data that clearly follows a curve—it just doesn’t work.

An underfitted model has high bias because it makes strong assumptions about the data without actually learning from it properly. The model performs poorly on both training data and test data, which makes it relatively easy to identify.

Real-World Example of Underfitting

Imagine you’re building a model to predict house prices. You only use the number of bedrooms as a feature and apply linear regression. Obviously, house prices depend on many factors—location, size, amenities, age of the property, and more. Your oversimplified model will perform terribly because it’s fundamentally incapable of capturing the complexity of real estate pricing.

Here’s a simple code example demonstrating underfitting:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Generate sample data with a non-linear pattern
np.random.seed(42)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

# Underfitted model: simple linear regression
model_underfit = LinearRegression()
model_underfit.fit(X, y)

# Predictions
X_test = np.linspace(0, 5, 100).reshape(-1, 1)
y_pred_underfit = model_underfit.predict(X_test)

# Plotting
plt.scatter(X, y, color='blue', s=30, marker='o', label='Training data')
plt.plot(X_test, y_pred_underfit, color='red', linewidth=2, label='Underfitted model')
plt.legend()
plt.title('Underfitting Example')
plt.show()

In this example, we’re trying to fit a straight line to data that follows a sine wave pattern. The model simply can’t capture the complexity, resulting in poor performance.

Signs of Underfitting

  • Low training accuracy
  • Low validation/test accuracy
  • High bias
  • The model is too simple for the problem

What is Overfitting in Machine Learning?

Now, let’s tackle the more common and trickier problem: overfitting in machine learning. This happens when your model learns the training data too well—including all the noise and random fluctuations. It’s like a student who memorizes every single practice problem but can’t solve anything slightly different.

An overfitted model has high variance because it’s too sensitive to small fluctuations in the training data. It performs exceptionally well on training data but fails on new, unseen data. This is particularly dangerous because your model might look great during development but completely fail in production.

Real-World Example of Overfitting

Let’s say you’re building a spam detection system. During training, you notice that every spam email in your dataset contains the word “congratulations.” Your overfitted model might learn that “congratulations” always means spam. However, in the real world, legitimate emails also use this word—birthday wishes, job offers, and genuine promotions. Your model will incorrectly flag these as spam because it memorized specific patterns rather than learning general ones.

Here’s a code example demonstrating overfitting:

# Overfitted model: very high degree polynomial
poly_features = PolynomialFeatures(degree=15)
X_poly = poly_features.fit_transform(X)

model_overfit = LinearRegression()
model_overfit.fit(X_poly, y)

# Predictions
X_test_poly = poly_features.transform(X_test)
y_pred_overfit = model_overfit.predict(X_test_poly)

# Plotting
plt.scatter(X, y, color='blue', s=30, marker='o', label='Training data')
plt.plot(X_test, y_pred_overfit, color='green', linewidth=2, label='Overfitted model')
plt.legend()
plt.title('Overfitting Example')
plt.show()

Notice how the polynomial of degree 15 creates a wildly complex curve that passes through almost every training point but would perform horribly on new data.

Signs of Overfitting

  • Very high training accuracy (sometimes near 100%)
  • Significantly lower validation/test accuracy
  • Large gap between training and validation performance
  • High variance
  • The model is too complex for the problem

The Sweet Spot: A Well-Fitted Model

The goal is finding the balance—a model that’s complex enough to capture real patterns but simple enough to generalise to new data. This is where overfitting and underfitting in machine learning converge into the concept of model optimization.

# Well-fitted model: appropriate polynomial degree
poly_features_good = PolynomialFeatures(degree=3)
X_poly_good = poly_features_good.fit_transform(X)

model_good = LinearRegression()
model_good.fit(X_poly_good, y)

# Predictions
X_test_poly_good = poly_features_good.transform(X_test)
y_pred_good = model_good.predict(X_test_poly_good)

# Plotting
plt.scatter(X, y, color='blue', s=30, marker='o', label='Training data')
plt.plot(X_test, y_pred_good, color='purple', linewidth=2, label='Well-fitted model')
plt.legend()
plt.title('Well-Fitted Model')
plt.show()

This model strikes the right balance, capturing the general trend without memorizing noise.

How You Can Prevent Overfitting?

1. Use More Training Data: More data helps your model learn general patterns rather than memorising specific examples. It’s harder to overfit when you have diverse training samples.

2. Cross-Validation: Split your data into multiple folds and validate your model across different subsets. This technique helps you understand how well your model generalizes.

3. Regularization: Techniques like L1 (Lasso) and L2 (Ridge) regularization add penalties for complexity, forcing your model to stay simpler.

from sklearn.linear_model import Ridge

# Ridge regression with regularization
model_ridge = Ridge(alpha=1.0)
model_ridge.fit(X_poly_good, y)

4. Dropout (for Neural Networks): Randomly drop neurons during training to prevent the network from relying too heavily on specific features.

5. Early Stopping: Monitor validation performance and stop training when it starts degrading, even if training performance is still improving.

6. Feature Selection: Remove irrelevant or redundant features that might cause your model to learn noise.

How to Prevent Underfitting?

1. Increase Model Complexity: Use more powerful algorithms or add more layers/parameters to your model. Although this might seem counterintuitive after discussing overfitting, sometimes your model genuinely needs more capacity.

2. Feature Engineering: Create new, more informative features from existing ones. For instance, combining “square footage” and “number of rooms” might give you “average room size,” which could be more predictive.

3. Reduce Regularization: If you’ve applied too much regularization, your model might be artificially constrained.

4. Train Longer: Sometimes underfitting happens simply because you haven’t trained enough. Let your model learn for more epochs.

Practical Tips for Your Projects

When working on your ML projects, always split your data into training, validation, and test sets. Train on the training set, tune hyperparameters using the validation set, and only evaluate final performance on the test set—which you shouldn’t touch until the very end.

Moreover, plot learning curves showing training and validation performance over time. If both curves are low and converging, you’re underfitting. If training accuracy is high but validation accuracy is low, you’re overfitting. If both are high and close together, congratulations—you’ve nailed it!

Also, remember that the “right” model complexity depends entirely on your problem, data size, and requirements. There’s no one-size-fits-all solution, which is why machine learning is as much an art as it is a science.

Wrapping Up

Understanding overfitting and underfitting in machine learning is fundamental to building models that actually work in production. These aren’t just theoretical concepts—they’re practical challenges you’ll face in every ML project, from your college assignments to industry applications.

The key takeaway? Always validate your models on unseen data, monitor the gap between training and validation performance, and be willing to iterate. Because ultimately, a model that generalizes well is worth far more than one that simply memorizes training data, no matter how impressive those training metrics look.

Here are some interesting machine learning topics for you:
How to Implement K Means Clustering: A Step-by-Step Guide with Sklearn
Real Life Applications Of Machine Learning Driving The Modern Tech
What’s the Real Difference Between AI and ML

Index
Scroll to Top