Skip to main content
HomeTutorialsMachine Learning

Cross-Entropy Loss Function in Machine Learning: Enhancing Model Accuracy

Explore cross-entropy in machine learning in our guide on optimizing model accuracy and effectiveness in classification with TensorFlow and PyTorch examples.
Jan 2024  · 12 min read

Almost all of the applications we use today incorporate a form of machine learning to enhance or automate a business process.

However, these models cannot simply be pushed to a live environment — especially in a high-stakes environment (i.e., predicting if someone has cancer). They must be optimized to operate efficiently and effectively before moving into production. This means the model parameters must be fine-tuned to ensure they contribute to an improved customer experience.

Practitioners typically use loss/cost functions to find the optimal solution for their machine learning model during training. Cross Entropy is one of the most popular loss functions used to optimize classification models.

In this article, we will dive deeper into:

  • What is cross entropy?
  • Calculating cross-entropy
  • Cross entropy as a loss function
  • Best practices.

Understanding Entropy

The concept of cross-entropy traces its roots back to the field of information theory, where information entropy, also known as Shannon entropy, was formally introduced in 1948 by Claude Shannon in a paper titled “A Mathematical Theory of Communication.” Before we tackle cross-entropy, let’s deal with entropy.

Entropy calculates the degree of randomness or disorder within a system. In the context of information theory, the entropy of a random variable is the average uncertainty, surprise, or information inherent to the possible outcomes. To put things simply, it measures the uncertainty of an event.

For example, if a fair coin were tossed, we would have two possible outcomes, where the probability of the coin landing on heads or tails is ½.

This means: P(X=heads) = P(X=tails) = ½.

The Shannon entropy equation.

The Shannon entropy equation

Using the Shannon entropy equation, the entropy of both terms of the coin would equate to 0, which means there’s almost no uncertainty at all – it’s either almost always going to be heads or almost always tails.

The greater the value of entropy, H(x), the greater the uncertainty for the probability distribution, and the smaller the value, the less uncertainty.

Now, we’re ready to delve into cross-entropy.

What is Cross-Entropy?

Cross-entropy, also known as logarithmic loss or log loss, is a popular loss function used in machine learning to measure the performance of a classification model.

It measures the average number of bits required to identify an event from one probability distribution, p, using the optimal code for another probability distribution, q. In other words, cross-entropy measures the difference between the discovered probability distribution of a classification model and the predicted values.

The cross-entropy loss function is used to find the optimal solution by adjusting the weights of a machine learning model during training. The objective is to minimize the error between the actual and predicted outcomes. Thus, a measure closer to 0 is a sign of a good model, whereas a measure closer to 1 is a sign of a poor-performing model.

If you’re familiar with Kullback-Leibler (KL) Divergence, you may wonder, “What’s the difference between cross-entropy and KL divergence?” And that’s a fair question. Both concepts are widely used to measure probability distributions' differences or similarities. Although they share some similarities, they serve different purposes.

As mentioned above, cross-entropy measures the average number of bits required to identify an event from one probability distribution, P, using the optimal code for another probability distribution, Q, and is typically used in machine learning to evaluate the performance of a model where the objective is to minimize the error between the predicted probability distribution and true distribution.

In contrast, KL divergence measures the difference between two probability distributions, P and Q. Namely, KL divergence quantifies the amount of information loss when Q is used to approximate P. This is incredibly useful in unsupervised learning tasks where the objective is to uncover structure in data by minimizing the divergence between the true and learned data distributions.

Cross Entropy as a Loss Function

In machine learning, loss functions help models determine how wrong it is and improve itself based on that wrongness. They are mathematical functions that quantify the difference between predicted and actual values in a machine learning model, but this isn’t all they do.

The measure of error from a loss function also serves as a guide during the optimization process by providing feedback to the model on how well it fits the data. Hence, most machine learning models implement a loss function during the optimization phase, where the model parameters are chosen to help the model minimize the error and arrive at an optimal solution – the smaller the error, the better the model.

We can measure the error between two probability distributions using the cross-entropy loss function. For example, let’s assume we’re conducting a binary classification task (a classification task with two classes, 0 and 1).

In this instance, we must use binary cross-entropy, which is the average cross-entropy across all data samples:

Binary cross entropy formula. [Source: Cross-Entropy Loss Function]

Binary cross entropy formula [Source: Cross-Entropy Loss Function]

If we were to calculate the loss of a single data point where the correct value is y=1, here’s how our equation would look:

Calculating the binary cross-entropy for a single instance where the true value is 1

Calculating the binary cross-entropy for a single instance where the true value is 1

The predicted probability, p, determines the value of loss, l. If the value of p is high, the model will be rewarded for making a correct prediction - this will be illustrated with a low value of loss, l.

However, a low predicted probability, p, would infer the model was incorrect, and the binary cross-entropy loss function will reflect this by making the value of l higher.

For a multi-class classification task, cross-entropy (or categorical cross-entropy as it is often referred to) can be simply extended as follows:

Categorical cross-entropy for a single instance

Categorical cross-entropy for a single instance

In other words, to apply cross-entropy to a multi-class classification task, the loss for each class is calculated separately and then summed to determine the total loss.

Implementing Cross-Entropy Loss in PyTorch and TensorFlow

In this part of the tutorial, we will learn how to use the cross-entropy loss function in TensorFlow and PyTorch.

Let’s start by creating the dataset. We will use Scikit learns make_classification function to help us:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

 # create training data
X, y = make_classification(
    n_samples=10000,
    n_informative=10,
    random_state=2022
)

# split into train and test
X_new, X_test = X[:9000, :], X[9000:, ]
y_new, y_test = y[:9000], y[9000:]

X_train, X_val, y_train, y_val = train_test_split(
    X_new, y_new,
    test_size=0.3
)
print(f"Train data: {X_train.shape}\n\
Train labels: {y_train.shape}\n\
Test data: {X_test.shape}\n\
Test labels: {y_test.shape}")

"""
Train data: (6300, 20)
Train labels: (6300,)
Test data: (1000, 20)
Test labels: (1000,)
"""

Cross-entropy in TensorFlow

The model we will build will consist of an input layer, a hidden layer, and an output layer.

Since this is a binary classification task, we will use binary cross-entropy as our loss function.

# building and training model
model = tf.keras.Sequential([
                            tf.keras.layers.Dense(10,
                                                  input_shape=(X_train.shape[1],),
                                                  activation="relu"),
                            tf.keras.layers.Dense(10,
                                                  activation="relu"),
                            tf.keras.layers.Dense(1, activation="sigmoid")
])

model.compile(
    loss="binary_crossentropy", # loss function here
    optimizer="adam",
    metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=20, validation_data=[X_val, y_val], verbose=0)

Next, we will plot the loss to see whether the model is improving – meaning the error decreases with each epoch until it can no longer improve.

# plotting the loss of the models
fig, ax = plt.subplots(figsize=(8,5))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.xticks(np.arange(0,20, step=1))
plt.legend(['train', 'test'], loc='upper right')
plt.show()

A plot of the loss of our neural network in TensorFlow.

A plot of the loss of our neural network in TensorFlow.

Cross-entropy in PyTorch

In PyTorch, inputs, outputs, and parameters of the model are encoded using tensors, which means we must convert our Numpy arrays to tensors. That’s the first thing we do in the code below, and then we build the neural network and print its dimensions.

# convert numpy arrays to tensors 
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).reshape(-1, 1)
y_val_tensor = torch.tensor(y_val, dtype=torch.float32).reshape(-1, 1)

# build the model
input_dim = X_train.shape[1]
hidden_dim = 10
output_dim = 1

model = nn.Sequential(
    nn.Linear(input_dim, hidden_dim),
    nn.ReLU(),
    nn.Linear(hidden_dim, hidden_dim),
    nn.ReLU(),
    nn.Linear(hidden_dim, output_dim),
    nn.Sigmoid()
    )

print(model)

"""
Sequential(
  (0): Linear(in_features=20, out_features=10, bias=True)
  (1): ReLU()
  (2): Linear(in_features=10, out_features=10, bias=True)
  (3): ReLU()
  (4): Linear(in_features=10, out_features=1, bias=True)
  (5): Sigmoid()
)
"""

Next, we define the binary cross-entropy loss function and optimizer:

loss_fn = nn.BCELoss()  # binary cross entropy
optimizer = optim.Adam(model.parameters(), lr=0.001)

And now to plot the loss:

# plotting the loss of the models
fig, ax = plt.subplots(figsize=(8,5))
plt.plot(train_loss)
plt.plot(val_loss)
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.xticks(np.arange(0,20, step=1))
plt.legend(['train', 'test'], loc='upper right')
plt.show()

A plot of the loss of our Neural network in PyTorch

A plot of the loss of our Neural network in PyTorch

Key Takeaways

Here’s a quick recap of what we’ve learned about cross-entropy loss:

  • Entropy calculates the degree of randomness or disorder within a system to measure the uncertainty of an event. If an outcome is certain, the measure of entropy will be low.
  • Cross-entropy, also known as logarithmic loss or log loss, is a popular loss function used in machine learning to measure the performance of a classification model. Namely, it measures the difference between the discovered probability distribution of a classification model and the predicted values.
  • Binary cross-entropy is used when performing binary classification, and categorical cross-entropy is used for multi-class classification.
  • Cross-entropy is similar to KL divergence, but they serve different purposes: cross-entropy is typically used in machine learning to evaluate the performance of a model where the objective is to minimize the error between the predicted probability distribution and true distribution, whereas KL is more useful in unsupervised learning tasks where the objective is to uncover structure in data by minimizing the divergence between the true and learned data distributions.

To continue your learning, check out our resources:


Photo of Kurtis Pykes
Author
Kurtis Pykes
Topics

Start Your Machine Learning Journey Today!

Track

Machine Learning Engineer

44hrs hr
This career track teaches you everything you need to know about machine learning engineering and MLOps.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

blog

The Top 20 Deep Learning Interview Questions and Answers

Dive into the top deep learning interview questions with answers for various professional profiles and application areas like computer vision and NLP
Iván Palomares Carrascosa's photo

Iván Palomares Carrascosa

tutorial

An Introduction to the Mamba LLM Architecture: A New Paradigm in Machine Learning

Discover the power of Mamba LLM, a transformative architecture from leading universities, redefining sequence processing in AI.
Kurtis Pykes 's photo

Kurtis Pykes

9 min

tutorial

A Beginner's Guide to Azure Machine Learning

Explore Azure Machine Learning in our beginner's guide to setting up, deploying models, and leveraging AutoML & ML Studio in the Azure ecosystem.
Moez Ali's photo

Moez Ali

11 min

tutorial

ML Workflow Orchestration With Prefect

Learn everything about a powerful and open-source workflow orchestration tool. Build, deploy, and execute your first machine learning workflow on your local machine and the cloud with this simple guide.
Abid Ali Awan's photo

Abid Ali Awan

tutorial

An Introduction to Vector Databases For Machine Learning: A Hands-On Guide With Examples

Explore vector databases in ML with our guide. Learn to implement vector embeddings and practical applications.
Gary Alway's photo

Gary Alway

8 min

code-along

Getting Started with Machine Learning Using ChatGPT

In this session Francesca Donadoni, a Curriculum Manager at DataCamp, shows you how to make use of ChatGPT to implement a simple machine learning workflow.
Francesca Donadoni's photo

Francesca Donadoni

See MoreSee More