Understanding Dataset Difficulty

In machine learning, many challenging questions often arise: how difficult is it for a model to classify a given example? how easy is the test dataset for a given model?, does a dataset for a given task like sentiment analysis harder than another dataset for a given model? These considerations are important in every day machine learning.

Every day, I grapple with the question: "Have I crafted a dataset that truly challenges the capabilities of the model, or have I inadvertently made it too simplistic?" Additionally, I continually ponder whether I have partitioned the dataset in a manner that presents a formidable challenge to the model's learning process. In my pursuit of answers to these crucial questions, I read this paper that furnishes insightful solutions .

<aside> 🤔 “Have I crafted a dataset that truly challenges the capabilities of the model, or have I inadvertently made it too simplistic?”

</aside>

In this paper, the authors employ information-theoretic measures—fear not, for the complexity of calculating this boils down to training classification models—to address the questions surrounding dataset difficulty. They introduce a novel measure known as V-usable information to do this.

Let's consider an illustrative example: "The food is delicious" (a classic choice for various scenarios), which clearly conveys a positive sentiment. Now, juxtapose this with the text "Uif gppe jt efmjdjpvt," which is actually an encryption of the initial sentence. The question that arises is: which of these two sentences provides more actionable information for a predictive AI model to discern the positive sentiment? Undoubtedly, it's the first one.

This paper introduces a single metric — the $\mathcal{V}$-usable information — that quantifies this notion of usable information. It is elegant and can be calculated for any classification problem as we will see next.

Formal definitions and Calculating Them

Let's consider a scenario where we have a dataset denoted as $X$, which is used for predicting classes $Y$. Additionally, we are working with a family of models represented as $\mathcal{V}$ , such as the transformer family of models. $\mathcal{V}$-usable information is losely based on mutual information . It can be calculated using two steps.

Step 1: Train a model on the dataset while deliberately providing it with empty input. Here, $\empty$ signifies that the model receives no input ( 'inf,' stands for infimum, representing the lowest value within a set).

$$ H(Y) = inf_{f\in V} E[-log_2 f[\empty][Y]] $$

Step 2: Train a model on the training dataset, this time with actual input:

$$ H(Y|X) = inf_{f\in\mathcal{V}}\ E[-log_2\ f[X][Y]] $$

Then calculate the $\mathcal{V}$-usable information

$$ I_{\mathcal{V}} = H(Y) - H(Y|X) $$

<aside> 🤔 This is reduction in entropy — which is information. Imagine you are learning a concept and are confused about it. Now you read a blog post or someone helps you understand something that you are stuck on. This helps you understand the concept in a better manner. Your confusion clears. You have gained some information.

</aside>