This is the first challenge among a series of challenges being hosted under Winter In Data Science Initiative Project ID: 62
Welcome to the Linearly Sparse! In this competition, your task is to build a predictive model that achieves strong accuracy on a target variable. The core idea is simple: A good model should be both accurate and efficient. Modern ML workflows often involve hundreds of input features, but real-world constraints frequently favor models that are compact, interpretable, and deployable. This competition evaluates both prediction quality and feature usage, rewarding solutions that achieve an effective balance.
You are provided with:
- A training dataset containing input features and corresponding target values
- A test dataset containing only input features
- A submission format where you will provide:
- A vector of model weights
- Predictions for the test set
Your goal is to: Minimize the score metric as described further The evaluation metric combines two aspects into a single score. Models that rely on many features may achieve high accuracy but will receive a penalty. Models that are sparse but inaccurate will also score poorly.
The best solutions find the right balance between the two.
Each submission contains:
- Your model’s weight vector
- Your predictions for the test set
- The evaluation metric interprets this structure and computes a combined score based on:
- Mean squared error (MSE) of your predictions
- A complexity penalty based on how many features your model uses
Lower scores are better.
Head over to the Data, Evaluation, and Submission tabs to understand the dataset structure, scoring mechanism, and submission requirements. Good luck — and may the most efficient model win!
In many real-world machine learning tasks, large feature sets make models powerful — but also harder to interpret, more expensive to deploy, and more prone to overfitting. This competition challenges you to design a model that is both accurate and efficient, using only the features that truly matter. You are provided with a tabular dataset consisting of:
- A training set with input features and target values
- A test set with input features only
- A custom scoring function that considers:
- How precisely you predict the target
- How many features your model relies on
- Your objective is to discover a model that makes strong predictions while minimizing unnecessary complexity.
- Models that use many features may perform well on accuracy alone, but they will receive a penalty during evaluation.
Your task is to find the sweet spot between predictive performance and feature economy.
Unlike standard regression challenges where only prediction accuracy matters, this competition rewards solutions that:
- Identify and leverage the most important features
- Avoid depending on the full feature set
- Balance accuracy with interpretability and simplicity This setup mimics real-world constraints where computational budgets, latency requirements, or domain knowledge push practitioners toward more compact models.
This competition evaluates two aspects of your model:
-
How accurately it predicts the target values for the test set, and
-
How efficiently it uses the available features. To capture this trade-off, we use a custom scoring function that combines:
-
Mean Squared Error (MSE) on the test predictions
-
A sparsity penalty based on how many features your model uses
The final score is computed as:
where:
- ( f ) = number of features your model actually uses ( the number of non-zero coefficients in the coefficient vector)
- ( m ) = total number of features
- ( alpha ) and ( p ) = penalty parameters (we choose both as 2 )
- Lower scores are better
In other words:
Two models can have similar accuracy, but the one using fewer features will achieve a better score. In general, you would want to have a weight vector that is sparse (i.e. contains a lot of zeros).
To make submission creation easier and to avoid formatting mistakes, we provide a create_submission utility.
You only need to supply:
- Your weight vector (length = 200)
- Your predictions on the test dataset (length = 500)
- A filename for the submission CSV The helper function will automatically generate a properly formatted
filename.csvfile that you can upload. Check thecreate_submission.txtfor the helper function.
The dataset for this competition consists of a training set and a test set, both containing 200 numerical features. Your goal is to learn from the training data and generate predictions for the unseen test samples.
The training file contains:
- 1000 samples (rows)
- 200 numerical features (
x1tox200) - 1 target variable (
y) Each row represents a single observation with 200 input variables and a corresponding output value. - You will use this data to train your model and compute the weight vector. Columns in
train.csv:
Column Name Description
y Target variable to be predicted
| x1–x200 | Numeric input features |
The test set contains:
- 500 samples (rows)
- The same 200 numerical features as the training set
- No target variable Your model should generate a prediction for each of these 500 samples. Columns in
test.csv:
| x1–x200 | Numeric input features (same format as training data) |
- Feature meanings are not explicitly provided — part of the challenge is determining which features are important.
- The training and test sets share the same feature structure.
- You will submit a weight vector (based on these 200 features) and predictions for the 500 test rows. The data has been prepared to encourage thoughtful feature selection and model design.
- Use the training data to learn effective patterns, and then apply your model to the test set to produce your final predictions.