Week-1: Linearly Sparse

This is the first challenge among a series of challenges being hosted under Winter In Data Science Initiative Project ID: 62

Week-1: Linearly Sparse — Overview

Welcome to the Linearly Sparse! In this competition, your task is to build a predictive model that achieves strong accuracy on a target variable. The core idea is simple: A good model should be both accurate and efficient. Modern ML workflows often involve hundreds of input features, but real-world constraints frequently favor models that are compact, interpretable, and deployable. This competition evaluates both prediction quality and feature usage, rewarding solutions that achieve an effective balance.

🎯 Your Objective

You are provided with:

A training dataset containing input features and corresponding target values
A test dataset containing only input features
A submission format where you will provide:
- A vector of model weights
- Predictions for the test set

Your goal is to: Minimize the score metric as described further The evaluation metric combines two aspects into a single score. Models that rely on many features may achieve high accuracy but will receive a penalty. Models that are sparse but inaccurate will also score poorly.

The best solutions find the right balance between the two.

What You Submit

Each submission contains:

Your model’s weight vector
Your predictions for the test set
The evaluation metric interprets this structure and computes a combined score based on:
- Mean squared error (MSE) of your predictions
- A complexity penalty based on how many features your model uses

Lower scores are better.

Get Started

Head over to the Data, Evaluation, and Submission tabs to understand the dataset structure, scoring mechanism, and submission requirements. Good luck — and may the most efficient model win!

Description

In many real-world machine learning tasks, large feature sets make models powerful — but also harder to interpret, more expensive to deploy, and more prone to overfitting. This competition challenges you to design a model that is both accurate and efficient, using only the features that truly matter. You are provided with a tabular dataset consisting of:

A training set with input features and target values
A test set with input features only
A custom scoring function that considers:
- How precisely you predict the target
- How many features your model relies on
- Your objective is to discover a model that makes strong predictions while minimizing unnecessary complexity.
- Models that use many features may perform well on accuracy alone, but they will receive a penalty during evaluation.

Your task is to find the sweet spot between predictive performance and feature economy.

🧩 What Makes This Competition Unique?

Unlike standard regression challenges where only prediction accuracy matters, this competition rewards solutions that:

Identify and leverage the most important features
Avoid depending on the full feature set
Balance accuracy with interpretability and simplicity This setup mimics real-world constraints where computational budgets, latency requirements, or domain knowledge push practitioners toward more compact models.

Evaluation

This competition evaluates two aspects of your model:

How accurately it predicts the target values for the test set, and
How efficiently it uses the available features. To capture this trade-off, we use a custom scoring function that combines:
Mean Squared Error (MSE) on the test predictions
A sparsity penalty based on how many features your model uses

The final score is computed as:

where:

( f ) = number of features your model actually uses ( the number of non-zero coefficients in the coefficient vector)
( m ) = total number of features
( alpha ) and ( p ) = penalty parameters (we choose both as 2 )
Lower scores are better

In other words:

Two models can have similar accuracy, but the one using fewer features will achieve a better score. In general, you would want to have a weight vector that is sparse (i.e. contains a lot of zeros).

📝 Submission Format

To make submission creation easier and to avoid formatting mistakes, we provide a create_submission utility. You only need to supply:

Your weight vector (length = 200)
Your predictions on the test dataset (length = 500)
A filename for the submission CSV The helper function will automatically generate a properly formatted filename.csv file that you can upload. Check the create_submission.txt for the helper function.

Dataset Description

The dataset for this competition consists of a training set and a test set, both containing 200 numerical features. Your goal is to learn from the training data and generate predictions for the unseen test samples.

🧠 Training Data (`train.csv`)

The training file contains:

1000 samples (rows)
200 numerical features (x1 to x200)
1 target variable (y) Each row represents a single observation with 200 input variables and a corresponding output value.
You will use this data to train your model and compute the weight vector. Columns in train.csv:

Column Name Description y Target variable to be predicted

| x1–x200 | Numeric input features |

🔍 Test Data (`test.csv`)

The test set contains:

500 samples (rows)
The same 200 numerical features as the training set
No target variable Your model should generate a prediction for each of these 500 samples. Columns in test.csv:

| x1–x200 | Numeric input features (same format as training data) |

📌 Notes

Feature meanings are not explicitly provided — part of the challenge is determining which features are important.
The training and test sets share the same feature structure.
You will submit a weight vector (based on these 200 features) and predictions for the 500 test rows. The data has been prepared to encourage thoughtful feature selection and model design.
Use the training data to learn effective patterns, and then apply your model to the test set to produce your final predictions.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
create_submission.txt		create_submission.txt
main.py		main.py
optuna.db		optuna.db
pyproject.toml		pyproject.toml
submission.csv		submission.csv
test.csv		test.csv
train.csv		train.csv
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Week-1: Linearly Sparse