- Overview
- Conversion Philosophy
- Detailed Feature Mappings
- Motivation and Rationale
- Technical Implementation
- Limitations and Assumptions
This document explains how user responses from the questionnaire are converted into numerical features that the XGBoost machine learning model uses to predict life expectancy. The conversion process bridges the gap between user-friendly questions and the statistical features derived from population-level health data.
The machine learning model was trained on regional population data from Finnish municipalities (2013-2021), where each data point represents aggregate statistics for thousands of people. However, we want to predict life expectancy for individual users based on their personal responses. This creates a fundamental challenge: how do we map individual characteristics to population-level features?
Our conversion strategy uses feature engineering to estimate what the population-level statistics would be for a region where everyone has similar characteristics to the user. This is not perfect, but it provides a reasonable approximation that allows us to leverage population-level patterns for individual predictions.
- Population Proxy Approach: We estimate population statistics as if the user lived in a region where their characteristics are typical
- Evidence-Based Scaling: Conversion factors are based on real-world relationships found in epidemiological research
- Conservative Estimates: When uncertain, we default to median population values to avoid over-prediction
- Correlation Preservation: We maintain the statistical relationships observed in the training data
The model expects features like "percentage of daily smokers in the region" (0-30%), not "does this person smoke" (yes/no). Direct substitution doesn't work because:
- The model learned relationships at the population level
- Feature scales and distributions would be completely different
- Statistical interactions between features would break down
Model Features: Average age, both sexes, Average age, men, Average age, women
Feature Importance: ~0.010 (moderate)
- What is your date of birth?
age = today.year - birth.year
features['Average age, both sexes'] = float(age)
features['Average age, men'] = float(age)
features['Average age, women'] = float(age)Age is the most straightforward conversion. We use the user's actual age directly as a proxy for the "average age" in their demographic context. This is reasonable because age-related health risks are highly individual.
Model Features: 1. EARNED INCOME, mean, Disposable cash income, median
Feature Importance: 0.614 (HIGHEST - most important factor!)
- What is your annual income (before taxes)? (€10,000 - €100,000)
income = user_data['income']
features['1. EARNED INCOME, mean'] = float(income)
features['Disposable cash income, median'] = float(income * 1.15)Income is the single strongest predictor of life expectancy in our model. The conversion is direct for earned income. Disposable income is estimated as 115% of earned income, based on typical Finnish tax structures where disposable income is slightly higher due to various benefits and deductions.
Why Income Matters:
- Access to quality healthcare and nutrition
- Ability to live in safer neighborhoods
- Lower stress levels
- Better health literacy and preventive care
- Access to fitness facilities and healthy food options
Model Features:
Share of persons aged 15 or over with tertiary level qualification, %Share of persons aged 15 or over with at least upper secondary qualification, %Share of persons aged 15 or over without upper secondary qualification, %, %
Feature Importance: 0.028 (low-moderate)
- What is your highest level of education?
- Less than high school → 10%
- High school diploma → 20%
- Bachelor's degree → 35%
- Master's degree or higher → 50%
education_map = {
'none': 10.0, # Low tertiary education region
'high_school': 20.0, # Below average
'bachelor': 35.0, # Above average
'master_plus': 50.0 # High education region
}
tertiary_pct = education_map[education]Education percentages represent what proportion of the population in a "typical" region has tertiary education. We map individual education to population percentages based on Finnish educational distribution data:
- ~10%: Regions with low educational attainment
- ~27%: National median
- ~50%: Urban areas with universities
Why Education Matters:
- Health literacy and awareness
- Better health-related decision making
- Higher income correlation
- Access to information and resources
- Social networks that promote healthy behaviors
Model Feature: daily_smokers (percentage of daily smokers in population)
Feature Importance: 0.027 (moderate)
- Do you smoke?
- Never / Not currently → 5%
- Occasionally → 12%
- Daily → 25%
smoking_map = {
'never': 5.0, # Low smoking region (healthy lifestyle area)
'occasional': 12.0, # Near median (13.6% national median)
'daily': 25.0 # High smoking region
}We map personal smoking status to regional smoking prevalence. A daily smoker likely lives in or represents characteristics of a region where smoking is more common (25%), while non-smokers represent healthier regions (5%). The national median is 13.6%.
Why Smoking Matters:
- Single most preventable cause of premature death
- Increases risk of cancer, heart disease, stroke
- Reduces lung capacity and immune function
- Accelerates aging processes
Model Features: alcohol_sales (liters per capita), binge_drinking (%)
Feature Importance: 0.035 for sales, 0.018 for binge drinking
- How many alcoholic drinks do you consume per week? (0-30)
alcohol_units = user_data['alcohol_units']
features['alcohol_sales'] = alcohol_units * 0.5 # Scale to regional sales
features['binge_drinking'] = min(alcohol_units * 1.5, 25.0) # Binge percentage- Alcohol sales: Scaled down by 0.5 to convert weekly personal consumption to annual regional per-capita sales (in liters)
- Binge drinking: Higher multiplier (1.5x) because people who drink more are more likely to engage in binge drinking. Capped at 25% (maximum observed in data)
Epidemiological Basis:
- 0 drinks/week → 0 L/capita, 0% binge drinking (very healthy region)
- 5 drinks/week → 2.5 L/capita, 7.5% binge drinking (near median)
- 20 drinks/week → 10 L/capita, 25% binge drinking (high-risk region)
Model Feature: physical_activity (percentage engaging in regular activity)
Feature Importance: 0.0009 (very low, surprisingly!)
- How many days per week do you exercise for at least 30 minutes? (0-7)
exercise_days = user_data['exercise_days']
features['physical_activity'] = (exercise_days / 7.0) * 40Converts weekly exercise frequency to a percentage (0-40%). The scaling assumes:
- 0 days → 0% (inactive region)
- 3.5 days → 20% (median activity level)
- 7 days → 40% (highly active region)
Note: Despite low importance in our model, extensive research shows physical activity strongly impacts health. The low importance may reflect:
- Physical activity is correlated with other factors (income, education)
- Regional data may not capture individual variation well
- Finnish population is generally active (high baseline)
Model Feature: disability_ratio (percentage with disability benefits)
Feature Importance: 0.056 (moderate)
- Do you have any of these chronic health conditions? (checkboxes)
- Diabetes, Heart disease, High blood pressure, Respiratory disease, Arthritis, Cancer, Other
chronic_conditions = user_data['chronic_conditions']
features['disability_ratio'] = min(len(chronic_conditions) * 3.0, 20.0)Each chronic condition adds 3 percentage points to the disability ratio, capped at 20% (max observed):
- 0 conditions → 0% (healthy region)
- 2-3 conditions → 6-9% (near median of 8.3%)
- 6+ conditions → 20% (maximum disability rate)
The 3% per condition factor is based on:
- Finnish disability benefit statistics
- Disease burden in the population
- Overlap between multiple conditions
Model Features: mental_health, severe_mental_strain (%)
Feature Importance: 0.003 for mental_health, 0.002 for severe_mental_strain
- How would you rate your mental and emotional well-being? (1-10 scale)
mental_health_score = user_data['mental_health_score'] # User: 1=poor, 10=excellent
features['mental_health'] = (11 - mental_health_score) * 15 # Model: higher=worse
features['severe_mental_strain'] = (11 - mental_health_score) * 2 # PercentageIMPORTANT: The model features have a negative correlation with life expectancy (higher values = worse outcomes). Therefore:
- User score of 10 (excellent) → mental_health = 15 (low strain)
- User score of 5 (moderate) → mental_health = 90 (medium strain)
- User score of 1 (poor) → mental_health = 150 (high strain)
The inversion formula (11 - score) converts from "higher is better" (user perspective) to "higher is worse" (model perspective).
Scaling Factors:
- Mental health: Multiplied by 15 to match the model's training range (15-150)
- Severe strain: Multiplied by 2 to represent percentage of population with severe strain (2-20%)
Model Feature: percentage_happy (% satisfied with life)
Feature Importance: 0.004 (low)
- Overall, how satisfied are you with your life? (1-10 scale)
happiness_score = user_data['happiness_score']
features['percentage_happy'] = (happiness_score / 10.0) * 80Linear scaling from user score (1-10) to population percentage (0-80%):
- Score of 5 → 40% happy (below median)
- Score of 7 → 56% happy (near median of 52%)
- Score of 10 → 80% happy (very satisfied region)
The 80% cap reflects that even in the happiest regions, not everyone reports being satisfied.
Model Feature: work_until_retired (average years until retirement)
Feature Importance: 0.001 (low)
- What is your current employment status?
- Employed / Unemployed / Retired / Student
retirement_age = 65 # Finnish retirement age
if employment_status == 'retired':
features['work_until_retired'] = 0
elif employment_status == 'employed':
features['work_until_retired'] = max(retirement_age - age, 0)
else: # unemployed, student
features['work_until_retired'] = 30 # Average working years remaining- Retired: 0 years (already retired)
- Employed: Calculate remaining working years until age 65
- Unemployed/Student: Use 30 as a placeholder (median in training data is 27.75)
Why Employment Matters:
- Social engagement and purpose
- Financial security
- Structured routine
- Social connections
- Mental stimulation
Several features are derived from combinations of user inputs:
base_obesity = 20.0 # National median
exercise_factor = (7 - exercise_days) * 1.0
features['obesity_rate'] = min(base_obesity + exercise_factor, 35.0)Rationale: Less exercise correlates with higher obesity. Each day of inactivity adds ~1% to obesity rate.
The model learned that in regions where:
- Income is high
- Education is high
- Smoking is low
- People exercise more
...life expectancy tends to be higher. Our conversion preserves these relationships by mapping individual characteristics to corresponding regional characteristics.
Conversion factors are based on:
- Finnish national statistics (THL, Statistics Finland)
- Epidemiological research linking individual behaviors to population health
- Regional health data patterns observed during model training
We use medians for features that can't be reasonably estimated from individual data (e.g., population density, healthcare worker availability). This prevents the model from making predictions based on unrealistic feature combinations.
This approach is based on ecological inference - using population-level data to make individual-level predictions. While not perfect, it's justified because:
- Contextual Effects: Individual health is influenced by community characteristics
- Behavioral Clustering: People with similar characteristics tend to live in similar areas
- Resource Availability: Regional resources (healthcare, education) are shared by individuals in that region
The conversion process follows these steps:
def get_feature_vector(user_data):
# 1. Calculate derived values (age, etc.)
age = calculate_age(user_data['birth_date'])
# 2. Initialize with population medians
features = feature_medians.copy()
# 3. Override with user-specific conversions
features['1. EARNED INCOME, mean'] = user_data['income']
features['daily_smokers'] = smoking_map[user_data['smoking']]
# ... (continue for all mapped features)
# 4. Keep medians for non-mappable features
# (population density, healthcare access, etc. stay at median)
# 5. Return ordered vector matching model's expected input
return [features[name] for name in feature_names]We validate conversions by:
- Range checking: Ensuring converted values fall within training data ranges
- Correlation preservation: Verifying that converted features maintain expected relationships
- Sanity testing: Comparing predictions for known scenarios to expected outcomes
Issue: Population-level relationships don't always hold for individuals
Mitigation: Focus on well-established individual-level risk factors (smoking, income, education)
Issue: Relationships may not be perfectly linear (e.g., 2x alcohol ≠ 2x risk)
Mitigation: Use caps and non-linear scaling where research supports it
Issue: Genetics, family history, specific medical conditions not captured
Mitigation: Clear disclaimers; emphasize educational purpose
Issue: Model trained on Finnish data may not generalize to other populations
Mitigation: Document data source; target Finnish users primarily
Issue: Current behaviors may not reflect lifetime patterns
Mitigation: Interpret as "current trajectory" prediction, not destiny
Issue: Complex interactions between features may not be perfectly preserved
Mitigation: XGBoost model captures non-linear interactions during training
Issue: Some model features (population density, healthcare access) use median values
Impact: These represent "average" conditions; predictions assume typical Finnish context
This conversion approach balances practical constraints (individual questionnaire → population model) with scientific rigor. While imperfect, it provides reasonable estimates that:
- Educate users about factors influencing longevity
- Preserve statistical relationships learned by the model
- Generate plausible predictions based on established health research
- Acknowledge limitations through appropriate disclaimers
The goal is not perfect individual prediction (which is impossible), but rather to demonstrate how lifestyle and socioeconomic factors collectively influence life expectancy based on population-level patterns.
- THL (Finnish Institute for Health and Welfare): https://sotkanet.fi/sotkanet/en/haku
- Statistics Finland: https://pxdata.stat.fi/PXWeb/pxweb/en/StatFin/
- Regional health indicators methodology (THL)
- Ecological inference in epidemiology
- XGBoost for survival prediction
- Population health determinants research
README.md- Project overviewbackend/utils/converter.py- Implementation codebackend/model/feature_info.json- Feature specificationsFEATURE_CONVERSION_GUIDE.md- This document