Skip to content

g8rdier/airbnb-berlin-superhost-analysis

Repository files navigation

Berlin Airbnb Superhost Pricing Differentiation: Multi-methodological Statistical Analysis

Abstract

This study presents a comprehensive statistical analysis of Superhost pricing differentiation strategies in Berlin's Airbnb market using a multi-methodological approach. The research combines Welch's t-tests, quantile regression (τ = 0.25/0.5/0.75/0.9), tertile-based market segmentation, and linear regression models with 70/30 train-test validation to reveal sophisticated pricing strategies and demonstrate practical applicability.

Academic Context: Business Informatics, 5th Semester - Data Analytics & Big Data
Institution: IU International University of Applied Sciences
Research Supervisor: Prof. Dr. rer. nat. Michael Barth
Student: Gregor Kobilarov (Matriculation No.: 4233113)
Dataset: InsideAirbnb Berlin (n=8,783, July 2025)
Primary Contribution: Empirical documentation of inverse pricing differentiation with statistical validation and predictive utility

Research Objective and Hypotheses

Research Question: Empirical analysis of Airbnb Superhost pricing differentiation strategies in Berlin's market, focusing on post-COVID market dynamics that enable analysis of evolutionary strategy adaptations.

H₀: Superhost premiums are identical between private rooms and entire apartments
H₁: Superhost premiums differ significantly between private rooms and entire apartments

Empirical Results: H₀ rejected with strong statistical evidence:

  • Private Rooms: -22.19% (95% Confidence Interval: [-27.33%, -15.06%])
  • Entire Apartments: +16.79% (95% Confidence Interval: [18.52%, 29.85%])
  • Difference: 38.98 percentage points (p < 2.2e-16, Cohen's d = -0.559)

Methodology

Sample Design

  • 3-Sigma Outlier Removal: 14,187→8,783 observations
  • Analysis Pipeline: (1) Welch's t-tests; (2) Quantile regression (25%/50%/75%/90%); (3) Tertile-based market segmentation; (4) Linear regression models
  • Train-Test Validation: 70/30 stratified split (R²=0.0087, RMSE=€399.05)
  • Robustness: Bootstrap confidence intervals (1,000 iterations)

Stage 1: Statistical Hypothesis Testing

  • Method: Welch's t-tests with unequal variance assumption
  • Sample: 8,783 listings after rigorous data cleaning
  • Statistical Evidence: t=-6.78 and t=8.37, both p < 2.2e-16
  • Primary Finding: Inverse pricing pattern with substantial effect sizes

Stage 2: Quantile Regression Analysis

  • Method: Robust regression across price distribution quantiles (τ = 0.25, 0.5, 0.75, 0.9)
  • Sample: Complete dataset of 8,808 listings including outliers
  • Key Insight: Consistent inverse pattern across complete price distribution
  • Methodological Innovation: Outlier-inclusive analysis ensuring complete market coverage

Stage 3: Interaction Effects Analysis

  • Method: Tertile-based price segmentation with statistical significance testing
  • Approach: Systematic analysis of 6 market segments (3 price tiers × 2 accommodation types)
  • Key Finding: Budget segments (+5.3% to +14.0%) vs. luxury segments (-19.9% to -40.5%)
  • Statistical Validation: 4/6 segments statistically significant

Stage 4: Predictive Model Validation

  • Method: Supervised learning with stratified 70/30 train-test split
  • Performance Metrics: R² = 0.0087, RMSE = €399.05, MAE = €88.93
  • Validation Approach: Out-of-sample validation confirms practical relevance
  • Innovation: Machine learning confirmation of statistical findings

Empirical Results

Primary Findings

Inverse pricing differentiation with substantial effect sizes:

  • Private Rooms: -22.19% (95% Confidence Interval: [-27.33%, -15.06%])
  • Entire Apartments: +16.79% (95% Confidence Interval: [18.52%, 29.85%])

Market Segmentation Analysis

Budget Segments: Private rooms show +8.3% to +14.0% Superhost premiums Luxury Segments: Private rooms show -30.2% to -40.5% Superhost discounts Entire Apartments: Consistent premiums across price segments (+5.3% to +18.7%)

Statistical Validation

Inference Validation: Welch's t-tests (t=-6.78 and t=8.37, p < 2.2e-16), bootstrap confidence intervals, and quantile estimators converge to consistent conclusions. Performance: Out-of-sample validation (RMSE=€399.05) confirms practical relevance.

Statistical Evidence Summary

Analysis Method Technique Key Finding Statistical Significance
Hypothesis Testing Welch's t-test Inverse pricing pattern p < 2.2e-16
Quantile Regression Robust estimation Consistent across price spectrum Significant across all quantiles
Interaction Effects Segmented analysis Strategic differentiation 4/6 segments statistically significant
Predictive Modeling Machine learning validation Practical applicability R² = 0.009 on test data

Technical Implementation

Statistical Methods

  • Hypothesis Testing: Welch's t-tests with bootstrap confidence intervals
  • Robust Analysis: Quantile regression with outlier integration
  • Interaction Modeling: Multi-factor ANOVA with price tier segmentation
  • Predictive Modeling: Linear regression with cross-validation

Software Environment

  • Platform: R Statistical Computing Environment (v4.3+)
  • Core Packages: quantreg, caret, effsize, car, psych, ggplot2, patchwork
  • Visualization: Publication-quality graphics with academic formatting
  • Reproducibility: Git version control with comprehensive documentation

Repository Architecture

airbnb-berlin-superhost-analysis/
├── data/
│   ├── raw/                    # InsideAirbnb source data (14,187 listings)
│   └── processed/              # Analysis-ready datasets (8,783 listings)
├── scripts/                    # Complete analysis pipeline (8 scripts)
│   ├── 01_data_import.R        # Data acquisition and validation
│   ├── 02_data_cleaning.R      # 3-sigma outlier removal and preprocessing
│   ├── 03_exploratory_analysis.R    # Descriptive statistics and diagnostics
│   ├── 04_hypothesis_testing.R      # Formal statistical inference
│   ├── 05_visualization.R           # Publication-quality graphics
│   ├── 06_quantile_regression_analysis.R  # Robust regression analysis
│   ├── 07_interaction_effects_analysis.R  # Strategic segmentation analysis
│   └── 08_predictive_model_analysis.R     # Machine learning validation
├── outputs/
│   ├── figures/                # Statistical visualizations (12 figures + PDF)
│   │   ├── 01_premium_comparison.png
│   │   ├── 02_price_distributions.png
│   │   ├── 03_sample_sizes.png
│   │   ├── 04_statistical_evidence.png
│   │   ├── 05_combined_academic_presentation.png
│   │   ├── 06_quantile_regression_analysis.png
│   │   ├── 07_interaction_effects_heatmap.png
│   │   ├── 07_interaction_effects_context.png
│   │   ├── 08_prediction_accuracy.png
│   │   ├── 08_model_validation_dashboard.png
│   │   ├── 08_feature_importance.png
│   │   └── visualization_compilation.pdf
│   ├── tables/                 # Comprehensive results (26+ CSV files)
│   │   ├── final_project_completion_summary.csv
│   │   ├── model_performance_comparison.csv
│   │   ├── feature_importance_analysis.csv
│   │   ├── interaction_effects_analysis.csv
│   │   └── [22+ additional analytical tables]
│   └── results/                # Research conclusions and summaries (6 files)
│       ├── academic_summary.txt
│       ├── hypothesis_test_results.csv
│       ├── predictive_model_validation_summary.txt
│       └── [3+ additional result summaries]
└── README.md                   # Project documentation

Data Processing Methodology

This research implements dual data cleaning approaches optimized for distinct analytical objectives:

Conservative Cleaning Pipeline (02_data_cleaning.R)

  • Objective: Statistical inference and hypothesis testing
  • Approach: Strict outlier removal using IQR-based bounds
  • Resulting Dataset: 8,783 listings with rigorous quality control
  • Statistical Rationale: Ensures robustness for inferential analyses through systematic exclusion of extreme values and incomplete records
  • Applications: t-tests, quantile regression, interaction effects analysis

Analytical Components

01_data_import.R - Data Acquisition Module

Function: Downloads and validates InsideAirbnb Berlin dataset
Technical Implementation:

  • Automated retrieval of listings.csv from InsideAirbnb repository
  • Data quality validation protocol (missing values, data type verification)
  • Source documentation generation with descriptive statistics
  • Standardized directory structure initialization
  • Output: Raw dataset (20,000+ listings) ready for preprocessing
  • Execution Time: Approximately 30 seconds

02_data_cleaning.R - Data Preprocessing Module

Function: Comprehensive data preparation for statistical analysis and modeling
Technical Implementation:

  • IQR-based outlier detection and removal
  • Systematic exclusion of incomplete records
  • Categorical variable standardization and price normalization
  • Accommodation type filtering (Entire Place, Private Room focus)
  • Comprehensive data quality assessment
  • Output: 8,783 listings optimized for analysis and modeling
  • Execution Time: Approximately 45 seconds

03_exploratory_analysis.R - Descriptive Analysis Module

Function: Comprehensive statistical exploration and pattern identification
Technical Implementation:

  • Multi-group summary statistics generation
  • Distribution analysis with normality testing protocols
  • Correlation matrix construction and diagnostic testing
  • Confounding variable identification procedures
  • Statistical profiling with publication-ready tables
  • Output: 15+ statistical tables and diagnostic visualizations
  • Execution Time: Approximately 60 seconds

04_hypothesis_testing.R - Statistical Inference Module

Function: Formal hypothesis testing with effect size quantification
Technical Implementation:

  • Welch's t-tests for populations with unequal variances
  • Cohen's d effect size calculation with confidence intervals
  • Statistical assumption testing (normality, homoscedasticity)
  • Bootstrap confidence interval generation for robust inference
  • Publication-ready statistical result compilation
  • Output: Hypothesis test results with p < 2.2e-16 significance
  • Execution Time: Approximately 45 seconds

05_visualization.R - Statistical Graphics Module

Function: Publication-quality visualization for research presentation
Technical Implementation:

  • Box plot, violin plot, and distribution comparison graphics
  • Academic color scheme implementation with professional formatting
  • Confidence interval visualization with error bar plots
  • Multi-panel figure composition using patchwork methodology
  • Robust PDF compilation using ggsave method for reliable output
  • Consistent theme application across all statistical graphics
  • Output: 8+ high-resolution figures for publication + PDF compilation
  • Execution Time: Approximately 30 seconds

06_quantile_regression_analysis.R - Robust Regression Module

Function: Quantile regression across price distribution for robustness validation
Technical Implementation:

  • Quantile regression implementation at τ = 0.25, 0.5, 0.75, 0.9
  • Complete dataset utilization including previously excluded outliers
  • Coefficient visualization across quantile spectrum
  • Robust statistical inference resistant to extreme values
  • Comprehensive quantile-specific result tabulation
  • Output: Validated findings across complete price distribution
  • Execution Time: Approximately 90 seconds

07_interaction_effects_analysis.R - Strategic Segmentation Module

Function: Price tier analysis revealing strategic differentiation patterns
Technical Implementation:

  • Tertile-based price segmentation within accommodation categories
  • Systematic t-test implementation across price segments
  • Professional heatmap construction for strategic positioning visualization
  • Interaction effect analysis between price and Superhost status
  • Strategic market positioning insight generation
  • Output: Professional heatmap and 6 segment-specific analyses
  • Execution Time: Approximately 60 seconds

08_predictive_model_analysis.R - Machine Learning Validation Module

Function: Predictive modeling demonstrating practical research applicability
Technical Implementation:

  • Stratified 70/30 train-test data partitioning
  • Linear regression model construction with comprehensive feature engineering
  • K-fold cross-validation for model robustness assessment
  • Performance metric calculation (R², RMSE, MAE)
  • Residual analysis and diagnostic plot generation
  • Output: Validated predictive model with quantified performance metrics
  • Execution Time: Approximately 60 seconds

Reproduction Protocol

Complete Analysis Pipeline

# Repository acquisition
git clone https://github.com/g8rdier/airbnb-berlin-superhost-analysis.git
cd airbnb-berlin-superhost-analysis

# Sequential script execution
source("scripts/01_data_import.R")
source("scripts/02_data_cleaning.R")
source("scripts/03_exploratory_analysis.R")
source("scripts/04_hypothesis_testing.R")
source("scripts/05_visualization.R")
source("scripts/06_quantile_regression_analysis.R")
source("scripts/07_interaction_effects_analysis.R")
source("scripts/08_predictive_model_analysis.R")

Total Pipeline Execution Time: 5-6 minutes for complete analysis

Research Contributions

Methodological Innovations

  • Multi-Method Validation Framework: Analytical robustness through diverse statistical approaches
  • Outlier-Inclusive Analysis: Quantile regression methodology ensuring complete market coverage
  • Predictive Validation Protocol: Machine learning confirmation of statistical findings for practical relevance

Empirical Insights

  • Strategic Differentiation Evidence: Quantitative proof of sophisticated pricing strategies across market segments
  • Platform Economics Contribution: Addition to sharing economy pricing behavior literature
  • Practical Applications: Actionable insights for accommodation hosts and platform operators

Scientific Contribution

Knowledge Gain

First-time documentation of sophisticated Superhost pricing differentiation strategies

  • Paradigm Shift: Move away from premium strategies toward segmentation approaches
  • Innovation: Integration of parametric tests, quantile regression, and out-of-sample validation
  • Implications: Price optimization, platform design, regulatory guidance

Key Empirical Findings

Entire Apartment Accommodation:

  • Budget segment (cheap): +12.0% Superhost premium
  • Medium segment: +0.04% premium
  • Luxury segment (90th percentile): +18.7% premium

Private Room Accommodation:

  • Budget segment (cheap): +8.3% Superhost premium
  • Medium segment: -4.1% discount
  • Luxury segment (90th percentile): -30.2% discount

Overall Pattern: Inverse pricing differentiation - Superhosts compete aggressively in budget segments for private rooms but leverage reputation for higher premiums in luxury entire apartment segments.

Research Status

Methodological Status: Multi-method statistical analysis with predictive validation
Academic Standard: Publication-ready methodology with complete reproducibility
Completion Date: August 2025
Recent Updates: Enhanced PDF visualization compilation with reliable ggsave method (August 2025)

Analytical Components Status

Quantile Regression Analysis

  • Validated findings across price distribution (τ = 0.25, 0.5, 0.75, 0.9)
  • Confirmed inverse pricing pattern through robust methodology
  • Complete dataset analysis incorporating price outliers

Interaction Effects Analysis

  • Systematic analysis of Superhost pricing strategies by price segment within accommodation types
  • Tertile-based segmentation with statistical significance testing
  • Professional heatmap visualization of systematic pricing differentiation

Predictive Model Validation

  • Stratified train-test split implementation (70/30 ratio)
  • Practical applicability validation: R² = 0.0087, RMSE = €399.05, MAE = €88.93
  • Empirical demonstration of Superhost status and accommodation feature predictive power

Academic Documentation

Principal Investigator: Gregor Kobilarov
Academic Level: Business Informatics, 5th Semester
Institution: IU International University of Applied Sciences
Research Supervisor: Prof. Dr. rer. nat. Michael Barth
Research Period: August 2025

Recommended Citation

Kobilarov, G. (2025). Advanced Statistical Analysis of Superhost Pricing Strategies
in Berlin's Airbnb Market: A Multi-Method Approach to Pricing Differentiation and
Predictive Validation. Business Informatics Research Project,
IU International University, Germany.

Statistical Validation: Multiple methodologies confirm robust, statistically significant findings
Practical Utility: Demonstrated through successful predictive modeling implementation
Academic Standard: Publication-ready methodology with complete reproducibility protocols

This research demonstrates advanced data analytics methodology with significant contributions to understanding platform-based pricing strategies in the sharing economy.

About

Superhost Premium Analysis - Berlin Airbnb Data (DABD University project)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages