This study presents a comprehensive statistical analysis of Superhost pricing differentiation strategies in Berlin's Airbnb market using a multi-methodological approach. The research combines Welch's t-tests, quantile regression (τ = 0.25/0.5/0.75/0.9), tertile-based market segmentation, and linear regression models with 70/30 train-test validation to reveal sophisticated pricing strategies and demonstrate practical applicability.
Academic Context: Business Informatics, 5th Semester - Data Analytics & Big Data
Institution: IU International University of Applied Sciences
Research Supervisor: Prof. Dr. rer. nat. Michael Barth
Student: Gregor Kobilarov (Matriculation No.: 4233113)
Dataset: InsideAirbnb Berlin (n=8,783, July 2025)
Primary Contribution: Empirical documentation of inverse pricing differentiation with statistical validation and predictive utility
Research Question: Empirical analysis of Airbnb Superhost pricing differentiation strategies in Berlin's market, focusing on post-COVID market dynamics that enable analysis of evolutionary strategy adaptations.
H₀: Superhost premiums are identical between private rooms and entire apartments
H₁: Superhost premiums differ significantly between private rooms and entire apartments
Empirical Results: H₀ rejected with strong statistical evidence:
- Private Rooms: -22.19% (95% Confidence Interval: [-27.33%, -15.06%])
- Entire Apartments: +16.79% (95% Confidence Interval: [18.52%, 29.85%])
- Difference: 38.98 percentage points (p < 2.2e-16, Cohen's d = -0.559)
- 3-Sigma Outlier Removal: 14,187→8,783 observations
- Analysis Pipeline: (1) Welch's t-tests; (2) Quantile regression (25%/50%/75%/90%); (3) Tertile-based market segmentation; (4) Linear regression models
- Train-Test Validation: 70/30 stratified split (R²=0.0087, RMSE=€399.05)
- Robustness: Bootstrap confidence intervals (1,000 iterations)
- Method: Welch's t-tests with unequal variance assumption
- Sample: 8,783 listings after rigorous data cleaning
- Statistical Evidence: t=-6.78 and t=8.37, both p < 2.2e-16
- Primary Finding: Inverse pricing pattern with substantial effect sizes
- Method: Robust regression across price distribution quantiles (τ = 0.25, 0.5, 0.75, 0.9)
- Sample: Complete dataset of 8,808 listings including outliers
- Key Insight: Consistent inverse pattern across complete price distribution
- Methodological Innovation: Outlier-inclusive analysis ensuring complete market coverage
- Method: Tertile-based price segmentation with statistical significance testing
- Approach: Systematic analysis of 6 market segments (3 price tiers × 2 accommodation types)
- Key Finding: Budget segments (+5.3% to +14.0%) vs. luxury segments (-19.9% to -40.5%)
- Statistical Validation: 4/6 segments statistically significant
- Method: Supervised learning with stratified 70/30 train-test split
- Performance Metrics: R² = 0.0087, RMSE = €399.05, MAE = €88.93
- Validation Approach: Out-of-sample validation confirms practical relevance
- Innovation: Machine learning confirmation of statistical findings
Inverse pricing differentiation with substantial effect sizes:
- Private Rooms: -22.19% (95% Confidence Interval: [-27.33%, -15.06%])
- Entire Apartments: +16.79% (95% Confidence Interval: [18.52%, 29.85%])
Budget Segments: Private rooms show +8.3% to +14.0% Superhost premiums Luxury Segments: Private rooms show -30.2% to -40.5% Superhost discounts Entire Apartments: Consistent premiums across price segments (+5.3% to +18.7%)
Inference Validation: Welch's t-tests (t=-6.78 and t=8.37, p < 2.2e-16), bootstrap confidence intervals, and quantile estimators converge to consistent conclusions. Performance: Out-of-sample validation (RMSE=€399.05) confirms practical relevance.
| Analysis Method | Technique | Key Finding | Statistical Significance |
|---|---|---|---|
| Hypothesis Testing | Welch's t-test | Inverse pricing pattern | p < 2.2e-16 |
| Quantile Regression | Robust estimation | Consistent across price spectrum | Significant across all quantiles |
| Interaction Effects | Segmented analysis | Strategic differentiation | 4/6 segments statistically significant |
| Predictive Modeling | Machine learning validation | Practical applicability | R² = 0.009 on test data |
- Hypothesis Testing: Welch's t-tests with bootstrap confidence intervals
- Robust Analysis: Quantile regression with outlier integration
- Interaction Modeling: Multi-factor ANOVA with price tier segmentation
- Predictive Modeling: Linear regression with cross-validation
- Platform: R Statistical Computing Environment (v4.3+)
- Core Packages: quantreg, caret, effsize, car, psych, ggplot2, patchwork
- Visualization: Publication-quality graphics with academic formatting
- Reproducibility: Git version control with comprehensive documentation
airbnb-berlin-superhost-analysis/
├── data/
│ ├── raw/ # InsideAirbnb source data (14,187 listings)
│ └── processed/ # Analysis-ready datasets (8,783 listings)
├── scripts/ # Complete analysis pipeline (8 scripts)
│ ├── 01_data_import.R # Data acquisition and validation
│ ├── 02_data_cleaning.R # 3-sigma outlier removal and preprocessing
│ ├── 03_exploratory_analysis.R # Descriptive statistics and diagnostics
│ ├── 04_hypothesis_testing.R # Formal statistical inference
│ ├── 05_visualization.R # Publication-quality graphics
│ ├── 06_quantile_regression_analysis.R # Robust regression analysis
│ ├── 07_interaction_effects_analysis.R # Strategic segmentation analysis
│ └── 08_predictive_model_analysis.R # Machine learning validation
├── outputs/
│ ├── figures/ # Statistical visualizations (12 figures + PDF)
│ │ ├── 01_premium_comparison.png
│ │ ├── 02_price_distributions.png
│ │ ├── 03_sample_sizes.png
│ │ ├── 04_statistical_evidence.png
│ │ ├── 05_combined_academic_presentation.png
│ │ ├── 06_quantile_regression_analysis.png
│ │ ├── 07_interaction_effects_heatmap.png
│ │ ├── 07_interaction_effects_context.png
│ │ ├── 08_prediction_accuracy.png
│ │ ├── 08_model_validation_dashboard.png
│ │ ├── 08_feature_importance.png
│ │ └── visualization_compilation.pdf
│ ├── tables/ # Comprehensive results (26+ CSV files)
│ │ ├── final_project_completion_summary.csv
│ │ ├── model_performance_comparison.csv
│ │ ├── feature_importance_analysis.csv
│ │ ├── interaction_effects_analysis.csv
│ │ └── [22+ additional analytical tables]
│ └── results/ # Research conclusions and summaries (6 files)
│ ├── academic_summary.txt
│ ├── hypothesis_test_results.csv
│ ├── predictive_model_validation_summary.txt
│ └── [3+ additional result summaries]
└── README.md # Project documentation
This research implements dual data cleaning approaches optimized for distinct analytical objectives:
- Objective: Statistical inference and hypothesis testing
- Approach: Strict outlier removal using IQR-based bounds
- Resulting Dataset: 8,783 listings with rigorous quality control
- Statistical Rationale: Ensures robustness for inferential analyses through systematic exclusion of extreme values and incomplete records
- Applications: t-tests, quantile regression, interaction effects analysis
Function: Downloads and validates InsideAirbnb Berlin dataset
Technical Implementation:
- Automated retrieval of listings.csv from InsideAirbnb repository
- Data quality validation protocol (missing values, data type verification)
- Source documentation generation with descriptive statistics
- Standardized directory structure initialization
- Output: Raw dataset (20,000+ listings) ready for preprocessing
- Execution Time: Approximately 30 seconds
Function: Comprehensive data preparation for statistical analysis and modeling
Technical Implementation:
- IQR-based outlier detection and removal
- Systematic exclusion of incomplete records
- Categorical variable standardization and price normalization
- Accommodation type filtering (Entire Place, Private Room focus)
- Comprehensive data quality assessment
- Output: 8,783 listings optimized for analysis and modeling
- Execution Time: Approximately 45 seconds
Function: Comprehensive statistical exploration and pattern identification
Technical Implementation:
- Multi-group summary statistics generation
- Distribution analysis with normality testing protocols
- Correlation matrix construction and diagnostic testing
- Confounding variable identification procedures
- Statistical profiling with publication-ready tables
- Output: 15+ statistical tables and diagnostic visualizations
- Execution Time: Approximately 60 seconds
Function: Formal hypothesis testing with effect size quantification
Technical Implementation:
- Welch's t-tests for populations with unequal variances
- Cohen's d effect size calculation with confidence intervals
- Statistical assumption testing (normality, homoscedasticity)
- Bootstrap confidence interval generation for robust inference
- Publication-ready statistical result compilation
- Output: Hypothesis test results with p < 2.2e-16 significance
- Execution Time: Approximately 45 seconds
Function: Publication-quality visualization for research presentation
Technical Implementation:
- Box plot, violin plot, and distribution comparison graphics
- Academic color scheme implementation with professional formatting
- Confidence interval visualization with error bar plots
- Multi-panel figure composition using patchwork methodology
- Robust PDF compilation using ggsave method for reliable output
- Consistent theme application across all statistical graphics
- Output: 8+ high-resolution figures for publication + PDF compilation
- Execution Time: Approximately 30 seconds
Function: Quantile regression across price distribution for robustness validation
Technical Implementation:
- Quantile regression implementation at τ = 0.25, 0.5, 0.75, 0.9
- Complete dataset utilization including previously excluded outliers
- Coefficient visualization across quantile spectrum
- Robust statistical inference resistant to extreme values
- Comprehensive quantile-specific result tabulation
- Output: Validated findings across complete price distribution
- Execution Time: Approximately 90 seconds
Function: Price tier analysis revealing strategic differentiation patterns
Technical Implementation:
- Tertile-based price segmentation within accommodation categories
- Systematic t-test implementation across price segments
- Professional heatmap construction for strategic positioning visualization
- Interaction effect analysis between price and Superhost status
- Strategic market positioning insight generation
- Output: Professional heatmap and 6 segment-specific analyses
- Execution Time: Approximately 60 seconds
Function: Predictive modeling demonstrating practical research applicability
Technical Implementation:
- Stratified 70/30 train-test data partitioning
- Linear regression model construction with comprehensive feature engineering
- K-fold cross-validation for model robustness assessment
- Performance metric calculation (R², RMSE, MAE)
- Residual analysis and diagnostic plot generation
- Output: Validated predictive model with quantified performance metrics
- Execution Time: Approximately 60 seconds
# Repository acquisition
git clone https://github.com/g8rdier/airbnb-berlin-superhost-analysis.git
cd airbnb-berlin-superhost-analysis
# Sequential script execution
source("scripts/01_data_import.R")
source("scripts/02_data_cleaning.R")
source("scripts/03_exploratory_analysis.R")
source("scripts/04_hypothesis_testing.R")
source("scripts/05_visualization.R")
source("scripts/06_quantile_regression_analysis.R")
source("scripts/07_interaction_effects_analysis.R")
source("scripts/08_predictive_model_analysis.R")Total Pipeline Execution Time: 5-6 minutes for complete analysis
- Multi-Method Validation Framework: Analytical robustness through diverse statistical approaches
- Outlier-Inclusive Analysis: Quantile regression methodology ensuring complete market coverage
- Predictive Validation Protocol: Machine learning confirmation of statistical findings for practical relevance
- Strategic Differentiation Evidence: Quantitative proof of sophisticated pricing strategies across market segments
- Platform Economics Contribution: Addition to sharing economy pricing behavior literature
- Practical Applications: Actionable insights for accommodation hosts and platform operators
First-time documentation of sophisticated Superhost pricing differentiation strategies
- Paradigm Shift: Move away from premium strategies toward segmentation approaches
- Innovation: Integration of parametric tests, quantile regression, and out-of-sample validation
- Implications: Price optimization, platform design, regulatory guidance
Entire Apartment Accommodation:
- Budget segment (cheap): +12.0% Superhost premium
- Medium segment: +0.04% premium
- Luxury segment (90th percentile): +18.7% premium
Private Room Accommodation:
- Budget segment (cheap): +8.3% Superhost premium
- Medium segment: -4.1% discount
- Luxury segment (90th percentile): -30.2% discount
Overall Pattern: Inverse pricing differentiation - Superhosts compete aggressively in budget segments for private rooms but leverage reputation for higher premiums in luxury entire apartment segments.
Methodological Status: Multi-method statistical analysis with predictive validation
Academic Standard: Publication-ready methodology with complete reproducibility
Completion Date: August 2025
Recent Updates: Enhanced PDF visualization compilation with reliable ggsave method (August 2025)
Quantile Regression Analysis
- Validated findings across price distribution (τ = 0.25, 0.5, 0.75, 0.9)
- Confirmed inverse pricing pattern through robust methodology
- Complete dataset analysis incorporating price outliers
Interaction Effects Analysis
- Systematic analysis of Superhost pricing strategies by price segment within accommodation types
- Tertile-based segmentation with statistical significance testing
- Professional heatmap visualization of systematic pricing differentiation
Predictive Model Validation
- Stratified train-test split implementation (70/30 ratio)
- Practical applicability validation: R² = 0.0087, RMSE = €399.05, MAE = €88.93
- Empirical demonstration of Superhost status and accommodation feature predictive power
Principal Investigator: Gregor Kobilarov
Academic Level: Business Informatics, 5th Semester
Institution: IU International University of Applied Sciences
Research Supervisor: Prof. Dr. rer. nat. Michael Barth
Research Period: August 2025
Kobilarov, G. (2025). Advanced Statistical Analysis of Superhost Pricing Strategies
in Berlin's Airbnb Market: A Multi-Method Approach to Pricing Differentiation and
Predictive Validation. Business Informatics Research Project,
IU International University, Germany.
Statistical Validation: Multiple methodologies confirm robust, statistically significant findings
Practical Utility: Demonstrated through successful predictive modeling implementation
Academic Standard: Publication-ready methodology with complete reproducibility protocols
This research demonstrates advanced data analytics methodology with significant contributions to understanding platform-based pricing strategies in the sharing economy.