This repository contains scripts for performing two advanced statistical analysis techniques: Principal Component Analysis (PCA) and Non-Metric Multidimensional Scaling (MDS). Both techniques are used to reduce the dimensionality of data and to visualize relationships between variables or observations.
- Introduction
- Principal Component Analysis (PCA)
- Non-Metric Multidimensional Scaling (MDS)
- Usage
- Dependencies
- Contributing
- License
Dimensionality reduction techniques are essential tools in data analysis for simplifying complex datasets, improving visualization, and uncovering hidden patterns. This repository includes scripts for two widely used techniques:
-
Principal Component Analysis (PCA): PCA is a statistical technique that transforms a dataset into a set of linearly uncorrelated variables called principal components. These components capture the maximum variance in the data, allowing for dimensionality reduction while retaining most of the original information.
-
Non-Metric Multidimensional Scaling (MDS): MDS is a means of visualizing the level of similarity (or dissimilarity) of individual cases in a dataset. Non-metric MDS, in particular, preserves the rank order of the dissimilarities, making it useful for ordinal data.
The pca_analysis.R script performs Principal Component Analysis on a dataset. The key steps involved in this script are:
-
Loading the Data: The script starts by loading the required packages and data. It ensures the data is in the appropriate format for PCA.
-
Data Transformation: The data is transformed from long to wide format, making it suitable for correlation analysis and PCA.
-
Handling Missing Data: Cases with missing values are dropped due to their small proportion relative to the total dataset, minimizing potential bias.
-
Correlation Matrix Calculation: A correlation matrix is calculated, excluding variables collected for descriptive purposes only.
-
PCA Construction: The script constructs the PCA model, checks the explained variance for each component, and visualizes this through a Scree plot and a Log Eigenvalue diagram.
-
Component Interpretation: The script visualizes the PCA results, including scatter plots and bar plots for participant scores on the principal components.
-
Biplot: Finally, a biplot is created to visualize the relationship between the first two principal components and the original variables.
The mds_analysis.R script performs Non-Metric Multidimensional Scaling on a dataset of country similarities. The key steps in this script include:
-
Loading Packages and Data: The script loads the necessary packages and the dataset for analysis.
-
Data Inspection: The dataset is checked for any coding mistakes, outliers, or missing values that could affect the analysis.
-
Similarity to Dissimilarity Conversion: The similarity data is converted into dissimilarities using the
sim2dissfunction. This conversion is crucial for performing MDS. -
MDS Execution: Non-Metric MDS is carried out using the
isoMDSfunction. The stress value, an indicator of how well the MDS configuration represents the dissimilarities, is checked. -
Visualization: The coordinates from the MDS analysis are plotted in a two-dimensional space, with additional visual aids such as vertical and horizontal reference lines.
-
Shepard Plot: A Shepard plot is generated to assess the goodness of fit, comparing the dissimilarities with the distances between points in the MDS configuration.
To run the analysis scripts:
- Ensure that all dependencies are installed (see Dependencies below).
- Load your dataset into the R environment.
- Run the scripts in your R environment.
The PCA script expects a dataset formatted for PCA, while the MDS script expects a dataset of similarity measures between entities.
Both scripts require R and several R packages:
tidyversereshape2psychsmacofMASSggplot2gridExtra
Install the necessary packages using:
install.packages(c("tidyverse", "reshape2", "psych", "smacof", "MASS", "ggplot2", "gridExtra"))