An initial effort to use NLP methods on user reviews to normalize star score bias. There are many misaligned reviews and star scores, along with easy five star scores. I wanted to test Classification models to give a better star distribution. I used the Yelp Academic Dataset found here - https://www.yelp.com/dataset
Examples of misaligned reviews:

Also, in general star scores skew towards 5 stars way too often.

If we look more categorically,

After importing, merging, and sampling the datasets to a manageable size; I filtered for food based business reviews.
- I then engineered a feature based on user avg scores vs if they scored the current business higher than the business average score. This binary feature is called 'fan' - and was my target variable in my models.
- I put the reviews through tfidf vectorization after normalizing with a tokeninzer and lemmatizer.
- I then ran a Random Forest Classifier and a 2 input,1 ouput LSTM / Sequential neural net.
The Random Forest gave better results at a baseline and is much less resource intense.
With the results, I built a function to normalize the star scores based on reviews. See the new distribution.
