A comprehensive machine learning analysis of hotel data scraped from Booking.com in Ho Chi Minh City, Vietnam. This project leverages various ML techniques to extract insights from hotel reviews, images, and metadata to provide actionable intelligence for the hospitality industry.
- Predictive Analytics: Develop robust models for review score prediction
- Market Segmentation: Identify distinct hotel segments using unsupervised learning
- Quality Classification: Create a reliable hotel quality classification system
- Image Analysis: Incorporate visual data in prediction models
- Combined Model Architecture:
- ResNet18 backbone for image feature extraction
- Fusion with numerical/categorical features
- Custom head for regression
- Traditional ML Approach:
- Ridge Regression with VIF-based feature selection
- Hyperparameter optimization via cross-validation
- RMSE-focused model evaluation
- Clustering Algorithms:
- K-means for basic segmentation
- DBSCAN for density-based clustering
- Evaluation Metrics:
- Silhouette Score
- Elbow Method for optimal cluster selection
- Multi-class Classification:
- Softmax Regression baseline
- Stacking Ensemble:
- Base models: SVM, KNN, Decision Tree, Random Forest
- Meta-model: Logistic Regression
- Class Definition:
def quality_mapping(score): if score < 7.0: return "Standard" # Basic amenities, lower prices elif score < 9.0: return "Superior" # Good quality, competitive pricing else: return "Exceptional" # Premium experience, luxury segment
bash python 3.8+ pytorch 1.9+ scikit-learn 0.24+ pandas 1.3+ numpy 1.19+
git clone https://github.com/username/booking-hotel-analysis.git
cd booking-hotel-analysis
pip install -r requirements.txt- Regression Models
# Deep Learning Approach
python evaluate.py \
--task_type regression \
--model_type dl \
--dataset 'booking_images' \
--n_epoch 5 \
--batch_size 32 \
--lr 0.01 \
--save_model
# Traditional ML Approach
python evaluate.py \
--task_type regression \
--model_type ml \
--model Vanilla_LinearRegression \
--vif_threshold 5.0- Classification Models
# Stacking Ensemble
python evaluate.py \
--task_type classification \
--model_type ml \
--model Ensemble \
--save_model- Clustering Analysis
python evaluate.py \
--task_type clustering \
--model_type ml \
--model KMeans \
--save_model- Build Docker Image
# Build image vα»i tag
docker build -t hotel-analysis:latest .- Run Container
# ChαΊ‘y container vα»i mounted volumes
docker run -it --name hotel-analysis \
-v "$(pwd)/data:/app/data" \
-v "$(pwd)/models:/app/models" \
-v "$(pwd)/results:/app/results" \
hotel-analysis:latest- Run Specific Tasks
# Regression task
docker run -it --name hotel-regression \
-v "$(pwd)/data:/app/data" \
-v "$(pwd)/models:/app/models" \
-v "$(pwd)/results:/app/results" \
hotel-analysis:latest \
python -u task_regression/evaluate.py \
--task_type regression \
--model_type ml \
--model Ridge_Regression
# Classification task
docker run -it --name hotel-classification \
-v "$(pwd)/data:/app/data" \
-v "$(pwd)/models:/app/models" \
-v "$(pwd)/results:/app/results" \
hotel-analysis:latest \
python -u task_classification/evaluate.py \
--task_type classification \
--model_type ml
# Clustering task
docker run -it --name hotel-clustering \
-v "$(pwd)/data:/app/data" \
-v "$(pwd)/models:/app/models" \
-v "$(pwd)/results:/app/results" \
hotel-analysis:latest \
python -u task_clustering/evaluate.py \
--task_type clustering \
--model_type ml- Useful Docker Commands
# List containers
docker ps -a
# Stop container
docker stop hotel-analysis
# Remove container
docker rm hotel-analysis
# View logs
docker logs -f hotel-analysis
# Clean up
docker system prune -a- Docker Compose (Optional)
# docker-compose.yml
version: '3.8'
services:
hotel-analysis:
build: .
volumes:
- ./data:/app/data
- ./models:/app/models
- ./results:/app/results
environment:
- PYTHONPATH=/app
- TASK_DIR=/app/data
- MODEL_DIR=/app/models
- RESULTS_DIR=/app/resultsRun with docker-compose:
docker-compose up --build- Docker Engine 19.03+
- Docker Compose 1.27+ (optional)
- At least 8GB RAM
- 20GB free disk space
project/
βββ data/
β βββ raw/ # Raw scraped data
β βββ processed/ # Cleaned & preprocessed data
β βββ hotel_images/ # Hotel image repository
βββ models/
β βββ regression/
β βββ classification/
β βββ clustering/
βββ notebooks/ # Analysis & experimentation
βββ src/ # Source code
| Feature | Type | Description |
|---|---|---|
| review_score | float | Rating (0-10) |
| price | float | Room price (VND) |
| facilities | list | Available amenities |
| location | str | Hotel location |
| images | tensor | Processed hotel images |
- RMSE: 0.85
- RΒ²: 0.78
- MAE: 0.67
- Accuracy: 0.84
- F1-Score: 0.82
- ROC-AUC: 0.89
- Silhouette Score: 0.76
- Optimal Clusters: 3
-
Model Enhancements:
- Implement attention mechanisms for image analysis
- Explore transformer architectures
- Incorporate temporal features
-
Feature Engineering:
- Develop more sophisticated text features
- Create location-based features
- Extract deeper image features
We welcome contributions! Please see our Contributing Guidelines for details.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this work in your research, please cite:
@misc{booking_analysis_2024,
author = {Your Name},
title = {Booking.com Hotel Analytics},
year = {2024},
publisher = {GitHub},
url = {https://github.com/khang3004/Comprehensive-ML-DL-Approaches-for-Hotel-Room-Review-Score-Prediction.git}
}For any queries, please reach out to gausseuler159357@gmail.com
