This data analysis project consists of three main components, each handled in a separate Jupyter notebook:
-
Web Scraping (QuotesScraped.ipynb)
- Extract quotes data from quotes.toscrape.com
- Store scraped data in CSV format
-
SQL Insights (SQL_insights.ipynb)
- Import CSV data into MySQL database
- Perform SQL queries for data insights
- Answer specific analytical questions
-
Exploratory Data Analysis and Visualization (Analysis & Visualization.ipynb)
- Perform basic statistical analysis
- Create visualizations to understand patterns in the data
- Uses Requests library to fetch web pages
- Implements BeautifulSoup for HTML parsing
- Handles pagination to scrape all available quotes (not just the first page)
- Extracts three key data points:
- Quote text
- Author name
- Associated tags
- Starts at page 1 of quotes.toscrape.com
- Locates quote blocks via HTML/CSS selectors (div.quote)
- Extracts content from structured elements within each quote block
- Follows "next" navigation links to subsequent pages
- Terminates when no more pages are available
- Creates a structured CSV file (quotes.csv) with three columns:
- author
- quote
- tag_name (comma-separated tag strings)
- Connects to MySQL using PyMySQL
- Loads scraped data into a 'quotes' table in 'usersystem' database
-
Author Quote Frequency
- Counts quotes by each author
- Orders results by frequency (descending)
-
Tag Popularity Analysis
- Implements complex string parsing to separate comma-delimited tags
- Counts occurrence frequency of each tag
- Identifies top 5 most common tags
-
Prolific Author Identification
- Filters authors who have contributed more than 5 quotes
- Results revealed: Albert Einstein (10), J.K. Rowling (9), Marilyn Monroe (7), Dr. Seuss (6), Mark Twain (6)
-
Content Length Analysis
- Identifies the longest quote in the dataset
- Associates this quote with its author
- Dataset overview (using pandas head() function)
- Missing value detection
- Unique author count
- Unique tag count
- Quote length statistics (average character count)
- Descriptive statistics
-
Author Distribution (Bar Chart)
- Horizontal bar chart of top 10 most quoted authors
- Uses Seaborn's barplot with viridis color palette
-
Word Frequency (Word Cloud)
- Pre-processes quote text (removing punctuation, lowercasing)
- Generates word cloud visualization of most frequent words
- Uses viridis color palette on white background
-
Tag Distribution (Pie Chart)
- Extracts and processes tag data
- Creates pie chart showing proportional representation of top 5 tags
- Includes percentage labels
From the SQL analysis, we can see that:
- Albert Einstein is the most quoted author with 10 quotes
- Five authors have more than 5 quotes each
- The longest quote in the dataset belongs to a specific author (query implemented)
The visualizations provide additional insights:
- Clear visualization of quote distribution across authors
- Common themes and words used across all quotes
- Relative popularity of different tags
-
Efficient Web Scraping
- The scraping script efficiently navigates through all pages
- Robust extraction of structured data elements
-
Advanced SQL Techniques
- Use of GROUP BY and HAVING clauses for filtering
- Complex string parsing for tag analysis
- Character length functions for content analysis
-
Effective Data Visualization
- Appropriate chart types for different data aspects
- Consistent color scheme (viridis palette)
- Clear labeling and titles
This project demonstrates a complete data pipeline from web scraping to visual analysis. It showcases skills in:
- Web data extraction
- Database querying and analysis
- Data cleaning and transformation
- Statistical analysis
- Data visualization
The implementation successfully answers key questions about author contribution frequency, tag popularity, and content characteristics within the quotes dataset.