Skip to content

Vikas-Gour713/ScrapifyQuotes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ScrapifyQuotes

Quotes Scraping and Analysis Project - Comprehensive Overview

Project Structure

This data analysis project consists of three main components, each handled in a separate Jupyter notebook:

  1. Web Scraping (QuotesScraped.ipynb)

    • Extract quotes data from quotes.toscrape.com
    • Store scraped data in CSV format
  2. SQL Insights (SQL_insights.ipynb)

    • Import CSV data into MySQL database
    • Perform SQL queries for data insights
    • Answer specific analytical questions
  3. Exploratory Data Analysis and Visualization (Analysis & Visualization.ipynb)

    • Perform basic statistical analysis
    • Create visualizations to understand patterns in the data

Component 1: Web Scraping Implementation

Key Features

  • Uses Requests library to fetch web pages
  • Implements BeautifulSoup for HTML parsing
  • Handles pagination to scrape all available quotes (not just the first page)
  • Extracts three key data points:
    • Quote text
    • Author name
    • Associated tags

Data Collection Process

  1. Starts at page 1 of quotes.toscrape.com
  2. Locates quote blocks via HTML/CSS selectors (div.quote)
  3. Extracts content from structured elements within each quote block
  4. Follows "next" navigation links to subsequent pages
  5. Terminates when no more pages are available

Output

  • Creates a structured CSV file (quotes.csv) with three columns:
    • author
    • quote
    • tag_name (comma-separated tag strings)

Component 2: SQL Analysis

Database Implementation

  • Connects to MySQL using PyMySQL
  • Loads scraped data into a 'quotes' table in 'usersystem' database

Key SQL Queries Implemented

  1. Author Quote Frequency

    • Counts quotes by each author
    • Orders results by frequency (descending)
  2. Tag Popularity Analysis

    • Implements complex string parsing to separate comma-delimited tags
    • Counts occurrence frequency of each tag
    • Identifies top 5 most common tags
  3. Prolific Author Identification

    • Filters authors who have contributed more than 5 quotes
    • Results revealed: Albert Einstein (10), J.K. Rowling (9), Marilyn Monroe (7), Dr. Seuss (6), Mark Twain (6)
  4. Content Length Analysis

    • Identifies the longest quote in the dataset
    • Associates this quote with its author

Component 3: Exploratory Data Analysis & Visualization

Basic EDA Operations

  • Dataset overview (using pandas head() function)
  • Missing value detection
  • Unique author count
  • Unique tag count
  • Quote length statistics (average character count)
  • Descriptive statistics

Visualizations

  1. Author Distribution (Bar Chart)

    • Horizontal bar chart of top 10 most quoted authors
    • Uses Seaborn's barplot with viridis color palette
  2. Word Frequency (Word Cloud)

    • Pre-processes quote text (removing punctuation, lowercasing)
    • Generates word cloud visualization of most frequent words
    • Uses viridis color palette on white background
  3. Tag Distribution (Pie Chart)

    • Extracts and processes tag data
    • Creates pie chart showing proportional representation of top 5 tags
    • Includes percentage labels

Key Insights

From the SQL analysis, we can see that:

  • Albert Einstein is the most quoted author with 10 quotes
  • Five authors have more than 5 quotes each
  • The longest quote in the dataset belongs to a specific author (query implemented)

The visualizations provide additional insights:

  • Clear visualization of quote distribution across authors
  • Common themes and words used across all quotes
  • Relative popularity of different tags

Technical Implementation Highlights

  1. Efficient Web Scraping

    • The scraping script efficiently navigates through all pages
    • Robust extraction of structured data elements
  2. Advanced SQL Techniques

    • Use of GROUP BY and HAVING clauses for filtering
    • Complex string parsing for tag analysis
    • Character length functions for content analysis
  3. Effective Data Visualization

    • Appropriate chart types for different data aspects
    • Consistent color scheme (viridis palette)
    • Clear labeling and titles

Conclusion

This project demonstrates a complete data pipeline from web scraping to visual analysis. It showcases skills in:

  • Web data extraction
  • Database querying and analysis
  • Data cleaning and transformation
  • Statistical analysis
  • Data visualization

The implementation successfully answers key questions about author contribution frequency, tag popularity, and content characteristics within the quotes dataset.

About

This repository contains a project focused on scraping, storing, and analyzing quotes from the web.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors