This project demonstrates data cleaning techniques using SQL on Nashville housing dataset. The SQL scripts standardize dates, handle missing values, split addresses, remove duplicates, and prepare the data for analysis.
The dataset is stored in housing.db, a SQLite database containing Nashville housing sales data.
- SQLite3 installed on your system
- Basic knowledge of SQL
-
Clone the repository:
git clone https://github.com/mrithip/nashville-housing-data-cleaning-sql.git cd nashville-housing-data-cleaning-sql -
Ensure SQLite3 is installed:
sqlite3 --version
Run the data cleaning script:
sqlite3 housing.db < datacleaning.sqlThe datacleaning.sql script performs the following operations:
- Table Management: Creates a backup and renames tables as needed
- Date Standardization: Converts sale dates to ISO format (YYYY-MM-DD)
- Missing Data Handling: Populates null property addresses using related records
- Address Parsing: Splits property addresses into separate address and city columns
- Owner Address Parsing: Splits owner addresses into address, city, and state columns
- Data Standardization: Converts "Y"/"N" values in SoldAsVacant to "Yes"/"No"
- Duplicate Removal: Identifies and removes duplicate records based on key fields
- Column Cleanup: Removes unnecessary columns (OwnerAddress, PropertyAddress, TaxDistrict)
After cleaning, the main table nashvillehousingdata contains standardized columns including:
- UniqueID
- ParcelID
- LandUse
- PropertyAddress (original)
- SaleDate (ISO format)
- SalePrice
- LegalReference
- SoldAsVacant
- OwnerName
- PropertySplitAddress
- PropertySplitCity
- OwnerSplitAddress
- OwnerSplitCity
- OwnerSplitState
- And others...
Feel free to submit issues and enhancement requests.