Python | Pandas | Matplotlib | Seaborn
Analyzed 17,880 job listings to find what separates fake posts from real ones — and built a simple Trust Score system to flag suspicious listings before they reach job seekers.
866 out of 17,880 job postings in this dataset are fake. That is 4.8% — and at the scale of a real job platform serving millions of listings, that means hundreds of thousands of fake posts reaching real job seekers every year.
The problem is not identifying that fake posts exist. The problem is catching them before they go live — fast enough to scale, without slowing down legitimate employers.
This analysis finds the five signals that consistently separate fake posts from real ones, and turns them into a scoring system any platform can implement without machine learning.
A post with no company logo, no company profile, a short description, and suspicious keywords has a 66.7% chance of being fake — up from a 4.8% baseline.
That jump is achieved using four simple checks. No model training. No complex infrastructure. Just pattern-based rules derived directly from the data.
This is the most important finding in the entire analysis.
| Red Flag Combination | Fraud Rate |
|---|---|
| All jobs (baseline) | 4.8% |
| No company logo | 15.9% |
| No logo + no profile | 21.8% |
| + Short description | 26.7% |
| + Suspicious keywords | 66.7% |
Individual signals are useful. Combined, they become a reliable detection system.
Real companies show who they are. Scammers hide. Fake posts almost never include a company logo or profile — and that pattern holds consistently across the entire dataset.
Real job posts are detailed. Scammers do not bother writing thorough role descriptions. Real company profiles average ~96 words. Fake ones average ~32 words.
Short profile + vague language = significantly higher fraud risk.
Words like "easy money," "fast cash," "urgent," and "work from home" appear in ~12% of fake posts vs ~4% of real ones. One keyword alone is not enough. Two or more together is a strong signal.
A single missing field is normal — not every employer fills everything in. But four or more missing fields is a pattern. The fraud rate drops sharply as post completeness goes up.
This one is counterintuitive. Posts that include salary details actually have a slightly higher fraud rate. Some scammers add fake salary ranges to look more convincing. Salary presence alone tells you nothing.
Fake listings lean toward remote claims because it widens the target audience. Not a strong standalone signal — but it contributes to the overall score.
Fake posts are spread evenly across full-time, part-time, contract, and other types. This is not a useful filter.
Based on these patterns, a simple five-point scoring system can flag risky posts at submission time — before they go live.
| Signal | Red Flag Condition | Points |
|---|---|---|
| Company logo | Missing | +1 |
| Company profile | Missing or blank | +1 |
| Description length | Below minimum character threshold | +1 |
| Suspicious keywords | Contains 2+ flagged words | +1 |
| Missing fields | 3+ fields left blank | +1 |
How the score works:
| Score | Risk Level | Action |
|---|---|---|
| 0–1 | Low | Post goes live immediately |
| 2–3 | Medium | Flag for quick review |
| 4–5 | High | Hold for manual review before publishing |
This check runs in milliseconds. It catches most fake posts without creating any friction for real employers who fill in their details properly — which they naturally do.
Action 1 — Make company identity fields mandatory
Require company name, logo upload, and profile description at submission. This single change blocks the most common pattern in fake posts. Real employers already provide this information. Scammers cannot fake it without effort.
Action 2 — Set minimum content standards
Add a minimum character count for job descriptions and requirements. A floor of 200–300 characters filters out the laziest scams without affecting real listings.
Action 3 — Deploy the Trust Score at submission
Build the five-signal scorer as a pre-publication check. Posts scoring 4–5 go to manual review. Posts scoring 2–3 get flagged in the moderation queue. This scales without needing a dedicated fraud team reviewing every listing.
These three changes require no machine learning, no model retraining, and no complex infrastructure. They are implementable immediately.
| Factor | Result |
|---|---|
| Employment type | Fake posts spread evenly across all types — not a useful filter |
| Salary presence | Slightly higher fraud rate in posts with salary — counterintuitive and unreliable |
| Remote work alone | Slightly elevated risk but too weak to use as a standalone signal |
Testing and ruling these out is just as important as finding what does work.
Before any analysis, the data was validated:
- Missing values identified and handled across all 18 columns
- Text fields with blanks filled as "Unknown" for consistency
- Binary flags created for missing company info, salary, location, education, and industry
- All features validated before being used in scoring
| Detail | Info |
|---|---|
| Source | Kaggle — Real or Fake Job Posting Prediction |
| Total Rows | 17,880 |
| Fake Posts | 866 (4.8%) |
| Columns | 18 |
| Type | Mix of text fields and binary flags |
| Tool | Used For |
|---|---|
| Python | Data cleaning, feature engineering, scoring logic |
| Pandas | Data manipulation and grouping |
| NumPy | Numerical operations |
| Matplotlib | All charts and visualizations |
| Seaborn | Statistical plots and styled visuals |
| Jupyter Notebook | Full end-to-end analysis |
fake-job-detection/
│
├── scam_job_detection_and_risk_analysis.ipynb ← Full analysis notebook
├── README.md ← You are reading this
│
└── images/
├── missing_data_chart_high_res.png
├── real_vs_fake_jobs_high_res.png
├── top_missing_columns_high_res.png
├── company_logo_comparison_high_res.png
├── telecommuting_comparison_high_res.png
├── employment_type_distribution_high_res.png
├── missing_company_profile_fraud_high_res.png
├── description_length_fixed_legend.png
├── salary_transparency_fraud_rate.png
├── completeness_vs_fraud_rate.png
└── risk_factor_bold_high_res.png
- Clone this repo
git clone https://github.com/analytics-ak/real-vs-fake-job-postings.git
- Install required libraries
pip install pandas numpy matplotlib seaborn
- Open the notebook
jupyter notebook scam_job_detection_and_risk_analysis.ipynb
- Run all cells — charts generate automatically
Fake job posts are not sophisticated. They are lazy — missing logos, missing profiles, short descriptions, and obvious keyword patterns. The data shows this consistently across the entire dataset.
The five signals identified here are enough to catch most fraud at submission time, before a single job seeker sees the listing. No model training required. No complex infrastructure. Just pattern-based rules that work because scammers consistently cut corners in the same places.
This analysis shows that fraud detection does not have to be complicated to be effective — it has to be applied at the right point in the process.
Ashish Kumar Dongre
🔗 LinkedIn | 💻 GitHub | 📂 Dataset on Kaggle







