Real vs. Fake Job Posting Analysis

Python | Pandas | Matplotlib | Seaborn

Analyzed 17,880 job listings to find what separates fake posts from real ones — and built a simple Trust Score system to flag suspicious listings before they reach job seekers.

The Short Version

866 out of 17,880 job postings in this dataset are fake. That is 4.8% — and at the scale of a real job platform serving millions of listings, that means hundreds of thousands of fake posts reaching real job seekers every year.

The problem is not identifying that fake posts exist. The problem is catching them before they go live — fast enough to scale, without slowing down legitimate employers.

This analysis finds the five signals that consistently separate fake posts from real ones, and turns them into a scoring system any platform can implement without machine learning.

The Core Finding

A post with no company logo, no company profile, a short description, and suspicious keywords has a 66.7% chance of being fake — up from a 4.8% baseline.

That jump is achieved using four simple checks. No model training. No complex infrastructure. Just pattern-based rules derived directly from the data.

How Fraud Rate Climbs as Red Flags Stack

This is the most important finding in the entire analysis.

Red Flag Combination	Fraud Rate
All jobs (baseline)	4.8%
No company logo	15.9%
No logo + no profile	21.8%
+ Short description	26.7%
+ Suspicious keywords	66.7%

Individual signals are useful. Combined, they become a reliable detection system.

What the Data Shows

Missing company info is the strongest single signal

Real companies show who they are. Scammers hide. Fake posts almost never include a company logo or profile — and that pattern holds consistently across the entire dataset.

Fake descriptions are short and vague

Real job posts are detailed. Scammers do not bother writing thorough role descriptions. Real company profiles average ~96 words. Fake ones average ~32 words.

Short profile + vague language = significantly higher fraud risk.

Suspicious keywords appear 3x more in fake posts

Words like "easy money," "fast cash," "urgent," and "work from home" appear in ~12% of fake posts vs ~4% of real ones. One keyword alone is not enough. Two or more together is a strong signal.

Missing fields compound the risk

A single missing field is normal — not every employer fills everything in. But four or more missing fields is a pattern. The fraud rate drops sharply as post completeness goes up.

Salary presence is not a safety signal

This one is counterintuitive. Posts that include salary details actually have a slightly higher fraud rate. Some scammers add fake salary ranges to look more convincing. Salary presence alone tells you nothing.

Remote job claims carry slightly higher risk

Fake listings lean toward remote claims because it widens the target audience. Not a strong standalone signal — but it contributes to the overall score.

Employment type does not matter

Fake posts are spread evenly across full-time, part-time, contract, and other types. This is not a useful filter.

The Trust Score System

Based on these patterns, a simple five-point scoring system can flag risky posts at submission time — before they go live.

Signal	Red Flag Condition	Points
Company logo	Missing	+1
Company profile	Missing or blank	+1
Description length	Below minimum character threshold	+1
Suspicious keywords	Contains 2+ flagged words	+1
Missing fields	3+ fields left blank	+1

How the score works:

Score	Risk Level	Action
0–1	Low	Post goes live immediately
2–3	Medium	Flag for quick review
4–5	High	Hold for manual review before publishing

This check runs in milliseconds. It catches most fake posts without creating any friction for real employers who fill in their details properly — which they naturally do.

What Should Be Done

Action 1 — Make company identity fields mandatory

Require company name, logo upload, and profile description at submission. This single change blocks the most common pattern in fake posts. Real employers already provide this information. Scammers cannot fake it without effort.

Action 2 — Set minimum content standards

Add a minimum character count for job descriptions and requirements. A floor of 200–300 characters filters out the laziest scams without affecting real listings.

Action 3 — Deploy the Trust Score at submission

Build the five-signal scorer as a pre-publication check. Posts scoring 4–5 go to manual review. Posts scoring 2–3 get flagged in the moderation queue. This scales without needing a dedicated fraud team reviewing every listing.

These three changes require no machine learning, no model retraining, and no complex infrastructure. They are implementable immediately.

What Did Not Matter

Factor	Result
Employment type	Fake posts spread evenly across all types — not a useful filter
Salary presence	Slightly higher fraud rate in posts with salary — counterintuitive and unreliable
Remote work alone	Slightly elevated risk but too weak to use as a standalone signal

Testing and ruling these out is just as important as finding what does work.

Data Quality

Before any analysis, the data was validated:

Missing values identified and handled across all 18 columns
Text fields with blanks filled as "Unknown" for consistency
Binary flags created for missing company info, salary, location, education, and industry
All features validated before being used in scoring

Dataset

Detail	Info
Source	Kaggle — Real or Fake Job Posting Prediction
Total Rows	17,880
Fake Posts	866 (4.8%)
Columns	18
Type	Mix of text fields and binary flags

Tools Used

Tool	Used For
Python	Data cleaning, feature engineering, scoring logic
Pandas	Data manipulation and grouping
NumPy	Numerical operations
Matplotlib	All charts and visualizations
Seaborn	Statistical plots and styled visuals
Jupyter Notebook	Full end-to-end analysis

Project Structure

fake-job-detection/
│
├── scam_job_detection_and_risk_analysis.ipynb  ← Full analysis notebook
├── README.md                                    ← You are reading this
│
└── images/
    ├── missing_data_chart_high_res.png
    ├── real_vs_fake_jobs_high_res.png
    ├── top_missing_columns_high_res.png
    ├── company_logo_comparison_high_res.png
    ├── telecommuting_comparison_high_res.png
    ├── employment_type_distribution_high_res.png
    ├── missing_company_profile_fraud_high_res.png
    ├── description_length_fixed_legend.png
    ├── salary_transparency_fraud_rate.png
    ├── completeness_vs_fraud_rate.png
    └── risk_factor_bold_high_res.png

How to Run This

Clone this repo

git clone https://github.com/analytics-ak/real-vs-fake-job-postings.git

Install required libraries

pip install pandas numpy matplotlib seaborn

Open the notebook

jupyter notebook scam_job_detection_and_risk_analysis.ipynb

Run all cells — charts generate automatically

Conclusion

Fake job posts are not sophisticated. They are lazy — missing logos, missing profiles, short descriptions, and obvious keyword patterns. The data shows this consistently across the entire dataset.

The five signals identified here are enough to catch most fraud at submission time, before a single job seeker sees the listing. No model training required. No complex infrastructure. Just pattern-based rules that work because scammers consistently cut corners in the same places.

This analysis shows that fraud detection does not have to be complicated to be effective — it has to be applied at the right point in the process.

Author

Ashish Kumar Dongre

🔗 LinkedIn | 💻 GitHub | 📂 Dataset on Kaggle

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
scam_job_detection_and_risk_analysis.ipynb		scam_job_detection_and_risk_analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real vs. Fake Job Posting Analysis

The Short Version

The Core Finding

How Fraud Rate Climbs as Red Flags Stack

What the Data Shows

Missing company info is the strongest single signal

Fake descriptions are short and vague

Suspicious keywords appear 3x more in fake posts

Missing fields compound the risk

Salary presence is not a safety signal

Remote job claims carry slightly higher risk

Employment type does not matter

The Trust Score System

What Should Be Done

What Did Not Matter

Data Quality

Dataset

Tools Used

Project Structure

How to Run This

Conclusion

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Real vs. Fake Job Posting Analysis

The Short Version

The Core Finding

How Fraud Rate Climbs as Red Flags Stack

What the Data Shows

Missing company info is the strongest single signal

Fake descriptions are short and vague

Suspicious keywords appear 3x more in fake posts

Missing fields compound the risk

Salary presence is not a safety signal

Remote job claims carry slightly higher risk

Employment type does not matter

The Trust Score System

What Should Be Done

What Did Not Matter

Data Quality

Dataset

Tools Used

Project Structure

How to Run This

Conclusion

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages