Repository: https://github.com/singhdivyank/multi-echelon-rl-inventory
Inventory Optimization is a critical problem in supply chain systems, where traditional heuristics such as (s, S) policies struggle under stochastic demand, lead time variability, and multi-echelon dependencies.
This project demonstrates how Deep Reinforcement Learning (DRL) can learn adaptive invetory control policies that optimize long-term cost and service level trade-offs.
We model the system as a Markov Decision Process (MDP):
- State (s): Inventory levels, pipeline stock, demand signals
- Action (a): Replenishment quantities
- Reward (r): Negative total cost
- Transition: Inventory dynamics + stochastic demand
Minimize expected cumulative cost:
Asynchronous Advantage Actor-Critic (A3C)
- Parallel actor-learners
- Advantage-based updates
- Stable and efficient training
Proximal Policy Optimization (PPO)
- Clipped objective for stability
- Generalized Advantage Estimation (GAE)
- Strong emperical performance
Classical inventory heuristic used as benchmark
Each algorithm is trained and evaluated on two environments: the original stationary formulation (Env-1) and a harder non-stationary variant with seasonal demand, correlated retailer demand, demand shocks, heavy-tailed lead times, and stochastic capacity caps (Env-2).
| Algorithm | Env-1 (stationary) | Env-2 (non-stationary) |
|---|---|---|
| PPO vs. fixed base-stock heuristic | +36.76% cost reduction | +94.20% cost reduction |
| A3C vs. retuned (s,S) policy | +1.64% cost reduction | -21.79% (heuristic wins) |
The negative A3C/Env-2 result is intentional and discussed in the report: single-environment evaluation of deep-RL inventory policies is not reliable evidence of real robustness.
Full writeup with figures, methodology, and threats to validity:
see the AAAI-26-formatted paper at docs/paper.pdf
(source: docs/paper.tex; Markdown mirror:
docs/report.md).
.multi-echelon-rl-inventory
├──actor-critic/
| ├── configs/
│ | ├── config.yaml
│ | └── meisConfig.yaml
| ├── src/
│ | ├── __init__.py
│ | ├── a3c_agent.py
│ | ├── meis_env.py
│ | ├── s_s_policy.py
│ | └── trainer.py
| ├── utils/
│ | ├── __init__.py
│ | ├── evaluation.py
│ | ├── helpers.py
│ | └── visualisation.py
| ├── results/
│ | ├── checkpoints/
│ | ├── logs/
│ | └── plots/
│ ├── main.py
| └── README.md
├──ppo/
| ├── configs/
│ | └── config.yaml
| ├── models/
│ | ├── __init__.py
│ | ├── actor_critic.py
│ | ├── baseline.py
│ | ├── env.py
│ | ├── replay_buffer.py
│ | └── ppo.py
| ├── src/
│ | ├── __init__.py
│ | ├── train.py
│ | ├── evaluate.py
│ | └── visualise.py
| ├── utils/
│ | ├── metrics.py
│ | ├── logger.py
│ | └── helpers.py
| ├── results/
│ | ├── checkpoints/
│ | ├── logs/
│ | ├── plots/
│ | ├── evaluation_results.json
│ | └── training_stats.json
│ ├── main.py
| └── README.md
├── .gitignore
├── README.md
└── requirements.txt
python3 -m venv .venv # initialise virtual environment
source .venv/bin/activate # activate virtual environment
pip install -r requirements.txt # install all required dependenciesEach algorithm accepts --env to pick the environment variant. Artifacts
from Env-1 and Env-2 are written to separate directories
(results/ vs results_complex/) so the two experiments never clobber
each other.
cd actor-critic
python3 main.py --mode train --env meis # train on Env-1 (original MEIS)
python3 main.py --mode train --env complex # train on Env-2 (non-stationary MEIS)
python3 main.py --mode eval --env meis # evaluate on Env-1
python3 main.py --mode eval --env complex # evaluate on Env-2cd ppo
python3 main.py --mode train --env divergent # train on Env-1 (divergent supply chain)
python3 main.py --mode train --env complex # train on Env-2 (non-stationary)
python3 main.py --mode eval --env divergent # evaluate on Env-1
python3 main.py --mode eval --env complex # evaluate on Env-2From the repo root:
python scripts/smoke_env.py # end-to-end sanity check for all 4 (algo, env) combos
python scripts/make_report.py # rebuild docs/report.md + figures from saved JSONs
python scripts/build_paper.py # rebuild docs/paper.pdf from docs/paper.tex (needs pdflatex)This project runs on multiple operating systems and hardware backends with no code changes required.
| OS | Supported | Notes |
|---|---|---|
| macOS | ✅ | Intel and Apple Silicon (M1 / M2 / M3) |
| Windows | ✅ | Windows 10 / 11 |
| Linux | ✅ | Recommended for multi-process A3C training |
macOS / Linux:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtWindows (PowerShell):
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txtWindows (Command Prompt):
python -m venv .venv
.venv\Scripts\activate.bat
pip install -r requirements.txtOn Windows, if script execution is blocked run:
Set-ExecutionPolicy -Scope CurrentUser RemoteSigned
PyTorch auto-detects the best available device at runtime — no manual configuration needed.
| Backend | Hardware | Auto-selected when |
|---|---|---|
| CUDA | NVIDIA GPU | torch.cuda.is_available() returns True |
| MPS | Apple Silicon (M1 / M2 / M3) | torch.backends.mps.is_available() returns True |
| CPU | Any machine | Fallback if neither CUDA nor MPS is available |
CUDA users: install the CUDA-enabled PyTorch wheel that matches your driver before running
pip install -r requirements.txt. See pytorch.org/get-started for the correct install command.
Requires Python 3.8+. Recommended: Python 3.10 or 3.11.
