Skip to content

singhdivyank/multi-echelon-rl-inventory

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep Reinforcement Learning for Multi-Echelon Inventory Optimization

Repository: https://github.com/singhdivyank/multi-echelon-rl-inventory

Python PyTorch Platform GPU License

Inventory Optimization is a critical problem in supply chain systems, where traditional heuristics such as (s, S) policies struggle under stochastic demand, lead time variability, and multi-echelon dependencies.

This project demonstrates how Deep Reinforcement Learning (DRL) can learn adaptive invetory control policies that optimize long-term cost and service level trade-offs.

Problem Formulation

We model the system as a Markov Decision Process (MDP):

  • State (s): Inventory levels, pipeline stock, demand signals
  • Action (a): Replenishment quantities
  • Reward (r): Negative total cost
  • Transition: Inventory dynamics + stochastic demand

Objective

Minimize expected cumulative cost: $$J(\pi) = \mathbb{E} \left[ \sum_{t=0}^{T} \gamma^t C(s_t, a_t) \right]$$

Algorithms Implemented

Asynchronous Advantage Actor-Critic (A3C)

  • Parallel actor-learners
  • Advantage-based updates
  • Stable and efficient training

Proximal Policy Optimization (PPO)

  • Clipped objective for stability
  • Generalized Advantage Estimation (GAE)
  • Strong emperical performance

Baseline: (s, S) policy

Classical inventory heuristic used as benchmark

Results Summary

Each algorithm is trained and evaluated on two environments: the original stationary formulation (Env-1) and a harder non-stationary variant with seasonal demand, correlated retailer demand, demand shocks, heavy-tailed lead times, and stochastic capacity caps (Env-2).

Algorithm Env-1 (stationary) Env-2 (non-stationary)
PPO vs. fixed base-stock heuristic +36.76% cost reduction +94.20% cost reduction
A3C vs. retuned (s,S) policy +1.64% cost reduction -21.79% (heuristic wins)

The negative A3C/Env-2 result is intentional and discussed in the report: single-environment evaluation of deep-RL inventory policies is not reliable evidence of real robustness.

Full writeup with figures, methodology, and threats to validity: see the AAAI-26-formatted paper at docs/paper.pdf (source: docs/paper.tex; Markdown mirror: docs/report.md).

Project Structure

.multi-echelon-rl-inventory
├──actor-critic/
|   ├── configs/
│   |   ├── config.yaml
│   |   └── meisConfig.yaml
|   ├── src/
│   |   ├── __init__.py
│   |   ├── a3c_agent.py
│   |   ├── meis_env.py
│   |   ├── s_s_policy.py
│   |   └── trainer.py
|   ├── utils/
│   |   ├── __init__.py
│   |   ├── evaluation.py
│   |   ├── helpers.py
│   |   └── visualisation.py
|   ├── results/
│   |   ├── checkpoints/
│   |   ├── logs/
│   |   └── plots/
│   ├── main.py
|   └── README.md
├──ppo/
|   ├── configs/
│   |   └── config.yaml
|   ├── models/
│   |   ├── __init__.py
│   |   ├── actor_critic.py
│   |   ├── baseline.py
│   |   ├── env.py
│   |   ├── replay_buffer.py
│   |   └── ppo.py
|   ├── src/
│   |   ├── __init__.py
│   |   ├── train.py
│   |   ├── evaluate.py
│   |   └── visualise.py
|   ├── utils/
│   |   ├── metrics.py
│   |   ├── logger.py
│   |   └── helpers.py
|   ├── results/
│   |   ├── checkpoints/
│   |   ├── logs/
│   |   ├── plots/
│   |   ├── evaluation_results.json
│   |   └── training_stats.json
│   ├── main.py
|   └── README.md
├── .gitignore
├── README.md
└── requirements.txt

Setup and Execution

python3 -m venv .venv           # initialise virtual environment
source .venv/bin/activate       # activate virtual environment
pip install -r requirements.txt # install all required dependencies

Each algorithm accepts --env to pick the environment variant. Artifacts from Env-1 and Env-2 are written to separate directories (results/ vs results_complex/) so the two experiments never clobber each other.

Run A3C

cd actor-critic
python3 main.py --mode train --env meis      # train on Env-1 (original MEIS)
python3 main.py --mode train --env complex   # train on Env-2 (non-stationary MEIS)
python3 main.py --mode eval  --env meis      # evaluate on Env-1
python3 main.py --mode eval  --env complex   # evaluate on Env-2

Run PPO

cd ppo
python3 main.py --mode train --env divergent # train on Env-1 (divergent supply chain)
python3 main.py --mode train --env complex   # train on Env-2 (non-stationary)
python3 main.py --mode eval  --env divergent # evaluate on Env-1
python3 main.py --mode eval  --env complex   # evaluate on Env-2

Smoke test + report

From the repo root:

python scripts/smoke_env.py     # end-to-end sanity check for all 4 (algo, env) combos
python scripts/make_report.py   # rebuild docs/report.md + figures from saved JSONs
python scripts/build_paper.py   # rebuild docs/paper.pdf from docs/paper.tex (needs pdflatex)

Platform Independence

This project runs on multiple operating systems and hardware backends with no code changes required.

Operating System

OS Supported Notes
macOS Intel and Apple Silicon (M1 / M2 / M3)
Windows Windows 10 / 11
Linux Recommended for multi-process A3C training

Virtual Environment Setup

macOS / Linux:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Windows (PowerShell):

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

Windows (Command Prompt):

python -m venv .venv
.venv\Scripts\activate.bat
pip install -r requirements.txt

On Windows, if script execution is blocked run: Set-ExecutionPolicy -Scope CurrentUser RemoteSigned

Hardware / GPU

PyTorch auto-detects the best available device at runtime — no manual configuration needed.

Backend Hardware Auto-selected when
CUDA NVIDIA GPU torch.cuda.is_available() returns True
MPS Apple Silicon (M1 / M2 / M3) torch.backends.mps.is_available() returns True
CPU Any machine Fallback if neither CUDA nor MPS is available

CUDA users: install the CUDA-enabled PyTorch wheel that matches your driver before running pip install -r requirements.txt. See pytorch.org/get-started for the correct install command.

Python Version

Requires Python 3.8+. Recommended: Python 3.10 or 3.11.

About

Reinforcement Learning–based optimization for multi-echelon inventory management, leveraging A3C and PPO to minimize costs and improve service levels under stochastic demand. Includes benchmarking against classical (s, S) policies, with scalable simulations and performance analysis

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages