Benchmarking FSDP with LLaMA Factory vs. 3D Parallelism with Nanotron

In my recent experiments, I compared two distributed training frameworks: LLaMA Factory using FSDP (Fully Sharded Data Parallelism) with accelerate and Nanotron using 3D parallelism. Below, I summarize the benchmark results and share my observations. The model used in my experiments is meta-llama/Meta-Llama-3.1-8B.

Context

I came across a similar benchmark in a LightOn blog, where they trained a model with 2 billion parameters on a single node with 4x A100-64GB GPUs. The results are summarized in the table below:

Parallelism Type	Model	Batch Size	Block-wise Activation Recomputation	Throughput	TFLOPs
FSDP	1.6B	5	Yes	11,000	96.25
3D (DP=2, PP=2, TP=1)	1.6B	2	Not supported	4,880	51

I found these results very interesting. However, they could benefit from more comprehensive experiments for two key reasons:

First, the training was conducted on only one node, which doesn’t account for the overhead introduced by inter-node communication.
Second, the comparison was made using different global batch sizes, which makes the throughput not strictly comparable as the convergence rate could vary.

For these reasons, I conducted a more elaborate, though still incomplete, benchmark.

Benchmark Setup

Setup:

H100 80GB GPUs
GPUs per Node: 4
Examples per GPU: 4
Cutoff Length: 4096

Results:

#	Framework	Settings	Nodes	Gradient Accum. (global bs)	Throughput (tokens/s)
1	LLaMA Factory	accelerate + FSDP	4	1 (64)	93.6K
2	Nanotron	DP=2, PP=1, TP=8	4	8 (64)	83.3K
3	Nanotron	DP=4, PP=1, TP=4	4	4 (64)	128K
4	Nanotron	DP=4, PP=2, TP=2	4	1 (16)	49.2K
5	Nanotron	DP=4, PP=1, TP=4	4	1 (16)	108K
6	Nanotron	DP=8, PP=1, TP=4	8	4 (128)	253K
7	Nanotron	DP=4, PP=1, TP=8	8	8 (128)	154K

Observations

Comparing lines 1 and 3, 3D(2D)-parallelism is 1.36 times more efficient than FSDP.
Lines 1 and 5 demonstrate that even with lower gradient accumulation, 3D(2D)-parallelism can still achieve higher throughput.
The comparisons between lines 2 and 3, as well as lines 6 and 7, suggest that the optimal TP is 4, which matches the number of GPUs per node.
The throughput scales predictably with the number of nodes. As a rule of thumb, throughput can be approximated as nnodes * 23,405 tokens/s.
When using Nanotron with TP=4 and PP=1, throughput appears to scale as nnodes * 32,000 tokens/s when DP is increased.
There is a noticeable drop in throughput when using PP > 1. For example, with DP=4, PP=2, TP=2, throughput significantly drops to 49.2K tokens/s.
Increasing DP while holding TP=4 and PP=1 constant yields better throughput. With DP=8, throughput reached 253K tokens/s on 8 nodes.

Known Issues with Nanotron

While Nanotron shows promise in certain configurations, there are currently some known issues:

Gradient Accumulation with PP > 1: There is a bug preventing gradient accumulation when using multinodes if PP > 1. You can track this issue here.
Checkpoint Resumption with PP > 1: Another bug exists that prevents resumption from checkpoints when PP > 1. This issue is being tracked here.

Acknowledgment

I would like to thank ChatGPT for helping refine the writing in this post.