According to the discuss in Is average the correct way for the gradient in DistributedDataParallel, I think we should set 8Ă—lr. I will state my reason under 1 node, 8gpus, localbatch=64(images processed by one gpu each iteration) scenario:
(1) Let us consider a batch images (batchsize=512), in DataParallel scenario, a complete forwardbackforwad pipeline is:

the input data are split to 8 slices (each contains 64 images), each slice is feed to net to compute output

outputs are concated in master gpu (usually gpu 0) to form a [512, C] outputs

compute the loss with groundtruth(same dimension: [512, C]) : loss = \frac{1}{512} \sum_{i=1}^512 mse(output[i], groundtruth[i])
( use mse loss as illustration)

use loss.backward to compute gradients.
So the finally [512, C] outputs are the same as computed on one gpu. So the learning rate here shall be set as 8Ă—lr to keep same as 512 batchsize in onegpuonenode scenarior.
(2) Secondly, when DistributedDataparallel is used, the pipeline is

the input data are also split to 8 slices

outputs are computed in each gpu to form a [64, C] outputs

In each gpu, compute the loss loss = \frac{1}{64} \sum_{i=1}^64 mse(output[i], groundtruth[i])
and compute gradients grad_k (k is the gpu number, k=0,1,...,7)
: (this is different with Dataparallel, which need collect all outputs in master gpu)

Average the gradients between all gpus: avg_grad =\frac{1}{8} \sum_{k=1}^8 grad_k
By this way, the averaged gradients are also same as the gradients computed on onegpuonenode scenario. So I think learning rate here need to be set as 8Ă—lr to keep same as 512 batchsize on onegpuonenode scenario.
The main difference between you and me is that when local batch is set as 64, I think local gradients will be averaged over local samples, resulting in torch.ones_like(param)*64/64
, but you think the local gradients will be summed over local samples, resulting in torch.ones_like(param) * 64
. I think local gradients will be averaged mainly because the loss function in pytroch, like mse(), will compute the average loss over all input samples, so the gradients computed from such loss also should be averaged over all input samples.
I do not know if I understand DistributedDataparallel in a right way. Please let me know if there has any wrong.