(2 node with 2 GPUs each) modes. If the test only uses 2 GPUs, it is important to set the distributed backend to "mp" to avoid Ray scheduling all workers in a node other than the head node, which can ...
self.q_proj = torch.nn.Linear(hidden_dim, hidden_dim) self.k_proj = torch.nn.Linear(hidden_dim, hidden_dim) self.v_proj = torch.nn.Linear(hidden_dim, hidden_dim) ...