a Python reference implementation, a CUDA C++ GPU implementation. The CUDA version splits the algorithm into explicit computation stages and executes independent updates in parallel, improving ...