Im applying matrix Transpose program on My PC with GTX850M, i used the transpose in this blog : http/devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/ But i want to implement huge sizes of matrix up to 20000 and 30000,.. but i get error out of memory,.. is it related to my GPU thread Space ? (its 1024 thread / block ) , what do u recommend i do to solve this problem ?