How to optimize core and threads in CUDA C