Improving MPI_Pack performance in CUDA-aware MPI

Improving MPI_Pack performance in CUDA-aware MPI