Abstract:
To address the new security challenge faced by IoT(Internet of Things) devices in the post-quantum era, a parallel optimization scheme for the babyKyber algorithm based on CUDA architecture is proposed in this paper. This research focuses on core algorithm modules such as polynomial multiplication and number-theoretic transformation, achieving acceleration by decomposing computations to the GPU thread level through fine-grained parallelism, and improving algorithm throughput by building a multi-thread block architecture through coarse-grained parallelism. In particular, GPU resource utilization optimization is investigated through experiments with dynamic thread block configurations. Experiment data demonstrate that the optimized parallel scheme achieves a throughput in the tens of millions on the NVIDIA GeForce MX150 GPU platform, yielding a speedup of three orders of magnitude over the CPU-based platform. This research presents a feasible engineering solution for implementing post-quantum cryptographic algorithms on resource-constrained IoT devices.