I try to solve this problem by creating a new kernel thread, "kccd", to populate the TLB buffer in the backgroud.
Specifically,
1. A new kernel thread is created with the help of "arch_initcall", and this kthread is responsible for memory allocation and setting memory attributes (private or shared);
2. The "swiotlb_tbl_map_single" routine only use the spin_lock protected TLB buffers pre-allocated by the kthread;
a) which actually includes ONE memory allocation brought by xarray insertion "__xa_insert__".
3. After each allocation, the water level of TLB resources will be checked. If the current TLB resources are found to be lower than the preset value (half of the watermark), the kthread will be awakened to fill them.
4. The TLB buffer allocation in the kthread is batched to "(MAX_ORDER_NR_PAGES << PAGE_SHIFT)" to reduce the holding time of spin_lock and number of calls to set_memory_decrypted().