Re: [PATCH v5 10/10] Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
From: Nuno Das Neves
Date: Wed Mar 19 2025 - 14:06:34 EST
On 3/19/2025 8:26 AM, Michael Kelley wrote:
> From: Michael Kelley <mhklinux@xxxxxxxxxxx> Sent: Tuesday, March 18, 2025 7:10 PM
>>
>> From: Nuno Das Neves <nunodasneves@xxxxxxxxxxxxxxxxxxx> Sent: Tuesday, March
>> 18, 2025 5:34 PM
>>>
>>> On 3/17/2025 4:51 PM, Michael Kelley wrote:
>>>> From: Nuno Das Neves <nunodasneves@xxxxxxxxxxxxxxxxxxx> Sent: Wednesday, February 26, 2025 3:08 PM
>
> [snip]
>
>>>>> +
>>>>> + region = mshv_partition_region_by_gfn(partition, mem.guest_pfn);
>>>>> + if (!region)
>>>>> + return -EINVAL;
>>> <snip>
>>>> + case MSHV_GPAP_ACCESS_TYPE_ACCESSED:
>>>>> + hv_type_mask = 1;
>>>>> + if (args.access_op == MSHV_GPAP_ACCESS_OP_CLEAR) {
>>>>> + hv_flags.clear_accessed = 1;
>>>>> + /* not accessed implies not dirty */
>>>>> + hv_flags.clear_dirty = 1;
>>>>> + } else { // MSHV_GPAP_ACCESS_OP_SET
>>>>
>>>> Avoid C++ style comments.
>>>>
>>> Ack
>>>
>>>>> + hv_flags.set_accessed = 1;
>>>>> + }
>>>>> + break;
>>>>> + case MSHV_GPAP_ACCESS_TYPE_DIRTY:
>>>>> + hv_type_mask = 2;
>>>>> + if (args.access_op == MSHV_GPAP_ACCESS_OP_CLEAR) {
>>>>> + hv_flags.clear_dirty = 1;
>>>>> + } else { // MSHV_GPAP_ACCESS_OP_SET
>>>>
>>>> Same here.
>>>>
>>> Ack
>>>
>>>>> + hv_flags.set_dirty = 1;
>>>>> + /* dirty implies accessed */
>>>>> + hv_flags.set_accessed = 1;
>>>>> + }
>>>>> + break;
>>>>> + }
>>>>> +
>>>>> + states = vzalloc(states_buf_sz);
>>>>> + if (!states)
>>>>> + return -ENOMEM;
>>>>> +
>>>>> + ret = hv_call_get_gpa_access_states(partition->pt_id, args.page_count,
>>>>> + args.gpap_base, hv_flags, &written,
>>>>> + states);
>>>>> + if (ret)
>>>>> + goto free_return;
>>>>> +
>>>>> + /*
>>>>> + * Overwrite states buffer with bitmap - the bits in hv_type_mask
>>>>> + * correspond to bitfields in hv_gpa_page_access_state
>>>>> + */
>>>>> + for (i = 0; i < written; ++i)
>>>>> + assign_bit(i, (ulong *)states,
>>>>
>>>> Why the cast to ulong *? I think this argument to assign_bit() is void *, in
>>>> which case the cast wouldn't be needed.
>>>>
>>> It looks like assign_bit() and friends resolve to a set of functions which do
>>> take an unsigned long pointer, e.g.:
>>>
>>> __set_bit() -> generic___set_bit(unsigned long nr, volatile unsigned long *addr)
>>> set_bit() -> arch_set_bit(unsigned int nr, volatile unsigned long *p)
>>> etc...
>>>
>>> So a cast is necessary.
>>
>> Indeed, you are right. Seems like set_bit() and friends should take a void *.
>> But that's a different kettle of fish.
>>
>>>
>>>> Also, assign_bit() does atomic bit operations. Doing such in a loop like
>>>> here will really hammer the hardware memory bus with atomic
>>>> read-modify-write cycles. Use __assign_bit() instead, which does
>>>> non-atomic operations. You don't need atomic here as no other
>>>> threads are modifying the bit array.
>>>>
>>> I didn't realize it was atomic. I'll change it to __assign_bit().
>>>
>>>>> + states[i].as_uint8 & hv_type_mask);
>>>>
>>>> OK, so the starting contents of "states" is an array of bytes. The ending
>>>> contents is an array of bits. This works because every bit in the ending
>>>> bit array is set to either 0 or 1. Overlap occurs on the first iteration
>>>> where the code reads the 0th byte, and writes the 0th bit, which is part of
>>>> the 0th byte. The second iteration reads the 1st byte, and writes the 1st bit,
>>>> which doesn't overlap, and there's no overlap from then on.
>>>>
>>>> Suppose "written" is not a multiple of 8. The last byte of "states" as an
>>>> array of bits will have some bits that have not been set to either 0 or 1 and
>>>> might be leftover garbage from when "states" was an array of bytes. That
>>>> garbage will get copied to user space. Is that OK? Even if user space knows
>>>> enough to ignore those bits, it seems a little dubious to be copying even
>>>> a few bits of garbage to user space.
>>>>
>>>> Some comments might help here.
>>>>
>>> This is a good point. The expectation is indeed that userspace knows which
>>> bits are valid from the returned "written" value, but I agree it's a bit
>>> odd to have some garbage bits in the last byte. How does this look (to be
>>> inserted here directly after the loop):
>>>
>>> + /* zero the unused bits in the last byte of the returned bitmap */
>>> + if (written > 0) {
>>> + u8 last_bits_mask;
>>> + int last_byte_idx;
>>> + int bits_rem = written % 8;
>>> +
>>> + /* bits_rem == 0 when all bits in the last byte were assigned */
>>> + if (bits_rem > 0) {
>>> + /* written > 0 ensures last_byte_idx >= 0 */
>>> + last_byte_idx = ((written + 7) / 8) - 1;
>>> + /* bits_rem > 0 ensures this masks 1 to 7 bits */
>>> + last_bits_mask = (1 << bits_rem) - 1;
>>> + states[last_byte_idx].as_uint8 &= last_bits_mask;
>>> + }
>>> + }
>>
>> A simpler approach is to "continue" the previous loop. And if "written"
>> is zero, this additional loop won't do anything either:
>>
>> for (i = written; i < ALIGN(written, 8); ++i)
>> __clear_bit(i, (ulong *)states);
>>
> > One further thought here: Could "written" be less than
> args.page_count at this point? That would require
> hv_call_get_gpa_access_states() to not fail, but still return
> a value for written that is less than args.page_count. If that
> could happen, then the above loop should be:
>
> for (i = written; i < bitmap_buf_sz * 8; ++i)
> __clear_bit(i, (ulong *)states);
>
> so that all the uninitialized bits and bytes that will be written
> back to user space are cleared.
> Hmmm...now I'm not so sure where the need for "written" came from in
the first place - in practice "written" will always be equal to
args.page_count except on error, but in that case there's a goto
free_return anyway, so the number is never copied to userspace. And
I checked the userspace code - it doesn't expect a partial result
either.
So it seems to be redundant, but I don't really want to remove it just
now.
Your suggestion with bitmap_buf_sz * 8 should be fine, and will make it
straightforward to remove "written" in a future cleanup if that ends up
looking like a good idea.
>>>
>>> The remaining bytes could be memset() to zero but I think it's fine to leave
>>> them.
>>
>> I agree. The remaining bytes aren't written back to user space anyway
>> since the copy_to_user() uses bitmap_buf_sz.
>
> Maybe I misunderstood what you meant by "remaining bytes". I think
> all bits and bytes that are written back to user space should have
> valid data or zeros so that no garbage is written back.
>
Agreed.
Nuno
> Michael
>
>>
>>>
>>>>> +
>>>>> + args.page_count = written;
>>>>> +
>>>>> + if (copy_to_user(user_args, &args, sizeof(args))) {
>>>>> + ret = -EFAULT;
>>>>> + goto free_return;
>>>>> + }
>>>>> + if (copy_to_user((void __user *)args.bitmap_ptr, states, bitmap_buf_sz))
>>>>> + ret = -EFAULT;
>>>>> +
>>>>> +free_return:
>>>>> + vfree(states);
>>>>> + return ret;
>>>>> +}