Re: [PATCH] afs: Fix ENOSPC, EDQUOT and other errors to fail a write rather than retrying

From: Jeffrey E Altman
Date: Thu Nov 04 2021 - 11:15:44 EST


On 11/3/2021 8:22 PM, Matthew Wilcox (willy@xxxxxxxxxxxxx) wrote:
> On Wed, Nov 03, 2021 at 11:43:20PM +0000, David Howells wrote:
>> Currently, at the completion of a storage RPC from writepages, the errors
>> ENOSPC, EDQUOT, ENOKEY, EACCES, EPERM, EKEYREJECTED and EKEYREVOKED cause
>> the pages involved to be redirtied and the write to be retried by the VM at
>> a future time.
>>
>> However, this is probably not the right thing to do, and, instead, the
>> writes should be discarded so that the system doesn't get blocked (though
>> unmounting will discard the uncommitted writes anyway).
> umm. I'm not sure that throwing away the write is the best answer
> for some of these errors. Our whole story around error handling in
> filesystems, the page cache and the VFS is pretty sad, but I don't think
> that this is the right approach.
>
> Ideally, we'd hold onto the writes in the page cache until (eg for ENOSPC
> / EDQUOT), the user has deleted some files, then retry the writes.

Hi Matthew,

I agree that it would be desirable to avoid discarding user data but in
practice that is hard to do.  The proposed behavior change is consistent
with other Unix AFS/AuriStorFS cache manager implementations.   There
are many situations which can result in an out of quota or out of space
error where the end user has absolutely no ability to do anything about it.

An EDQUOT error might occur because the AFS volume has reached its
quota.  However, the writer only has insert privilege and cannot
delete.  The user might not even be able to list the contents of the
volume.   

An ENOSPC error might be the result of the backing store for AFS vice
partitions filling due to data being written to other AFS volumes that
the writer has no ability to access or manage.

AFS cache managers frequently implement write-on-close semantics and
will flush dirty content to the fileserver only when the file is closed
or the local cache is out-of-space.   Holding onto dirty data that
cannot be flushed to the server on a multi-user timeshare system can
result on unwanted negative impacts on other users of the system.

Another risk is that if dirty data persists locally that the
EDQUOT/ENOSPC errors will be replaced by EACCES or EPERM errors when the
associated authentication credentials expire.

If a back-off strategy is to be implemented in the future, AFS does
provide RPCs that can be used to query the volume's online status, the
maximum quota in one KiB blocks, the blocks in use, the available blocks
in the partition, and the maximum number of blocks in the partition.  
Querying RXAFS_GetVolumeStatus or RXYFS_GetVolumeStatus can avoid the
overhead of issuing a StoreData operation that is likely to fail.

Jeffrey Altman


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature