[PATCH V6 RESEND 0/5] cachefiles: Introduce failover mechanism for on-demand mode

From: Jia Zhu
Date: Sun Nov 19 2023 - 23:15:54 EST


Changes since v5:
In cachefiles_daemon_poll(), replace xa_for_each_marked with xas_for_each_marked.

[Background]
============
In the on-demand read mode, if user daemon unexpectedly closes an on-demand fd
(for example, due to daemon crashing), subsequent read operations and inflight
requests relying on these fd will result in a return value of -EIO, indicating
an I/O error.
While this situation might be tolerable for individual personal users, it
becomes a significant concern when it occurs in a real public cloud service
production environment (like us). Such I/O errors will be propagated to cloud
service users, potentially impacting the execution of their jobs and
compromising the overall stability of the cloud service. Besides, we have no
way to recover this.

[Design]
========
The main concept behind daemon failover is to reopen the inflight request-related
objects so that the newly started daemon can process the requests as usual.
To achieve this, certain requirements need to be met:
1. Storing inflight requests during a daemon crash:
It is necessary to have a mechanism in place to store the inflight
requests while the daemon is offline or during a crash. This ensures
that the requests are not lost and can be processed once the daemon
is up and running again.
2. Holding the handle of /dev/cachefiles:
The handle of /dev/cachefiles should be retained, either by the container
snapshotter or systemd, to facilitate the failover process. This allows
the newly started daemon to access the necessary resources and continue
processing the requests seamlessly.

It's important to note that if the user chooses not to keep the /dev/cachefiles
fd, the failover feature will not be enabled. In this case, inflight requests
will return error, which will be passed on to the container, maintaining the same
behavior as the current setup.

By implementing these mechanisms, the failover system ensures that inflight requests
are not lost during a daemon crash and that the newly started daemon can resume
its operations smoothly, providing a more robust and reliable service for users.

[Flow Path]
===========
This patchset introduce three states for ondemand object:
CLOSE: This state represents an object that has either just been allocated or
closed by the user daemon.
OPEN: This state indicates that the object is open and ready for processing.
It signifies that the related OPEN request has been successfully handled
and the object is available for read operations or other interactions.
REOPENING: This state is assigned to an object that has been previously closed
but is now being driven to reopen due to a read request. The REOPENING state
indicates that the object is in the process of being reopened, preparing
for subsequent read operations.

1. The daemon utilizes Unix Domain Sockets (UDS) to send and receive fd in order to
maintain and pass the reference to "/dev/cachefiles".

2. In the event of a user daemon crash, the daemon is restarted and the reference
to the file descriptor for "/dev/cachefiles" is recovered.

3. The user daemon writes "restore" to the device, triggering the following actions:
3.1. The object's state is reset from CLOSE to REOPENING, indicating that it
is in the process of reopening.
3.2. A work unit is initialized, which reinitializes the object and adds it to
the work queue. This allows the daemon to handle the open request,
transitioning from kernel space to user space.

4. As a result of these recovery mechanisms, the user of the upper filesystem
remains unaware of the daemon crash. The inflight I/O operations are restored
and correctly handled, ensuring that the system operates seamlessly without
any noticeable disruptions.

By implementing these steps, the system achieves fault tolerance by recovering and
restoring the necessary references and states, ensuring the smooth functioning of
the user daemon and providing a seamless experience to the users of the upper filesystem.

[GitWeb]
========
https://github.com/userzj/linux/tree/fscache-failover-v6

RFC: https://lore.kernel.org/all/20220818135204.49878-1-zhujia.zj@xxxxxxxxxxxxx/
V1: https://lore.kernel.org/all/20221011131552.23833-1-zhujia.zj@xxxxxxxxxxxxx/
V2: https://lore.kernel.org/all/20221014030745.25748-1-zhujia.zj@xxxxxxxxxxxxx/
V3: https://lore.kernel.org/all/20221014080559.42108-1-zhujia.zj@xxxxxxxxxxxxx/
V4: https://lore.kernel.org/all/20230111052515.53941-1-zhujia.zj@xxxxxxxxxxxxx/
V5: https://lore.kernel.org/all/20230329140155.53272-1-zhujia.zj@xxxxxxxxxxxxx/

[Test]
======
There are testcases for above mentioned scenario.
A user process read the file by fscache on-demand reading.
At the same time, we kill the daemon constantly.
The expected result is that the file read by user is consistent with
original, and the user doesn't notice that daemon has ever been killed.

https://github.com/userzj/demand-read-cachefilesd/commits/failover-test

In addition, this patchset has also been merged in our downstream kernel
for almost one year as out-of-tree patches for real production use.
Therefore, we hope it could be landed upstream too.

Jia Zhu (5):
cachefiles: introduce object ondemand state
cachefiles: extract ondemand info field from cachefiles_object
cachefiles: resend an open request if the read request's object is
closed
cachefiles: narrow the scope of triggering EPOLLIN events in ondemand
mode
cachefiles: add restore command to recover inflight ondemand read
requests

fs/cachefiles/daemon.c | 15 +++-
fs/cachefiles/interface.c | 7 +-
fs/cachefiles/internal.h | 59 +++++++++++++-
fs/cachefiles/ondemand.c | 166 ++++++++++++++++++++++++++++----------
4 files changed, 201 insertions(+), 46 deletions(-)

--
2.20.1