Re: [PATCH 2/2] ext4: implement cgroup writeback support

From: Artem Bityutskiy
Date: Wed Sep 23 2015 - 15:47:24 EST


On Wed, Sep 23, 2015 at 9:24 PM, Theodore Ts'o <tytso@xxxxxxx> wrote:
> Artem,
>
> Can you (or someone on the cgroups list, perhaps) give more details
> about how Fedora 22 sets up groups?
>
> Unfortunately apparently no one has gotten an official Fedora image
> for Google Compute Engine so it's a bit of a pain for me to reproduce
> the problem. (I suppose I could use AWS, but all of my test
> infrastructure uses GCE, and I'd really rather not have to install a
> Java Runtime on my laptop. :-)

[ My apologies for top posting and for sending HTML e-mails which do
not get through vger.
I am using gmail web interface, and just learned how to send plain
text from here. So re-sending
my longer answer. ]

Hi Ted, Chris, Tejun, all,

quick and probably messy reply before I go to sleep...

I can give more information tomorrow. But one note - It would be helpful to get
questions like "send us the output of this command" rather than "what are the
cgroups you are in", because I am not fluent with cgroups. IOW, more specific
questions are welcome.

Some more about my setup. I have an number of testboxes, which are 1/2/4-socket
servers. I compile the kernel for them on a separate worker box. Then I copy the
kernel binary to /boot, and the modules to /lib/modules, then run
'sync' and then
reboot to reboot to the new kernel. And vrey often many module files
are corrupted.
They won't load because of majic/crc mismatches.

I copy stuff over scp. Well, this is not exactly scp, but rather a
Python 'scp' module,
which is based on the 'paramiko' module. But I think this should not matter.

Anyway, may be there are some cgroups related with scp/ssh sessions or
/lib/modules
in Fedora 22?

Also note, I tried to be careful during bisecting, I used 4 servers in
parallel, and
did 5 reboot tests on each of them. With this patch reverted all 4
boxes survive 5
reboots just fine. Without this patch reverted, each fail 1-3 reboots.

And, by the way, I forgot this detail - I cut AC power off at the end,
then put it back
on after a 20 seconds delay. I mean, this is a clean reboot, but with
power cut at the
end. So the process is this:

1. I run 'sync' on the box remotely over ssh
2. I run 'reboot' on the box remotely over ssh, the ssh connection
gets closed at this point
3. I ping the box, and keep doing this until it is stops echoing back
4. I wait several seconds, and then just cut the AC power off. The
wall socket power is off.

So if there was something in, say, SSD cache which was not synced, it
is gone too.

May be this patch reveals an existing issue. My setup has been stable
with 4.2 and many
previous kernels, and it only fails with 4.3-rcX, and my bisecting
lead to this patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/