Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

From: Johannes Weiner
Date: Fri Dec 23 2016 - 02:32:59 EST


On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > I unmounted the fs, mkfs'd it again, ran the
> > > workload again and about a minute in this fired:
> > >
> > > [628867.607417] ------------[ cut here ]------------
> > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> >
> > Well, part of the changes during the merge window were the shadow
> > entry tracking changes that came in through Andrew's tree. Adding
> > Johannes Weiner to the participants.
> >
> > > Now, this workload does not touch the page cache at all - it's
> > > entirely an XFS metadata workload, so it should not really be
> > > affecting the working set code.
> >
> > Well, I suspect that anything that creates memory pressure will end up
> > triggering the working set code, so ..
> >
> > That said, obviously memory corruption could be involved and result in
> > random issues too, but I wouldn't really expect that in this code.
> >
> > It would probably be really useful to get more data points - is the
> > problem reliably in this area, or is it going to be random and all
> > over the place.
>
> Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
> soon followed by NULL pointer deref in list_lru_isolate, one time when
> I tried out Sunday's git tree. Not seen since, I haven't had time to
> investigate, just set it aside as something to worry about if it happens
> again. But it looks like shadow_lru_isolate() has issues beyond Dave's
> case (I've no XFS and no iscsi), suspect unrelated to his other problems.

This seems consistent with what Dave observed: we encounter regular
pages in radix tree nodes on the shadow LRU that should only contain
nodes full of exceptional shadow entries. It could be an issue in the
new slot replacement code and the node tracking callback.

Looking...