Re: [PATCH v5 3/4] zram: zram memory size limitation
From: Dan Streetman
Date: Wed Aug 27 2014 - 15:04:59 EST
On Wed, Aug 27, 2014 at 12:59 PM, David Horner <ds2horner@xxxxxxxxx> wrote:
> On Wed, Aug 27, 2014 at 12:29 PM, Dan Streetman <ddstreet@xxxxxxxx> wrote:
>> On Wed, Aug 27, 2014 at 11:35 AM, David Horner <ds2horner@xxxxxxxxx> wrote:
>>> On Wed, Aug 27, 2014 at 11:14 AM, Dan Streetman <ddstreet@xxxxxxxx> wrote:
>>>> On Wed, Aug 27, 2014 at 10:44 AM, David Horner <ds2horner@xxxxxxxxx> wrote:
>>>>> On Wed, Aug 27, 2014 at 10:03 AM, Dan Streetman <ddstreet@xxxxxxxx> wrote:
>>>>>> On Tue, Aug 26, 2014 at 10:51 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
>>>>>>> Hey Joonsoo,
>>>>>>>
>>>>>>> On Wed, Aug 27, 2014 at 10:26:11AM +0900, Joonsoo Kim wrote:
>>>>>>>> Hello, Minchan and David.
>>>>>>>>
>>>>>>>> On Tue, Aug 26, 2014 at 08:22:29AM -0400, David Horner wrote:
>>>>>>>> > On Tue, Aug 26, 2014 at 3:55 AM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
>>>>>>>> > > Hey Joonsoo,
>>>>>>>> > >
>>>>>>>> > > On Tue, Aug 26, 2014 at 04:37:30PM +0900, Joonsoo Kim wrote:
>>>>>>>> > >> On Mon, Aug 25, 2014 at 09:05:55AM +0900, Minchan Kim wrote:
>>>>>>>> > >> > @@ -513,6 +540,14 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>>>>>>>> > >> > ret = -ENOMEM;
>>>>>>>> > >> > goto out;
>>>>>>>> > >> > }
>>>>>>>> > >> > +
>>>>>>>> > >> > + if (zram->limit_pages &&
>>>>>>>> > >> > + zs_get_total_pages(meta->mem_pool) > zram->limit_pages) {
>>>>>>>> > >> > + zs_free(meta->mem_pool, handle);
>>>>>>>> > >> > + ret = -ENOMEM;
>>>>>>>> > >> > + goto out;
>>>>>>>> > >> > + }
>>>>>>>> > >> > +
>>>>>>>> > >> > cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
>>>>>>>> > >>
>>>>>>>> > >> Hello,
>>>>>>>> > >>
>>>>>>>> > >> I don't follow up previous discussion, so I could be wrong.
>>>>>>>> > >> Why this enforcement should be here?
>>>>>>>> > >>
>>>>>>>> > >> I think that this has two problems.
>>>>>>>> > >> 1) alloc/free happens unnecessarilly if we have used memory over the
>>>>>>>> > >> limitation.
>>>>>>>> > >
>>>>>>>> > > True but firstly, I implemented the logic in zsmalloc, not zram but
>>>>>>>> > > as I described in cover-letter, it's not a requirement of zsmalloc
>>>>>>>> > > but zram so it should be in there. If every user want it in future,
>>>>>>>> > > then we could move the function into zsmalloc. That's what we
>>>>>>>> > > concluded in previous discussion.
>>>>>>>>
>>>>>>>> Hmm...
>>>>>>>> Problem is that we can't avoid these unnecessary overhead in this
>>>>>>>> implementation. If we can implement this feature in zram efficiently,
>>>>>>>> it's okay. But, I think that current form isn't.
>>>>>>>
>>>>>>>
>>>>>>> If we can add it in zsmalloc, it would be more clean and efficient
>>>>>>> for zram but as I said, at the moment, I didn't want to put zram's
>>>>>>> requirement into zsmalloc because to me, it's weird to enforce max
>>>>>>> limit to allocator. It's client's role, I think.
>>>>>>>
>>>>>>> If current implementation is expensive and rather hard to follow,
>>>>>>> It would be one reason to move the feature into zsmalloc but
>>>>>>> I don't think it makes critical trobule in zram usecase.
>>>>>>> See below.
>>>>>>>
>>>>>>> But I still open and will wait others's opinion.
>>>>>>> If other guys think zsmalloc is better place, I am willing to move
>>>>>>> it into zsmalloc.
>>>>>>
>>>>>> Moving it into zsmalloc would allow rejecting new zsmallocs before
>>>>>> actually crossing the limit, since it can calculate that internally.
>>>>>> However, with the current patches the limit will only be briefly
>>>>>> crossed, and it should not be crossed by a large amount. Now, if this
>>>>>> is happening repeatedly and quickly during extreme memory pressure,
>>>>>> the constant alloc/free will clearly be worse than a simple internal
>>>>>> calculation and failure. But would it ever happen repeatedly once the
>>>>>> zram limit is reached?
>>>>>>
>>>>>> Now that I'm thinking about the limit from the perspective of the zram
>>>>>> user, I wonder what really will happen. If zram is being used for
>>>>>> swap space, then when swap starts getting errors trying to write
>>>>>> pages, how damaging will that be to the system? I haven't checked
>>>>>> what swap does when it encounters disk errors. Of course, with no
>>>>>> zram limit, continually writing to zram until memory is totally
>>>>>> consumed isn't good either. But in any case, I would hope that swap
>>>>>> would not repeatedly hammer on a disk when it's getting write failures
>>>>>> from it.
>>>>>>
>>>>>> Alternately, if zram was being used as a compressed ram disk for
>>>>>> regular file storage, it's entirely up to the application to handle
>>>>>> write failures, so it may continue to try to write to a full zram
>>>>>> disk.
>>>>>>
>>>>>> As far as what the zsmalloc api would look like with the limit added,
>>>>>> it would need a setter and getter function (adding it as a param to
>>>>>> the create function would be optional i think). But more importantly,
>>>>>> it would need to handle multiple ways of specifying the limit. In our
>>>>>> specific current use cases, zram and zswap, each handles their
>>>>>> internal limit differently - zswap currently uses a % of total ram as
>>>>>> its limit (defaulting to 20), while with these patches zram will use a
>>>>>> specific number of bytes as its limit (defaulting to no limit). If
>>>>>> the limiting mechanism is moved into zsmalloc (and possibly zbud),
>>>>>> then either both users need to use the same units (bytes or %ram), or
>>>>>> zsmalloc/zbud need to be able to set their limit in either units. It
>>>>>> seems to me like keeping the limit in zram/zswap is currently
>>>>>> preferable, at least without both using the same limit units.
>>>>>>
>>>>>
>>>>> zswap knows what 20% (or whatever % it currently uses , and perhaps it too
>>>>> will become a tuning knob) of memory is in bytes.
>>>>>
>>>>> So, if the interface to establish a limit for a pool (or pool set, or whatever
>>>>> zsmalloc sets up for its allocation mechanism) is stipulated in bytes
>>>>> (to actually use pages internally, of visa-versa) , then both can use
>>>>> that interface.
>>>>> zram with its native page stipulation, and zswap with calculated % of memory).
>>>>
>>>> No, unless zswap monitors memory hotplug and updates the limit on each
>>>> hotplug event, 20% of the *current* total ram at zswap initialization
>>>> is not equal to an actual 20% of ram limit. zswap checks its size
>>>> against totalram_pages for each new allocation. I don't think we would
>>>> prefer adding memory hotplug monitoring to zswap just to update the
>>>> zpool size limit.
>>>>
>>>
>>> OK - I see the need to retain the limits where they are in the using
>>> components so that
>>> zsmalloc is not unnecessarily complicated (keeping track of 2 limit methods).
>>>
>>> So, zswap has the same race conditions and possible transient over-allocations?
>>> It looks like I will have to check on how zswap implements it.
>>> But perhaps you can answer the question that is not in the code:
>>> Have there been reported thrashing behaviour around the 20% limit for zswap?
>>
>> zswap does a simple over-allocation check before allocating anything.
>> So during page store, it checks if (total_ram * 0.20) < used. This
>> actually places the effective limit higher than the specified limit,
>> but only by a single allocation. This approach could be taken with
>> zram as well.
>>
>> The amount of over-allocation (past the specified limit) would vary
>> between zsmalloc and zbud. Since zbud increases itself in page
>> increments, any over-allocation past the zswap limit would be by only
>> 1 page. However, zsmalloc is variable in its allocation increments,
>> as it depends on which class needs to be grown; zsmalloc is divided
>> into many "classes", each of contains some number of "zspages" which
>> try to precisely contain some number of N-sized areas; e.g. one class
>> might use zspages that are 2 pages to store 3 separate areas which are
>> each 2/3 of a page number of bytes; if that class needed to be grown,
>> it would add one zspage that is 2 pages. The max number of actual
>> pages per zspage is defined by ZS_MAX_PAGES_PER_ZSPAGE which is
>> currently set to 1<<2, so 4.
>>
>> So with zswap, it will over-allocate memory past its specified limit,
>> up to 1 page (with zbud) or up to 4 pages (with zsmalloc). zram could
>> do the same, simply check if its size > limit before each write, and
>> fail if so; that would remove the alloc/free issue, and would only
>> over-allocate by at most 4 pages (with the current zsmalloc settings).
>> Alternately, zram could check if its (current_size + 4pages > limit),
>> which would then stop it short of the limit by up to 4 pages. Really
>> though, 4 pages either above or under the limit probably doesn't
>> matter.
>
> I agree that it isn't the over allocation of a few pages, per se, but if
> there is potential for concurrent allocation through either zram of zswap
> there is the possibility of high contention thrashing as the processes
> jocky for those final pages. Perhaps not a thundering herd, but
> it could readily become a hot spot as the pressure is on and there is
> no explicit mechanism to throttle back retries and new efforts.
>
> I agree that a precheck before zsmalloc request in zram may be a valuable
> optimization, but only if there is a real possibility of threshold thrashing.
> I suggested before that it might be best to wait and see....
> Do you have any suggestions on a stress test approach to empirically assess
> the risk?
Minchan is the expert here so he may have suggestions and/or existing
tests, but I think just creating a zram device with a limit and any
filesystem and mounting it, then filling up the device, would be a
good basic test. Maybe once it gets close to the limit, try writing
to it with multiple threads concurrently. Also, while near/at the
limit, try concurrently writing to it and deleting from it - probably
writing many small files as well as deleting many small files would be
best to keep it right at its limit. However, I think
zram_bvec_write() is already protected inside the init_lock semphore
(via zram_make_request), so it seems like there will never be any
concurrency issue.
Since the threads writing to files at the limit will be getting
(expected) errors during write, the test is really looking at is
whether there is any significant performance difference between a
alloc/free-on-full approach or a pre-check/fail-on-full approach.
That might not become apparent until there is significant memory
pressure, so you might set the zram limit high enough, or you could
otherwise artificially use up system memory (e.g. stress -m).
Also, as a general test of the new zram limit, it would be interesting
to see how various filesystem drivers handle unexpected write
failures...I'm no fs expert, but I would expect that this is different
than a typical ENOSPC condition, which I assume comes from the
filesystem, not the block layer. This error would be more like a
hardware error.
I'd also be interested in what happens when a zram-backed swap device
reaches its limit...I would hope swap device write errors would lead
to a normal OOM condition and invoke the oom-killer, but from
page_io.c end_swap_bio_write:
* We failed to write the page out to swap-space.
* Re-dirty the page in order to avoid it being reclaimed.
* Also print a dire warning that things will go BAD (tm)
* very quickly.
>
> Thanks again.
>
>
>>
>>>
>>> thanks.
>>>
>>>>>
>>>>> Both would need a mechanism to change the max as need change,
>>>>> so the API has to handle this.
>>>>>
>>>>>
>>>>> Or am I way off base?
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> > >
>>>>>>>> > > Another idea is we could call zs_get_total_pages right before zs_malloc
>>>>>>>> > > but the problem is we cannot know how many of pages are allocated
>>>>>>>> > > by zsmalloc in advance.
>>>>>>>> > > IOW, zram should be blind on zsmalloc's internal.
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> > We did however suggest that we could check before hand to see if
>>>>>>>> > max was already exceeded as an optimization.
>>>>>>>> > (possibly with a guess on usage but at least using the minimum of 1 page)
>>>>>>>> > In the contested case, the max may already be exceeded transiently and
>>>>>>>> > therefore we know this one _could_ fail (it could also pass, but odds
>>>>>>>> > aren't good).
>>>>>>>> > As Minchan mentions this was discussed before - but not into great detail.
>>>>>>>> > Testing should be done to determine possible benefit. And as he also
>>>>>>>> > mentions, the better place for it may be in zsmalloc, but that
>>>>>>>> > requires an ABI change.
>>>>>>>>
>>>>>>>> Why we hesitate to change zsmalloc API? It is in-kernel API and there
>>>>>>>> are just two users now, zswap and zram. We can change it easily.
>>>>>>>> I think that we just need following simple API change in zsmalloc.c.
>>>>>>>>
>>>>>>>> zs_zpool_create(gfp_t gfp, struct zpool_ops *zpool_op)
>>>>>>>> =>
>>>>>>>> zs_zpool_create(unsigned long limit, gfp_t gfp, struct zpool_ops
>>>>>>>> *zpool_op)
>>>>>>>>
>>>>>>>> It's pool allocator so there is no obstacle for us to limit maximum
>>>>>>>> memory usage in zsmalloc. It's a natural idea to limit memory usage
>>>>>>>> for pool allocator.
>>>>>>>>
>>>>>>>> > Certainly a detailed suggestion could happen on this thread and I'm
>>>>>>>> > also interested
>>>>>>>> > in your thoughts, but this patchset should be able to go in as is.
>>>>>>>> > Memory exhaustion avoidance probably trumps the possible thrashing at
>>>>>>>> > threshold.
>>>>>>>> >
>>>>>>>> > > About alloc/free cost once if it is over the limit,
>>>>>>>> > > I don't think it's important to consider.
>>>>>>>> > > Do you have any scenario in your mind to consider alloc/free cost
>>>>>>>> > > when the limit is over?
>>>>>>>> > >
>>>>>>>> > >> 2) Even if this request doesn't do new allocation, it could be failed
>>>>>>>> > >> due to other's allocation. There is time gap between allocation and
>>>>>>>> > >> free, so legimate user who want to use preallocated zsmalloc memory
>>>>>>>> > >> could also see this condition true and then he will be failed.
>>>>>>>> > >
>>>>>>>> > > Yeb, we already discussed that. :)
>>>>>>>> > > Such false positive shouldn't be a severe problem if we can keep a
>>>>>>>> > > promise that zram user cannot exceed mem_limit.
>>>>>>>> > >
>>>>>>>>
>>>>>>>> If we can keep such a promise, why we need to limit memory usage?
>>>>>>>> I guess that this limit feature is useful for user who can't keep such promise.
>>>>>>>> So, we should assume that this false positive happens frequently.
>>>>>>>
>>>>>>>
>>>>>>> The goal is to limit memory usage within some threshold.
>>>>>>> so false positive shouldn't be harmful unless it exceeds the threshold.
>>>>>>> In addition, If such false positive happens frequently, it means
>>>>>>> zram is very trobule so that user would see lots of write fail
>>>>>>> message, sometime really slow system if zram is used for swap.
>>>>>>> If we protect just one write from the race, how much does it help
>>>>>>> this situation? I don't think it's critical problem.
>>>>>>>
>>>>>>>>
>>>>>>>> > And we cannot avoid the race, nor can we avoid in a low overhead competitive
>>>>>>>> > concurrent process transient inconsistent states.
>>>>>>>> > Different views for different observers.
>>>>>>>> > They are a consequence of the theory of "Special Computational Relativity".
>>>>>>>> > I am working on a String Unification Theory of Quantum and General CR in LISP.
>>>>>>>> > ;-)
>>>>>>>>
>>>>>>>> If we move limit logic to zsmalloc, we can avoid the race by commiting
>>>>>>>> needed memory size before actual allocation attempt. This commiting makes
>>>>>>>> concurrent process serialized so there is no race here. There is
>>>>>>>> possibilty to fail to allocate, but I think this is better than alloc
>>>>>>>> and free blindlessly depending on inconsistent states.
>>>>>>>
>>>>>>> Normally, zsmalloc/zsfree allocates object from existing pool so
>>>>>>> it's not big overhead and if someone continue to try writing once limit is
>>>>>>> full, another overhead (vfs, fs, block) would be bigger than zsmalloc
>>>>>>> so it's not a problem, I think.
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>>>>>> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
>>>>>>>> see: http://www.linux-mm.org/ .
>>>>>>>> Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>
>>>>>>>
>>>>>>> --
>>>>>>> Kind regards,
>>>>>>> Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/