Re: [PATCH 18/43] workqueue: reimplement workqueue flushing usingcolor coded works

From: Tejun Heo
Date: Mon Mar 01 2010 - 12:24:07 EST


Hello, Oleg.

On 03/01/2010 05:31 AM, Oleg Nesterov wrote:
> On 02/26, Tejun Heo wrote:
>>
>> + /*
>> + * The last color is no color used for works which don't
>> + * participate in workqueue flushing.
>> + */
>> + WORK_NR_COLORS = (1 << WORK_STRUCT_COLOR_BITS) - 1,
>> + WORK_NO_COLOR = WORK_NR_COLORS,
>
> Not sure this makes sense, but instead of special NO_COLOR reserved
> for barriers we could change insert_wq_barrier() to use cwq == NULL
> to mark this work as no-color. Note that the barriers doesn't need a
> valid get_wq_data() or even _PENDING bit set (apart from BUG_ON()
> check in run_workqueue).

Later when all the worklists on a cpu is unified into a single per-cpu
worklist, work->data is the only way to access which cwq and wq a work
belongs to and process_one_work() depends on it. Looking at how it's
used, I suspect process_one_work() can survive without knowing the cwq
and wq of a work but it's gonna require a number of condition checks
throughout the queueing, execution and debugfs code (or we can set the
current_cwq to a bogus barrier wq).

Yeah it's definitely something to think about but I think that 15
flush colors should be more than enough especially given that colors
will have increased throughput as pressure builds up, so I'm a bit
reluctant to break invariants for one more color at this point.

>> +static int work_flags_to_color(unsigned int flags)
>> +{
>> + return (flags >> WORK_STRUCT_COLOR_SHIFT) &
>> + ((1 << WORK_STRUCT_COLOR_BITS) - 1);
>> +}
>
> perhaps work_to_color(work) makes more sense, every caller needs to
> read work->data anyway.

I think it better matches work_color_to_flags() but yeah doing
*work_data_bits() in every caller is ugly. I'll change it to
work_color().

>> +static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
>> +{
>> + /* ignore uncolored works */
>> + if (color == WORK_NO_COLOR)
>> + return;
>> +
>> + cwq->nr_in_flight[color]--;
>> +
>> + /* is flush in progress and are we at the flushing tip? */
>> + if (likely(cwq->flush_color != color))
>> + return;
>> +
>> + /* are there still in-flight works? */
>> + if (cwq->nr_in_flight[color])
>> + return;
>> +
>> + /* this cwq is done, clear flush_color */
>> + cwq->flush_color = -1;
>
> Well. I am not that smart to understand this patch quickly ;) will try to
> read it more tomorrow, but...

Sadly, this is one complex piece of code. I tried to make it simpler
but couldn't come up with anything significantly better and it takes
me some time to get my head around it even though I wrote and spent
quite some time with it. If you can suggest something simpler, it
will be great.

> Very naive question: why do we need cwq->nr_in_flight[] at all? Can't
> cwq_dec_nr_in_flight(color) do
>
> if (WORK_NO_COLOR)
> return;
>
> if (cwq->flush_color == -1)
> return;
>
> BUG_ON((cwq->flush_color != color);
> if (next_work && work_to_color(next_work) == color)
> return;
>
> cwq->flush_color = -1;
> if (atomic_dec_and_test(nr_cwqs_to_flush))
> complete(first_flusher);
>
> ?

Issues I can think of off the top of my head are...

1. Works are removed from the worklist before the callback is called
and inaccessible once the callback starts running, so completion
order can't be tracked with work structure themselves.

2. Even if it could be tracked that way, with later unified worklist,
it will require walking the worklist looking for works with
matching cwq.

3. Transfer of works to work->scheduled list and the linked works.
This ends up fragmenting the worklist before execution starts.

> And I can't understand ->nr_cwqs_to_flush logic.
>
> Say, we have 2 CPUs and both have a single pending work with color == 0.
>
> flush_workqueue()->flush_workqueue_prep_cwqs() does for_each_possible_cpu()
> and each iteration does atomic_inc(nr_cwqs_to_flush).
>
> What if the first CPU flushes its work between these 2 iterations?
>
> Afaics, in this case this first CPU will complete(first_flusher->done),
> then the flusher will increment nr_cwqs_to_flush again during the
> second iteration.
>
> Then the flusher drops ->flush_mutex and wait_for_completion() succeeds
> before the second CPU flushes its work.
>
> No?

Yes, thanks for spotting it. It should start with 1 and do
dec_and_test after the iterations are complete. Will fix.

>> void flush_workqueue(struct workqueue_struct *wq)
>> {
>> ...
>> + * First flushers are responsible for cascading flushes and
>> + * handling overflow. Non-first flushers can simply return.
>> + */
>> + if (wq->first_flusher != &this_flusher)
>> + return;
>> +
>> + mutex_lock(&wq->flush_mutex);
>> +
>> + wq->first_flusher = NULL;
>> +
>> + BUG_ON(!list_empty(&this_flusher.list));
>> + BUG_ON(wq->flush_color != this_flusher.flush_color);
>> +
>> + while (true) {
>> + struct wq_flusher *next, *tmp;
>> +
>> + /* complete all the flushers sharing the current flush color */
>> + list_for_each_entry_safe(next, tmp, &wq->flusher_queue, list) {
>> + if (next->flush_color != wq->flush_color)
>> + break;
>> + list_del_init(&next->list);
>> + complete(&next->done);
>
> Is it really possible that "next" can ever have the same flush_color?

Yeap, when all the colors are consumed, works wait in the overflow
queue. On the next flush completion, all works on the overflow list
get the same color so that flush colors increase their throughput as
pressure increases instead of building up a long queue of flushers.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/