Re: [PATCH 5/8] pstore: Fix long-term implicit conversions in the compression routines

From: Ard Biesheuvel
Date: Sat Oct 08 2022 - 13:53:11 EST


On Sat, 8 Oct 2022 at 18:04, Guilherme G. Piccoli <gpiccoli@xxxxxxxxxx> wrote:
>
> On 08/10/2022 12:53, Ard Biesheuvel wrote:
> > [...]
> > So one thing I don't understand about these changes is why we need
> > them in the first place.
> >
> > The zbufsize routines are all worst case routines, which means each
> > one of those will return a value that exceeds the size parameter.
> >
> > We only use compression for dmesg, which compresses quite well. If it
> > doesn't compress well, there is probably something wrong with the
> > input data, so preserving it may not be as critical. And if
> > compressing the data makes it bigger, can't we just omit the
> > compression for that particular record?
> >
> > In summary, while adding zbufsize to the crypto API seems a reasonable
> > thing to do, I don't see why we'd want to make use of it in pstore -
> > can't we just use the decompressed size as the worst case compressed
> > size for all algorithms, and skip the compression if it doesn't fit?
> >
> > Or am I missing something here?
>
> In a way (and if I understand you correctly - please let me know if not)
> you are making lot of sense: why not just use the maximum size (which is
> the full decompressed size + header) as the worst case in pstore and
> skip these highly specialized routines that calculate the worst case for
> each algorithm, right?
>

Yes

> This is exactly what 842 (sw compress) is doing now. If that's
> interesting and Kees agrees, and if nobody else plans on doing that, I
> could work on it.
>
> Extra question (maybe silly on my side?): is it possible that
> _compressed_ data is bigger than the original one? Isn't there any
> "protection" on the compress APIs for that? In that case, it'd purely
> waste of time / CPU cycles heheh
>

No, this is the whole point of those helper routines, as far as I can
tell. Basically, if you put data that cannot be compressed losslessly
(e.g., a H264 video) through a lossless compression routine, the
resulting data will be bigger due to the overhead of the compression
metadata.

However, we are compressing ASCII text here, so using the uncompressed
size as an upper bound for the compressed size is reasonable for any
compression algorithm. And if dmesg output is not compressible, there
must be something seriously wrong with it.

So we could either just drop such input, or simply not bother
compressing it if doing so would only take up more space. Given the
low likelihood that we will ever hit this case, I'd say we just ignore
those.

Again, please correct me if I am missing something here (Kees?). Are
there cases where we compress data that may be compressed already?