Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers

From: Indu Bhagat
Date: Thu Jan 30 2025 - 16:22:53 EST

Next message: Song Liu: "Re: [PATCH v11 bpf-next 5/7] bpf: Use btf_kfunc_id_set.remap logic for bpf_dynptr_from_skb"
Previous message: Babu Moger: "[PATCH v3 7/7] x86/resctrl: Introduce interface to modify io_alloc Capacity Bit Masks"
In reply to: Indu Bhagat: "Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers"
Next in thread: Indu Bhagat: "Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 1/27/25 5:10 PM, Andrii Nakryiko wrote:

+struct sframe_preamble {
+ u16 magic;
+ u8 version;
+ u8 flags;
+} __packed;
+
+struct sframe_header {
+ struct sframe_preamble preamble;
+ u8 abi_arch;
+ s8 cfa_fixed_fp_offset;
+ s8 cfa_fixed_ra_offset;
+ u8 auxhdr_len;
+ u32 num_fdes;
+ u32 num_fres;
+ u32 fre_len;
+ u32 fdes_off;
+ u32 fres_off;
+} __packed;
+
+struct sframe_fde {
+ s32 start_addr;
+ u32 func_size;
+ u32 fres_off;
+ u32 fres_num;
+ u8 info;
+ u8 rep_size;
+ u16 padding;
+} __packed;

I couldn't understand from SFrame itself, but why do sframe_header,
sframe_preamble, and sframe_fde have to be marked __packed, if it's
all naturally aligned (intentionally and by design)?..

Right, but the spec says they're all packed. Maybe the point is that
some future sframe version is free to introduce unaligned fields.

SFrame specification aims to keep SFrame header and SFrame FDE members
at aligned boundaries in future versions.

Only SFrame FRE related accesses may have unaligned accesses.

Yeah, and it's actually bothering me quite a lot 🙂 I have a tentative
proposal, maybe we can discuss this for SFrame v3? Let me briefly
outline the idea.

I looked at the idea below. It could work wrt unaligned accesses.

Speaking of unaligned accesses, I will ask away: Is the reason to avoid unaligned accesses performance hit or are there other practical reasons to it ?

So, currently in v2, FREs within FDEs use an array-of-structs layout.
If we use preudo-C type definitions, it would be something like this
for FDE + its FREs:

struct FDE_and_FREs {
struct sframe_func_desc_entry fde_metadata;

union FRE {
struct FRE8 {
u8 sfre_start_address;
u8 sfre_info;
u8|u16|u32 offsets[M];
}
struct FRE16 {
u16 sfre_start_address;
u16 sfre_info;
u8|u16|u32 offsets[M];
}
struct FRE32 {
u32 sfre_start_address;
u32 sfre_info;
u8|u16|u32 offsets[M];
}
} fres[N] __packed;
};

where all fres[i]s are one of those FRE8/FRE16/FRE32, so start
addresses have the same size, but each FRE has potentially different
offsets sizing, so there is no common alignment, and so everything has
to be packed and unaligned.

But what if we take a struct-of-arrays approach and represent it more like:

struct FDE_and_FREs {
struct sframe_func_desc_entry fde_metadata;
u8|u16|u32 start_addrs[N]; /* can extend to u64 as well */
u8 sfre_infos[N];
u8 offsets8[M8];
u16 offsets16[M16] __aligned(2);
u32 offsets32[M32] __aligned(4);
/* we can naturally extend to support also u64 offsets */
};

i.e., we split all FRE records into their three constituents: start
addresses, info bytes, and then each FRE can fall into either 8-, 16-,
or 32-bit offsets "bucket". We collect all the offsets, depending on
their size, into these aligned offsets{8,16,32} arrays (with natural
extension to 64 bits, if necessary), with at most wasting 1-3 bytes to
ensure proper alignment everywhere.

Note, at this point we need to decide if we want to make FREs binary
searchable or not.

If not, we don't really need anything extra. As we process each
start_addrs[i] and sfre_infos[i] to find matching FRE, we keep track
of how many 8-, 16-, and 32-bit offsets already processed FREs
consumed, and when we find the right one, we know exactly the starting
index within offset{8,16,32}. Done.

But if we were to make FREs binary searchable, we need to basically
have an index of offset pointers to quickly find offsetsX[j] position
corresponding to FRE #i. For that, we can have an extra array right
next to start_addrs, "semantically parallel" to it:

u8|u16|u32 start_addrs[N];
u8|u16|u32 offset_idxs[N];

where start_addrs[i] corresponds to offset_idxs[i], and offset_idxs[i]
points to the first offset corresponding to FRE #i in offsetX[] array
(depending on FRE's "bitness"). This is a bit more storage for this
offset index, but for FDEs with lots of FREs this might be a
worthwhile tradeoff.

Few points:
a) we can decide this "binary searchability" per-FDE, and for FDEs
with 1-2-3 FREs not bother, while those with more FREs would be
searchable ones with index. So we can combine both fast lookups,
natural alignment of on-disk format, and compactness. The presence of
index is just another bit in FDE metadata.

I have been going back and forth on this one. So there seem to be the following options here:
#1. Make "binary searchability" a per-FDE decision.
#2. Make "binary searchability" a per-section decision (I expect aarch64 to have very low number of FREs per FDE).
#3. Bake "binary searchability" into the SFrame FRE specification. So its always ON for all FDEs. The advantage is that it makes stack tracers simpler to implement with less code.

I do think #2, #3 appear simpler in concept.

b) bitness of offset_idxs[] can be coupled with bitness of
start_addrs (for simplicity), or could be completely independent and
identified by FDE's metadata (2 more bits to define this just like
start_addr bitness is defined). Independent probably would be my
preference, with linker (or whoever will be producing .sframe data)
can pick the smallest bitness that is sufficient to represent
everything.

ATM, GAS does apply special logic to decide the bitness of start_addrs per function, and ld just uses that info. Coupling the bitness of offset_idx with bitness of start_addrs will be easy (or _easier_ I think), but for now, I leave it as "should be doable" :)

Yes, it's a bit more complicated to draw and explain, but everything
will be nicely aligned, extensible to 64 bits, and (optionally at
least) binary searchable. Implementation-wise on the kernel side it
shouldn't be significantly more involved. Maybe the compiler would
need to be a bit smarter when producing FDE data, but it's no rocket
science.

Thoughts?

Combining the requirements from your email and Josh's follow up:
- No unaligned accesses
- Sorted FREs

I would put compaction as a "good to have" requirement. It appears to me that any compaction will mean a sort of post-processing which will interfere with JIT usecase.

Next message: Song Liu: "Re: [PATCH v11 bpf-next 5/7] bpf: Use btf_kfunc_id_set.remap logic for bpf_dynptr_from_skb"
Previous message: Babu Moger: "[PATCH v3 7/7] x86/resctrl: Introduce interface to modify io_alloc Capacity Bit Masks"
In reply to: Indu Bhagat: "Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers"
Next in thread: Indu Bhagat: "Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]