Re: ATA 4 KiB sector issues.
From: Greg Freemyer
Date: Mon Mar 08 2010 - 00:38:17 EST
cc'ing Martin Petersen since I believe he is one of the most
knowledgeable kernel hackers on this topic and has been working the
issue for the last year.
On Sun, Mar 7, 2010 at 10:48 PM, Tejun Heo <tj@xxxxxxxxxx> wrote:
> Hello, guys.
>
> It looks like transition to ATA 4k drives will be quite painful and we
> aren't really ready although these drives are already selling widely.
> I've written up a summary document on the issue to clarify stuff as
> it's getting more and more confusing and develop some consensus. It's
> also on the linux ata wiki.
>
> http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>
> I've cc'd people whom I can think of off the top of my head but I
> surely have missed some people who would have been interested. Please
> feel free to add cc's or forward the message to other MLs.
> Especially, I don't know much about partitioners so the details there
> are pretty shallow and could be plain wrong. It would be great if
> someone who knows more about this stuff can chime in.
>
> Thanks.
>
> === Document follows ===
>
> ATA 4 KiB sector issues
>
> Background
> ==========
>
> Up until recently, all ATA hard drives have been organized in 512 byte
> sectors. For example, my 500 GB or 477 GiB hard drive is organized of
> 976773168 512 byte sectors numbered from 0 to 976773167. This is how
> a drive communicates with the driver. When the operating system wants
> to read 32 KiB of data at 1 MiB position, the driver asks the drive to
> read 64 sectors from LBA (Logical block address, sector number) 2048.
>
> Because each sector should be addressable, readable and writable
> individually, the physical medium also is organized in the same sized
> sectors. In addition to the area to store the actual data, each
> sector requires extra space for book keeping - inter-sector space to
> enable locating and addressing each sector and ECC data to detect and
> correct inevitable raw data errors.
>
> As the densities and capacities of hard drives keep growing, stronger
> ECC becomes necessary to guarantee acceptable level of data integrity
> increasing the space overhead. In addition, in most applications,
> hard drives are now accessed in units of at least 8 sectors or 4096
> bytes and maintaining 512 byte granularity has become somewhat
> meaningless.
>
> This reached a point where enlarging the sector size to 4096 bytes
> would yield measurably more usable space given the same raw data
> storage size and hard drive manufacturers are transitioning to 4 KiB
> sectors.
>
> Anandtech has a good article which illustrates the background and
> issues with pretty diagrams[1].
>
>
> Physical vs. Logical
> ====================
>
> Because the 512 byte sector size has been around for a very long time
> and upto ATA/ATAPI-7 the sector size was fixed at 512 bytes, the
> sector size assumption is scattered across all the layers -
> controllers or bridge chips snooping commands, BIOSs, boot codes,
> drivers, partitioners and system utilities, which makes it very
> difficult to change the sector size from 512 byte without breaking
> backward compatibility massively.
>
> As a workaround, the concept of logical sector size was introduced.
> The physical medium is organized in 4 KiB sectors but the firmware on
> the drive will present it as if the drive is composed of 512 byte
> sectors thus making the drive behave as before, so if the driver asks
> the hard drive to read 64 sectors from LBA 2048, the firmware will
> translate it and read 8 4 KiB sectors from hardware sector 256. As a
> result, the hard drive now has two sector sizes - the physical one
> which the physical media is actually organized in, and the logical one
> which the firmware presents to the outside world.
>
> A straight forward example mapping between physical sector and LBA
> would be
>
> LBA = 8 * phys_sect
>
>
> Alignment problem on 4 KiB physical / 512 logical drives
> =======================================================
>
> This workaround keeps older hardware and software working while
> allowing the drive to use larger sector size internally. However, the
> discrepancy between physical and logical sector sizes creates an
> alignment issue. For example, if the driver wants to read 7 sectors
> from LBA 2047, the firmware has to read hardware sector 255 and 256
> and trim leading 7*512 bytes and tailing 512 bytes.
>
> For reads, this isn't an issue as drives read in larger chunks anyway
> but for writes, the drive has to do read-modify-write to achieve the
> requested action. It has to first read hardware sector 255 and 256,
> update requested parts and then write back those sectors which can
> cause significant performance degradation[2].
>
> The problem is aggravated by the way DOS partitions[3] have been laid
> out traditionally. For reasons dating back more than two decades,
> they are laid out considering something called disk geometry which
> nowadays are arbitrary values with a number of restrictions for
> backward compatibility accumulated over the years. The end result is
> that until recently (most Linux variants and upto Windows XP) the
> first partition ends up on sector 63 and later ones on cylinder
> boundaries where each cylinder usually is composed of 255 * 63
> sectors.
>
> Most modern filesystems generate 4 KiB aligned accesses from the
> partition it is in. If a drive maps 4 KiB physical sectors to 512
> byte logical sectors from LBA0, the filesystem in the first partition
> will always be misaligned and filesystems in later partitions are
> likely to be misaligned too.
>
>
> Solving the alignment problem on 4 KiB physical / 512 logical drives
> ====================================================================
>
> There are multiple ways which attempt to solve the problem.
>
> S-1. Yet another workaround from the firmware - offset-by-one.
>
> Yet another workaround which can be done by the firmware is to
> offset physical to logical mapping by one logical sector such that
> LBA 63 ends up on physical sector boundary, which aligns the first
> partition to physical sectors without requiring any software update.
> The example mapping between phys_sector and LBA becomes
>
> LBA = 8 * phys_sect - 1
>
> The leading 512 bytes from phys_sect 0 is not used and LBA 0 starts
> from after that point. phys_sect 1 maps to LBA 7 and phys_sect 8 to
> 63, making LBA 63 aligned on hardware sector.
>
> Although this aligns only the first partition, for many use cases,
> especially the ones involving older software, this workaround was
> deemed useful and some recent drives with 4 KiB physical sectors are
> equipped with a dip switch to turn on or off offset-by-one mapping.
>
> S-2. The proper solution.
>
> Correct alignments for all partitions can't be achieved by the
> firmware alone. The system utilities should be informed about the
> alignment requirements and align partitions accordingly.
>
> The above firmware workaround complicates the situation because the
> two different configurations require different offsets to achieve
> the correct alignments. ATA/ATAPI-8 specifies a way for a drive to
> export the physical and logical sector sizes and the LBA offset
> which is aligned to the physical sectors.
>
> In Linux, these parameters are exported via the following sysfs
> nodes.
>
> physical sector size : /sys/block/sdX/queue/physical_block_size
> logical sector size : /sys/block/sdX/queue/logical_block_size
> alignment offset : /sys/block/sdX/alignment_offset
>
> Let the physical sector size be PSS, logical sector size LSS and
> alignment offset AOFF. The system software should place partitions
> such that the starting LBAs of all partitions are aligned on
>
> (n * PSS + AOFF) / LSS
>
> For 4 KiB physical sector offset-by-one drives, PSS is 4096, LSS 512
> and AOFF 3584 and with n of 7 the above becomes,
>
> (7 * 4096 + 3584) / 512 == 63
>
> making sector 63 an aligned LBA where the first partition can be
> put, but without the offset-by-one mapping, AOFF is zero and LBA 63
> is not aligned.
>
> With the above new alignment requirement in place, it becomes
> difficult to honor the legacy one - first partition on sector 63 and
> all other partitions on cylinder boundary (255 * 63 sectors) - as
> the two alignment requirements contradict each other. This might be
> worked around by adjusting how LBA and CHS addresses are mapped but
> the disk geometry parameters are hard coded everywhere and there is
> no reliable way to communicate custom geometry parameters.
>
>
> Complications
> =============
>
> Unfortunately, there are complications.
>
> C-1. The standard is not and won't be followed as-is.
>
> Some of the existing BIOSs and/or drivers can't cope with drives
> which report 4 KiB physical sector size. To work around this, some
> drive models lie that its physical sector size is 512 bytes when the
> actual configuration is 4 KiB without offsetting.
>
> This nullifies the provisions for alignment in the ATA standard but
> results in the correct alignment for Windows Vista and 7. OS
> behaviors will be described further later.
>
> For these drives, which are likely to continue to be shipped for the
> foreseeable future, traditional LBA 63 and cylinder based aligning
> results in misalignment.
>
> C-2. Windows XP depends on the traditional partition layout.
>
> Windows XP makes use of the CHS start/end addresses in the partition
> table and gets confused if partitions are not laid out
> traditionally. This means that XP can't be installed into a
> partition prepared by later versions of Windows[4]. This isn't a
> big problem for Windows because in most cases the later version is
> replacing the older one, not the other way around.
>
> Unfortunately, the situation is more complex for Linux because Linux
> is often co-installed with various versions of Windows and XP is
> still quite popular. This means that when a Linux partitioner is
> used to prepare a partition which may be used by Windows, the
> partitioner might have to consider which version of Windows is going
> to be used and whether to align the partitions for the correct
> alignment or compatibility with older versions of Windows.
>
> C-3. The 2 TiB barrier and the possibility for 4 KiB logical sector size.
>
> The DOS partition format uses 32 bit for the starting LBA and the
> number of sectors and, reportedly, 32 bit Windows XP shares the
> limitation. With 32 bit addressing and 512 byte logical sector
> size, the maximum addressable sector + 1 is at
>
> 2^32 * 2^9 == 2^41 == 2 TiB
>
> The DOS partition format allows a partition to reach beyond 2 TiB as
> long as the starting LBA is under 2 TiB; however, both Windows XP
> and and the Linux kernel (at least upto v2.6.33) refuse such
> partition configurations.
>
> With the right combination of host controller, BIOS and driver, this
> barrier can be overcome by enlarging the logical sector size to 4
> KiB, which will push the barrier out to 16 TiB. On the right
> configuration, Windows XP is reportedly able to address beyond the 2
> TiB barrier with a DOS partition and 4 KiB logical sector size.
> Linux kernel upto v2.6.33 doesn't work under such configurations but
> a patch to make it work is pending[5].
>
> This might also be beneficial for operating systems which don't
> suffer from this limitation. A different partition format - GPT[6]
> - should be used beyond 2^32 sectors, which could harm compatibility
> with older BIOSs or other operating systems which don't recognize
> the new format.
>
> As mentioned previously, 512 byte sector assumption has been there
> for a very long time and changing it is likely to cause various
> compatibility problems at many different layers from hardware up to
> the system utilities.
>
>
> Windows
> =======
>
> As hard drive vendors aim for performance and compatibility in modern
> Windows environments, it is worthwhile to investigate how Windows
> partitions with different alignment requirements. Up until Windows
> XP, it followed the traditional layout - the first partition on LBA 63
> and the others on cylinder boundaries where a cylinder is defined as
> 255 tracks with 63 sectors each.
>
> Windows Vista and 7 align partitions differently. As the two behave
> similarly, only 7's behavior is shown here. These partition tables
> are created by Windows 7 RC installer on blank disks.
>
> W-1. 512 byte physical and logical sector drive.
>
> ST FIRST T LAST LBA NBLKS
> 80 202100 07 df130c 00080000 00200300
> 00 df140c 07 feffff 00280300 00689e12
> 00 000000 00 000000 00000000 00000000
> 00 000000 00 000000 00000000 00000000
>
> Part0: FIRST C 0 H 32 S 33 : 2048 (63 sec/trk)
> LAST C 12 H 223 S 19 : 206847 (255 heads/cyl)
> LBA 2048 + 204800 = 206848
>
> Part1: FIRST C 12 H 223 S 20 : 206848
> LAST C 1023 H 254 S 63 : E
> LBA 206848 + 312371200 = 312578048
>
> Both aligned at (2048 * n). Part 1 not aligned to cylinder.
>
> W-2. 4 KiB physical and 512 byte logical sector drive without offset-by-one.
>
> ST FIRST T LAST LBA NBLKS
> 80 202100 07 df130c 00080000 00200300
> 00 df140c 07 feffff 00280300 00b83f25
> 00 000000 00 000000 00000000 00000000
> 00 000000 00 000000 00000000 00000000
>
> Part0: FIRST C 0 H 32 S 33 : 2048 (63 sec/trk)
> LAST C 12 H 223 S 19 : 206847 (255 heads/cyl)
> LBA 2048 + 204800 = 206848
>
> Part1: FIRST C 12 H 223 S 20 : 206848
> LAST C 1023 H 254 S 63 : E
> LBA 206848 + 624932864 = 625139712
>
> Both aligned at (2048 * n). Part 1 not aligned to cylinder.
>
> W-3. 4 KiB physical and 512 byte logical sector drive with offset-by-one.
>
> ST FIRST T LAST LBA NBLKS
> 80 202800 07 df130c 07080000 f91f0300
> 00 df1b0c 07 feffff 07280300 f9376d74
> 00 000000 00 000000 00000000 00000000
> 00 000000 00 000000 00000000 00000000
>
> Part0: FIRST C 0 H 32 S 40 : 2055 (63 sec/trk)
> LAST C 12 H 223 S 19 : 206847 (255 heads/cyl)
> LBA 2055 + 204793 = 206848
>
> Part1: FIRST C 12 H 223 S 27 : 206855
> LAST C 1023 H 254 S 63 : E
> LBA 206855 + 1953314809 = 1953521664
>
> Both aligned at (2048 * n + 7). Part 1 not aligned to cylinder.
>
> The partitioner seems to be using 1M as the basic alignment unit and
> offsetting from there if explicitly requested by the drive and there
> is no difference between handling of 512 byte and 4 KiB drives, which
> explains why C-1 works for hard drive vendors.
>
> In all cases, the partitioner ignores both the first partition on LBA
> 63 and the others on cylinder boundary requirements while still using
> the same 255*63 cylinder size. Also, note that in W-3, both part 0
> and 1 end up with odd number of sectors. It seems that they simply
> decided to completely break away from the traditional layout, which is
> understandable given that there really isn't one good solution which
> can cover all the cases and that the default larger alignment benefits
> earlier SSDs.
>
> Windows Vista basically shows the same behavior. Vista was tested by
> creating two partitions using the management tool. Test data is
> available at [7].
>
> *-alignment_offset : alignment_offset reported by Linux kernel
> *-fdisk : fdisk -l output
> *-fdisk-u : fdisk -lu output
> *-hdparm : hdparm -I output
> *-mbr : dump of mbr
> *-part : decoded partition table from mbr
>
> Please note that hdparm is misreporting the alignment offset. It
> should be reporting 512 instead of 256 for offset-by-one drives.
>
>
> So, what now for Linux?
> =======================
>
> The situation is not easy. Considering all the factors, the only
> workable solution looks like doing what Windows is doing. Hard drive
> and SSD vendors are focusing on compatibility and performance on
> recent Windows releases and are happy to do things which break the
> standard defined mechanism as shown by C-1, so parting away from what
> Windows does would be unnecessarily painful.
>
> Unfortunately, while Windows can assume that newer releases won't
> share the hard drive with older releases including Windows XP, Linux
> distros can't do that. There will be many installations where a
> modern Linux distros share a hard drive with older releases of
> Windows. At this point, I can't see a silver bullet solution.
>
> Partitioners maybe should only align partitions which will be used by
> Linux and default to the traditional layout for others while allowing
> explicit override. I think Windows XP wouldn't have problem with
> differently aligned partitions as long as it doesn't actually use them
> but haven't tested it.
>
> Reportedly, commonly used partitioners aren't ready to handle drives
> larger than 2 TiB in any configuration and alignment isn't done
> properly for drives with 4 KiB physical sectors. 4 KiB logical sector
> support is broken in both the kernel and partitioners. (need more
> details and probably a whole section on partitioner behaviors)
>
> Unfortunately, the transition to 4 KiB sector size, physical only or
> logical too, is looking fairly ugly. Hopefully, a reasonable solution
> can be reached in not too distant future but even with all the
> software side updated, it looks like it's gonna cause significant
> amount of confusion and frustration.
>
>
> [1] http://www.anandtech.com/storage/showdoc.aspx?i=3691
> [2] http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives
> [3] http://en.wikipedia.org/wiki/Master_boot_record
> [4] http://support.microsoft.com/kb/931760
> [5] http://thread.gmane.org/gmane.linux.kernel/953981
> [6] http://en.wikipedia.org/wiki/GUID_Partition_Table
> [7] http://userweb.kernel.org/~tj/partalign/
>
> * Mar 04 2009
> Initial draft, Tejun Heo <tj@xxxxxxxxxx>
> * Mar 08 2009
> Updated according to comments from Daniel Taylor
> <Daniel.Taylor@xxxxxxx>. Other minor updates.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
Preservation and Forensic processing of Exchange Repositories White Paper -
<http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html>
The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/