- How can I do disk surface scanning, and fix/reallocate bad sectors in Linux from the command line?
- 2 Answers 2
- Самотестирование диска с помощью smartctl
- Утилиты sg_verify и sg_reasign
- Переназначение секторов для дисков с интерфейсом SATA
- Аналог MHDD под Linux
- How to Fix Bad Sectors on Linux
- How to Fix Bad Sectors on Linux?
- What are bad sectors?
- e2fsck command
- Prepare a Bootable DVD
How can I do disk surface scanning, and fix/reallocate bad sectors in Linux from the command line?
I bought a disk with some bad sectors, planning to fix them and then use it as part of RAID 6 cluster. I can do bad sector fixing under Windows, there are very good bad block fixing tools, but under Windows, the process is very slow, one sector fix takes 15 minutes. In my experience, Linux is better at dealing with devices that don’t respond in time and this results in a far faster process under Linux. However, I checked the fsck manual, but did not find any useful option for surface & bad block scanning or bad block reallocation. How can I scan the surface of my hard disk and fix/reallocate bad sectors in Linux from the command line?
Since a long, long time (decades), disks fix their own bad sectors unprompted. Operating systems don’t even know the physical mapping of sectors on disk due to repair remapping performed by the drive firmware. Given that, should it not be sufficient to either (1) just let the drive notice the bad sectors and fix them when they get wrtritten, or (2) if you really want to, force every sector to be rewritten once with dd if=/dev/zero so that it happens sooner. So long as you are using the disk for new storage, not trying to read existing data on it that might be inaccessible, you should be fine
2 Answers 2
This answer is about magnetic disks. SSDs are different. Also, this is disk with no data (or no data you care to preserve) on it; see my answer to “Can I fix bad blocks on my hard disk with a single command” for what to do if you have important data on the disk.
Disks made since at least the late 90s manage bad blocks themselves. In brief, a disk will handle a bad block by transparently replacing it with a spare sector. It will do so if (a) while reading, it discovers the block is «weak», but ECC is enough to recover the data; (b) while writing, it discovers the sector header is bad; (c) while writing, if a read previously detected the sector as bad, but the data was not recoverable.
The disk firmware typically lets you monitor this process (the counts at least) via SMART attributes. Typically there will be at least a count of reallocated sectors and two counts of pending (discovered bad on read, ECC failed, has not yet been written to).
There are two ways to get the disk to notice bad sectors:
- Use smartctl -t offline /dev/sdX to tell the disk firmware to do an offline surface scan. You then just leave the disk alone (completely idle will be fastest) until it’s done (check the «Offline data collection status» in smartctl -c /dev/sdX ). This will typically update the «offline uncorrectable» count in SMART. (Note: drives can be configured to automatically run an offline check routinely.)
- Have Linux read the entire disk, e.g., badblocks -b 4096 -c 1024 -s /dev/sdX . This will typically update the «current pending sector» count in SMART.
Either of the above may also increase the reallocated sector count—this is case (b), the ECC recovered the data.
Now, to recover the sectors you just need to write to them. Normally, that’d be a simple pv -pterba /dev/zero > /dev/sdX (or just plain cat , or dd ) but you plan to make these part of a RAID array. The RAID init will write to the entire disk anyway, so that’s pointless. The only exception the beginning and end of the disk—it’s possible a few tens of megabytes will be missed (due to alignment, headers, etc.). So:
disk=/dev/sdX end=$(echo "$(/sbin/blockdev --getsize64 "$disk")/4096-32768" | bc) dd if=/dev/zero bs=4096 count=32768 of="$disk" # first 128 MiB dd if=/dev/zero bs=4096 seek="$end" count=32768 of="$disk" # last 128 MiB
I think I managed to avoid the all-to-easy fencepost error 1 above, so that should blank the first and last 128MiB of the disk. Then let mdadm raid init write the rest. It’s harmless (except for trivial wear, and wasting hours of time) to zero the whole disk if you’d like to, though.
Another thing to do, if your disks support it: smartctl -l scterc,40,100 (or whatever numbers) to tell the disk that you want it to give up on correcting read errors quicker—40 would be 4 seconds. The two numbers are read errors and write errors; mdraid will easily correct read errors via parity (and write the failed sector back to the disk to let it reallocate). Write errors, though, will fail the disk out of the array.
PS: Make sure to keep an eye on the reallocated sectors count. That attribute going to failed is bad news. And if its continuously increasing, that’s bad news too.
PPS: Make sure your RAID arrays are scrubbed (every sector read and all the parity verified) routinely. Many distros already ship a script that does this monthly. This will detect & repair any new bad blocks as otherwise seldom-read bad blocks can linger and ultimately cause rebuild failure.
1 Fencepost error—a type of off-by-one error from failing to count one of the ends. Named from if you have a fence post every 3ft, how many fence posts in a 9ft freestanding fence? The correct answer is 4; the fencepost error is 3 and is from not counting the post at the beginning or at the end.
Самотестирование диска с помощью smartctl
Существуют два вида тестов — foreground (когда диск в ответ на все команды имеет статус CHECK CONDITION) и background — когда диск остается работоспособным.
Тесты бывают short и long.
Short — ограничен по времени и тестирует только часть диска.
Long — аналогичен финальному тестированию на производстве, не ограничен по времени и тестирует весь диск.
Запускаем “длинный” тест, который можно запускать прямо во время работы системы:
sudo smartctl -t long /dev/sdb
Узнать сколько времени займет тот или иной тест можно командой:
sudo smartctl -c /dev/sdc . Self-test execution status: ( 249) Self-test routine in progress. 90% of test remaining. Total time to complete Offline data collection: ( 1211) seconds. .
После того как он закончится смотрим результат с помощью команды smartctl -l selftest /dev/sdb:
sudo smartctl -l selftest /dev/sdb smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 28243 1048784
Также можно протестировать только часть диска. Команда:
sudo smartctl -t select,10-20 /dev/sdc
протестирует сектора с 10 по 20 включительно.
Кроме того, можно одной командой протестировать несколько диапазонов:
sudo smartctl -t select,0-10 -t select,5-15 -t select,10-20 /dev/sdc
Запустить foreground тест можно, указав опцию -C:
sudo smartctl -t long -C /dev/sdc
Короткий foreground тест, который принудительно обновит значения параметров SMART можно запустить так:
sudo smartctl -t offline /dev/sda
Утилиты sg_verify и sg_reasign
Пакет утилит sg3-utils предназначен для работы с дисками на низком уровне, поддерживающими полный набор команд SCSI. В настоящее время — это диски с интерфейсом SAS (но не SATA).
Обнаружить нечитающийся сектор можно, например, с помощью команды dd:
sudo dd if=/dev/sdb of=/dev/null bs=4096
Затем, когда dd сообщит об ошибке, с помощью sg_verify можно удостовериться, что проблема именно в том секторе:
sudo sg_verify --lba=1193046 /dev/sdb verify (10): Fixed format, current; Sense key: Medium Error Additional sense: Unrecovered read error Info fld=0x123456 [[1193046]] Field replaceable unit code: 228 Actual retry count: 0x008b medium or hardware error, reported lba=0x123456
Зетем можно проверить, что размер GLIST не очень велик и там есть место:
sudo sg_reassign --grown /dev/sdb >> Elements in grown defect list: 0
Теперь можно выполнить переназначкение сектора и проверить, что размер GLIST изменился:
sudo sg_reassign --address=1193046 /dev/sdb sudo sg_reassign --grown /dev/sdb >> Elements in grown defect list: 1
Размер GLIST изменился на 1 — чего и следовало ожидать. В результате в переназначаемом секторе будет записан заданный производителем паттерн (либо — если сектор все-же удалось прочесть — его содержимое не изменится).
Также утилита sg_reassign может выполнить переназначение группы секторов.
Для дисков с интерфейсами SAS и SATA наборы команд для работы с переназначаемыми секторами различны. Диск с интерфейсом SAS (неважно подключен он к SATA или SAS контроллеру) способен безусловно переназначить группу секторов в резервную область независимо от того, читаемы сектора или нет. Использование sg_reasign для диска SATA скорее всего никакого эффекта не даст.
Переназначение секторов для дисков с интерфейсом SATA
Для переназначения секторов на дисках с интерфейсом SATA можно просто дать команду на перезапись этого сектора, либо воспользоваться hdparm:
Принципиальная разница между дисками SAS и SATA в том, что случае SAS можно принудительно переназначить даже хороший сектор, а в случае SATA переназначение при записи принимает контроллер диска. При этом,этот вариант годится для обоих интерфейсов:
sudo hdparm --repair-sector 1000 /dev/sda
for i in ; do sudo hdparm --repair-sector $i /dev/sda; done
Аналог MHDD под Linux
С удивлением обнаружил наличие под Linux некоторого аналога (по крайней мере визуально) популярной программы под DOS — MHDD.
Речь идет об утилите, входящей в состав SystemRescueCd — whdd.
How to Fix Bad Sectors on Linux
We’re here to talk about how to fix bad sectors on Linux and keep your hard disk in good shape. Read on to find out more!
Bad sectors are the most annoying problem for an end-user. Every operating system has specific tools called system software to fix the bad sectors. For instance, Microsoft Windows has a utility called chkdsk that can be used to check disks for errors. Note that having a bad sector is a physical problem, and the usual solution is to mark them and avoid writing in the future.
One of the ways to check for bad sectors is to boot the system with some other drive and then scan your drive for bad sectors. If an external drive is not available, you can also boot the system in single-user mode.
There is a tool called e2fsck in Linux that can be used to check the file system and then use badblocks to check the drive. This article will discuss in detail the e2fsck command and its various options. In addition, a step-by-step procedure is provided to scan your system for errors and fix them.
How to Fix Bad Sectors on Linux?
Before we get to the meat of it, you might want to understand what a bad sector means, and how it can adversely affect the system. Finally, we will take a look at the multiple ways to fix this issue on Linux OS.
What are bad sectors?
Hard disk stores data in the form of sectors. A collection of sectors form a track, and a collection of tracks form clusters. Bad sectors are those sectors of the disk that have been corrupted or damaged, and you can’t store data on those sectors.
There can be multiple reasons for bad sectors, such as power failure, disk damage, age of a disk, etc. Despite a sector being corrupted, the operating system may consider them as normal and attempt to write to those sectors. This can lead to data loss. Therefore, those sectors must be marked such that the operating system avoids writing to these sectors.
Tip: One should scan their drives regularly to eliminate issues with a system, such as slow read/ write, data loss, etc.
e2fsck command
The command scans your drive for errors. There are several options available with this command. The flag c searches for bad sectors and adds to a list, f checks for the file system, the p option attempts to repair any error (if possible), and v is the verbose mode.
Now, we will discuss the basic steps to fix bad sectors in a system.
Prepare a Bootable DVD
First, you need to have Ubuntu burned into a DVD. Make sure that you have another drive that has Linux or Ubuntu installed, and that can access your hard drive to be checked for bad sectors.