在Linux中监视和测试SSD的运行状况

时间:2020-03-05 15:26:10  来源:igfitidea点击:

什么是S.M.A.R.T.?

S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology)–是一种嵌入在硬盘或者SSD等存储设备中的技术,其目标是监视其运行状况。

实际上,
在正常的驱动器操作期间,S.M.A.R.T将监视几个磁盘参数,例如读取错误的数量,驱动器的启动时间甚至环境条件。
此外,S.M.A.R.T。
并且还可以在驱动器上执行按需测试。

理想情况下,S.M.A.R.T 这样可以预见可预见的故障,例如由机械磨损或者磁盘表面退化引起的那些,以及由意外缺陷引起的不可预见的故障。
由于驱动器通常不会突然发生故障,因此S.M.A.R.T.为操作系统或者系统管理员提供了一个选项,可以识别即将发生故障的驱动器,以便在发生任何数据丢失之前可以对其进行更换。

什么是S.M.A.R.T.?

一切似乎都很棒。
但是,S.M.A.R.T。
不是水晶球。
它不能以100%的准确度预测故障,也不能保证没有任何预警就不会出现驱动器故障。
充其量,S.M.A.R.T。
应该用来估计失败的可能性。

考虑到故障预测的统计性质,S.M.A.R.T。
这项技术特别引起使用大量存储单元的兴趣,并且已经进行了现场研究以评估S.M.A.R.T.报告的问题,以预期数据中心或者服务器场中的磁盘替换需求。

2015年,微软和宾夕法尼亚州立大学进行了一项针对固态硬盘的研究。

根据这项研究,似乎有一些S.M.A.R.T.属性是即将发生故障的良好指示。
该文件特别提到:

重新分配的(重新分配)扇区数:

尽管底层技术有根本的不同,但该指标在SSD领域似乎比在硬盘领域同样重要。
值得一提的是,由于SSD中使用了损耗均衡算法,当几个块开始出现故障时,很快就会有更多的机会失败。
编程/擦除(P/E)失败计数:

这是底层闪存硬件出现问题的征兆,其中驱动器无法清除或者将数据存储在块中。
由于制造过程中的缺陷,因此很少会出现此类错误。
然而,闪存具有有限数量的清除/写入周期。
因此,再次,事件数量的突然增加可能表明驱动器已达到寿命极限,并且我们可以预期更多的存储单元很快就会发生故障。
CRC和不可纠正的错误(“数据错误”):

这些事件可能是由于存储错误或者驱动器的内部通信链接出现问题引起的。
该指示器同时考虑了已更正的错误(因此没有向主机系统报告任何问题)和未更正的错误(因此阻止了驱动器报告无法读取到主机系统)。
换句话说,可纠正的错误对于主机操作系统而言是不可见的,但是由于驱动器固件必须纠正数据,并且可能发生扇区重定位,因此它们仍然会影响驱动器的性能.SATA降速计数:

由于暂时的干扰,驱动器与主机之间的通信链接出现问题,或者由于内部驱动器出现问题,SATA接口可以切换到较低的信号速率。
将链路降级到标称链路速率以下会对观察到的驱动器性能产生明显影响。
选择较低的信号速率并不罕见,尤其是在较旧的驱动器上。
因此,当与一个或者多个前述指标的存在相关联时,该指标最重要。

根据研究,故障的SSD中有62%表现出上述症状中的至少a)
但是,如果我们颠倒了这一说法,这也意味着研究的38%的SSD出现故障而没有出现以上任何症状。
这项研究没有提及故障驱动器是否表现出其他任何S.M.A.R.T.报告失败与否。
因此,这不能直接与Google论文中提到的硬盘故障(无事先通知)中的36%相提并论。

微软/宾夕法尼亚州立大学的论文没有披露所研究的确切驱动器模型,但据作者称,大多数驱动器都来自同一供应商,跨越了几代人。

研究发现不同模型之间的可靠性存在显着差异。
例如,研究的“最差”模型在第一次重定位错误后九个月出现了20%的失败率,而在第一次发生数据错误后九个月出现了高达36%的失败率。
“最差”模型也恰好是本文研究的较早的驱动器一代。

另一方面,对于相同的症状,属于最年轻设备的驱动器对于相同的错误仅分别显示3%和20%的故障率。
很难说出这些数字是否可以通过改进驱动器设计和制造过程来解释,或者仅仅是驱动器老化的影响。

最有趣的是,并且我在较早时候给出了一些可能的原因,该白皮书提到,这不是原始值,而是报告的错误数量的突然增加,应将其视为令人震惊的指标:

“”“ SSD故障之前出现症状的可能性更高,强烈的表现形式和快速发展会阻止它们在几个月后的生存能力”

换句话说,偶尔会有S.M.A.R.T.报告的错误可能不被视为即将发生故障的信号。
但是,当健康的SSD开始报告越来越多的错误时,则必须预料到短期的中期故障。

但是如何知道硬盘驱动器或者SSD是否健康?
为了满足好奇心,或者因为要开始密切监视驱动器,现在该介绍smartctl监视工具了:

在Linux中使用smartctl监视SSD的状态

有几种方法可以列出Linux中的磁盘,但可以监视S.M.A.R.T.

磁盘的状态,我建议使用“ smartctl”工具,它是“ smartmontool”软件包的一部分(至少在Debian/Ubuntu上)。

sudo apt install smartmontools

smartctl是一个命令行工具,但这是完美的,特别是如果要自动在服务器上收集数据的话。

使用smartctl的第一步是检查磁盘是否具有S.M.A.R.T.已启用并受该工具支持:

sh$sudo smartctl -i /dev/sdb
smartctl 6.6 2015-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Momentus 7200.4
Device Model:     ST9500420AS
Serial Number:    5VJAS7FL
LU WWN Device Id: 5 000c50 02fa0b800
Firmware Version: D005SDM1
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Mar 12 15:54:43 2016 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

如我们所见,我的笔记本电脑内置硬盘确实具有S.M.A.R.T.功能和S.M.A.R.T.支持已启用。
那么,关于S.MA.R.T.地位?
是否记录了一些错误?

报告“有关磁盘的所有SMART信息”是-a选项的工作:

sh$sudo smartctl -i -a /dev/sdb
smartctl 6.6 2015-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Momentus 7200.4
Device Model:     ST9500420AS
Serial Number:    5VJAS7FL
LU WWN Device Id: 5 000c50 02fa0b800
Firmware Version: D005SDM1
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Mar 12 15:56:58 2016 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 110) minutes.
Conveyance self-test routine
recommended polling time:      (   3) minutes.
SCT capabilities:            (0x103f)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   099   006    Pre-fail  Always       -       29694249
  3 Spin_Up_Time            0x0003   100   098   085    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   095   095   020    Old_age   Always       -       5413
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       3
  7 Seek_Error_Rate         0x000f   071   060   030    Pre-fail  Always       -       51710773327
  9 Power_On_Hours          0x0032   070   070   000    Old_age   Always       -       26423
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   096   037   020    Old_age   Always       -       4836
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   072   072   000    Old_age   Always       -       28
188 Command_Timeout         0x0032   100   096   000    Old_age   Always       -       4295033738
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   056   042   045    Old_age   Always   In_the_past 44 (Min/Max 21/44 #22)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       184
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       104
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       395415
194 Temperature_Celsius     0x0022   044   058   000    Old_age   Always       -       44 (0 13 0 0 0)
195 Hardware_ECC_Recovered  0x001a   050   045   000    Old_age   Always       -       29694249
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       25131 (246 202 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3028413736
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1613088055
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0
SMART Error Log Version: 1
ATA Error Count: 3
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 3 occurred at disk power-on lifetime: 21171 hours (882 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- -
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  -------------------
  60 00 08 ff ff ff 4f 00      00:45:12.580  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:45:12.580  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:45:12.579  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:45:12.571  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00      00:45:12.543  READ FPDMA QUEUED
Error 2 occurred at disk power-on lifetime: 21171 hours (882 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- -
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  -------------------
  60 00 00 ff ff ff 4f 00      00:45:09.456  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      00:45:09.451  READ FPDMA QUEUED
  61 00 08 ff ff ff 4f 00      00:45:09.450  WRITE FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      00:45:08.878  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      00:45:08.856  READ FPDMA QUEUED
Error 1 occurred at disk power-on lifetime: 21131 hours (880 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- -
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  -------------------
  60 00 00 ff ff ff 4f 00      05:52:18.809  READ FPDMA QUEUED
  61 00 00 7e fb 31 45 00      05:52:18.806  WRITE FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      05:52:18.571  READ FPDMA QUEUED
  ea 00 00 00 00 00 a0 00      05:52:18.529  FLUSH CACHE EXT
  61 00 08 ff ff ff 4f 00      05:52:18.527  WRITE FPDMA QUEUED
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     10904         
# 2  Short offline       Completed without error       00%        12         
# 3  Short offline       Completed without error       00%         0         
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

了解smartctl命令的输出

那是很多信息,并且解释这些数据并不总是那么容易。
最有趣的部分可能是标记为“具有阈值的供应商特定的SMART属性”的部分。
它报告了S.M.A.R.T.设备,并让我们将这些值(当前或者历史上最差的值)与某个供应商定义的阈值进行比较。

例如,这是我的磁盘报告重新定位的扇区的方式:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       3

我们可以看到一个“失败前”属性。
这仅表示该属性与异常相对应。
因此,如果该属性超过阈值,则可能表示即将发生故障。
另一个类别是“ Old_age”,用于表示与“正常穿着”属性相对应的属性。

最后一个字段(此处为“ 3”)对应于驱动器报告的该属性的原始值。
通常,此数字具有物理意义。
其中这是重新分配的扇区的实际数量。
但是,对于其他属性,可能是温度(摄氏度),时间(小时或者分钟)或者驱动器遇到特定条件的次数。

除了原始值外,S.M.A.R.T。
启用的驱动器必须报告“标准化”值(字段值,最差值和阈值)。
这些值在1-254范围内标准化(阈值0-255)。
磁盘固件使用某种内部算法执行该归一化。
此外,不同的制造商可能会以不同的方式标准化相同的属性。
大多数值以百分比报告,较高的值是最好的,但这不是强制性的。
当参数低于或者等于制造商提供的阈值时,则认为该磁盘因该属性而失败。
考虑到该文章第一部分中提到的所有保留,当“ pre-fail”属性发生故障时,可能即将发生磁盘故障。

再举一个例子,让我们研究一下“寻找错误率”:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  7 Seek_Error_Rate         0x000f   071   060   030    Pre-fail  Always       -       51710773327

实际上,这是S.M.A.R.T.的问题。
报告中,每个值的确切含义是特定于供应商的。
在我的情况下,希捷使用对数标度对值进行归一化。
因此,“ 71”表示大约有1千万次搜寻的错误(以7.1次方为10)。
有趣的是,有史以来最差的错误是一百万次搜寻的一个错误(10到6.0的幂)。
如果我正确解释,那意味着现在我的磁盘头的位置比过去更准确。
我没有密切关注该磁盘,因此此分析应谨慎。
最初调试时,驱动器可能只需要一段磨合期?
除非这是机械零件磨损的结果,然后今天减少了摩擦?
在任何情况下,无论是什么原因,此值都更是一种性能指标,而不是故障预警。
因此,这并不会打扰我很多。

除此之外,大约六个月前记录了三个可疑错误,该驱动器似乎处于良好状态(根据S.M.A.R.T.的规定),该笔记本驱动器的通电时间超过1100天(26423小时):

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours          0x0032   070   070   000    Old_age   Always       -       26423

出于好奇,我在配备SSD的最新笔记本电脑上进行了相同的测试:

sh$sudo smartctl -i /dev/sdb
smartctl 6.5 2015-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA THNSNK256GVN8
Serial Number:    17FS131LTNLV
LU WWN Device Id: 5 00080d 9109b2ceb
Firmware Version: K8XA4103
User Capacity:    256 060 514 304 bytes [256 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      M.2
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Mar 13 01:03:23 2016 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

首先要注意的是,即使该设备是S.M.AR.T。
启用后,它不在smartctl数据库中。
这不会阻止该工具从SSD收集数据,但是它将无法报告不同供应商特定属性的确切含义:

sh$sudo smartctl -a /dev/sdb
smartctl 6.5 2015-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  120) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  11) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000a   100   100   000    Old_age   Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   100   100   050    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0013   100   100   050    Pre-fail  Always       -       0
  7 Unknown_SSD_Attribute   0x000b   100   100   050    Pre-fail  Always       -       0
  8 Unknown_SSD_Attribute   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       171
 10 Unknown_SSD_Attribute   0x0013   100   100   050    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       105
166 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute       0x0013   100   100   010    Pre-fail  Always       -       100
170 Unknown_Attribute       0x0013   100   100   010    Pre-fail  Always       -       0
173 Unknown_Attribute       0x0012   200   200   000    Old_age   Always       -       0
175 Program_Fail_Count_Chip 0x0013   100   100   010    Pre-fail  Always       -       0
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       18
194 Temperature_Celsius     0x0023   063   032   020    Pre-fail  Always       -       37 (Min/Max 11/68)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
240 Unknown_SSD_Attribute   0x0013   100   100   050    Pre-fail  Always       -       0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

通常,这是我们期望获得全新SSD的输出。
即使由于缺乏针对特定于供应商的数据的规范化或者元信息,许多属性也被报告为“ Unknown_SSD_Attribute”。
我可能只希望以后的版本“ smartctl”将在工具数据库中包含与该特定驱动器模型相关的数据,以便我可以更准确地确定可能的问题。

使用smartctl在Linux中测试SSD

到目前为止,我们已经检查了驱动器在正常运行期间收集的数据。
但是,S.M.A.R.T。
该协议还支持多个“自检”命令,以便按需启动诊断。

除非明确要求,否则自检可以在正常的磁盘操作期间运行。
由于测试和主机I/O请求都将争夺驱动器,因此在测试期间磁盘性能将下降。
S.M.A.R.T.规范指定几种自检。
最重要的是:

简短的自检(-t short

该测试将检查驱动器的电气和机械性能以及读取性能。
简短的自检通常只需要几分钟即可完成(通常需要2到10分钟)。
扩展的自检(-t long

该测试需要花费一两个数量级的时间才能完成。
通常,这是简短自测的更深入版本。
此外,该测试将在没有时间限制的情况下扫描整个磁盘表面以查找数据错误。
测试持续时间将与磁盘大小成正比。

该测试套件的设计是一种相对较快的方法,可以检查设备在运输过程中可能造成的损坏。

以下是与上述相同的磁盘中的示例。
我让你猜哪个是哪个:

sh$sudo smartctl -t short /dev/sdb
smartctl 6.5 2015-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Mar 12 18:06:17 2016
Use smartctl -X to abort test.

现在已经说明了该测试。
让我们等到完成以显示结果:

sh$sudo sh -c 'sleep 120 && smartctl -l selftest /dev/sdb'
smartctl 6.5 2015-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       171         

现在,在另一张磁盘上进行相同的测试:

sh$sudo smartctl -t short /dev/sdb
smartctl 6.6 2015-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Mar 12 21:59:39 2016
Use smartctl -X to abort test.

再次睡眠两分钟,然后显示测试结果:

sh$sudo sh -c 'sleep 120 && smartctl -l selftest /dev/sdb'
smartctl 6.6 2015-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     26429         
# 2  Short offline       Completed without error       00%     10904         
# 3  Short offline       Completed without error       00%        12         
# 4  Short offline       Completed without error       00%         0         

有趣的是,在那种情况下,驱动器和计算机制造商似乎都在磁盘上进行了一些快速测试(寿命为0h和12h)。
我绝对不用担心自己监视驱动器的运行状况。
因此,由于我正在对该文章进行一些自测,因此让我们开始进行扩展测试以了解其运行方式:

sh$sudo smartctl -t long /dev/sdb
smartctl 6.6 2015-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 110 minutes for test to complete.
Test will complete after Tue Mar 13 00:09:08 2016
Use smartctl -X to abort test.

显然,这一次我们将不得不等待比短期测试更长的时间。
因此,让我们开始吧:

sh$sudo bash -c 'sleep $((110*60)) && smartctl -l selftest /dev/sdb'
[sudo] password for sylvain:
smartctl 6.6 2015-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       20%     26430         810665229
# 2  Short offline       Completed without error       00%     26429         
# 3  Short offline       Completed without error       00%     10904         
# 4  Short offline       Completed without error       00%        12         
# 5  Short offline       Completed without error       00%         0         

在后一种情况下,应特别注意通过短期测试和扩展测试获得的不同结果,即使它们一次接一个进行。
好吧,也许那个磁盘毕竟不是那么健康!需要注意的重要一点是,测试将在第一个读取错误后停止。
因此,如果我们想对所有读取错误进行详尽的诊断,则必须在每个错误之后继续进行测试。
我鼓励我们看一下写得很好的smartctl(8)手册页,以获取有关选项-t select,N-max和-t select,cont的更多信息:

sh$sudo smartctl -t select,810665230-max /dev/sdb
smartctl 6.6 2015-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Selective self-test routine immediately in off-line mode".
SPAN         STARTING_LBA           ENDING_LBA
   0            810665230            976773167
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Testing has begun.
smartctl 6.6 2015-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Selective offline   Completed without error       00%     26432         
# 2  Extended offline    Completed: read failure       20%     26430         810665229
# 3  Short offline       Completed without error       00%     26429         
# 4  Short offline       Completed without error       00%     10904         
# 5  Short offline       Completed without error       00%        12         
# 6  Short offline       Completed without error       00%         0