Слайд 1Before Terabytes Fall
Disk reliability in Windows Vista and beyond
Frank Shu
Program Manager
WDEG-Storage
Microsoft Corporation
Matthew
Kerner
Program Manager
Windows Diagnosis
Microsoft Corporation
Слайд 2
Windows Storage Devices
Strategic pillars
Timely, comprehensive, quality platform support for optical devices
Optimized platform features enabling your Windows experience, here and now
Leading platform enabling
storage fabric adoption
Preferred platform for developing, deploying, and using
storage devices
Слайд 3Session Outline
Introduction (Frank Shu)
Windows Vista Disk Diagnostics (Matthew Kerner)
Future Technology (Frank Shu)
Demo
(Microsoft and Samsung)
Слайд 4What Matters Most
To Our Users?
A consumer bought a new computer and it
works great at work and at home. She couldn’t do her everyday tasks without it. What matters most to her?
CPU power
Network connection
Battery life
Something else…
Слайд 6Protecting Data:
Windows Vista disk diagnostics
Matthew Kerner
Слайд 7Quantifying Disk Failures
Catastrophic disk failures
~200 disks replaced per week at Microsoft
in 2003
Top driver of Microsoft support’s hardware-related support calls in both client and server
Based on Microsoft figures, disk failures cost many millions of dollars per year in enterprises
Localized failures (bad blocks)
Kernel and user-mode crashes
1.7% of customer-report Microsoft Online Crash Analysis crashes are due to disk errors
Application hangs during read recovery
Слайд 8Disk Failure Mitigations
Prevention
Hybrid hard disks (mobile systems)
RAID
Catastrophic failure recovery
Data backup
Disk replacement
Localized
failure recovery
Repair from redundant copy
Restore from backup
Слайд 9Windows Vista Disk Diagnostics
Purpose: Save user data before catastrophic disk failure
Client SKUs
Self
Monitoring And Reporting Technology (S.M.A.R.T.) polling triggers diagnostic
Uses S.M.A.R.T. trip status – no threshold/attribute comparison
Warns user of impending failure and walks them through backup and replacement
Windows Vista backup improvements
Слайд 10Disk Diagnostics Details
Disk class driver polls S.M.A.R.T. status hourly as it
has done since Windows 2000
Based on industry feedback, no use of Disk
Self-Test or attribute comparison
Failure triggers user-mode code
Filter out duplicate failures
Log SMART READ LOG details to OS event log
Device error count from summary error log sector
Life timestamp from most recent error log entry
Trigger user-context interactive resolution
Customizable by Group Policy
Print instructions, walk user through backup
Слайд 11Startup Repair/Windows Recovery Environment
Purpose: Recover from non-bootable states, including those caused
by disk failures
Automatic failover on boot failure
to recovery partition
Optionally deployed by OEM
Available on installation media
Hands-free diagnosis and repair
of top non-boot issues
Слайд 12Corrupted File Recovery
Purpose: Turn repeat user-mode crashes caused by corrupted system
binaries into one-time crash with silent repair from cache
Windows Error Reporting crash handler triggers diagnostic on inpage error crashes due to bad blocks
Diagnoses corrupted system files
Silent repair from System File Cache
Слайд 13Windows Vista Disk Diagnostics
Matthew Kerner
Program Manager
Windows Diagnosis
Слайд 14Opportunities For Future Technology
Proactive failure prevention
Reduce scenario pain by enabling resolutions other
than just data recovery
Requires finer-grained failure description
to help host choose the best resolution
Increase warning time before failures
to allow users to save data
Слайд 15Future Technology:
Protecting User Data
And Preventing Hard
Drive Failure Proactively
Frank Shu
Слайд 16What Is PRCS?
Proactive Reporting and Correcting Safeguard (PRCS) enables a device
and host to correct failure conditions proactively
Device can report hostile conditions before damage or failure occurs
Host reacts to a device event in real time based on policy and user preference
A proposal for the PRCS protocol has
been submitted to T13
Слайд 17Why Is PRCS Important?
User’s digital data is more valuable than ever
before
Disk drive capacity continue to increase
Not every PC user can afford RAID
Deliver on opportunities for improvements beyond S.M.A.R.T.
Слайд 18Goals Of PRCS
Proactively protect user data
Improve the user experience
when data is
at risk
Reduce OEM’s customer support costs
Reduce warranty costs for disk drive vendors
Слайд 19PRCS Features
Device monitors its own conditions
in real time
Reduce host monitoring performance impact
Device
sends meaningful PRCS events to the host for correction of hostile conditions and data protection
No translations or guesses required
Host acts on device’s PRCS event proactively according to policy and user preference
Слайд 20PRCS Advantages
PRCS is proactive
Taking a corrective action before errors occur
Protecting data
when it is at risk
PRCS is designed for end users, not just computer experts
No need to understand a cryptic message to
benefit from PRCS. For example: “The previous
self-test completed having the electrical element
of the test failed”
PRCS enables transparent mitigation of a hostile condition or a recovery process
Users do not need to configure a self-test mode or reporting method
Users control policy as desired
Слайд 21Proactive Disk Diagnostics
Debasis Baral
Vice President of Engineering
Samsung
Слайд 22HDD Reliability 101
HDD reliability and performance
is negatively impacted by extremes
in the
following operating conditions
Temperature Demo
Vibration Demo
Shock Demo
Duty cycle
Altitude
Humidity
A combination of the above conditions
A history of the above combinations
Слайд 23Reliability Versus Temperature
HDD life decreases with temperature
Failure rates increase exponentially with
temperature
for all HDD suppliers
Environmental temperature increase from 25C to 100C could translate into 10 – 50x shorter life
Samsung HDD Lab Engineering Sample Data
Слайд 24Performance Versus Vibration
Data throughput or drive performance can be
significantly affected in the
presence of vibration
Effect of vibration is reversible
Cumulative effects of vibration on long term drive reliability is a subject of ongoing research
Samsung HDD Lab Engineering Sample Data
Слайд 25Reliability Versus Shock
Excessive shock is the major cause of failure in
both PC
and consumer electronics environments
Shock Modeling
Courtesy: E. Jayson and Frank Talke, UC San Diego
Op. Shock Scratches
Damage by corners, leading edge, and side edges of the slider.
Operating shock damage
Non-operating shock damage
Слайд 26Reliability Design Guidelines
Failure modes and failure rates
of disk drives depend
on their operating environments
Temperature and Handling
(shock and vibration) are major factors impacting HDD reliability
HDD reliability will be enhanced if OS detects and manages reliability risks
and stress events intelligently (PRCS)
Users can improve HDD data reliability
by correctly responding to PRCS events
Слайд 27PRCS
Kai Chen
Microsoft Corporation
Debasis Baral
Samsung
Слайд 28Call To Action
Test your drives with Windows Vista Disk Diagnostics and
send feedback
Ensure your drives comply with ATA-7 specs to surface device error count and life timestamp
Engage with the Startup Repair team to build a plan for Startup Repair in OEM factory processes
Participate in T13 discussions on PRCS
Plan your device designs in line with PRCS guidelines
Слайд 29Additional Resources
Whitepapers
Windows Recovery Environment/Startup
Repair/Built-in Diagnostics: http://www.microsoft.com/technet/windowsvista/evaluate/feat/relperf.mspx
Feedback/Questions
Windows Vista Disk Diagnosis:
Corrupt File
Recovery:
Windows Recovery Environment/Startup Repair:
PRCS:
Dfdfeed @ microsoft.com
Dfdfeed @ microsoft.com
Recovery @ microsoft.com
Prcsdisc @ microsoft.com
Слайд 30© 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista
and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.