LVC21F-207 Firmware first error handling on Arm Neoverse Reference design platforms.

Session Abstract

Level: Intermediate Reliability, Availability and Serviceability (RAS) is a measure that defines the robustness of the system. A RAS enabled platform ensures that the system produces correct outputs, is always operational and is easily maintainable. RAS reduces the systems downtime by detecting the hardware errors and correcting them when possible. This talk is about firmware first error handling and proposed framework for dynamic generation of HEST ACPI table which helps enabling it. The HEST and SDEI ACPI tables respectively, publishes the platform error sources and enables error notification mechanism for them. At boot OSPM uses HEST ACPI table to obtain the list of hardware error sources present on the platform. All platform errors sources are added to HEST table as error source descriptors. This design uses GHESv2 type error source descriptor to represent all error sources. On occurrence of a platform hardware error, firmware populates the Common Platform Error Record (CPER) and notifies OSPM about the presence of a hardware error using an SDEI event. The number of hardware error sources supported by platform vary and are not fixed. This calls for a need of having a framework to generate HEST table dynamically. The proposed framework helps dynamic generation and installation of HEST ACPI table. It is achieved by providing interfaces that help dynamically create, append error source descriptors and install HEST ACPI table. This talk delves into the details of this. The firmware-first error handling is demonstrated using a reference implementation for handling DMC's 1-bit DRAM errors.

Session Speakers

Omkar Kulkarni

Software Engineer at ARM (Arm)

I work at ARM in the Open Source Software division. My area of work is predominantly on enabling end-to-end RAS support on Arm Neoverse reference design platforms.

Level: Intermediate Reliability, Availability and Serviceability (RAS) is a measure that defines the robustness of the system. A RAS enabled platform ensures that the system produces correct outputs, is always operational and is easily maintainable. RAS reduces the systems downtime by detecting the hardware errors and correcting them when possible.

This talk is about firmware first error handling and proposed framework for dynamic generation of HEST ACPI table which helps enabling it. The HEST and SDEI ACPI tables respectively, publishes the platform error sources and enables error notification mechanism for them. At boot OSPM uses HEST ACPI table to obtain the list of hardware error sources present on the platform. All platform errors sources are added to HEST table as error source descriptors. This design uses GHESv2 type error source descriptor to represent all error sources. On occurrence of a platform hardware error, firmware populates the Common Platform Error Record (CPER) and notifies OSPM about the presence of a hardware error using an SDEI event.

The number of hardware error sources supported by platform vary and are not fixed. This calls for a need of having a framework to generate HEST table dynamically. The proposed framework helps dynamic generation and installation of HEST ACPI table. It is achieved by providing interfaces that help dynamically create, append error source descriptors and install HEST ACPI table.

This talk delves into the details of this. The firmware-first error handling is demonstrated using a reference implementation for handling DMC’s 1-bit DRAM errors.

comments powered by Disqus

Other Posts

Sign up. Receive Updates. Stay informed.

Sign up to our mailing list to receive updates on the latest Linaro Connect news!