«mce: [Hardware Error]: Machine check events logged» appears in syslog. What should I do?
I have installed the latest version of OSSEC (2.8.1) and I have also enabled email notifications. And I am getting loads of these sorts of notifications saying that there is a Hardware Error and something about mce:
OSSEC HIDS Notification. 2015 Apr 04 20:09:22 Received From: Bath-Towel->/var/log/syslog Rule: 1002 fired (level 2) -> "Unknown problem somewhere in the system." Portion of the log(s): Apr 4 20:09:21 Bath-Towel kernel: [ 1873.680872] mce: [Hardware Error]: Machine check events logged --END OF NOTIFICATION
So what exactly does this mean? What does mce stand for? And is this apparent hardware error anything that I should worry about? OS Information:
Description: Ubuntu 14.10 Release: 14.10
You will need to do a bit of reading on ossec, see the rules — ossec-docs.readthedocs.org/en/latest/manual/rules-decoders . The web interface helps as it has a number of explanations — ossec.net/wiki/index.php/OSSECWUI:Install
This is not about OSSEC at all. You got that notification because OSSEC found the word «error» in syslog. Although I don’t think it is off-topic, you’ll probably get more help form Unix & Linux or Server Fault.
@bodhi.zazen All it has to do to be on-topic is run on Ubuntu. Now that doesn’t mean you’ll get an answer of course.
1 Answer 1
A Machine Check Exception (MCE) is a type of computer hardware error that occurs when a computer’s central processing unit detects a hardware problem.
Your computer experienced a hardware error and the kernel logged an event in a buffer. You can use mcelog to log and view the machine check events. From mcelog manpage:
X86 CPUs report errors detected by the CPU as machine check events (MCEs). These can be data corruption detected in the CPU caches, in main memory by an integrated memory controller, data transfer errors on the front side bus or CPU interconnect or other internal errors. Possible causes can be cosmic radiation, instable power supplies, cooling problems, broken hardware, running systems out of specification, or bad luck.
Most errors can be corrected by the CPU by internal error correction mechanisms. Uncorrected errors cause machine check exceptions which may kill processes or panic the machine. A small number of corrected errors is usually not a cause for worry, but a large number can indicate future failure.
When a corrected or recovered error happens the x86 kernel writes a record describing the MCE into a internal ring buffer available through the /dev/mcelog device. mcelog retrieves errors from /dev/mcelog, decodes them into a human readable format and prints them on the standard output or optionally into the system log.
If you didn’t notice any crash, probably the error was successfully corrected. Still, I advise you to install mcelog to keep track of such events:
sudo apt-get install mcelog
The events will be logged to /var/log/mcelog . You can also run:
to query the mcelog daemon for errors.
How to diagnose and fix Kernel Panic Fatal Machine Check error?
I have got a new Samsung Series 7 laptop with dual boot setup for Windows 8 and Ubuntu 12.10. A fine machine comparable to a Macbook Pro. The Ubuntu installation was quite a hassle, but with the help of Boot Repair finally it seemed to work. Or so I thought. Windows 8 starts fine, but if I want to start Ubuntu regularly the following Machine Check Exception error occurs, quite similar to this one
[Hardware Error] CPU 1: Machine Check Exception: 5 Bank 6 [Hardware Error] RIP !inexact! 33 [Hardware Error] TSC 95b623464c ADDR fe400 MISC 3880000086 .. [similar messages for CPU 2,3 and 0] .. [Hardware Error] Machine Check: Processor context corrupt Kernel panic - not syncing: Fatal Machine Check Rebooting in 30 seconds
Kernel panic does not sound good. Then it starts to reboot, and the second boot trial often works. Is it a Kernel or driver problem? The laptop has an Intel Core i7 processor. I already deactivated Hyperthreading in the BIOS, but it does not seem to help 🙁 I also disabled the Execute Disable Bit (EDB) flag in the BIOS. EDB is an Intel hardware-based security feature that can help reduce system exposure to viruses and malicious code. Since I disabled it, the error did occur less frequently, but it still appears occasionally 🙁 It seems to be the same error as described here and here. Maybe a Samsung specific Kernel problem? A similar error also happens on a Samsung Ultrabook Series 9 (which seems to be kernel bugs 49161 and 47121). At my Samsung Series 7, it still occurs for instance during booting on battery after «Checking battery state». Perhaps anyone else has an idea? These Kernel Panic errors are reallly annoying..
Machine Check Error occurs when installing Ubuntu (include image of log)
If any more information is helpful, please let me know. I really appreciate your help!
Edit: Just to be clear, my computer does not have any OS installed yet. I am building it from scratch. I encountered this problem when I was trying to install Ubuntu. Later, I made a Windows USB stick, but it didn’t work either. After the Windows logo was displayed for 5 seconds, the screen went black and nothing happened.
2 Answers 2
The first step to decoding Machine Check Exception errors is to install mcelog and run that:
sudo apt-get install mcelog sudo mcelog --ascii
Maybe that will provide something more human readable.
MCE errors are usually caused by hardware issues. However, on Haswell, Broadwell and Skylake processors, they can also be caused by outdated firmware to work around processor errata/defects. The Xeon E5-v3 processor does have several MCE-generating errata, and therefore it will require a reasonably up-to-date firmware to get a microcode capable of supporting Linux.
The procedures to deal with possible hardware defects are well known, and you will find lots of information and guides in the network if you search for them. I will answer from the microcode/firmware angle, which is a lot less well known.
Assuming you are not doing anything as idiotic as insisting on trying to overclock/undervolt/underclock a system that is reporting MCE errors (i.e. ensure every overclock feature of the mainboard is inactive):
- Install the latest firmware (BIOS/UEFI) update from the system vendor, or chances are you will not even manage to install a Linux distro because it will crash before the end of the install (or corrupt the installed image).
If you installed that Xeon on a desktop board (which appears to be the case, since EVGA is not known to make server/workstation-class hardware), well, you may have to pester the motherboard vendor for a new BIOS version with the latest microcode and memory controller firmware from Intel, or hack that BIOS yourself to update its built-in microcode with the latest available from Intel — search for BIOS modding forums for help, but do try talking to EVGA first, an official BIOS is much better.
- Install the intel-microcode package/»CPU microcode driver» when prompted for that, as long as the firmware has new enough microcode to actually manage to finish installing Ubuntu and boot the system without crashing, the intel-microcode package can be used to fix most remaining microcode issues.
machine check error
Ноут стал периодически падать жалуясь на сабж, в mcelog пока одна запись:
MCE 0 CPU 0 BANK 0 ADDR 7f6d13f93000 TIME 1359834885 Sat Feb 2 23:54:45 2013 STATUS f600000000010015 MCGSTATUS 0 MCGCAP 106 APICID 0 SOCKETID 0 CPUID Vendor AMD Family 20 Model 1
надо смотреть логи IPMI ежели это сервер
у меня МЦЕ были из-за того что поставили дешевую кингстоновскую память и она начала глючить.
Это ноут. За два прогона memtest86 ошибок не нашлось. Какую ещё диагностику имеет смысл провести?
проверить не заклочена ли память
мемтест кстати у меня за 15 прогонов ничего не дал
а вот просто запущеный линукс вешал машину за 15-20 минут
Радиатор вроде чистый, да и не похоже на перегрев — падает при 56 градусах, при 63 ревёт но работает. Память поменял — не помогло.
Если не память, то тогда или перегрев видяхи, или что-то на мамке. На материнской плате могут быть проблемы с конденсаторами или flexing какой-нибудь, это, кстати, и видеочип может быть.
Попробуй ещё вытащить, почистить и хорошо вставить обратно всякие miniPCI или usb платки и прочее, например wifi карту.
сменить ядро, запустить оффтоп и посмотреть будет ли падать в нём
За два прогона memtest86 ошибок не нашлось. Какую ещё диагностику имеет смысл провести?
memtest86 делает что угодно, но только не детектирует ошибок. Заявляю авторитетно, так как проверял на собственном опыте.
протестируй S&M. У меня была планка, с которой глючил комп. memtest86 не нашел ничего за сутки прогона. А S&M нашел сбой за 2-3 минуты.
«BKDG for AMD Family 14h Models 00h-0Fh Processors (PUB)» говорит, что Error code type: TLB, Transaction Type: Data, Cache level: L1. Т.е. произошла ошибка трансляции виртуального адреса в физический при работе с кешем данных первого уровня.
Учитывая, что при TLB miss процессор лезет в память ажно 4 раза, то ставлю на сбоящую память или нестабильное питание.