之前这台 oracle 是其他开发部署的,然后没交接就到我这里了。
系统是 centos7,装的 gnome 。
前阵子频发故障,查了一下 IPMI 发现是硬件的问题于是找厂家换了硬件。
可这个星期回来发现星期五晚上又挂了,直接死机和之前状态差不多,尝试热重启后发现。
只要进 oracle 账户里启动 dbstart 系统就开始走 crash 流程,保存 dump 文件然后重启。
查看了 log 文件发现只有 desktop,gnome 的相关的大量错误日志,dump 文件就没看,也不会看。
索性就想直接把 gnome 卸载试试,结果每次运行 yum remove 都会崩溃重启。
换了个思路,启动时默认启动 console,成功卸载,进入 oracle 账户也正常启动,到现在已经 2 个小时了暂时没出现之前的故障。
以为看不懂 dump 文件所以究其原因也没办法细察,索性就来问问 V2 的各位大佬,GNOME 会导致系统崩溃嘛?本身是内网专网环境,一直用也没更新,怎么好好的就爆炸了呢??
1
barathrum 2020-09-21 12:48:16 +08:00
crash 有一个 dmesg,有基本的 panic 日志,可以先看一眼,不一定是 gnome 的问题。
|
2
CallMeReznov OP @barathrum #1
`` [255089.021729] swap_info_get: Bad swap offset entry 3ffffffffffa7 [255089.021737] BUG: Bad page map in process crond pte:0000b000 pmd:355de7a067 [255089.021744] addr:00007f68efc7b000 vm_flags:08000070 anon_vma: (null) mapping:ffff88b2fafd9550 index:18d [255089.021811] vma->vm_ops->fault: xfs_filemap_fault+0x0/0x30 [xfs] [255089.021850] vma->vm_file->f_op->mmap: xfs_file_mmap+0x0/0x80 [xfs] [255089.021856] CPU: 28 PID: 84674 Comm: crond Kdump: loaded Not tainted 3.10.0-1062.el7.x86_64 #1 [255089.021860] Hardware name: Inspur NF5280M5/Curry, BIOS 4.1.13 01/16/2020 [255089.021863] Call Trace: [255089.021875] [<ffffffffb1379262>] dump_stack+0x19/0x1b [255089.021884] [<ffffffffb0dea641>] print_bad_pte+0x1f1/0x290 [255089.021890] [<ffffffffb0ded22b>] unmap_page_range+0x87b/0xc80 [255089.021897] [<ffffffffb0ded6b1>] unmap_single_vma+0x81/0xf0 [255089.021903] [<ffffffffb0dee929>] unmap_vmas+0x49/0x90 [255089.021909] [<ffffffffb0df90dc>] exit_mmap+0xac/0x1a0 [255089.021918] [<ffffffffb0e50335>] ? flush_old_exec+0x3b5/0x940 [255089.021927] [<ffffffffb0c971f7>] mmput+0x67/0xf0 [255089.021932] [<ffffffffb0e50470>] flush_old_exec+0x4f0/0x940 [255089.021941] [<ffffffffb0eae0ac>] load_elf_binary+0x33c/0xd90 [255089.021951] [<ffffffffc04c6064>] ? load_misc_binary+0x64/0x460 [binfmt_misc] [255089.021958] [<ffffffffb0f27543>] ? ima_get_action+0x23/0x30 [255089.021963] [<ffffffffb0f26a5e>] ? process_measurement+0x8e/0x250 [255089.021968] [<ffffffffb0f26f19>] ? ima_bprm_check+0x49/0x50 [255089.021974] [<ffffffffb0e4faca>] search_binary_handler+0x9a/0x1c0 [255089.021980] [<ffffffffb0e511c6>] do_execve_common.isra.24+0x616/0x880 [255089.021986] [<ffffffffb0e516c9>] SyS_execve+0x29/0x30 [255089.021994] [<ffffffffb138c478>] stub_execve+0x48/0x80 [255089.021998] Disabling lock debugging due to kernel taint [318488.055221] perf: interrupt took too long (2516 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 [320413.845970] perf: interrupt took too long (3164 > 3145), lowering kernel.perf_event_max_sample_rate to 63000 [342727.130342] perf: interrupt took too long (3970 > 3955), lowering kernel.perf_event_max_sample_rate to 50000 [354284.998982] mce: [Hardware Error]: CPU 12: Machine Check Exception: 5 Bank 14: b200000000020405 [354284.999037] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffffb13812ab> {native_safe_halt+0xb/0x20} [354284.999088] mce: [Hardware Error]: TSC 2ea6e1d80d148 [354284.999114] mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1600169230 SOCKET 1 APIC 20 microcode 2000065 [354284.999159] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [354285.001478] mce: [Hardware Error]: Machine check: Processor context corrupt [354285.001512] Kernel panic - not syncing: Fatal machine check `` |
3
barathrum 2020-09-21 13:42:43 +08:00 1
说实话看起来还是硬件错误,不如继续找浪潮看看?
|
4
CallMeReznov OP @barathrum #3 已经让浪潮的人在路上了,感觉 CPU 也不利索,这次让他们把 CPU 都换了看看吧
|