Oracle 19.8新特性"asmcmd credverify" and "asmcmd credfix"

平台:Redhat Linux 7.6

数据库版本:Oracle 19.10

问题描述:进行OCR磁盘替换DD破坏实验,并删除原有的OCR磁盘组。中间过程中,因spfile文件以及asm密码文件没有进行迁移新的OCR磁盘。导致两个节点crs集群的其中一个节点只能手工进行启动。

注:模拟破坏OCR磁盘过程中,发现OCR备份在OCR磁盘组中,一旦OCR发生损坏,恢复过程中无法读取OCR磁盘组中的OCR备份。无法按照常规restore进行还原。(参考:2717306.1 2256199.1 1929673.1 2264675.1)

1.通过以下命令查看集群状态,发现组件crsd异常、ora.storage异常,其余都是正常的

crsctl stat res -t -init

2.通过检查crsd日志发现问题

该问题是此次事故的第一个具体原因。

3.通过查看gpnp,来进一步观察spfile文件位置

通过以下可以看到文件位置是正常的。

4.通过init.ora文件将asm实例拉起

[grid@bjyctzdb10 ~]$ cat init.ora+ASM1.__large_pool_size=20971520+ASM2.__large_pool_size=20971520+ASM1.__oracle_base=’/u02/app/grid’#ORACLE_BASE set from in memory value+ASM2.__oracle_base=’/u02/app/grid’#ORACLE_BASE set from in memory value+ASM1._asm_max_connected_clients=4+ASM2._asm_max_connected_clients=4*.asm_diskgroups='ARCH’,'DATA’,'OCR_NEW’#Manual Mount.asm_diskstring=’/dev/oracleasm/disks/’*.asm_power_limit=1*.large_pool_size=12M*.remote_login_passwordfile='EXCLUSIVE’SQL> startup pfile=’/home/grid/init.ora’;ASM instance startedTotal System Global Area 1137173312 bytesFixed Size 8905536 bytesVariable Size 1103101952 bytesASM Cache 25165824 bytesASM diskgroups mountedSQL>

5.查看集群中spfile位置描述

#从这里发现并未指定spfile文件位置,可能是该原因找不到spfile文件,导致集群无法启动。

SQL> show parameter spfileNAME TYPE VALUE-------------------------------------------spfile string

指定spfile文件位置。

SQL> alter system set spfile=’+OCR_NEW/bjyctzd-cluster/ASMPARAMETERFILE/registry.253.1072795999’;System altered.SQL> show parameter spfileNAME TYPE VALUE--------------------------------------------spfile string +OCR_NEW/bjyctzd-cluster/ASMPARAMETERFILE/registry.253.1072795999

6.重启集群,再次观察–需要重启asm实例才可以

和之前问题一样。

7.再次使用init.ora去启动asm磁盘

SQL> startup pfile=’/home/grid/init.ora’;ASM instance startedTotal System Global Area 1137173312 bytesFixed Size 8905536 bytesVariable Size 1103101952 bytesASM Cache 25165824 bytesASM diskgroups mountedSQL> show parameter spfileNAME TYPE VALUE--------------------------------------------spfile stringSQL> show parameter spfileNAME TYPE VALUE--------------------------------------------spfile stringSQL> alter system set spfile=’+OCR_NEW/bjyctzd-cluster/ASMPARAMETERFILE/registry.253.1072795999’;System altered.SQL> create pfile=’/home/grid/init.ora’ from spfile;File created.SQL> shutdown immediateASM diskgroups dismountedASM instance shutdownSQL> startupASM instance startedTotal System Global Area 1137173312 bytesFixed Size 8905536 bytesVariable Size 1103101952 bytesASM Cache 25165824 bytesASM diskgroups mountedSQL> show parameter spfileNAME TYPE VALUE--------------------------------------------spfile string +OCR_NEW/bjyctzd-cluster/ASMPARAMETERFILE/registry.253.1072795999SQL>

8.再次重启集群还是如上问题,asm实例需要手工启动、asm实例启动之后crsd集群组件才可以启动,通过查看日志发现问题

9.通过命令查看asm密码文件位置

发现密码文件和密码文件备份都在之前的OCR磁盘中,但是OCR已经dismount。

[grid@bjyctzdb10 ~]$ srvctl config asmASM home:Password file: +OCR/orapwASMBackup of Password file: +OCR/orapwASM_backupASM listener: LISTENERASM instance count: 3Cluster ASM listener: ASMNET1LSNR_ASM[grid@bjyctzdb10 ~]$

修改asm密码文件位置以及备份位置。

[root@bjyctzdb10 ~]# srvctl modify asm -pwfile OCR_NEW[root@bjyctzdb10 ~]# srvctl config asmASM home:Password file: OCR_NEWBackup of Password file: +OCR/orapwASM_backupASM listener: LISTENERASM instance count: 3Cluster ASM listener: ASMNET1LSNR_ASM[root@bjyctzdb10 ~]#[root@bjyctzdb10 ~]# srvctl modify asm -pwfilebackup ocr_new[root@bjyctzdb10 ~]# srvctl config asmASM home:Password file: OCR_NEWBackup of Password file: ocr_newASM listener: LISTENERASM instance count: 3Cluster ASM listener: ASMNET1LSNR_ASM[root@bjyctzdb10 ~]#

10.再次尝试关闭集群以及启动集群

11.通过查看日志,还是因为无法找到密码文件打开asm磁盘

12.进一步观察asm密码文件

从这里我们可以看到Password file:后面没有跟具体的密码文件。

[grid@bjyctzdb09 ~]$ srvctl config asmASM home:Password file: OCR_NEWBackup of Password file: ocr_newASM listener: LISTENERASM instance count: 3Cluster ASM listener: ASMNET1LSNR_ASM[grid@bjyctzdb09 ~]$ exit

13.进入asmcmd,查看是否有密码文件

从下面我们可以发现没有asm密码文件。

ASMCMD> cd ocr_newASMCMD> lsbjyctzd-cluster/

14.重新生成一个密码文件

提示已经存在一个密码文件。

[grid@bjyctzdb09 ~]$ orapwd file=’+ocr_new/orapwasm’ asm=yEnter password for SYS:OPW-00010: Could not create the password file. This resource has a Password File.

通过这个,我们可以看到密码文件是OCR_NEW。

[grid@bjyctzdb09 ~]$ srvctl config asmASM home:Password file: OCR_NEWBackup of Password file: ocr_newASM listener: LISTENERASM instance count: 3Cluster ASM listener: ASMNET1LSNR_ASM[grid@bjyctzdb09 ~]$ exit

15.根据上面提示,我们重新将asm密码文件位置设置为空

[grid@bjyctzdb09 ~]$ srvctl modify asm -pwfile[grid@bjyctzdb09 ~]$ srvctl config asmASM home:Password file:Backup of Password file: ocr_newASM listener: LISTENERASM instance count: 3Cluster ASM listener: ASMNET1LSNR_ASM

16.重新生成密码文件,并再次查看密码文件位置

这是我们可以发现,对应的密码文件节点1和节点2已经存在。

[grid@bjyctzdb09 ~]$ orapwd file=’+ocr_new/orapwasm’ asm=yEnter password for SYS:[grid@bjyctzdb09 ~]$ srvctl config asmASM home:Password file: +OCR_NEW/orapwasmBackup of Password file: ocr_newASM listener: LISTENERASM instance count: 3Cluster ASM listener: ASMNET1LSNR_ASM[grid@bjyctzdb10 ~]$ srvctl config asmASM home:Password file: +OCR_NEW/orapwasmBackup of Password file: ocr_newASM listener: LISTENERASM instance count: 3Cluster ASM listener: ASMNET1LSNR_ASM

这是我们发现我们的密码文件已经完整的回来了。

17.再次重启集群,观察是否可以自动将crsd资源拉起来

失败,和之前的失败原因是一样的。

18.本次具体的报错信息

alert.log日志报错。

2021-05-17 19:09:52.326 [CRSD(123152)]CRS-0804: Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage Storage layer error [Insufficient quorum to open OCR devices] [0]]. Details at (:CRSD00111:) in /u02/app/grid/diag/crs/bjyctzdb09/crs/trace/crsd.trc.2021-05-17 19:09:52.533 [CRSD(123176)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 1231762021-05-17 19:09:58.341 [CRSD(123176)]CRS-1013: The OCR location in an ASM disk group is inaccessible. Details in /u02/app/grid/diag/crs/bjyctzdb09/crs/trace/crsd.trc.2021-05-17 19:09:58.346 [CRSD(123176)]CRS-0804: Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage Storage layer error [Insufficient quorum to open OCR devices] [0]]. Details at (:CRSD00111:) in /u02/app/grid/diag/crs/bjyctzdb09/crs/trace/crsd.trc.2021-05-17 19:09:58.555 [CRSD(123220)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 123220

ohasd_orarootagent_root.trc报错信息。

2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] Error [kgfoAl06] in [kgfokge] at kgfo.c:31802021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] ORA-01017: invalid username/password; logon denied2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] Category: 72021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] DepInfo: 10172021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] – trace dump end –2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} Thread:kgfoCheckMountExt isRunning is reset to false here2021-05-17 18:58:56.791 : USRTHRD:2394859264: [ INFO] {0:5:3} Thread:kgfoCheckMountExt isFinished set to true2021-05-17 18:58:56.791 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Tasklet::doTask m_cv.timewait returned2021-05-17 18:58:56.791 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Tasklet::doTask(kgfoCheckMountExt) executed in 3 seconds2021-05-17 18:58:56.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] StorageAgent::parsekgforetcodes retcode = 7, kgfoCheckMount(OCR_NEW), flag 22021-05-17 18:58:56.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] (null) category: 7, operation: kgfoAl06, loc: kgfokge, OS error: 1017, other: ORA-01017: invalid username/password; logon denied2021-05-17 18:58:56.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] StorageAgent::check 260 kgfo returncode 12021-05-17 18:58:56.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] (:CLSN00140:)StorageAgent::check 300 parsekgforretcodes OCR dgName OCR_NEW state 12021-05-17 18:58:56.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] Storage::start waiting for check to not return PARTIAL or UNPLANNED_OFFLINE 12021-05-17 18:58:57.791 :CLSDYNAM:3566778112: [ora.storage]{0:5:3} [start] StorageAgent::check 160 NODEROLE_HUB getOCRdetails2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt thread constructor exit this:9c07c820 m_pThnd:0 m_thndMX:9c07c840, m_tintMX:9c07c890 &m_postMX:0x7f0a9c07c8702021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} TaskletObjManager::manageTasklets START2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt thread destructor entry { this:0x7f0a9c03a8e0 m_pThnd:0x7f0a9c0851f0 m_thndMX:0x7f0a9c03a900, m_tintMX:0x7f0a9c03a950 &m_postMX:0x7f0a9c03a9302021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt thread destructor stop2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt stop { m_pThnd:0x7f0a9c0851f0 tid:2394859264 running:0 alive:02021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread stop:thread is not running:0 or alive:02021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread stop:thread acquire m_pThnd:0x7f0a9c0851f0 m_thndMX:0x7f0a9c03a9002021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread stop:thread sltstjn2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt stop release m_thndMX:0x7f0a9c03a900 m_tid:2394859264}2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt thread destructor m_pThnd:(nil) exit }2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} TaskletObjManager::manageTasklets END2021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread::start { acquire thndMX:9c07c8402021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread::start spawn pThnd:0x7f0a9c04dff0 thndType:12021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread::start thread spawned tid:23948592642021-05-17 18:58:57.793 : USRTHRD:3566778112: [ INFO] {0:5:3} Thread:kgfoCheckMountExt Thread::start spawned release thndMX:9c07c840 }2021-05-17 18:58:57.799 : CLSNS:2394859264: clsns_SetTraceLevel:trace level set to 1.2021-05-17 18:58:57.827 : USRTHRD:2394859264: [ INFO] {0:5:3} [ora.storage] 9607 Error 4 querying length of attr ASM_DISCOVERY_ADDRESS

19.查看asm实例中的用户名以及权限信息

从这里发现此次事故的第二个具体原因。

[root@bjyctzdb09 dbs]# asmcmd lspwusrUsername sysdba sysoper sysasmSYS TRUE TRUE TRUE

对照正常的asm实例用户如下:(12c之后的版本都是如下–11g版本没有验证)。

20.添加对应的用户,并授予对应的权限

[root@bjyctzdb09 dbs]# asmcmd orapwusr --add ASMSNMPEnter password: *******[root@bjyctzdb09 dbs]# asmcmd orapwusr --grant sysdba ASMSNMP[root@bjyctzdb09 dbs]# asmcmd lspwusrUsername sysdba sysoper sysasmSYS TRUE TRUE TRUEASMSNMP TRUE FALSE FALSE[root@bjyctzdb09 dbs]# orapwusr --add CRSUSER__ASM_001-bash: orapwusr: command not found[root@bjyctzdb09 dbs]# asmcmd orapwusr --add CRSUSER__ASM_001Enter password: ******[root@bjyctzdb09 dbs]# asmcmd orapwusr --grant sysdba CRSUSER__ASM_001[root@bjyctzdb09 dbs]# asmcmd orapwusr --grant sysasm CRSUSER__ASM_001

从这里我们可以发现,用户已经完全正确,且对应的权限也没有什么问题。

[root@bjyctzdb09 dbs]# asmcmd lspwusrUsername sysdba sysoper sysasmSYS TRUE TRUE TRUEASMSNMP TRUE FALSE FALSECRSUSER__ASM_001 TRUE FALSE TRUE

21.再次重启crs,观察是否能够自动启动

报错,和之前一样的报错信息。

22.进一步查阅mos文档,发现最终问题

数据库版本19.8或者之后的ru补丁后的数据库,带来一个新特性"asmcmd credverify" and “asmcmd credfix” 命令来创建asm密码文件。

执行如下命令。该命令OCR、OLR和密码文件中是否一致,以及验证是否存在连接集群的凭证。

[grid@bjyctzdb10 ~]$ asmcmd --nocp credverifycredverify: Cannot connect using credentials for CRSUSER__ASM_001, please run 'credfix’ to fix the credentials.

根据上述提示,无法使用功能CRSUSER__ASM_001,并需要运行一下命令来修复凭证。

从下面可以看出删除了CRSUSER__ASM_001用户,并创建了新的用户CRSUSER__ASM_002,而且最后将连接集群的证书修复。

[grid@bjyctzdb10 ~]$ asmcmd --nocp credfixcredfix: Failed to connect using credentials for CRSUSER__ASM_001.op=addcrscreds wrap=/tmp/creds0.xmlcredfix: Creating new credentials, no valid credentials in OCR.credfix: New user CRSUSER__ASM_002 created.credfix: Starting SSH session on node bjyctzdb09.credfix: OLR for bjyctzdb09 has been fixed if credentials were created incorrectly. Exiting SSH session.op=credimport wrap=/tmp/creds0.xml olr=true force=truecredfix: OLR for bjyctzdb10 has been fixed if credentials were created incorrectly.op=delcrscreds crs_user=CRSUSER__ASM_001credfix: Deleted CRSUSER__ASM_001 from OCR.credverify: starting SSH session on node bjyctzdb09credverify: Credentials created correctly on bjyctzdb09. Exiting SSH session.credverify: Credentials created correctly on bjyctzdb10.credfix: Credentials have been fixed if they were created incorrectly

23.再次查看asm实例中的用户和密码

[grid@bjyctzdb10 ~]$ asmcmd lspwusrUsername sysdba sysoper sysasmSYS TRUE TRUE TRUEASMSNMP TRUE FALSE FALSECRSUSER__ASM_002 TRUE FALSE TRUE

24.再次重启集群,观察crs是否自动启动

到此,问题得到完美解决!

墨天轮原文链接:https://www.modb.pro/db/65395(复制到浏览器或者点击“阅读原文”立即查看)

END
(0)

相关推荐