Troubleshooting
- Sabine Engel
- Thomas Sikora
- Dominik Karch
Scanner Issues
Detecting a scanner problem is quite easy because the affected scanners are listed on top of the scanner homepage in the WebUI. But finding the root cause for the failure can be more difficult. Here are some tipps how to troubleshoot scanner issues:
Open the WebUI and go to SCANNER.
Critical configurations are listed on top of the page. There are two buttons for each scan category (Topology, Performance, Event): The left one shows the scan status, the right one the persist status.
Click on a red icon to see the log file for the selected item.
Set severity to "Error" to limit the amount of information.
In most cases, the error can now be narrowed down to either a configuration issue or a product defect:
- IP address or password incorrect?
- insufficient user rights to collect the requested data?
- wrong protocol type?
- access issues due to firewall or other limitations?
Known Issues
Compute
HMC responding slowly to REST API calls
Typically, REST API calls to the HMC should be answered within ms up to a few seconds (depending on the amount of data that is requested by the command). If the execution of those calls takes much longer resulting in unacceptably increasing performance scan durations, the problem might be caused by a bug in the IBM HMC.
2023-06-30T12:07:51,354 INFO [performance-scan_Worker-7]: Execution of command [ScanPowerVmPerformanceData] took [1204342]ms [HMC_xxx] [PerfScan] (RawCommandExecutorImpl)
BVQ Version: any
Suggested Action:
Upgrade the HMC to V10R2M1040 + iFix MF71107 or V10R2M1041 or higher.
Important: Upgrading to this level will only resolve the issue if performance collection on the HMC was disabled prior to the upgrade. Otherwise one of the following two methods is required:
- Reinstall the HMC
- Request the pesh password from IBM and reset Postgres DB on the HMC. All prior collected performance stats will be deleted by this reset procedure. Note: An IBM maintenance contract is required to get the pesh password:
### Postgres DB reset after V10R2M1031 installation # Just for a better feeling ;-) hmcshutdown -t now -r # As hscroot, if possible save PCM data saveupgdata -r disksftp -h <IP> -u <USER> -d <Directory> -i perfmon[,netcfg] --migrate # As hscpe pesh <HMC_Serial> and then enter the "password of the day" su - #Insert root password, you can change it before, as hscroot do a chhmcusr -u root -t passwd # yes twice! This resets the postgres DB - all PCM data will be lost /opt/hsc/bin/hscSignal 511 /opt/hsc/bin/hscSignal 511 exit hmcshutdown -t now -r # Once again, it's another bug :-( pesh <HMC_Serial> and then enter the "password of the day" su - /opt/hsc/bin/hscSignal 511 /opt/hsc/bin/hscSignal 511 exit hmcshutdown -t now -r # After reboot, enable PCM data collection again # Not tested yet, restore PCM data rstupgdata -r sftp -h <IP> -u <USER> -d <Directory> --migrate
Error Message: HTTP-Error 500 "Unable to connect to Database" during topo scan
2021-04-08T09:17:37,383 ERROR [topology-scan_Worker-1]: Error executing call to [/rest/api/uom/ManagedSystem/<system_id>/<object_type>], got status [500 INTERNAL_SERVER_ERROR] with message [ <entry xmlns="http://www.w3.org/2005/Atom" xmlns:ns2="http://a9.com/-/spec/opensearch/1.1/" xmlns:ns3="http://www.w3.org/1999/xhtml"> <id>5fc9dc1c-03be-4e80-b8fb-4f9ea9dc0b4a</id> <title>HttpErrorResponse</title> <published>2021-04-08T09:17:46.640+02:00</published> <author> <name>IBM Power Systems Management Console</name> </author> <content type="application/vnd.ibm.powervm.web+xml; type=HttpErrorResponse"> <HttpErrorResponse:HttpErrorResponse xmlns:HttpErrorResponse="http://www.ibm.com/xmlns/systems/power/firmware/web/mc/2012_10/" xmlns="http://www.ibm.com/xmlns/systems/power/firmware/web/mc/2012_10/" xmlns:ns2="http://www.w3.org/XML/1998/namespace/k2" schemaVersion="V1_0"> <Metadata> <Atom/> </Metadata> <HTTPStatus kb="ROR" kxe="false">500</HTTPStatus> <RequestURI kxe="false" kb="ROR">/rest/api/uom/ManagedSystem/3870ebf9-dd78-31e5-a91e-719c4f86178b/NetworkBridge</RequestURI> <ReasonCode kb="ROR" kxe="false">Unknown internal error.</ReasonCode> <Message kb="ROO" kxe="false">com.ibm.pmc.rest.provider.exceptions.RESTProviderException: Exception While getting SEA:: Unable to connect to Database. </Message> <RequestBody kb="ROO" kxe="false"/> <RequestHeaders kxe="false" kb="ROO"> {x-forwarded-server=hmc101.localdomain, x-forwarded-host=172.16.146.192:12443, X-Transaction-ID=XT10011179, host=172.16.146.192:12443, connection=Keep-Alive, x-api-session=9PLm73Uh1Zg84V-wivc1TrsRpNLdrV-VbWeDWSfBEheHxNiuwGEgl_0OWwcLpFlVDaydA2z37DWTvHLMo9McsEwN9X_at8GfE5_ayfeF9Qjzf2EGlzYxJW0G4BzEaznehGmjRR7GfsY3ktUmcte6LT_-JHq5tCqUzfx4nVz6E9wu4dbG2Bo-DhJnkH81b1HLqpjCZyPQlAHf1fqpFOFWr29_7xq-GN_J5tiE1zwSXvY=, x-forwarded-for=172.16.168.84, accept-encoding=gzip,deflate, accept=application/vnd.ibm.powervm.uom+xml;, user-agent=Apache-HttpClient/4.5.12 (Java/11)} </RequestHeaders> </HttpErrorResponse:HttpErrorResponse> </content> </entry> ] (PowerVmClient)
BVQ Version: 2021.H1.3 and above
Suggested Action:
The Postgres DB on VIO servers which the HMC queries to get various system information (like virtual networks, virtual storage, etc.) is broken or Postgres service (vio daemon) is not running. This causes HTTP-500 error messages on both, the HMC and BVQ scanner.
The issue can be fixed using the following commands on the VIOS:
ssh padmin@${vios} $ oem_setup_env # stopsrc -s vio_daemon # /usr/sbin/slibclean # rm -rf /home/ios/CM # startsrc -s vio_daemon -a '-d 4'
Error Message: HTTP-Error 500 "Exception while getting SEA" during topo scan
2021-05-18T19:33:27,290 ERROR [topology-scan_Worker-6]: Error executing call to [/rest/api/uom/ManagedSystem/b8f44367-98bd-377d-8227-7db6208f1c4c/NetworkBridge], got status [500 INTERNAL_SERVER_ERROR] with message [ <entry xmlns="http://www.w3.org/2005/Atom" xmlns:ns2="http://a9.com/-/spec/opensearch/1.1/" xmlns:ns3="http://www.w3.org/1999/xhtml"> <id>7af9ea16-352a-4ab1-890f-1e24405102e7</id> <title>HttpErrorResponse</title> <published>2021-05-18T19:33:27.285+02:00</published> <author> <name>IBM Power Systems Management Console</name> </author> <content type="application/vnd.ibm.powervm.web+xml; type=HttpErrorResponse"> <HttpErrorResponse:HttpErrorResponse xmlns:HttpErrorResponse="http://www.ibm.com/xmlns/systems/power/firmware/web/mc/2012_10/" xmlns="http://www.ibm.com/xmlns/systems/power/firmware/web/mc/2012_10/" xmlns:ns2="http://www.w3.org/XML/1998/namespace/k2" schemaVersion="V1_0"> <Metadata> <Atom/> </Metadata> <HTTPStatus kxe="false" kb="ROR">500</HTTPStatus> <RequestURI kb="ROR" kxe="false">/rest/api/uom/ManagedSystem/b8f44367-98bd-377d-8227-7db6208f1c4c/NetworkBridge</RequestURI> <ReasonCode kb="ROR" kxe="false">Unknown internal error.</ReasonCode> <Message kb="ROO" kxe="false">com.ibm.pmc.rest.provider.exceptions.RESTProviderException: Exception While getting SEA:: The system is currently too busy to complete the specified request. Please retry the operation at a later time. If the operation continues to fail, check the error log to see if the filesystem is full. </Message> <RequestBody kb="ROO" kxe="false"/> <RequestHeaders kxe="false" kb="ROO">{x-forwarded-server=hmc3.labwi.sva.de, x-forwarded-host=hmc3.labwi.sva.de:12443, X-Transaction-ID=XT11438532, host=hmc3.labwi.sva.de:12443, connection=Keep-Alive, x-api-session=xnqKwOuPE9APo0rubodXReDeoXN2SnlAXIeEzEu4guge1pd6sg4oCF0WlAE94qpB7NjiX5q8L5xHLJMUuS4LWqRvxopaTucnrOqa6TACGCWhAMYJ4DekkrJtxlpM_s0GkNkoerZ5JSvutYojiYro9N2TNortma44FydeyORKQF260PAUjI2SLytd10mS8PJTpb9uzkbo6h0P0quXOSqRXg==, x-forwarded-for=10.10.120.73, accept-encoding=gzip,deflate, accept=application/vnd.ibm.powervm.uom+xml;, user-agent=Apache-HttpClient/4.5.12 (Java/11)}</RequestHeaders> </HttpErrorResponse:HttpErrorResponse> </content> </entry> ] (PowerVmClient)
BVQ Version: 2021.H1.3 and above
Suggested Action:
Different problems can result in this error. One reason might be a full filesystem (as the error message itself suggests). Another reason might be a dodgy SEA adapter. Please run command
entstat -all entX
to see if there are Limbo Packets which indicate that the SEA has detected its physical network is not operational.
Error Message: "ObjectNotValidException" during topo persist
2021-07-19T15:43:45,609 ERROR [persistExecutor_3]: Error during command execution: Error during powervm topology persist execution! [TopoPersist] (AbstractCommandExecutor) de.sva.bvq.data.grid.api.exception.ObjectNotValidException: Create operation not valid for object of type [pvm_physical_volume_to_virtual_io_server]! pvm_physical_volume_location_code: [[[pvm_physical_volume_location_code] must not be null!][[pvm_physical_volume_location_code] is primary key (not auto generated) and must be set!]]
BVQ Version: 2021.H1.3 and above
Suggested Action:
Information provided by AIX and sent via REST do not match (In this case it was an hdisk which looked fine from AIX point of view but reported no location code when queried via REST API).
Root cause is not understood but rebooting the VIOS servers - one after the other - probably fixes the issue.
Error Message: "NullPointerException" during topo persist
2022-03-17T11:34:01,746 ERROR [PersistExecutor_1]: Error during command execution [TopoPersist] (BaseJobExecutor) java.lang.NullPointerException: null at de.sva.bvq.persister.powervm.commands.PersistVirtualNetworkBridgeCommand.executeCommand(PersistVirtualNetworkBridgeCommand.java:53) ~[bvq-powervm-persist-2021.H2.9.jar!/:?]
BVQ Version: 2021.H1.3 and above
Suggested Action:
This problem can be caused by a corrupt CMDB on a VIO server. There is a script available from IBM to clean up the CMDB which probably fixes the issue.
Error Message: "DuplicateKeyException" during topo persist
2023-07-18T21:05:26,008 ERROR [PersistExecutor_1]: Error during command execution [HMC-102-202] [TopoPersist] (BaseJobExecutor) org.springframework.dao.DuplicateKeyException: Write operation error on server localhost:27017. Write error: WriteError{code=11000, message='E11000 duplicate key error collection: bvq.dgx_pvm_virtual_network_bridge index: pk dup key: { primaryKey.pvm_hmc_group_id: "hmc-grp-c36a2689-f68c-40e0-a84a-565dc834f2fd", primaryKey.pvm_managed_system_id: "4f310ca5-2cf1-3126-a3ca-90498ac0a16e", primaryKey.pvm_virtual_network_bridge_id: "cd6ab433-6592-3ae5-be8e-83e8ca0b2c2d", validFrom: new Date(1689707111000) }', details={}}.; nested exception is com.mongodb.MongoWriteException: Write operation error on server localhost:27017...
BVQ Version: 2022.H1 and above
Suggested Action:
This problem can be caused by a corrupt CMDB on a VIO server. There is a script available from IBM to clean up the CMDB which probably fixes the issue.
Network
Error message: "Max limit for REST sessions reached"
Got error from Brocade Switch with IP 10.10.101.147! (body: { "errors": { "error": [ { "error-type": "application", "error-tag": "operation-failed", "error-app-tag": "Error", "error-path": "/rest/login", "error-message": "Max limit for REST sessions reached", "error-info": { "error-code": 14, "error-module": "auth" }
BVQ Version: 6.2 and above
Suggested Action:
By default, the SSH session limits on Brocade switches is set to 3. The number of SSH sessions can be increased up to 10 by using CLI command mgmtapp --config -maxrestsession <1...10>
Brocade scanner fails with "authentication failed"
Due to a bug in FOS version 8.2.3a and 8.2.3a1, communication via SNMP and REST APIs is broken after an update causing Brocade REST scanners to fail.
Root cause is a file descriptor leak that occurs in weblinker during LDAP authentication. Once all file descriptors are consumed a verify error is logged. This also results in webtools authentication failing, causing SANnav, BNA or BVQ to be unable to authenticate with the switch.
BVQ Version: any
Suggested Action:
Workaround
Performing an hafailover on a director or an hareboot on a non-director will restore connectivity. Once executed, disabling LDAP will prevent the issue from being hit. If LDAP cannot be disabled, reducing the number of login attempts via HTTP/Webtools will increase the timeframe before the issue is observed again. Downgrading from 8.2.3a or 8.2.3a1 to a lower release will also stop the issue from occurring.
Final Fix
FOS 8.2.3a2
Error message: "E11000 duplicate key error collection: bvq.dgx_brocade_rule"
There is a known Brocade bug which leads to duplicate brocade rule entries. See Brocade documentation FOS-845272. This bug is fixed in Brocade FOS 9.1.1.c, 9.2.0 or higher.
BVQ Version: 2023.H1.2
Suggested Action:
There are two ways to resolve this issue:
- upgrade BVQ to version 2023.H1.3 or higher. This version can cope with such duplicate entries
- upgrade switches to FOS 9.1.1.c or later 9.1 versions. (9.2.0. is not yet supported by 2023.H1.3)
Storage
Error Message: "rbash: line xxx: yyy Killed"
Catched Exception! (SSHConnectionImpl) de.qualicision.bvq.exception.BvqScanException: SSH result contains "rbash: line xxx: yyy Killed"
BVQ Version: all
Suggested Action:
None.
This is an SVC limitation. The error typically disappears the next time the system is scanned.
Error Message: "the svc raised an error -> CMMVC6098"
Catched Exception! (SSHConnectionImpl) de.qualicision.bvq.adapter.ssh.action.CommandErrorException: the svc raised an error -> CMMVC6098
BVQ Version: all
Suggested Action:
None.
The error occurs when SVC is busy , e.g. copying files between nodes. The error typically disappears the next time the system is scanned.