AIX Simplified Remote Restart SRR

Simplified Remote Restart SRR 


Remote Restart is PowerVM feature where AIX Operationg system can be restarted on different physical Power System upon failures. The failures of Power System include Power System network failure, power failure and system going to error state.

This features was introduced in AIX Operation system because of need to high availability of operating system upon failures. Remote Restart is valid when source system is in these states - Power off state, Power off in progress state, Error state, No connection and partitions are powered off state, partition can be remote restarted to target stable Power System

Requirements for Remote Restart 
HMC - 920 and above
System/ Firmware - Power8 / FW820 and above
VIOS - 2234 and above
SSP capable remote restart will be support from HMC840 and VIOS 2240 and above levels



View the partition SRR capable flag and status
$ lssyscfg -m system1 -r lpar -Fname,state,simplified_remote_restart_capable,remote_restart_status
partition01,Not Activated,0,Invalid
partition02,Running,1,Remote Restartable

Remote restart states
Invalid                         - Partition SRR capability is set to zero / Not activated upon setting it to 1
Remote Restartable              - Partition can be restarted
Source Remote Restarting        - Partition is restarting on source system
Destination Remote Restarting   - Partition is restarting on target system
Remote Restarted                - Partition remote restarted

Enable the Remote Restart capability flag using cli
$
$ lssyscfg -m system1 -r lpar -Fname,state,simplified_remote_restart_capable,remote_restart_status
partition01,Not Activated,0,Invalid
$ chsyscfg -r lpar -m system1 -i "name=partition01,simplified_remote_restart_capable=1"
$ echo $?
$ 0
$ lssyscfg -m system1 -r lpar -Fname,state,simplified_remote_restart_capable,remote_restart_status
partition01,Not Activated,1,Remote Restartable
$

Disable te SRR Capability by setting it to zero using cli
$
$ lssyscfg -m system1 -r lpar -Fname,state,simplified_remote_restart_capable,remote_restart_status
partition01,Not Activated,1,Remote Restartable
$ chsyscfg -r lpar -m system1 -i "name=partition01,simplified_remote_restart_capable=0"
$ echo $?
$ 0
$ lssyscfg -m system1 -r lpar -Fname,state,simplified_remote_restart_capable,remote_restart_status
partition01,Not Activated,0,Invalid
$

Enable SRR using HMC GUI
Login to HMC gui select partition01 properties, check the box Simplified Remote Restart as given below screenshot




Disable SRR using HMC GUI
Login to HMC gui select partition01 properties, uncheck the box Simplified Remote Restart as given below screenshot

Refresh the partition SRR status in HMC
Some time the partition Remote Restart data may be displayed invalid due to network connection issues between HMC and Partition. In this scenario we need to refresh the SRR details stored in HMC disk to actual state of partition. Use refdev to refresh the device data of partition.
refdev -m system1 -p partition01

Remote Restart Command Line

Start remote restart validation using validate option
$ date; rrstartlpar -m system1 -t system2 -p partition01 -o validate; echo $?; date

Remote restart on target system2 server
$ date; rrstartlpar -m system1 -t system2 -p partition01 -o restart; echo $?; date

Abort the remote restart operation
$ date; rrstartlpar -m system1 -t system2 -p partition01 -o cancel; echo $?; date

Recover remote restart operation upon failure --force can be used
$ date; rrstartlpar -m system1 -t system2 -p partition01 -o recover --force; echo $?; date

Clean up the source system1 when autoclean up has not started with force option
$ date; rrstartlpar -m system1 -p partition01 -o cleanup --force; echo $?; date

Force option can be specified when the normal recover or clean up is not working out.
--force

No Connection SRR


This is new feature added in to SRR. When HMC is unable to establish connection to Power System FSP, and System is in No Connection state. The partition that is powered off can be restarted on to target system using the --noconnection option. To validate and restart follow below commands
$ date; rrstartlpar -m system1 -t system2 -p partition01 -o validate --noconnection; echo $?; date
$ date; rrstartlpar -m system1 -t system2 -p partition01 -o restart --noconnection; echo $?; date

When the target system2 is not managed by the HMC that manages system1 then ip address of target HMC that manages Power System should be provided with --ip address option as below
--ip 9.x.x.x

Where 9.x.x.x is the HMC IP address

Return Codes of rrstartlpar

Below are the return code description of rrstartlpar command
0  Remote Restart is successful
1  Remote Restart Failed
81 Partition name is already used on target.
82 Remote Restart Validation Failed.
83 Remote Restart failed before no return point
84 Remote Restart failed after no return point
85 Remote Restart Recover is not valid
86 Remote Restart Recover failed
87 Remote Restart failed before no return point & recovery was successful.
89 Remote Restart force recover failed.
91 Remote Restart recover failed during roll back.
92 Remote restart recover failed during cleanup

Concurrent remote restart info
lsrrstartlpar will list down the concurrent remote restart there are currently under progress. The below command line can also be used to list the maximum number of remote restart that system supports

List down the system remote restart details
lsrrstartlpar –r sys -m system1
num_dest_remote_restarts_in_progress=0,max_dest_remote_restarts_supported=32,powervm_partition_simplified_remote_restart_capable=1,powervm_partition_remote_restart_capable=1

List down the partition wise remote restart details
lsrrstartlpar –r lpar –m system1
lpar_name=partition01,lpar_id=8,lpar_uuid=23966ACD-59AB-4276-A448-65F209FA8D83,remote_restart_operation_state=Remote Restartable,simplified_remote_restart_capable=1,remote_restart_capable=0

Simulating Source system modes


We can simulate the system states like Power off, Power off in progress, error state and no connection state by following the below procedure

Power off / Power off in progress - To make this state simply power of the source system in normal mode and start the remote restart to destination power system

error state - To get this system state get the FSP ip using hmc command "lssyscfg -r sys -m system1 -Fipaddr" login to dev user and select "System Dump" and select "Save settings and initiate dump" System will go to error state and get back to operating state.

No Connection state - Power off the partition that requires to be restarted and block the hmc to fsp network connection using hmc firewall. Login as super user and execute to block "iptables -I INPUT -s <CEC IP> -j DROP" to unblock "iptables -D INPUT -s <CEC IP> -j DROP" the system will goto no connection state and SRR can be performed.