Error Handling
Contents
5. Error Handling#
Learning Objectives
In this section we will discuss calculation failures in AiiDA, how they are reported and how to deal with them automatically via error handlers.
As anyone doing computational work is well aware, calculations can fail for a variety of reasons. If you’re only running a few calculations manually, typically you would:
Check the outputs of the calculation to figure out why the calculation failed.
Adapt the inputs of the calculation to try and remedy the problem.
However, when running many calculations in high-throughput, this process needs to be automated as well.
Show cell content
from local_module import load_temp_profile
data = load_temp_profile(
name="error_handling",
add_computer=True,
add_pw_code=True,
add_sssp=True,
add_structure_si=True,
)
5.1. Exit codes#
Exit codes in AiiDA are used to clearly communicate how a process terminated. They consist of two parts: a positive integer, called the exit status, and a message giving more detail, also called the exit message. If the exit status is zero, which is the default, the process is said to have terminated nominally and finished successfully. A non-zero exit status is often used to communicate that there was some kind of a problem during the execution of the process and in that case it is said to be failed.
To see this in action, we’ll once again run a Quantum ESPRESSO pw.x
calculation, but this time adapt the inputs so the calculation will fail to converge electronically.
from local_module.pw_builder import get_pw_builder
pw_builder = get_pw_builder(data.code, data.structure, 'fast')
To simulate a failed calculation, we’ll reduce the number of electronic steps to only 6 and run the calculation:
pw_builder.parameters['ELECTRONS']['electron_maxstep'] = 6
from aiida.engine import run_get_node
result = run_get_node(pw_builder)
10/04/2022 08:15:43 AM <13124> aiida.parser.PwParser: [ERROR] ERROR_ELECTRONIC_CONVERGENCE_NOT_REACHED
10/04/2022 08:15:43 AM <13124> aiida.parser.PwParser: [ERROR] The electronic minimization cycle did not reach self-consistency.
10/04/2022 08:15:43 AM <13124> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [WARNING] output parser returned exit code<410>: The electronic minimization cycle did not reach self-consistency.
As we can see, running the calculation with this input fails, and the AiiDA parser for pw.x
found that the electronic minimization cycle failed to reach self-consistency.
The exit status and message are also stored on the calcjob node:
print(result.node.exit_status, result.node.exit_message)
410 The electronic minimization cycle did not reach self-consistency.
You can see the full list of exit codes that are defined for the PwCalculation
using verdi plugin list
:
%verdi plugin list aiida.calculations quantumespresso.pw
Show cell output
Description:
`CalcJob` implementation for the pw.x code of Quantum ESPRESSO.
Inputs:
kpoints: required KpointsData kpoint mesh or kpoint path
parameters: required Dict The input parameters that are to be used to construct the input file.
pseudos: required UpfData, UpfData A mapping of `UpfData` nodes onto the kind name to which they should apply.
structure: required StructureData The input structure.
code: optional Code The `Code` to use for this job. This input is required, unless the `remote_ ...
hubbard_file: optional SinglefileData SinglefileData node containing the output Hubbard parameters from a HpCalcu ...
metadata: optional
parallelization: optional Dict Parallelization options. The following flags are allowed:
npool : The numb ...
parent_folder: optional RemoteData An optional working directory of a previously completed calculation to rest ...
remote_folder: optional RemoteData Remote directory containing the results of an already completed calculation ...
settings: optional Dict Optional parameters to affect the way the calculation job and the parsing a ...
vdw_table: optional SinglefileData Optional van der Waals table contained in a `SinglefileData`.
Outputs:
output_parameters: required Dict The `output_parameters` output node of the successful calculation.
remote_folder: required RemoteData Input files necessary to run the process will be stored in this folder node ...
retrieved: required FolderData Files that are retrieved by the daemon will be stored in this node. By defa ...
output_atomic_occupations: optional Dict
output_band: optional BandsData The `output_band` output node of the successful calculation if present.
output_kpoints: optional KpointsData
output_structure: optional StructureData The `output_structure` output node of the successful calculation if present ...
output_trajectory: optional TrajectoryData
remote_stash: optional RemoteStashData Contents of the `stash.source_list` option are stored in this remote folder ...
Exit codes:
1: The process has failed with an unspecified error.
2: The process failed with legacy failure mode.
10: The process returned an invalid output.
11: The process did not register a required output.
100: The process did not have the required `retrieved` output.
110: The job ran out of memory.
120: The job ran out of walltime.
301: The retrieved temporary folder could not be accessed.
302: The retrieved folder did not contain the required stdout output file.
303: The retrieved folder did not contain the required XML file.
304: The retrieved folder contained multiple XML files.
305: Both the stdout and XML output files could not be read or parsed.
310: The stdout output file could not be read.
311: The stdout output file could not be parsed.
312: The stdout output file was incomplete probably because the calculation got interrupted.
320: The XML output file could not be read.
321: The XML output file could not be parsed.
322: The XML output file has an unsupported format.
340: The calculation stopped prematurely because it ran out of walltime but the job was killed by the scheduler before the files were safely written to disk for a potential restart.
350: The parser raised an unexpected exception: {exception}
400: The calculation stopped prematurely because it ran out of walltime.
410: The electronic minimization cycle did not reach self-consistency.
461: The code failed with negative dexx in the exchange calculation.
462: The code failed during the cholesky factorization.
463: Too many bands failed to converge during the diagonalization.
481: The k-point parallelization "npools" is too high, some nodes have no k-points.
500: The ionic minimization cycle did not converge for the given thresholds.
501: Then ionic minimization cycle converged but the thresholds are exceeded in the final SCF.
502: The ionic minimization cycle did not converge after the maximum number of steps.
510: The electronic minimization cycle failed during an ionic minimization cycle.
511: The ionic minimization cycle converged, but electronic convergence was not reached in the final SCF.
520: The ionic minimization cycle terminated prematurely because of two consecutive failures in the BFGS algorithm.
521: The ionic minimization cycle terminated prematurely because of two consecutive failures in the BFGS algorithm and electronic convergence failed in the final SCF.
531: The electronic minimization cycle did not reach self-consistency.
541: The variable cell optimization broke the symmetry of the k-points.
710: The electronic minimization cycle did not reach self-consistency, but `scf_must_converge` is `False` and/or `electron_maxstep` is 0.
5.2. Error handling: BaseRestartWorkChain
#
Because automatically recovering from errors is such a common use case, aiida-core
comes with an abstract base class that implements the required logic for doing so: the BaseRestartWorkChain
.
The full logic is shown below:
In short, the BaseRestartWorkChain
, checks the exit code of the process it is wrapping and runs a corresponding error handler in case one is implemented.
If so, the calculation is restarted up to a number of times specified by the user.
If no handler is implemented, it still tries to restart the calculation once (e.g. in case of node failures).
The BaseRestartWorkChain
of the pw.x
calculation is called the PwBaseWorkChain
.
Similar to the higher-level PwBandsWorkChain
shown in the second section, it comes with a handy method for obtaining a fully populated builder based on a chosen protocol:
from aiida_quantumespresso.workflows.pw.base import PwBaseWorkChain
builder = PwBaseWorkChain.get_builder_from_protocol(
code=data.code,
structure=data.structure,
protocol="fast",
)
For the PwBaseWorkChain
, the inputs of the pw.x
calculation are available in the pw
namespace.
Let’s once again choose a very low value for the maximum number of electronic iterations so the calculation fails to converge electronically and run the work chain:
builder.pw.parameters['ELECTRONS']['electron_maxstep'] = 6
from aiida.engine import run_get_node
result = run_get_node(builder)
Report: [101|PwBaseWorkChain|run_process]: launching PwCalculation<106> iteration #1
Error: ERROR_ELECTRONIC_CONVERGENCE_NOT_REACHED
Error: The electronic minimization cycle did not reach self-consistency.
Warning: output parser returned exit code<410>: The electronic minimization cycle did not reach self-consistency.
Report: [101|PwBaseWorkChain|report_error_handled]: PwCalculation<106> failed with exit status 410: The electronic minimization cycle did not reach self-consistency.
Report: [101|PwBaseWorkChain|report_error_handled]: Action taken: reduced beta mixing from 0.4 to 0.32000000000000006 and restarting from the last calculation
Report: [101|PwBaseWorkChain|inspect_process]: PwCalculation<106> failed but a handler dealt with the problem, restarting
Report: [101|PwBaseWorkChain|run_process]: launching PwCalculation<114> iteration #2
Report: [101|PwBaseWorkChain|results]: work chain completed after 2 iterations
Report: [101|PwBaseWorkChain|on_terminated]: cleaned remote folders of calculations: 106 114
Bingo!
The PwBaseWorkChain
does its intended job: after identifying the exit code of the failure of the pw.x
calculation, it adapts its inputs and restarts the calculation.
This is also visible in the hierarchical overview obtained using verdi process status
:
%verdi process status {result.node.pk}
PwBaseWorkChain<101> Finished [0] [4:results]
├── create_kpoints_from_distance<102> Finished [0]
├── PwCalculation<106> Finished [410]
└── PwCalculation<114> Finished [0]
In this case, adapting the charge density mixing as is reported above is most likely not necessary, but hopefully this gives an idea of how a base restart work chain can help improve the robustness of your calculations.
The PwBaseWorkChain
already has a whole set of error handlers implemented.
As an example implementation, below you can see the code of the error handler that was called in our previous test run:
@process_handler(priority=410, exit_codes=[
PwCalculation.exit_codes.ERROR_ELECTRONIC_CONVERGENCE_NOT_REACHED,
])
def handle_electronic_convergence_not_reached(self, calculation):
"""Handle `ERROR_ELECTRONIC_CONVERGENCE_NOT_REACHED` error.
Decrease the mixing beta and fully restart from the previous calculation.
"""
factor = self.defaults.delta_factor_mixing_beta
mixing_beta = self.ctx.inputs.parameters.get('ELECTRONS', {}).get('mixing_beta', self.defaults.qe.mixing_beta)
mixing_beta_new = mixing_beta * factor
self.ctx.inputs.parameters['ELECTRONS']['mixing_beta'] = mixing_beta_new
action = f'reduced beta mixing from {mixing_beta} to {mixing_beta_new} and restarting from the last calculation'
self.set_restart_type(RestartType.FULL, calculation.outputs.remote_folder)
self.report_error_handled(calculation, action)
return ProcessHandlerReport(True)
We won’t go too much into the details of the implementation here.
Just note that the handler is implemented as a method on the PwBaseWorkChain
, decorated with the process_handler
decorator where the exit codes it attempts to fix are specified.
The body of the method adapts the inputs of the calculation and reports that an error has been handled to the user.
5.3. Transport and scheduler issues#
Next to issues with the calculations, it’s also possible to suffer from transient issues e.g. related to connecting to a remote resources or submitting to a scheduler. In section 3.8 we described how a calculation is run on a remote resource through AiiDA. If one of these steps fails, AiiDA will not simply give up on the calculation. Instead, it will use an exponential backof mechanism with a certain number of retries, which is configurable by the user. After this number of attempts, AiiDA will pause the corresponding process:
Note
As these types of issues are transient and require us to submit the process, they are difficult to reproduce in this executable notebook. Hence we simply show some quick examples here.
$ verdi process list
PK Created Process label Process State Process status
---- --------- ---------------- --------------- ------------------------------------------------------------------------------------
1467 21m ago PwRelaxWorkChain ⏵ Waiting Waiting for child processes: 2076
1595 21m ago PwRelaxWorkChain ⏵ Waiting Waiting for child processes: 1952
1904 20m ago PwRelaxWorkChain ⏵ Waiting Waiting for child processes: 1909
1909 20m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 1917
1917 20m ago PwCalculation ⏸ Waiting Pausing after failed transport task: update_calculation failed 5 times consecutively
1952 20m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 1957
1957 19m ago PwCalculation ⏸ Waiting Pausing after failed transport task: stash_calculation failed 5 times consecutively
2039 19m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 2044
2044 19m ago PwCalculation ⏸ Waiting Pausing after failed transport task: submit_calculation failed 5 times consecutively
2076 19m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 2089
2089 19m ago PwCalculation ⏸ Waiting Pausing after failed transport task: upload_calculation failed 5 times consecutively
We can see several processes which have been paused in one of the steps of how AiiDA runs a calculation on the remote computer. Checking the report of one of the calculations indicates that there were issues authenticating to the remote cluster, and that the process was paused after 5 connection attempts:
$ verdi process report 2089
[...]
| paramiko.ssh_exception.AuthenticationException: Authentication failed.
+-> WARNING at 2022-10-03 23:20:41.902383+00:00
| maximum attempts 5 of calling do_upload, exceeded
Fortunately, once the connection issue has been resolved, AiiDA allows you to simply “play” the processes and continue the corresponding workflows without issue:
$ verdi process play -a
Success: played Process<2089>
Success: played Process<2044>
Success: played Process<1957>
Success: played Process<1917>
Once the processes are no longer paused, the daemon workers will pick them back up and continue running them:
$ verdi process list
PK Created Process label Process State Process status
---- --------- ---------------- --------------- ---------------------------------------
1467 26m ago PwRelaxWorkChain ⏵ Waiting Waiting for child processes: 2076
1595 26m ago PwRelaxWorkChain ⏵ Waiting Waiting for child processes: 1952
1904 24m ago PwRelaxWorkChain ⏵ Waiting Waiting for child processes: 1909
1909 24m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 1917
1917 24m ago PwCalculation ⏵ Waiting Monitoring scheduler: job state RUNNING
1952 24m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 1957
1957 24m ago PwCalculation ⏵ Waiting Waiting for transport task: stash
2039 24m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 2044
2044 24m ago PwCalculation ⏵ Waiting Waiting for transport task: submit
2076 24m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 2089
2089 24m ago PwCalculation ⏵ Waiting Waiting for transport task: upload