5. Error Handling#

Learning Objectives

In this section we will discuss calculation failures in AiiDA, how they are reported and how to deal with them automatically via error handlers.

As anyone doing computational work is well aware, calculations can fail for a variety of reasons. If you’re only running a few calculations manually, typically you would:

Check the outputs of the calculation to figure out why the calculation failed.
Adapt the inputs of the calculation to try and remedy the problem.

However, when running many calculations in high-throughput, this process needs to be automated as well.

5.1. Exit codes#

Exit codes in AiiDA are used to clearly communicate how a process terminated. They consist of two parts: a positive integer, called the exit status, and a message giving more detail, also called the exit message. If the exit status is zero, which is the default, the process is said to have terminated nominally and finished successfully. A non-zero exit status is often used to communicate that there was some kind of a problem during the execution of the process and in that case it is said to be failed.

To see this in action, we’ll once again run a Quantum ESPRESSO pw.x calculation, but this time adapt the inputs so the calculation will fail to converge electronically.

from local_module.pw_builder import get_pw_builder

pw_builder = get_pw_builder(data.code, data.structure, 'fast')

To simulate a failed calculation, we’ll reduce the number of electronic steps to only 6 and run the calculation:

pw_builder.parameters['ELECTRONS']['electron_maxstep'] = 6

from aiida.engine import run_get_node

result = run_get_node(pw_builder)

10/04/2022 08:15:43 AM <13124> aiida.parser.PwParser: [ERROR] ERROR_ELECTRONIC_CONVERGENCE_NOT_REACHED
10/04/2022 08:15:43 AM <13124> aiida.parser.PwParser: [ERROR] The electronic minimization cycle did not reach self-consistency.
10/04/2022 08:15:43 AM <13124> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [WARNING] output parser returned exit code<410>: The electronic minimization cycle did not reach self-consistency.

As we can see, running the calculation with this input fails, and the AiiDA parser for pw.x found that the electronic minimization cycle failed to reach self-consistency. The exit status and message are also stored on the calcjob node:

print(result.node.exit_status, result.node.exit_message)

410 The electronic minimization cycle did not reach self-consistency.

You can see the full list of exit codes that are defined for the PwCalculation using verdi plugin list:

%verdi plugin list aiida.calculations quantumespresso.pw

Show cell output Hide cell output

Description:

    `CalcJob` implementation for the pw.x code of Quantum ESPRESSO.

Inputs:
                    kpoints:  required  KpointsData       kpoint mesh or kpoint path
                 parameters:  required  Dict              The input parameters that are to be used to construct the input file.
                    pseudos:  required  UpfData, UpfData  A mapping of `UpfData` nodes onto the kind name to which they should apply.
                  structure:  required  StructureData     The input structure.
                       code:  optional  Code              The `Code` to use for this job. This input is required, unless the `remote_ ...
               hubbard_file:  optional  SinglefileData    SinglefileData node containing the output Hubbard parameters from a HpCalcu ...
                   metadata:  optional                    
            parallelization:  optional  Dict              Parallelization options. The following flags are allowed:
npool  : The numb ...
              parent_folder:  optional  RemoteData        An optional working directory of a previously completed calculation to rest ...
              remote_folder:  optional  RemoteData        Remote directory containing the results of an already completed calculation ...
                   settings:  optional  Dict              Optional parameters to affect the way the calculation job and the parsing a ...
                  vdw_table:  optional  SinglefileData    Optional van der Waals table contained in a `SinglefileData`.
Outputs:
          output_parameters:  required  Dict              The `output_parameters` output node of the successful calculation.
              remote_folder:  required  RemoteData        Input files necessary to run the process will be stored in this folder node ...
                  retrieved:  required  FolderData        Files that are retrieved by the daemon will be stored in this node. By defa ...
  output_atomic_occupations:  optional  Dict              
                output_band:  optional  BandsData         The `output_band` output node of the successful calculation if present.
             output_kpoints:  optional  KpointsData       
           output_structure:  optional  StructureData     The `output_structure` output node of the successful calculation if present ...
          output_trajectory:  optional  TrajectoryData    
               remote_stash:  optional  RemoteStashData   Contents of the `stash.source_list` option are stored in this remote folder ...
Exit codes:
                          1:  The process has failed with an unspecified error.
                          2:  The process failed with legacy failure mode.
                         10:  The process returned an invalid output.
                         11:  The process did not register a required output.
                        100:  The process did not have the required `retrieved` output.
                        110:  The job ran out of memory.
                        120:  The job ran out of walltime.
                        301:  The retrieved temporary folder could not be accessed.
                        302:  The retrieved folder did not contain the required stdout output file.
                        303:  The retrieved folder did not contain the required XML file.
                        304:  The retrieved folder contained multiple XML files.
                        305:  Both the stdout and XML output files could not be read or parsed.
                        310:  The stdout output file could not be read.
                        311:  The stdout output file could not be parsed.
                        312:  The stdout output file was incomplete probably because the calculation got interrupted.
                        320:  The XML output file could not be read.
                        321:  The XML output file could not be parsed.
                        322:  The XML output file has an unsupported format.
                        340:  The calculation stopped prematurely because it ran out of walltime but the job was killed by the scheduler before the files were safely written to disk for a potential restart.
                        350:  The parser raised an unexpected exception: {exception}
                        400:  The calculation stopped prematurely because it ran out of walltime.
                        410:  The electronic minimization cycle did not reach self-consistency.
                        461:  The code failed with negative dexx in the exchange calculation.
                        462:  The code failed during the cholesky factorization.
                        463:  Too many bands failed to converge during the diagonalization.
                        481:  The k-point parallelization "npools" is too high, some nodes have no k-points.
                        500:  The ionic minimization cycle did not converge for the given thresholds.
                        501:  Then ionic minimization cycle converged but the thresholds are exceeded in the final SCF.
                        502:  The ionic minimization cycle did not converge after the maximum number of steps.
                        510:  The electronic minimization cycle failed during an ionic minimization cycle.
                        511:  The ionic minimization cycle converged, but electronic convergence was not reached in the final SCF.
                        520:  The ionic minimization cycle terminated prematurely because of two consecutive failures in the BFGS algorithm.
                        521:  The ionic minimization cycle terminated prematurely because of two consecutive failures in the BFGS algorithm and electronic convergence failed in the final SCF.
                        531:  The electronic minimization cycle did not reach self-consistency.
                        541:  The variable cell optimization broke the symmetry of the k-points.
                        710:  The electronic minimization cycle did not reach self-consistency, but `scf_must_converge` is `False` and/or `electron_maxstep` is 0.

5.2. Error handling: `BaseRestartWorkChain`#

Because automatically recovering from errors is such a common use case, aiida-core comes with an abstract base class that implements the required logic for doing so: the BaseRestartWorkChain. The full logic is shown below:

In short, the BaseRestartWorkChain, checks the exit code of the process it is wrapping and runs a corresponding error handler in case one is implemented. If so, the calculation is restarted up to a number of times specified by the user. If no handler is implemented, it still tries to restart the calculation once (e.g. in case of node failures).

The BaseRestartWorkChain of the pw.x calculation is called the PwBaseWorkChain. Similar to the higher-level PwBandsWorkChain shown in the second section, it comes with a handy method for obtaining a fully populated builder based on a chosen protocol:

from aiida_quantumespresso.workflows.pw.base import PwBaseWorkChain

builder = PwBaseWorkChain.get_builder_from_protocol(
    code=data.code, 
    structure=data.structure,
    protocol="fast",
)

For the PwBaseWorkChain, the inputs of the pw.x calculation are available in the pw namespace. Let’s once again choose a very low value for the maximum number of electronic iterations so the calculation fails to converge electronically and run the work chain:

builder.pw.parameters['ELECTRONS']['electron_maxstep'] = 6

from aiida.engine import run_get_node
result = run_get_node(builder)

Report: [101|PwBaseWorkChain|run_process]: launching PwCalculation<106> iteration #1
Error: ERROR_ELECTRONIC_CONVERGENCE_NOT_REACHED
Error: The electronic minimization cycle did not reach self-consistency.
Warning: output parser returned exit code<410>: The electronic minimization cycle did not reach self-consistency.
Report: [101|PwBaseWorkChain|report_error_handled]: PwCalculation<106> failed with exit status 410: The electronic minimization cycle did not reach self-consistency.
Report: [101|PwBaseWorkChain|report_error_handled]: Action taken: reduced beta mixing from 0.4 to 0.32000000000000006 and restarting from the last calculation
Report: [101|PwBaseWorkChain|inspect_process]: PwCalculation<106> failed but a handler dealt with the problem, restarting
Report: [101|PwBaseWorkChain|run_process]: launching PwCalculation<114> iteration #2
Report: [101|PwBaseWorkChain|results]: work chain completed after 2 iterations
Report: [101|PwBaseWorkChain|on_terminated]: cleaned remote folders of calculations: 106 114

Bingo! The PwBaseWorkChain does its intended job: after identifying the exit code of the failure of the pw.x calculation, it adapts its inputs and restarts the calculation. This is also visible in the hierarchical overview obtained using verdi process status:

%verdi process status {result.node.pk}

PwBaseWorkChain<101> Finished [0] [4:results]
    ├── create_kpoints_from_distance<102> Finished [0]
    ├── PwCalculation<106> Finished [410]
    └── PwCalculation<114> Finished [0]

In this case, adapting the charge density mixing as is reported above is most likely not necessary, but hopefully this gives an idea of how a base restart work chain can help improve the robustness of your calculations.

The PwBaseWorkChain already has a whole set of error handlers implemented. As an example implementation, below you can see the code of the error handler that was called in our previous test run:

    @process_handler(priority=410, exit_codes=[
        PwCalculation.exit_codes.ERROR_ELECTRONIC_CONVERGENCE_NOT_REACHED,
    ])
    def handle_electronic_convergence_not_reached(self, calculation):
        """Handle `ERROR_ELECTRONIC_CONVERGENCE_NOT_REACHED` error.
        Decrease the mixing beta and fully restart from the previous calculation.
        """
        factor = self.defaults.delta_factor_mixing_beta
        mixing_beta = self.ctx.inputs.parameters.get('ELECTRONS', {}).get('mixing_beta', self.defaults.qe.mixing_beta)
        mixing_beta_new = mixing_beta * factor

        self.ctx.inputs.parameters['ELECTRONS']['mixing_beta'] = mixing_beta_new
        action = f'reduced beta mixing from {mixing_beta} to {mixing_beta_new} and restarting from the last calculation'

        self.set_restart_type(RestartType.FULL, calculation.outputs.remote_folder)
        self.report_error_handled(calculation, action)
        return ProcessHandlerReport(True)

We won’t go too much into the details of the implementation here. Just note that the handler is implemented as a method on the PwBaseWorkChain, decorated with the process_handler decorator where the exit codes it attempts to fix are specified. The body of the method adapts the inputs of the calculation and reports that an error has been handled to the user.

5.3. Transport and scheduler issues#

Next to issues with the calculations, it’s also possible to suffer from transient issues e.g. related to connecting to a remote resources or submitting to a scheduler. In section 3.8 we described how a calculation is run on a remote resource through AiiDA. If one of these steps fails, AiiDA will not simply give up on the calculation. Instead, it will use an exponential backof mechanism with a certain number of retries, which is configurable by the user. After this number of attempts, AiiDA will pause the corresponding process:

Note

As these types of issues are transient and require us to submit the process, they are difficult to reproduce in this executable notebook. Hence we simply show some quick examples here.

$ verdi process list
  PK  Created    Process label     Process State    Process status
----  ---------  ----------------  ---------------  ------------------------------------------------------------------------------------
21m ago    PwRelaxWorkChain  ⏵ Waiting        Waiting for child processes: 2076
21m ago    PwRelaxWorkChain  ⏵ Waiting        Waiting for child processes: 1952
20m ago    PwRelaxWorkChain  ⏵ Waiting        Waiting for child processes: 1909
20m ago    PwBaseWorkChain   ⏵ Waiting        Waiting for child processes: 1917
20m ago    PwCalculation     ⏸ Waiting        Pausing after failed transport task: update_calculation failed 5 times consecutively
20m ago    PwBaseWorkChain   ⏵ Waiting        Waiting for child processes: 1957
19m ago    PwCalculation     ⏸ Waiting        Pausing after failed transport task: stash_calculation failed 5 times consecutively
19m ago    PwBaseWorkChain   ⏵ Waiting        Waiting for child processes: 2044
19m ago    PwCalculation     ⏸ Waiting        Pausing after failed transport task: submit_calculation failed 5 times consecutively
19m ago    PwBaseWorkChain   ⏵ Waiting        Waiting for child processes: 2089
19m ago    PwCalculation     ⏸ Waiting        Pausing after failed transport task: upload_calculation failed 5 times consecutively

We can see several processes which have been paused in one of the steps of how AiiDA runs a calculation on the remote computer. Checking the report of one of the calculations indicates that there were issues authenticating to the remote cluster, and that the process was paused after 5 connection attempts:

$ verdi process report 2089
[...]
 | paramiko.ssh_exception.AuthenticationException: Authentication failed.
+-> WARNING at 2022-10-03 23:20:41.902383+00:00
 | maximum attempts 5 of calling do_upload, exceeded

Fortunately, once the connection issue has been resolved, AiiDA allows you to simply “play” the processes and continue the corresponding workflows without issue:

$ verdi process play -a
Success: played Process<2089>
Success: played Process<2044>
Success: played Process<1957>
Success: played Process<1917>

Once the processes are no longer paused, the daemon workers will pick them back up and continue running them:

$ verdi process list 
  PK  Created    Process label     Process State    Process status
----  ---------  ----------------  ---------------  ---------------------------------------
26m ago    PwRelaxWorkChain  ⏵ Waiting        Waiting for child processes: 2076
26m ago    PwRelaxWorkChain  ⏵ Waiting        Waiting for child processes: 1952
24m ago    PwRelaxWorkChain  ⏵ Waiting        Waiting for child processes: 1909
24m ago    PwBaseWorkChain   ⏵ Waiting        Waiting for child processes: 1917
24m ago    PwCalculation     ⏵ Waiting        Monitoring scheduler: job state RUNNING
24m ago    PwBaseWorkChain   ⏵ Waiting        Waiting for child processes: 1957
24m ago    PwCalculation     ⏵ Waiting        Waiting for transport task: stash
24m ago    PwBaseWorkChain   ⏵ Waiting        Waiting for child processes: 2044
24m ago    PwCalculation     ⏵ Waiting        Waiting for transport task: submit
24m ago    PwBaseWorkChain   ⏵ Waiting        Waiting for child processes: 2089
24m ago    PwCalculation     ⏵ Waiting        Waiting for transport task: upload

Demonstration

Error Handling

Contents

5. Error Handling#

5.1. Exit codes#

5.2. Error handling: BaseRestartWorkChain#

5.3. Transport and scheduler issues#

5.2. Error handling: `BaseRestartWorkChain`#