{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Error Handling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{admonition} Learning Objectives\n", ":class: learning-objectives\n", "\n", "In this section we will discuss calculation failures in AiiDA, how they are reported and how to deal with them automatically via error handlers.\n", "\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As anyone doing computational work is well aware, calculations can fail for a variety of reasons.\n", "If you're only running a few calculations manually, typically you would:\n", "\n", "1. Check the outputs of the calculation to figure out why the calculation failed.\n", "2. Adapt the inputs of the calculation to try and remedy the problem.\n", "\n", "However, when running many calculations in high-throughput, this process needs to be automated as well.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "from local_module import load_temp_profile\n", "\n", "data = load_temp_profile(\n", " name=\"error_handling\",\n", " add_computer=True,\n", " add_pw_code=True,\n", " add_sssp=True,\n", " add_structure_si=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exit codes\n", "\n", "Exit codes in AiiDA are used to clearly communicate how a process terminated. They consist of two parts: a positive integer, called the _exit status_, and a message giving more detail, also called the _exit message_.\n", "If the exit status is zero, which is the default, the process is said to have terminated nominally and finished successfully.\n", "A non-zero exit status is often used to communicate that there was some kind of a problem during the execution of the process and in that case it is said to be failed.\n", "\n", "To see this in action, we'll once again run a Quantum ESPRESSO `pw.x` calculation, but this time adapt the inputs so the calculation will fail to converge electronically.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from local_module.pw_builder import get_pw_builder\n", "\n", "pw_builder = get_pw_builder(data.code, data.structure, 'fast')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To simulate a failed calculation, we'll reduce the number of electronic steps to only 6 and run the calculation:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pw_builder.parameters['ELECTRONS']['electron_maxstep'] = 6\n", "\n", "from aiida.engine import run_get_node\n", "\n", "result = run_get_node(pw_builder)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, running the calculation with this input fails, and the AiiDA parser for `pw.x` found that the electronic minimization cycle failed to reach self-consistency.\n", "The exit status and message are also stored on the calcjob node:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(result.node.exit_status, result.node.exit_message)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see the full list of exit codes that are defined for the `PwCalculation` using `verdi plugin list`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-output" ] }, "outputs": [], "source": [ "%verdi plugin list aiida.calculations quantumespresso.pw" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Error handling: `BaseRestartWorkChain`\n", "\n", "Because automatically recovering from errors is such a common use case, `aiida-core` comes with an abstract base class that implements the required logic for doing so: the `BaseRestartWorkChain`.\n", "The full logic is shown below:\n", "\n", "![base restart flow](_static/aiida/flow_base_restart.png){height=300px align=center}\n", "\n", "In short, the `BaseRestartWorkChain`, checks the exit code of the process it is wrapping and runs a corresponding error handler in case one is implemented.\n", "If so, the calculation is restarted up to a number of times specified by the user.\n", "If no handler is implemented, it still tries to restart the calculation once (e.g. in case of node failures).\n", "\n", "The `BaseRestartWorkChain` of the `pw.x` calculation is called the `PwBaseWorkChain`.\n", "Similar to the higher-level `PwBandsWorkChain` shown in the second section, it comes with a handy method for obtaining a fully populated builder based on a chosen protocol:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from aiida_quantumespresso.workflows.pw.base import PwBaseWorkChain\n", "\n", "builder = PwBaseWorkChain.get_builder_from_protocol(\n", " code=data.code, \n", " structure=data.structure,\n", " protocol=\"fast\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the `PwBaseWorkChain`, the inputs of the `pw.x` calculation are available in the `pw` namespace.\n", "Let's once again choose a very low value for the maximum number of electronic iterations so the calculation fails to converge electronically and run the work chain:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "builder.pw.parameters['ELECTRONS']['electron_maxstep'] = 6\n", "\n", "from aiida.engine import run_get_node\n", "result = run_get_node(builder)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bingo!\n", "The `PwBaseWorkChain` does its intended job: after identifying the exit code of the failure of the `pw.x` calculation, it adapts its inputs and restarts the calculation.\n", "This is also visible in the hierarchical overview obtained using `verdi process status`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%verdi process status {result.node.pk}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, adapting the charge density mixing as is reported above is most likely not necessary, but hopefully this gives an idea of how a base restart work chain can help improve the robustness of your calculations. \n", "\n", "The `PwBaseWorkChain` already has a whole set of error handlers implemented.\n", "As an example implementation, below you can see the code of the error handler that was called in our previous test run:\n", "\n", "```python\n", "\n", " @process_handler(priority=410, exit_codes=[\n", " PwCalculation.exit_codes.ERROR_ELECTRONIC_CONVERGENCE_NOT_REACHED,\n", " ])\n", " def handle_electronic_convergence_not_reached(self, calculation):\n", " \"\"\"Handle `ERROR_ELECTRONIC_CONVERGENCE_NOT_REACHED` error.\n", " Decrease the mixing beta and fully restart from the previous calculation.\n", " \"\"\"\n", " factor = self.defaults.delta_factor_mixing_beta\n", " mixing_beta = self.ctx.inputs.parameters.get('ELECTRONS', {}).get('mixing_beta', self.defaults.qe.mixing_beta)\n", " mixing_beta_new = mixing_beta * factor\n", "\n", " self.ctx.inputs.parameters['ELECTRONS']['mixing_beta'] = mixing_beta_new\n", " action = f'reduced beta mixing from {mixing_beta} to {mixing_beta_new} and restarting from the last calculation'\n", "\n", " self.set_restart_type(RestartType.FULL, calculation.outputs.remote_folder)\n", " self.report_error_handled(calculation, action)\n", " return ProcessHandlerReport(True)\n", "```\n", "\n", "We won't go too much into the details of the implementation here.\n", "Just note that the handler is implemented as a method on the `PwBaseWorkChain`, decorated with the `process_handler` decorator where the exit codes it attempts to fix are specified.\n", "The body of the method adapts the inputs of the calculation and reports that an error has been handled to the user." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transport and scheduler issues\n", "\n", "Next to issues with the calculations, it's also possible to suffer from transient issues e.g. related to connecting to a remote resources or submitting to a scheduler.\n", "In section 3.8 we described how a calculation is run on a remote resource through AiiDA.\n", "If one of these steps fails, AiiDA will not simply give up on the calculation.\n", "Instead, it will use an exponential backof mechanism with a certain number of retries, which is configurable by the user.\n", "After this number of attempts, AiiDA will pause the corresponding process:\n", "\n", ":::{note}\n", "\n", "As these types of issues are transient and require us to submit the process, they are difficult to reproduce in this executable notebook.\n", "Hence we simply show some quick examples here.\n", "\n", ":::\n", "\n", "```console\n", "$ verdi process list\n", " PK Created Process label Process State Process status\n", "---- --------- ---------------- --------------- ------------------------------------------------------------------------------------\n", "1467 21m ago PwRelaxWorkChain ⏵ Waiting Waiting for child processes: 2076\n", "1595 21m ago PwRelaxWorkChain ⏵ Waiting Waiting for child processes: 1952\n", "1904 20m ago PwRelaxWorkChain ⏵ Waiting Waiting for child processes: 1909\n", "1909 20m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 1917\n", "1917 20m ago PwCalculation ⏸ Waiting Pausing after failed transport task: update_calculation failed 5 times consecutively\n", "1952 20m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 1957\n", "1957 19m ago PwCalculation ⏸ Waiting Pausing after failed transport task: stash_calculation failed 5 times consecutively\n", "2039 19m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 2044\n", "2044 19m ago PwCalculation ⏸ Waiting Pausing after failed transport task: submit_calculation failed 5 times consecutively\n", "2076 19m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 2089\n", "2089 19m ago PwCalculation ⏸ Waiting Pausing after failed transport task: upload_calculation failed 5 times consecutively\n", "```\n", "\n", "We can see several processes which have been paused in one of the steps of how AiiDA runs a calculation on the remote computer.\n", "Checking the report of one of the calculations indicates that there were issues authenticating to the remote cluster, and that the process was paused after 5 connection attempts:\n", "\n", "```console\n", "$ verdi process report 2089\n", "[...]\n", " | paramiko.ssh_exception.AuthenticationException: Authentication failed.\n", "+-> WARNING at 2022-10-03 23:20:41.902383+00:00\n", " | maximum attempts 5 of calling do_upload, exceeded\n", "```\n", "\n", "Fortunately, once the connection issue has been resolved, AiiDA allows you to simply \"play\" the processes and continue the corresponding workflows without issue:\n", "\n", "```console\n", "$ verdi process play -a\n", "Success: played Process<2089>\n", "Success: played Process<2044>\n", "Success: played Process<1957>\n", "Success: played Process<1917>\n", "```\n", "\n", "Once the processes are no longer paused, the daemon workers will pick them back up and continue running them:\n", "\n", "```console\n", "$ verdi process list \n", " PK Created Process label Process State Process status\n", "---- --------- ---------------- --------------- ---------------------------------------\n", "1467 26m ago PwRelaxWorkChain ⏵ Waiting Waiting for child processes: 2076\n", "1595 26m ago PwRelaxWorkChain ⏵ Waiting Waiting for child processes: 1952\n", "1904 24m ago PwRelaxWorkChain ⏵ Waiting Waiting for child processes: 1909\n", "1909 24m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 1917\n", "1917 24m ago PwCalculation ⏵ Waiting Monitoring scheduler: job state RUNNING\n", "1952 24m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 1957\n", "1957 24m ago PwCalculation ⏵ Waiting Waiting for transport task: stash\n", "2039 24m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 2044\n", "2044 24m ago PwCalculation ⏵ Waiting Waiting for transport task: submit\n", "2076 24m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 2089\n", "2089 24m ago PwCalculation ⏵ Waiting Waiting for transport task: upload\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.13 ('aiida-qe-demo')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "vscode": { "interpreter": { "hash": "20c30adb377910d9d5c8112cf74e9f7ecb37538254a701570d770f074373c53e" } } }, "nbformat": 4, "nbformat_minor": 2 }