Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure ironic to use Ipmitool retries #172

Merged
merged 3 commits into from
Jul 8, 2020

Conversation

maelk
Copy link
Member

@maelk maelk commented Jul 8, 2020

Without this configuration, the ipmitool timeout is 1 second. This is too short
for vbmc. This commit uses the ipmitool retry feature and extends the
timeout.

This PR also set console=ttyS0 in the IPA kernel parameter to gather IPA logs
in Metal3-dev-env

@maelk
Copy link
Member Author

maelk commented Jul 8, 2020

/test-integration

@metal3-io-bot metal3-io-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 8, 2020
@metal3-io-bot metal3-io-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 8, 2020
@maelk
Copy link
Member Author

maelk commented Jul 8, 2020

/test-integration

@dtantsur
Copy link
Member

dtantsur commented Jul 8, 2020

Without this configuration, the ipmitool timeout is 1 second.

Why do you think so? The -R and -N option specify the retry number and the delay between retries. The global ironic retries and timeout still apply (unless we have a bug, obviously).

@maelk
Copy link
Member Author

maelk commented Jul 8, 2020

if Ironic is handling the retires and timeouts, it sets the -R and -N options to 1 : https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/ipmitool.py#L483 . In that case the internal ipmitool timeout is 1 second. I agree that ironic would retry later on. But in metal3-dev-env this 1s timeout is too short on a lot of operations. So letting ipmitool handle the timeout and retries allow us to wait longer for the vbmc answer.

maelk added 3 commits July 8, 2020 17:48
Without this configuration, the ipmitool timeout is 1 second. This is too short
for vbmc. This commit uses the ipmitool retry feature and extends the timeout
to prevent errors such as
```
Suspicious activity detected for node...
when attempting to heartbeat. Heartbeat request has been rejected as the
version of ironic-python-agent indicated in the heartbeat operation
should support agent token functionality.
```
@maelk
Copy link
Member Author

maelk commented Jul 8, 2020

/test-integration

@dtantsur
Copy link
Member

dtantsur commented Jul 8, 2020

Mmm, I think I understand what you're getting at. The retransmission rate is too fast? Do you have a link to example where you hit this?

@bfournie FYI, it may be a problem with your patch.

@maelk
Copy link
Member Author

maelk commented Jul 8, 2020

The issue is rather that ipmitool does not wait long enough before giving up when running with -N 1. And when you configure the retries to be handled by ironic, the interval between retries is set properly, but the timeout is forced to be 1 , i.e. you send a request, fail after 1s and wait 4s before sending the next request. That's the problematic part for Metal3 Dev env

@maelk
Copy link
Member Author

maelk commented Jul 8, 2020

/assign @russellb
/cc @dtantsur
/cc @dhellmann

@maelk
Copy link
Member Author

maelk commented Jul 8, 2020

This PR fixes the error that is otherwise visible in ironic logs :

2020-07-08 14:44:57.013 119 DEBUG oslo_concurrency.processutils [req-bbc52505-10b7-42d0-9d29-1b1d7675eecc - - - - -] 'ipmitool -I lanplus -H 192.168.111.1 -L ADMINISTRATOR -p 6230 -U admin -R 1 -N 1 -f /tmp/tmpwchdt9mc power status' failed. Not Retrying. execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:455�[00m
2020-07-08 14:44:57.015 119 WARNING ironic.drivers.modules.ipmitool [req-bbc52505-10b7-42d0-9d29-1b1d7675eecc - - - - -] IPMI Error encountered, retrying "ipmitool -I lanplus -H 192.168.111.1 -L ADMINISTRATOR -p 6230 -U admin -R 1 -N 1 -f /tmp/tmpwchdt9mc power status" for node 99970ccd-4deb-4d48-b624-77bc60b26bc0. Error: Unexpected error while running command.
Command: ipmitool -I lanplus -H 192.168.111.1 -L ADMINISTRATOR -p 6230 -U admin -R 1 -N 1 -f /tmp/tmpwchdt9mc power status
Exit code: 1
Stdout: ''
Stderr: 'Unable to get Chassis Power Status\n': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.�[00m

Ironic usually retries but sometimes fails all the retries. By waiting a bit more, there is no need for retries. the ironic logs for an example of a failed CI run can be found here : https://jenkins.nordix.org/view/Metal3/job/airship_master_v1a3_integration_test_ubuntu/243/artifact/logs-jenkins-airship_master_v1a3_integration_test_ubuntu-243.tgz (in the docker folder)

@maelk
Copy link
Member Author

maelk commented Jul 8, 2020

Copy link
Member

@dhellmann dhellmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description makes sense. Let's see if these changes make CI more stable.

@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann, maelk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 8, 2020
@dhellmann
Copy link
Member

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants