Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auto-Techsupport] Issues related to Multiple Cores crashing handled #1948

Merged
merged 6 commits into from
Dec 6, 2021

Conversation

vivekrnv
Copy link
Contributor

@vivekrnv vivekrnv commented Nov 24, 2021

Signed-off-by: Vivek Reddy Karri vkarri@nvidia.com

What I did

Issues seen when multiple cores are crashed in very quick succession:

  1. The rate_limit_interval is not honored. Because, i previously was finding out the last created tech-support using the glob pattern sonic_dump_*tar*, which will not include the dumps which are being currently run. These existing dump will not have .tar.gz extension. Thus, modified the get_ts_dumps to search based on the TS_ROOT i.e sonic_dump_*
  2. show auto-tech support history is not showing all the created dumps. I've previously used to take the diff of tech support dumps before and after running the invocation and used to assign the diff as the corresponding techsupport for this core. This approach is prone to race condition as we can have multiple dumps in the diff found in the interval.
    Avoided this by parsing the stdout returned by show techsupport invocation

How I did it

How to verify it

  1. Unit Tests
  2. Generate core-dumps in very quick succession. Use the default rate limit interval. Should only see one entry in tech-support history
  3. Set global rate limit interval to 0. Generate cores in quick succession. Should see a few entries in the history.

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

@vivekrnv
Copy link
Contributor Author

@ganglyu @qiluo-msft Please help review

@vivekrnv
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

matches = re.findall(TS_PTRN, ts_stdout)
if matches:
return matches[-1]
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else is not necessary here.

if "show techsupport --since '2 days ago'" in cmd_str:
patcher.fs.create_file("/var/dump/sonic_dump_random3.tar.gz")
return 0, "", ""
print(cmd_str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is print used for debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Will remove

Copy link
Contributor

@ganglyu ganglyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@qiluo-msft qiluo-msft merged commit 6de91af into sonic-net:master Dec 6, 2021
@vivekrnv vivekrnv deleted the event_driv_ts_bug branch December 6, 2021 17:00
judyjoseph pushed a commit that referenced this pull request Jan 9, 2022
…1948)

#### What I did

**Issues seen when multiple cores are crashed in very quick succession:**
1) The **rate_limit_interval** is not honored. Because, i previously was finding out the last created tech-support using the glob pattern `sonic_dump_*tar*`, which  will not include the dumps which are being currently run. These existing dump will not have .tar.gz extension. Thus, modified the `get_ts_dumps` to search based on the TS_ROOT i.e `sonic_dump_*`
2) **show auto-tech support history** is not showing all the created dumps. I've previously used to take the diff of tech support dumps before and after running the invocation and used to assign the diff as the corresponding techsupport for this core. This approach is prone to race condition as we can have multiple dumps in the diff found in the interval. 
Avoided this by parsing the stdout returned by `show techsupport` invocation

#### How to verify it

1) Unit Tests
2) Generate core-dumps in very quick succession. Use the default rate limit interval. Should only see one entry in tech-support history
3) Set global rate limit interval to 0. Generate cores in quick succession. Should see a few entries in the history.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants