-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[xcvrd] Catch TypeError and KeyError exception to protect xcvrd from crashing #283
Conversation
Signed-off-by: Kebo Liu <kebol@nvidia.com>
@prgeor this is critical fix for 202205. can you please approve? |
('supported_min_laser_freq', str(port_info_dict['supported_min_laser_freq']) | ||
if 'supported_min_laser_freq' in port_info_dict else 'N/A') | ||
]) | ||
except (KeyError, TypeError) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see each field is checked if available in port_info_dict[] before accessing, so why KeyError is required? Also why TypeError req. if it just accessing the dictionary field value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see each field is checked if available in port_info_dict[] before accessing, so why KeyError is required? Also why TypeError req. if it just accessing the dictionary field value?
This try to fix below crash
syslog.1:Jul 15 10:42:31.594198 r-leopard-56 INFO pmon#supervisord: xcvrd Process Process-2:
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd Traceback (most recent call last):
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd self.run()
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd self._target(*self._args, **self._kwargs)
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1679, in task_worker
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd rc = post_port_sfp_info_to_db(logical_port, self.port_mapping, self.xcvr_table_helper.get_intf_tbl(asic_index), transceiver_dict)
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 306, in post_port_sfp_info_to_db
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd fvs = swsscommon.FieldValuePairs(
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 229, in __init__
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd _swsscommon.FieldValuePairs_swiginit(self, _swsscommon.new_FieldValuePairs(*args))
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd TypeError: Wrong number or type of arguments for overloaded function 'new_FieldValuePairs'.
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd Possible C/C++ prototypes are:
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd std::vector< std::pair< std::string,std::string > >::vector()
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd std::vector< std::pair< std::string,std::string > >::vector(std::vector< std::pair< std::string,std::string > > const &)
syslog.1:Jul 15 10:42:31.598088 r-leopard-56 INFO pmon#supervisord: xcvrd std::vector< std::pair< std::string,std::string > >::vector(std::vector< std::pair< std::string,std::string > >::size_type)
syslog.1:Jul 15 10:42:31.598088 r-leopard-56 INFO pmon#supervisord: xcvrd std::vector< std::pair< std::string,std::string > >::vector(std::vector< std::pair< std::string,std::string > >::size_type,std::vector< std::pair< std::string,std::string > >::value_type const &)
from this backtrace, I suppose at least one of the port_info_dict[] failed to fetch the value, either because of key error, or because none object of port_info_dict. I add the 'try' ... catch try to see what exactly the exception is, but the crash will disappear and no exception caught, if I remove the try..,catch I can see it again, I am not able to know the exact exception, so I have to add both of the exceptions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DOM table will be update with wrong values and user won't come to know this (i don't see anyone check the log unless any issue). Can we dig more which filed is causing the failure? I guess this is something to do with a specific optics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor I do more tests today, I think it's not related to the cable, I have two testbeds with the same type of cable installed, but can only find issue on one of the testbeds, so I tend to believe this is a timing issue. I tried test to protect type
and vendor_rev
with change ('type', port_info_dict['type'] if type in port_info_dict else 'N/A')
and ('vendor_rev', port_info_dict['vendor_rev'] if vendor_rev in port_info_dict else 'N/A')
, seems I am not able to reproduce the crash anymore. So I would suggest catching KeyError
, and skipping updating DB in case of exception, so xcvrd will be able to update the DB in the next iteration. What do you say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor I do more tests today, I think it's not related to the cable, I have two testbeds with the same type of cable installed, but can only find issue on one of the testbeds, so I tend to believe this is a timing issue. I tried test to protect
type
andvendor_rev
with change('type', port_info_dict['type'] if type in port_info_dict else 'N/A')
and('vendor_rev', port_info_dict['vendor_rev'] if vendor_rev in port_info_dict else 'N/A')
, seems I am not able to reproduce the crash anymore. So I would suggest catchingKeyError
, and skipping updating DB in case of exception, so xcvrd will be able to update the DB in the next iteration. What do you say?
@keboliu lets proceed with catching KeyError and also print the error message on keyError to know which field is giving error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keboliu do you mind using dictionary's get() method for each of these fields,? That would make the code more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see some more failures during the test, and need some more time to investigate.
close this PR and open sonic-net/sonic-platform-common#305 to fix the crash |
Signed-off-by: Kebo Liu kebol@nvidia.com
Description
Catch TypeError and KeyError exception to protect xcvrd from crashing
Motivation and Context
This is to fix a crash issue: sonic-net/sonic-buildimage#11525
How Has This Been Tested?
run sonic-mgmt sfp reset test case: platform_tests/sfp/test_sfputil.py::test_check_sfputil_reset to make sure xcvrd will not crash
Additional Information (Optional)