Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CMIS cable] Xcvrd could crash while reseting CMIS cable #11525

Closed
keboliu opened this issue Jul 25, 2022 · 0 comments
Closed

[CMIS cable] Xcvrd could crash while reseting CMIS cable #11525

keboliu opened this issue Jul 25, 2022 · 0 comments
Assignees
Labels
Issue for 202205 Triaged this issue has been triaged

Comments

@keboliu
Copy link
Collaborator

keboliu commented Jul 25, 2022

Description

Recently when I perform community SFP reset(https://github.com/Azure/sonic-mgmt/blob/master/tests/platform_tests/sfp/test_sfputil.py#L90) against CMIS cables(with latest 202205 branch) I see xcvrd can crash at different position due to encountered into “NoneType” object and “KeyErrors”, I listed the traceback as below:

1. NoneType object in sonic_platform_base/sonic_xcvr/api/public/cmis.py

syslog:Jul 15 11:06:10.144401 r-leopard-56 INFO pmon#supervisord: xcvrd Traceback (most recent call last):
syslog:Jul 15 11:06:10.144481 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
syslog:Jul 15 11:06:10.145927 r-leopard-56 INFO pmon#supervisord: xcvrd     self.run()
syslog:Jul 15 11:06:10.145927 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/lib/python3.9/threading.py", line 892, in run
syslog:Jul 15 11:06:10.146121 r-leopard-56 INFO pmon#supervisord: xcvrd     self._target(*self._args, **self._kwargs)
syslog:Jul 15 11:06:10.146121 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1438, in task_worker
syslog:Jul 15 11:06:10.146450 r-leopard-56 INFO pmon#supervisord: xcvrd     post_port_dom_info_to_db(logical_port_name, self.port_mapping, self.xcvr_table_helper.get_dom_tbl(asic_index), self.task_stopping_event, dom_info_cache=dom_info_cache)
syslog:Jul 15 11:06:10.146450 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 503, in post_port_dom_info_to_db
syslog:Jul 15 11:06:10.146598 r-leopard-56 INFO pmon#supervisord: xcvrd     dom_info_dict = _wrapper_get_transceiver_dom_info(physical_port)
syslog:Jul 15 11:06:10.146598 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 152, in _wrapper_get_transceiver_dom_info
syslog:Jul 15 11:06:10.146710 r-leopard-56 INFO pmon#supervisord: xcvrd     return platform_chassis.get_sfp(physical_port).get_transceiver_bulk_status()
syslog:Jul 15 11:06:10.146710 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/sonic_platform_base/sonic_xcvr/sfp_optoe_base.py", line 28, in get_transceiver_bulk_status
syslog:Jul 15 11:06:10.146907 r-leopard-56 INFO pmon#supervisord: xcvrd     return api.get_transceiver_bulk_status() if api is not None else None
syslog:Jul 15 11:06:10.146907 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/sonic_platform_base/sonic_xcvr/api/public/cmis.py", line 206, in get_transceiver_bulk_status
syslog:Jul 15 11:06:10.147235 r-leopard-56 INFO pmon#supervisord: xcvrd     bulk_status['laser_temperature'] = laser_temp_dict['monitor value']
syslog:Jul 15 11:06:10.147355 r-leopard-56 INFO pmon#supervisord: xcvrd TypeError: 'NoneType' object is not subscriptable 

2. NoneType object in sonic_platform_base/sonic_xcvr/api/public/cmisVDM.py

syslog:Jul 15 11:55:33.306925 r-leopard-56 INFO pmon#supervisord: xcvrd Exception in thread Thread-1:
syslog:Jul 15 11:55:33.307011 r-leopard-56 INFO pmon#supervisord: xcvrd Traceback (most recent call last):
syslog:Jul 15 11:55:33.307011 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
syslog:Jul 15 11:55:33.307742 r-leopard-56 INFO pmon#supervisord: xcvrd     self.run()
syslog:Jul 15 11:55:33.307742 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/lib/python3.9/threading.py", line 892, in run
syslog:Jul 15 11:55:33.308319 r-leopard-56 INFO pmon#supervisord: xcvrd     self._target(*self._args, **self._kwargs)
syslog:Jul 15 11:55:33.308319 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1438, in task_worker
syslog:Jul 15 11:55:33.309200 r-leopard-56 INFO pmon#supervisord: xcvrd     post_port_dom_info_to_db(logical_port_name, self.port_mapping, self.xcvr_table_helper.get_dom_tbl(asic_index), self.task_stopping_event, dom_info_cache=dom_info_cache)
syslog:Jul 15 11:55:33.309218 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 503, in post_port_dom_info_to_db
syslog:Jul 15 11:55:33.309688 r-leopard-56 INFO pmon#supervisord: xcvrd     dom_info_dict = _wrapper_get_transceiver_dom_info(physical_port)
syslog:Jul 15 11:55:33.309719 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 152, in _wrapper_get_transceiver_dom_info
syslog:Jul 15 11:55:33.310205 r-leopard-56 INFO pmon#supervisord: xcvrd     return platform_chassis.get_sfp(physical_port).get_transceiver_bulk_status()
syslog:Jul 15 11:55:33.310234 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/sonic_platform_base/sonic_xcvr/sfp_optoe_base.py", line 28, in get_transceiver_bulk_status
syslog:Jul 15 11:55:33.310258 r-leopard-56 INFO pmon#supervisord: xcvrd     return api.get_transceiver_bulk_status() if api is not None else None
syslog:Jul 15 11:55:33.310284 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/sonic_platform_base/sonic_xcvr/api/public/cmis.py", line 210, in get_transceiver_bulk_status
syslog:Jul 15 11:55:33.310293 r-leopard-56 INFO pmon#supervisord: xcvrd     self.vdm_dict = self.get_vdm()
syslog:Jul 15 11:55:33.310753 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/sonic_platform_base/sonic_xcvr/api/public/cmis.py", line 1021, in get_vdm
syslog:Jul 15 11:55:33.311200 r-leopard-56 INFO pmon#supervisord: xcvrd     vdm = self.vdm.get_vdm_allpage() if not self.is_flat_memory() else {}
syslog:Jul 15 11:55:33.311220 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/sonic_platform_base/sonic_xcvr/api/public/cmisVDM.py", line 178, in get_vdm_allpage
syslog:Jul 15 11:55:33.312102 r-leopard-56 INFO pmon#supervisord: xcvrd     vdm_current_page = self.get_vdm_page(page, vdm_flag_page)
syslog:Jul 15 11:55:33.312133 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/sonic_platform_base/sonic_xcvr/api/public/cmisVDM.py", line 55, in get_vdm_page
syslog:Jul 15 11:55:33.312634 r-leopard-56 INFO pmon#supervisord: xcvrd     vdm_typeID = vdm_descriptor[1::2]
syslog:Jul 15 11:55:33.312662 r-leopard-56 INFO pmon#supervisord: xcvrd TypeError: 'NoneType' object is not subscriptable

3. Type error in XCVRD probably related to some key is not exist in some dict.

syslog.1:Jul 15 10:42:31.594198 r-leopard-56 INFO pmon#supervisord: xcvrd Process Process-2:
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd Traceback (most recent call last):
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd     self.run()
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd     self._target(*self._args, **self._kwargs)
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1679, in task_worker
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd     rc = post_port_sfp_info_to_db(logical_port, self.port_mapping, self.xcvr_table_helper.get_intf_tbl(asic_index), transceiver_dict)
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 306, in post_port_sfp_info_to_db
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd     fvs = swsscommon.FieldValuePairs(
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 229, in __init__
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd     _swsscommon.FieldValuePairs_swiginit(self, _swsscommon.new_FieldValuePairs(*args))
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd TypeError: Wrong number or type of arguments for overloaded function 'new_FieldValuePairs'.
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd   Possible C/C++ prototypes are:
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd     std::vector< std::pair< std::string,std::string > >::vector()
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd     std::vector< std::pair< std::string,std::string > >::vector(std::vector< std::pair< std::string,std::string > > const &)
syslog.1:Jul 15 10:42:31.598088 r-leopard-56 INFO pmon#supervisord: xcvrd     std::vector< std::pair< std::string,std::string > >::vector(std::vector< std::pair< std::string,std::string > >::size_type)
syslog.1:Jul 15 10:42:31.598088 r-leopard-56 INFO pmon#supervisord: xcvrd     std::vector< std::pair< std::string,std::string > >::vector(std::vector< std::pair< std::string,std::string > >::size_type,std::vector< std::pair< std::string,std::string > >::value_type const &)

4. After I fixed(WA) above failures I encountered into a new ‘NoneType’ Error:

Jul 21 12:06:19.059260 r-leopard-56 INFO pmon#supervisord: xcvrd Process Process-2:
Jul 21 12:06:19.062242 r-leopard-56 INFO pmon#supervisord: xcvrd Traceback (most recent call last):
Jul 21 12:06:19.062443 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
Jul 21 12:06:19.062443 r-leopard-56 INFO pmon#supervisord: xcvrd     self.run()
Jul 21 12:06:19.062459 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
Jul 21 12:06:19.062459 r-leopard-56 INFO pmon#supervisord: xcvrd     self._target(*self._args, **self._kwargs)
Jul 21 12:06:19.062459 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1683, in task_worker
Jul 21 12:06:19.062459 r-leopard-56 INFO pmon#supervisord: xcvrd     rc = post_port_sfp_info_to_db(logical_port, self.port_mapping, self.xcvr_table_helper.get_intf_tbl(asic_index), transceiver_dict)
Jul 21 12:06:19.062475 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 300, in post_port_sfp_info_to_db
Jul 21 12:06:19.062475 r-leopard-56 INFO pmon#supervisord: xcvrd     port_info_dict = _wrapper_get_transceiver_info(physical_port)
Jul 21 12:06:19.062485 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 143, in _wrapper_get_transceiver_info
Jul 21 12:06:19.062485 r-leopard-56 INFO pmon#supervisord: xcvrd     return platform_chassis.get_sfp(physical_port).get_transceiver_info()
Jul 21 12:06:19.062509 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/sonic_platform_base/sonic_xcvr/sfp_optoe_base.py", line 24, in get_transceiver_info
Jul 21 12:06:19.062509 r-leopard-56 INFO pmon#supervisord: xcvrd     return api.get_transceiver_info() if api is not None else None
Jul 21 12:06:19.062509 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/sonic_platform_base/sonic_xcvr/api/public/cmis.py", line 149, in get_transceiver_info
Jul 21 12:06:19.062524 r-leopard-56 INFO pmon#supervisord: xcvrd     appl_advt = self.get_application_advertisement()
Jul 21 12:06:19.062524 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/sonic_platform_base/sonic_xcvr/api/public/cmis.py", line 1848, in get_application_advertisement
Jul 21 12:06:19.062524 r-leopard-56 INFO pmon#supervisord: xcvrd     dic.update(self.xcvr_eeprom.read(consts.APPLS_ADVT_FIELD_PAGE01))
Jul 21 12:06:19.062534 r-leopard-56 INFO pmon#supervisord: xcvrd TypeError: 'NoneType' object is not iterable

I suspect that after the CMIS cable reseted, the eeprom is not ready yet but XCVRD is keep reading them periodically, so none type object returned, and cause XCVRD crash, I feel that the CMIS cable decode code is not rubust enough.

Steps to reproduce the issue:

  1. perform sfputil reset test on platforms with CMIS cables

Describe the results you received:

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue for 202205 Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

4 participants