Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mellanox] thermal control enhancement for dynamic minimum fan speed and PSU fan speed policy #4403

Merged
merged 7 commits into from
Apr 21, 2020

Conversation

Junchao-Mellanox
Copy link
Collaborator

@Junchao-Mellanox Junchao-Mellanox commented Apr 9, 2020

- What I did

  1. Support dynamic minimum fan speed policy. Before this change, the default fan speed is 60% in Mellanox platform which means that the fan speed will be set to 60% if all fans work well. But 60% might be too high for cold environment which consumes a lot power and makes noise. The idea of dynamic minimum fan speed is to set minimum fan speed dynamically according to thermal zone temperature, air flow direction. The dynamic minimum fan speed policy check air flow direction and thermal zone temperature every 60 seconds and determine a proper minimum fan speed according to a "minimum table". Set fan speed to lower than minimum fan speed is not allowed.

  2. Support PSU fan speed policy. Before this change, the PSU fan speed is a fix value in Mellanox platform. It means that the PSU fan speed won't be adjusted according to the system's thermal zone status. The PSU fan speed policy checks the system cooling level periodically and set PSU fan speed according to it.

  3. Adjust the current "all fan and psu presence" policy. Before this change, when all fan and psu work well, this policy just set fan speed to 60% and enable kernel thermal algorithm. Within this change, the policy will check all thermal zones and make sure all thermal zones return to their normal state before setting fan speed to 60%.

  4. Add a fan fault policy. If any fan in fault state, set other fan speed to 100%.

- How I did it

  1. Add a private policy for dynamic minimum fan speed. The policy contains a new condition to detect that if thermal zone temperature or air flow direction changes. The policy also contains a new action to set the minimum fan speed.

  2. Add a private policy for PSU fan speed. The policy contains a new condition to check that if system cooling level changes. The policy also contains a new action to set PSU fan speed via i2c tool.

- How to verify it

Manual test and regression test. Added two test cases in sonic-net/sonic-mgmt#1552.

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@lgtm-com
Copy link

lgtm-com bot commented Apr 9, 2020

This pull request introduces 1 alert when merging 3b77d71 into 768d18d - view on LGTM.com

new alerts:

  • 1 for Unused local variable

@Junchao-Mellanox
Copy link
Collaborator Author

retest vsimage please

@Junchao-Mellanox
Copy link
Collaborator Author

retest vsimage please

@@ -0,0 +1,134 @@
DEVICE_DATA = {
'ACS-MSN2700': {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like for the different SKU which are different flavour of the same Platform type we need to define the same table over and over. can you please find a better way to do so?
for example: ACS-MSN2700, LS-SN2700, Mellanox-SN2700, Mellanox-SN2700-D48C8, etc.
soon when we will have a dynamic SKU creator we will not be able to support it thus it cannot be based on SKU.
we have the same limitation in sfputil and this is wrong.

}
}
},
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are missing SN4700 which is also supported.

Copy link
Collaborator Author

@Junchao-Mellanox Junchao-Mellanox Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have minimum table for 4700. We set minimum FAN speed to 60% for 4700 or any other platform who has no minimum table.

@@ -22,17 +23,24 @@

FAN_PATH = "/var/run/hw-management/thermal/"
LED_PATH = "/var/run/hw-management/led/"
CONFIG_PATH = "/var/run/hw-management/config"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we have this config path on different files. is this correct to have it this way instead of having it inherited?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct. The config path here is a directory and I only need it to find some i2c address related to PSU fan. I don't see difference among differenct platforms.

# fan_dir isn't supported on Spectrum 1. It is supported on Spectrum 2 and later switches
FAN_DIR = "/var/run/hw-management/system/fan_dir"
COOLING_STATE_PATH = "/var/run/hw-management/thermal/cooling_cur_state"

# SKUs with unplugable FANs:
# 1. don't have fanX_status and should be treated as always present
hwsku_dict_with_unplugable_fan = ['ACS-MSN2010', 'ACS-MSN2100']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, should not be SKU but rather a system type

if not isinstance(level, int):
raise RuntimeError("Failed to set cooling level, input parameter must be integer")

if level < 1 or level > 10:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to have min, max defines and use it for both if case and prints.

THERMAL_ZONE_TEMPERATURE = "thermal_zone_temp"
THERMAL_ZONE_NORMAL_TEMPERATURE = "temp_trip_norm"

MODULE_COUNTER_PATH = "/var/run/hw-management/config/module_counter"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you are using it?

THERMAL_ZONE_NORMAL_TEMPERATURE = "temp_trip_norm"

MODULE_COUNTER_PATH = "/var/run/hw-management/config/module_counter"
GEARBOX_COUNTER_PATH = "/var/run/hw-management/config/gearbox_counter"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you are using it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.



@classmethod
def _write_generic_file(cls, filename, content):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what the purpose of the generic file? can you please add more description for it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated description.

logger.log_info("Fail to write file {} due to {}".format(filename, repr(e)))

@classmethod
def set_thermal_algorithm_status(cls, status, force=True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the behaviour expected when disabled?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When disable the thermal algorithm, kernel no longer adjust the FAN speed according to thermal zone temperature. We usually disable the thermal algorithm when FAN absence or PSU absence. If a fan unit or PSU is removed from switch, there will be a "air hole" at switch, we need set fan speed to 100% and prevent thermal algorithm adjust it.

@@ -159,6 +185,44 @@ def test_all_fan_presence_condition():
fan_info.collect(chassis)
assert condition.is_match({'fan_info': fan_info})

def test_any_fan_fault_condition():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you consider to have a test file for the tests instead of the general code used for production? not must but i think it is better

@jleveque
Copy link
Contributor

Please fix conflicts

@jleveque jleveque merged commit c730f3e into sonic-net:master Apr 21, 2020
@Junchao-Mellanox Junchao-Mellanox deleted the thermal-enhance branch May 7, 2020 06:10
@abdosi
Copy link
Contributor

abdosi commented May 28, 2020

@Junchao-Mellanox Please create PR for 201911. There is merge conflict.
Removing Request for 201911 Label

@rlhui @liat-grozovik

Junchao-Mellanox added a commit to Junchao-Mellanox/sonic-buildimage that referenced this pull request Jun 1, 2020
…and PSU fan speed policy (sonic-net#4403)

Conflicts:
	platform/mellanox/mlnx-platform-api/sonic_platform/fan.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants