Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExaBGP process get stuck with rapid API updates #1141

Closed
newrushbolt opened this issue Feb 27, 2023 · 5 comments
Closed

ExaBGP process get stuck with rapid API updates #1141

newrushbolt opened this issue Feb 27, 2023 · 5 comments
Assignees
Labels

Comments

@newrushbolt
Copy link

Describe the bug

Main ExaBGP process is getting stuck if api process has a lot of flaps between "announce" and "withdraw".
It takes about 5 minutes when flap interval is 500ms.
Once stuck main process is not moving anywhere from one exact syscall:

strace -f -p '870194'
strace: Process 870194 attached
write(5, "neighbor 10.89.1.3 send update s"..., 129

Last log messages looks like this:

processes       route added to neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self
processes       route removed from neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self
processes       route removed from neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self
processes       route removed from neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self
processes       route removed from neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self
processes       route removed from neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self
processes       route removed from neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self
processes       route added to neighbor 10.89.1.3 local-ip 10.89.1.2 local-as 65316 peer-as 65318 router-id 10.89.1.2 family-allowed in-open : 172.31.0.89/32 next-hop self

To Reproduce

I uploaded a complete test environment to https://github.com/newrushbolt/exabgp_stuck_example, please have a look.

But the key steps are:

  • Start an ExaBGP with healthcheck-style api process that returns withdraw x.x.x.x or announce x.x.x.x.
  • Make sure that BGP-session using this api process is up and running
  • Make this api process "flap" between "announce" and "withdraw" once a second
  • Wait from 5 to 15 minutes

Expected behavior

ExaBGP daemon runs no matter how many times healthcheck process flaps between "announce" and "withdraw".

Environment (please complete the following information):

  • OS: Centos Stream 8 with kernel==4.18.0-408.el8.x86_64
  • Version: 4.2.21 or main(tested on commit 1331435)
  • Python: 3.6, 3.8, 3.9

Additional context

All the additional information (config, envs, other packages versions) could be found in test repo https://github.com/newrushbolt/exabgp_stuck_example.

@thomas-mangin
Copy link
Member

Thank you for the lab, I will look into it.

@newrushbolt
Copy link
Author

newrushbolt commented Mar 7, 2023

Thanks for you response.
You are absolutely correct: exabgp.api.ack was enabled because my env-file was misplaced and ignored.

The freeze itself happens because when API process don't read acks, PIPE buffer(64 kb by default) getting full.
That's untrivial to debug, I couldn't find a way to get used buffer size from outside of processes.
You can verify that full buffer is freeze reason this way:

  1. Get PID's
# ps afxu
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1 16.0  0.6 261600 54364 ?        Ssl  01:24  86:13 /usr/bin/python3.9 python3.9 /usr/local/bin/exabgp --debug /etc/exabgp/exabgp.conf
root          11  0.1  0.6 249228 51172 ?        Sl   01:24   0:34  \_ /usr/bin/python3.9 /usr/bin/python3.9 /usr/local/bin/exabgp
root          14  0.2  0.3 197832 31068 ?        Sl   01:24   1:19  \_ /bin/python3 /bin/python3 /check.py
  1. Find child process STDIN fd ID, stdin is always fd=0:
# tree /proc/14/fd
/proc/14/fd
├── 0 -> pipe:[6017184]
├── 1 -> pipe:[6017185]
└── 2 -> pipe:[6019460]
  1. Make sure main process is connected to this fd:
# tree /proc/1/fd | grep '6017184'
├── 7 -> pipe:[6017184]
  1. Manually read stdin:
 timeout 2 cat /proc/14/fd/0 > child_stdin.txt

If reading instantly unfreezes the main process, than it's PIPE buffer to blame, and child process probably not reading it's STDIN.

@thomas-mangin I'd like to further dig in this problem, and probably will return later with some kind of pull-request, that will allow to write an error to log, or even auto-restart child process if it is not reading the buffer.

@thomas-mangin
Copy link
Member

confirmed as API ack. so closing.

@newrushbolt
Copy link
Author

Some time later I figured out that the problem was not only because of ack, but also because of api.neighbor-changes generated incoming messages to API-process, even when api.send = {} and ack=false.
This is kinda confusing, and took a lot of time to debug, so maybe you would consider moving neighbor-changes into send section.

Anyway, great thanks to you and all the contributors for making this product ❤️

@thomas-mangin
Copy link
Member

Thank you @newrushbolt - opened an issue to deal with this abnormality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants