Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The process has stuck when reached too many open files #55

Open
Felixoid opened this issue Feb 7, 2020 · 2 comments
Open

The process has stuck when reached too many open files #55

Felixoid opened this issue Feb 7, 2020 · 2 comments

Comments

@Felixoid
Copy link
Collaborator

Felixoid commented Feb 7, 2020

Hello. Yesterday the process has stuck in a strange state without doing anything.

I did check the CH logs, there wasn't any suspicious.

There's a log from the stuck time:

> grep -C15 2020-02-06T20:12:33.379Z TMOP-CCH.log
[2020-02-06T20:12:27.381Z] INFO [upload] start handle {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019944197075408.lz4"}
[2020-02-06T20:12:27.403Z] INFO [upload] start handle {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019944197075408.lz4"}
[2020-02-06T20:12:27.617Z] ERROR [upload] handle failed {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019944197075408.lz4", "error": "Post http://ch-host:8123/?query=INSERT+INTO+graphite.tagged+%28Date%2C+Tag1%2C+Path%2C+Tags%2C+Version%29+FORMAT+RowBinary: EOF", "time": 0.235977166}
[2020-02-06T20:12:27.754Z] INFO [upload] handle success {"name": "graphite", "filename": "/var/lib/carbon-clickhouse/graphite/default.1581019944197075408.lz4", "time": 0.374817767}
[2020-02-06T20:12:28.129Z] INFO [upload] handle success {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019944197075408.lz4", "time": 0.727357581}
[2020-02-06T20:12:28.380Z] INFO [upload] start handle {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019944197075408.lz4"}
[2020-02-06T20:12:28.486Z] INFO [upload] handle success {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019944197075408.lz4", "time": 0.105951873}
[2020-02-06T20:12:30.379Z] INFO [upload] start handle {"name": "graphite", "filename": "/var/lib/carbon-clickhouse/graphite/default.1581019947197180216.lz4"}
[2020-02-06T20:12:30.380Z] INFO [upload] start handle {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019947197180216.lz4"}
[2020-02-06T20:12:30.380Z] INFO [upload] start handle {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019947197180216.lz4"}
[2020-02-06T20:12:30.478Z] INFO [upload] handle success {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019947197180216.lz4", "time": 0.098053346}
[2020-02-06T20:12:30.625Z] INFO [upload] handle success {"name": "graphite", "filename": "/var/lib/carbon-clickhouse/graphite/default.1581019947197180216.lz4", "time": 0.245967823}
[2020-02-06T20:12:30.875Z] ERROR [upload] handle failed {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019947197180216.lz4", "error": "Post http://ch-host:8123/?query=INSERT+INTO+graphite.index+%28Date%2C+Level%2C+Path%2C+Version%29+FORMAT+RowBinary: EOF", "time": 0.495387526}
[2020-02-06T20:12:31.380Z] INFO [upload] start handle {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019947197180216.lz4"}
[2020-02-06T20:12:31.639Z] INFO [upload] handle success {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019947197180216.lz4", "time": 0.25905138}
[2020-02-06T20:12:33.379Z] INFO [upload] start handle {"name": "graphite", "filename": "/var/lib/carbon-clickhouse/graphite/default.1581019950197305764.lz4"}
[2020-02-06T20:12:33.380Z] INFO [upload] start handle {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019950197305764.lz4"}
[2020-02-06T20:12:33.380Z] INFO [upload] start handle {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019950197305764.lz4"}
[2020-02-06T20:12:45.198Z] INFO [writer] chunk interval changed {"interval": "10s"}
[2020-02-06T20:12:49.234Z] WARN [tcp] unfinished line {"peer": "[2a00:1f78:fffd:402e::101]:35402", "line": "servers.a"}
[2020-02-06T20:12:50.012Z] WARN [tcp] unfinished line {"peer": "[2a00:1f78:fffd:4002::a9d]:56246", "line": "metric.name"}
[2020-02-06T20:12:50.709Z] WARN [tcp] failed to accept connection {"error": "accept tcp [::]:2003: accept4: too many open files"}
[2020-02-06T20:12:50.709Z] WARN [tcp] failed to accept connection {"error": "accept tcp [::]:2003: accept4: too many open files"}
[2020-02-06T20:12:50.709Z] WARN [tcp] failed to accept connection {"error": "accept tcp [::]:2003: accept4: too many open files"}
**** the line repeats next 1.3GiB****

Another strange thing - disk space was allocated in df output, but du didn't show anything: (upd: looks like logrotate has removed the file, but carbon-clickhouse hasn't release it by some reason)

> df
Filesystem     1K-blocks    Used Available Use% Mounted on
udev              980608       0    980608   0% /dev
tmpfs             198304   22024    176280  12% /run
/dev/vda         7329792 7329772        20 100% /
# cuted
> sudo du -hscx / | sort -h 
1.7G	/
1.7G	total
> sudo lsof | grep -i del
COMMAND     PID   TID             USER   FD      TYPE             DEVICE SIZE/OFF       NODE NAME
carbon-cl 19199       carbon-clickhouse   13w      REG              254,0 5603639296    4872077 /var/log/carbon-clickhouse/carbon-clickhouse.log.1 (deleted)
carbon-cl 19199  6028 carbon-clickhouse   13w      REG              254,0 5603639296    4872077 /var/log/carbon-clickhouse/carbon-clickhouse.log.1 (deleted)
carbon-cl 19199 19200 carbon-clickhouse   13w      REG              254,0 5603639296    4872077 /var/log/carbon-clickhouse/carbon-clickhouse.log.1 (deleted)
*** dozens of line with different TID ***

I'll increase nofiles limit for sure because it was the reason for the outage. But maybe you'd find a way to avoid such deadlocks. Even panic and die would be better behavior IMHO.

@phuntik
Copy link

phuntik commented Sep 6, 2021

bump!
Faced the same errors in log with no upload to clickhouse:

-- Logs begin at Wed 2021-09-01 20:28:48 +05, end at Mon 2021-09-06 20:06:59 +05. --
Sep 01 20:28:48 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:48.017+0500] ERROR [upload] ReadDir failed {"name": "graphite_tagged", "error": "open /data/carbon-clickhouse/graphite_tagged: too many open f>
Sep 01 20:28:48 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:48.017+0500] ERROR [writer] ReadDir failed {"error": "open /data/carbon-clickhouse/: too many open files"}
Sep 01 20:28:48 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:48.019+0500] ERROR [upload] ReadDir failed {"name": "graphite_index", "error": "open /data/carbon-clickhouse/graphite_index: too many open fil>
Sep 01 20:28:48 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:48.043+0500] ERROR [upload] ReadDir failed {"name": "graphite", "error": "open /data/carbon-clickhouse/graphite: too many open files"}
Sep 01 20:28:48 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:48.513+0500] ERROR [writer] create failed {"filename": "/data/carbon-clickhouse/default.1630510128513008811.lz4", "error": "open /data/carbon->
Sep 01 20:28:49 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:49.017+0500] ERROR [upload] ReadDir failed {"name": "graphite_tagged", "error": "open /data/carbon-clickhouse/graphite_tagged: too many open f>
Sep 01 20:28:49 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:49.017+0500] ERROR [writer] ReadDir failed {"error": "open /data/carbon-clickhouse/: too many open files"}
Sep 01 20:28:49 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:49.018+0500] ERROR [upload] ReadDir failed {"name": "graphite_index", "error": "open /data/carbon-clickhouse/graphite_index: too many open fil>
Sep 01 20:28:49 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:49.043+0500] ERROR [upload] ReadDir failed {"name": "graphite", "error": "open /data/carbon-clickhouse/graphite: too many open files"}
Sep 01 20:28:49 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:49.513+0500] ERROR [writer] create failed {"filename": "/data/carbon-clickhouse/default.1630510129513915235.lz4", "error": "open /data/carbon->
Sep 01 20:28:50 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:50.018+0500] ERROR [writer] ReadDir failed {"error": "open /data/carbon-clickhouse/: too many open files"}

Can't confirm that problem is in logrotate-carbon interaction because this 'strange' thing of @Felixoid doesn't appear to me. But from my side can admit that in times of normal work carbon was consuming about 8% of cpu. Now, at the moment of bad behaviour, it consumes 100% of cpu (one full core).
/data/carbon-clickhouse/graphite(index,tagged) folders contains 540k of broken symlinks in each (5 hosts x 7 days of system and app-specific metrics) and this count increases. Date of creation of oldest symlink in folder corresponds with time when metrics has stopped being uploaded.

[root@Monitoring carbon-clickhouse]# du -sh /data/carbon-clickhouse/*
4.0K	/data/carbon-clickhouse/default.1617361073904867900.lz4
16K	/data/carbon-clickhouse/default.1630397796503127702.lz4
32M	/data/carbon-clickhouse/graphite
32M	/data/carbon-clickhouse/graphite_index
32M	/data/carbon-clickhouse/graphite_tagged

Folders cleaning and restart helped.
carbon-clickhouse -version
0.11.1

Sooo, any updates with this issue, диарс? :)
BTW my humble opinion is same to @Felixoid that it it not good that carbon shows green state in systemctl status when in fact not working properly.

@Felixoid
Copy link
Collaborator Author

The best thing I can see here is putting all ports down and writing an ERROR to log. Does it sound like a good solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants