[cf-dev] rep fd keep increasing until 'too many open files' and cell in bad status
Hi, CF developers We met a problem that rep fd keep increasing
until 'too many open files'.
Our cloudfoundry env was built on kubenetes
cluster, it had 3 VMs under it. 1 for diego-cell (4core * 16G) and
2 for others. When we did stress test, we used 10+ threads to push/start/stop/../delete
apps continuously with 10s thinktime between each step. It began with 0
errors, but always ended with cell in bad status hours later. App
stage failed with 'can't communicate with compatible cells' and 'too many
open files' in rep.stdout.log . We began to monitor the # of files under
/proc/<rep-pid>/fd due to the 'too many open files' hint and noticed
that the # of files was steady at first, then from a point, it kept increasing,
even after the push app test was completely stopped, the increasing file
number seems like the cause of 'too many open files' and most likely would
cause the node(VM) unreachable in the end.
Why would this fd keep increasing? Was
there some leak or something couldn't be released?