[cf-dev] CF app that helps with self-healing

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[cf-dev] CF app that helps with self-healing

Siva Balan
Dear CF community,
We are trying to find a way to selectively restart some instances of apps or to restart a specific app on an as needed basis based on some alerts that we receive from our monitoring solution. One option we are considering is to have a self-healing app deployed in CF which will have some REST endpoints exposed which we can call from our alert policies that will perform those actions for us. This self-healing app will essentially have the capabilities of CF CLI for stopping and starting services and instances. This app will also be protected by UAA.
Before we go off and start developing this app, I wanted to check if anyone in the CF community has thought about this approach before and have a solution in place or any ideas to consider.

Thanks,
Siva Balan
_._,_._,_

Links:

You receive all messages sent to this group.

View/Reply Online (#8849) | [hidden email] | [hidden email] | Mute This Topic | New Topic


Reminder that all communication on this mailing list is subject to the Cloud Foundry Foundation's code of conduct, which can be found here: https://www.cloudfoundry.org/code-of-conduct/
Your Subscription | [hidden email] | Unsubscribe [[hidden email]]
_._,_._,_
Reply | Threaded
Open this post in threaded view
|

Re: [cf-dev] CF app that helps with self-healing

Daniel Mikusa
Not sure I totally get what you are asking, but `cf restart-app-instance` will restart an instance, so if you have an alert trigger a script, you could script the restart.

Or you could just have the app itself know when it gets into a bad state, presumably it would if it's emitting the metrics to indicate this, and exit. When it exits the platform will just restart the app.

Dan


On Fri, Jan 24, 2020 at 12:30 PM Siva <[hidden email]> wrote:
Dear CF community,
We are trying to find a way to selectively restart some instances of apps or to restart a specific app on an as needed basis based on some alerts that we receive from our monitoring solution. One option we are considering is to have a self-healing app deployed in CF which will have some REST endpoints exposed which we can call from our alert policies that will perform those actions for us. This self-healing app will essentially have the capabilities of CF CLI for stopping and starting services and instances. This app will also be protected by UAA.
Before we go off and start developing this app, I wanted to check if anyone in the CF community has thought about this approach before and have a solution in place or any ideas to consider.

Thanks,
Siva Balan

_._,_._,_

Links:

You receive all messages sent to this group.

View/Reply Online (#8850) | [hidden email] | [hidden email] | Mute This Topic | New Topic


Reminder that all communication on this mailing list is subject to the Cloud Foundry Foundation's code of conduct, which can be found here: https://www.cloudfoundry.org/code-of-conduct/
Your Subscription | [hidden email] | Unsubscribe [[hidden email]]
_._,_._,_
Reply | Threaded
Open this post in threaded view
|

Re: [cf-dev] CF app that helps with self-healing

Siva Balan
Hi Daniel,
Thanks for your response.
I am aware of all the options you are suggesting. But what we are looking for is a process to restart an app instance without human intervention from an alert policy in our monitoring system. This monitoring system is outside of CF and does not have access to CF CLI. But it can access REST endpoints.
For eg - The monitoring system will detect a high CPU utilization on one of the app instance. It will raise an alert which will trigger a policy that will call a REST endpoint of this self healing app. Based on the parameters passed in the request, the self-healing app will restart the requested app instance.
This is required when the app does not know that it is in a bad state but some metrics we are tracking are indicating that the app instance need to be restarted.
Hope that makes sense.

Thanks
Siva

On Fri, Jan 24, 2020 at 9:55 AM Daniel Mikusa <[hidden email]> wrote:
Not sure I totally get what you are asking, but `cf restart-app-instance` will restart an instance, so if you have an alert trigger a script, you could script the restart.

Or you could just have the app itself know when it gets into a bad state, presumably it would if it's emitting the metrics to indicate this, and exit. When it exits the platform will just restart the app.

Dan


On Fri, Jan 24, 2020 at 12:30 PM Siva <[hidden email]> wrote:
Dear CF community,
We are trying to find a way to selectively restart some instances of apps or to restart a specific app on an as needed basis based on some alerts that we receive from our monitoring solution. One option we are considering is to have a self-healing app deployed in CF which will have some REST endpoints exposed which we can call from our alert policies that will perform those actions for us. This self-healing app will essentially have the capabilities of CF CLI for stopping and starting services and instances. This app will also be protected by UAA.
Before we go off and start developing this app, I wanted to check if anyone in the CF community has thought about this approach before and have a solution in place or any ideas to consider.

Thanks,
Siva Balan



--
_._,_._,_

Links:

You receive all messages sent to this group.

View/Reply Online (#8851) | [hidden email] | [hidden email] | Mute This Topic | New Topic


Reminder that all communication on this mailing list is subject to the Cloud Foundry Foundation's code of conduct, which can be found here: https://www.cloudfoundry.org/code-of-conduct/
Your Subscription | [hidden email] | Unsubscribe [[hidden email]]
_._,_._,_
Reply | Threaded
Open this post in threaded view
|

Re: [cf-dev] CF app that helps with self-healing

Daniel Jones
Hi Siva,

I'm not aware of a similar solution that already exists. A couple of thoughts:
  • Could you use HTTP healthchecks, and have the endpoint return a non-200 status code if the app detects high CPU usage itself?
  • Be mindful of how CPU usage is reported. Whilst current containerisation tech can limit how many CPU shares a process gets, it can't control the system calls that report how much CPU is available. Hence things like `top` will appear inaccurate, and you should ensure the CPU usage statistics come from the metrics that feed into the cpu-entitlement-plugin. If you want to double-check this, there's a blog post (https://www.cloudfoundry.org/blog/better-way-split-cake-cpu-entitlements/) and the folks in the #garden channel are awfully helpful.
  • Having an endpoint that allows remote termination of an app sounds like a bit of a security risk, but I'm sure you'll manage that appropriately.

Regards,
Daniel 'Deejay' Jones - CTO
+44 (0)79 8000 9153
EngineerBetter Ltd - More than cloud platform specialists


On Fri, 24 Jan 2020 at 22:27, Siva <[hidden email]> wrote:
Hi Daniel,
Thanks for your response.
I am aware of all the options you are suggesting. But what we are looking for is a process to restart an app instance without human intervention from an alert policy in our monitoring system. This monitoring system is outside of CF and does not have access to CF CLI. But it can access REST endpoints.
For eg - The monitoring system will detect a high CPU utilization on one of the app instance. It will raise an alert which will trigger a policy that will call a REST endpoint of this self healing app. Based on the parameters passed in the request, the self-healing app will restart the requested app instance.
This is required when the app does not know that it is in a bad state but some metrics we are tracking are indicating that the app instance need to be restarted.
Hope that makes sense.

Thanks
Siva

On Fri, Jan 24, 2020 at 9:55 AM Daniel Mikusa <[hidden email]> wrote:
Not sure I totally get what you are asking, but `cf restart-app-instance` will restart an instance, so if you have an alert trigger a script, you could script the restart.

Or you could just have the app itself know when it gets into a bad state, presumably it would if it's emitting the metrics to indicate this, and exit. When it exits the platform will just restart the app.

Dan


On Fri, Jan 24, 2020 at 12:30 PM Siva <[hidden email]> wrote:
Dear CF community,
We are trying to find a way to selectively restart some instances of apps or to restart a specific app on an as needed basis based on some alerts that we receive from our monitoring solution. One option we are considering is to have a self-healing app deployed in CF which will have some REST endpoints exposed which we can call from our alert policies that will perform those actions for us. This self-healing app will essentially have the capabilities of CF CLI for stopping and starting services and instances. This app will also be protected by UAA.
Before we go off and start developing this app, I wanted to check if anyone in the CF community has thought about this approach before and have a solution in place or any ideas to consider.

Thanks,
Siva Balan



--

_._,_._,_

Links:

You receive all messages sent to this group.

View/Reply Online (#8852) | [hidden email] | [hidden email] | Mute This Topic | New Topic


Reminder that all communication on this mailing list is subject to the Cloud Foundry Foundation's code of conduct, which can be found here: https://www.cloudfoundry.org/code-of-conduct/
Your Subscription | [hidden email] | Unsubscribe [[hidden email]]
_._,_._,_
Reply | Threaded
Open this post in threaded view
|

Re: [cf-dev] CF app that helps with self-healing

Daniel Mikusa
In reply to this post by Siva Balan


On Fri, Jan 24, 2020 at 5:28 PM Siva <[hidden email]> wrote:
Hi Daniel,
Thanks for your response.
I am aware of all the options you are suggesting. But what we are looking for is a process to restart an app instance without human intervention from an alert policy in our monitoring system. This monitoring system is outside of CF and does not have access to CF CLI. But it can access REST endpoints.

The cf cli is just a glorified rest client. If you can access the cloud controller API for your foundation, you can do everything I mentioned w/out the cf cli & by using raw rest commands.


+1 to everything Daniel Jones said in his response.

Hope that helps!

Dan

 
For eg - The monitoring system will detect a high CPU utilization on one of the app instance. It will raise an alert which will trigger a policy that will call a REST endpoint of this self healing app. Based on the parameters passed in the request, the self-healing app will restart the requested app instance.
This is required when the app does not know that it is in a bad state but some metrics we are tracking are indicating that the app instance need to be restarted.
Hope that makes sense.

Thanks
Siva

On Fri, Jan 24, 2020 at 9:55 AM Daniel Mikusa <[hidden email]> wrote:
Not sure I totally get what you are asking, but `cf restart-app-instance` will restart an instance, so if you have an alert trigger a script, you could script the restart.

Or you could just have the app itself know when it gets into a bad state, presumably it would if it's emitting the metrics to indicate this, and exit. When it exits the platform will just restart the app.

Dan


On Fri, Jan 24, 2020 at 12:30 PM Siva <[hidden email]> wrote:
Dear CF community,
We are trying to find a way to selectively restart some instances of apps or to restart a specific app on an as needed basis based on some alerts that we receive from our monitoring solution. One option we are considering is to have a self-healing app deployed in CF which will have some REST endpoints exposed which we can call from our alert policies that will perform those actions for us. This self-healing app will essentially have the capabilities of CF CLI for stopping and starting services and instances. This app will also be protected by UAA.
Before we go off and start developing this app, I wanted to check if anyone in the CF community has thought about this approach before and have a solution in place or any ideas to consider.

Thanks,
Siva Balan



--

_._,_._,_

Links:

You receive all messages sent to this group.

View/Reply Online (#8853) | [hidden email] | [hidden email] | Mute This Topic | New Topic


Reminder that all communication on this mailing list is subject to the Cloud Foundry Foundation's code of conduct, which can be found here: https://www.cloudfoundry.org/code-of-conduct/
Your Subscription | [hidden email] | Unsubscribe [[hidden email]]
_._,_._,_
Reply | Threaded
Open this post in threaded view
|

Re: [cf-dev] CF app that helps with self-healing

Troy Topnik-2
In reply to this post by Daniel Jones
Ideally you'd want to trace the application misbehavior to a root cause in the application itself, but I think we've all been in the situation where "turn it off and on again" is an easier solution. :)

I wonder if this could be a feature request for App-AutoScaler? It already has access to the metric types required for the operation, but it would need to be able to take a policy action based on those metrics other than scaling up or down (e.g. "adjustment" : "restart" ).

TT

--
Troy Topnik
Senior Product Manager, 
SUSE Cloud Application Platform 
 
_._,_._,_

Links:

You receive all messages sent to this group.

View/Reply Online (#8854) | [hidden email] | [hidden email] | Mute This Topic | New Topic


Reminder that all communication on this mailing list is subject to the Cloud Foundry Foundation's code of conduct, which can be found here: https://www.cloudfoundry.org/code-of-conduct/
Your Subscription | [hidden email] | Unsubscribe [[hidden email]]
_._,_._,_
Reply | Threaded
Open this post in threaded view
|

Re: [cf-dev] CF app that helps with self-healing

Siva Balan
Thanks Daniel J and Daniel M for your inputs.
Troy - We are also thinking something along those lines to see of we can use the App Autoscaler for the restarts.

-Siva

On Mon, Jan 27, 2020 at 10:05 AM Troy Topnik <[hidden email]> wrote:
Ideally you'd want to trace the application misbehavior to a root cause in the application itself, but I think we've all been in the situation where "turn it off and on again" is an easier solution. :)

I wonder if this could be a feature request for App-AutoScaler? It already has access to the metric types required for the operation, but it would need to be able to take a policy action based on those metrics other than scaling up or down (e.g. "adjustment" : "restart" ).

TT

--
Troy Topnik
Senior Product Manager, 
SUSE Cloud Application Platform 
 



--
_._,_._,_

Links:

You receive all messages sent to this group.

View/Reply Online (#8855) | [hidden email] | [hidden email] | Mute This Topic | New Topic


Reminder that all communication on this mailing list is subject to the Cloud Foundry Foundation's code of conduct, which can be found here: https://www.cloudfoundry.org/code-of-conduct/
Your Subscription | [hidden email] | Unsubscribe [[hidden email]]
_._,_._,_
Reply | Threaded
Open this post in threaded view
|

Re: [cf-dev] CF app that helps with self-healing

Hjortshoj, Julian
To me this seems a lot like a health check.  Is there some reason that you couldn't add a health check endpoint to your app instances (either directly, or as a sidecar) and then let CF take care of restarting the app instances for you?

From: [hidden email] <[hidden email]> on behalf of Siva <[hidden email]>
Sent: Monday, January 27, 2020 11:22 AM
To: Discussions about Cloud Foundry projects and the system overall. <[hidden email]>
Subject: Re: [cf-dev] CF app that helps with self-healing
 

[EXTERNAL EMAIL]

Thanks Daniel J and Daniel M for your inputs.
Troy - We are also thinking something along those lines to see of we can use the App Autoscaler for the restarts.

-Siva

On Mon, Jan 27, 2020 at 10:05 AM Troy Topnik <[hidden email]> wrote:
Ideally you'd want to trace the application misbehavior to a root cause in the application itself, but I think we've all been in the situation where "turn it off and on again" is an easier solution. :)

I wonder if this could be a feature request for App-AutoScaler? It already has access to the metric types required for the operation, but it would need to be able to take a policy action based on those metrics other than scaling up or down (e.g. "adjustment" : "restart" ).

TT

--
Troy Topnik
Senior Product Manager, 
SUSE Cloud Application Platform 
 



--
_._,_._,_

Links:

You receive all messages sent to this group.

View/Reply Online (#8856) | [hidden email] | [hidden email] | Mute This Topic | New Topic


Reminder that all communication on this mailing list is subject to the Cloud Foundry Foundation's code of conduct, which can be found here: https://www.cloudfoundry.org/code-of-conduct/
Your Subscription | [hidden email] | Unsubscribe [[hidden email]]
_._,_._,_