[cf-dev] Cloud controller doesn't recover after database downtime

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[cf-dev] Cloud controller doesn't recover after database downtime

Holger Oehm
Hi,

Today we saw in our productive system during the update of the
database instance (which hosts ccdb, uaadb, locketdb and diegodb)
an error during the push of an app.

That was to be expected. The unexpected thing was, that afterwards
(when the database instance was up and running again) further attempts
to push the same application also kept failing.
 From the CF_TRACE we saw that a PUT to /v2/apps/<guid> got a response
with status code 400, with code 100001, description "The app is invalid:
VCAP::CloudController::BuildCreate::StagingInProgress" and error_code
"CF-AppInvalid".

This didn't recover by itself for 20 minutes. After that an operator did
a cf restage of the application and the problem disappeared.

Everything else worked as expected, also the diego-sync job was running
fine.

My guess is, that the database disappeared at an inconvenient point in
time. And this left an inconsistent state.

What looks strange to me is that a cf push of the same application
kept failing, but a cf restage fixed it. Shouldn't both commands
fix the situation?

Best Regards,
Holger.

-=-=-=-=-=-=-=-=-=-=-=-
Links:

You receive all messages sent to this group.

View/Reply Online (#7897): https://lists.cloudfoundry.org/g/cf-dev/message/7897
View All Messages In Topic (1): https://lists.cloudfoundry.org/g/cf-dev/topic/16775873
Mute This Topic: https://lists.cloudfoundry.org/mt/16775873/474226
New Topic: https://lists.cloudfoundry.org/g/cf-dev/post

Change Your Subscription: https://lists.cloudfoundry.org/g/cf-dev/editsub/474226
Group Home: https://lists.cloudfoundry.org/g/cf-dev
Contact Group Owner: [hidden email]
Terms of Service: https://lists.cloudfoundry.org/static/tos
Unsubscribe: https://lists.cloudfoundry.org/g/cf-dev/leave/920759/1741049355/xyzzy
-=-=-=-=-=-=-=-=-=-=-=-

Reply | Threaded
Open this post in threaded view
|

Re: [cf-dev] Cloud controller doesn't recover after database downtime

Mike Youngstrom
We have seen this same thing as well but haven't had time to dig into it deeper.  For us it isn't hard to reproduce.  Simply do a push on a loop while doing an update duplicates it for us.  You might have enough info here for an issue in the CC project if nobody from the team looks at this message.

Mike

On Thu, Apr 5, 2018 at 5:25 AM, Holger Oehm <[hidden email]> wrote:
Hi,

Today we saw in our productive system during the update of the
database instance (which hosts ccdb, uaadb, locketdb and diegodb)
an error during the push of an app.

That was to be expected. The unexpected thing was, that afterwards
(when the database instance was up and running again) further attempts
to push the same application also kept failing.
From the CF_TRACE we saw that a PUT to /v2/apps/<guid> got a response
with status code 400, with code 100001, description "The app is invalid: VCAP::CloudController::BuildCreate::StagingInProgress" and error_code "CF-AppInvalid".

This didn't recover by itself for 20 minutes. After that an operator did
a cf restage of the application and the problem disappeared.

Everything else worked as expected, also the diego-sync job was running
fine.

My guess is, that the database disappeared at an inconvenient point in
time. And this left an inconsistent state.

What looks strange to me is that a cf push of the same application
kept failing, but a cf restage fixed it. Shouldn't both commands
fix the situation?

Best Regards,
Holger.




_._,_._,_

Links:

You receive all messages sent to this group.

View/Reply Online (#7899) | [hidden email] | [hidden email] | Mute This Topic | New Topic

Change Your Subscription
Group Home
[hidden email]
Terms Of Service
Unsubscribe From This Group

_._,_._,_