[stellar/stellar-core] Clean work abort (#1729)

MonsieurNicolas commented on this pull request.

> @@ -78,6 +80,7 @@ class Work : public WorkParent
virtual void onRun();
virtual void onFailureRetry();
virtual void onFailureRaise();
+ virtual void onAbort();

you need to add documentation on abort semantics

> @@ -249,25 +266,45 @@ Work::advance()
}

CLOG(DEBUG, «Work») << "advancing " << getUniqueName(); - advanceChildren(); - if (allChildrenSuccessful()) + + // If necessary, propagate abort signal before advancing children + // This is to prevent scheduling any children to run if they are about + // to be in WORK_ABORTING state (such children are scheduled to abort there is no `WORK_ABORTING` state > @@ -141,10 +145,22 @@ Work::callComplete()
};
}

+void
+Work::scheduleAbort(CompleteResult result)

this is strange: I would expect `scheduleAbort` to just schedule a call to `abort`

> }
— else if (anyChildRaiseFailure())
+ else if (anyChildAborted())

not sure I understand: why would a child aborting cause parents to abort as well?

> @@ -434,4 +498,43 @@ Work::notify(std::string const& child)
<< " of completed child " << child; advance(); } + +void +Work::abort(CompleteResult result) +{ + // When `abort` signal is issued, pending work is in either + // one of two states: + // 1. It hasn't been scheduled to run yet. If some children are still + // running, this is handled in advance where work is scheduled to abort. + // Otherwise, work is scheduled to abort right away. + // 2. Work is already in IO service queue, but hasn't started running yet. + // This scenario is handled in `run` method, where abort is scheduled + // instead of success. + + assert(getState() == WORK_PENDING); error handling is wrong: I would expect an exception to be thrown if abort is called at the wrong time. That said: I am not sure there should ever be a bad time to call abort, if Work is already complete or aborting, it can safely return (no-op)? > +
+ assert(getState() == WORK_PENDING);
+ mAborting = true;
+ bool allDone = true;
+
+ for (auto const& c : mChildren)
+ {
+ if (!c.second->isDone())
+ {
+ allDone = false;
+ }
+
+ // Only abort when work is pending. Wait if it’s running, as it will be
+ // handled in `advance`. If work has finished with success or fail,
+ // nothing to abort either
+ if (c.second->getState() == Work::WORK_PENDING)

why isn’t the logic simply
«`c++
if (!c.second->isDone())
{
allDone = false;
}
else
{
c.second->abort();
}
«`

> + allDone = false;
+ }
+
+ // Only abort when work is pending. Wait if it’s running, as it will be
+ // handled in `advance`. If work has finished with success or fail,
+ // nothing to abort either
+ if (c.second->getState() == Work::WORK_PENDING)
+ {
+ c.second->abort();
+ }
+ }
+
+ if (allDone)
+ {
+ // Children are ready, schedule abort for work itself.
+ scheduleAbort(result);

It seems it would be better to just `scheduleRun` here

> @@ -61,6 +61,13 @@ WorkManagerImpl::notify(std::string const& child)
mApp.getMetrics().NewMeter({«work», «root», «failure»}, «unit»).Mark();
mChildren.erase(child);
}
+ else if (i->second->getState() == Work::WORK_FAILURE_ABORTED)

Before reviewing this PR, I opened https://github.com/stellar/stellar-core/issues/1755 as I thought semantics were already not super clean and error prone, now that we have abort(ing), we really need to formalize well what is going on, otherwise we’re going to run into very strange bugs.

Also, the semantics implied here from `onAbort` don’t really seem to follow what was described in https://github.com/stellar/stellar-core/issues/1706 (or at least it’s unclear that it can work if `onAbort` only triggers some work).

I would recommend going back to basics: describe a state machine, its transitions and when certain callbacks (`onXYZ`) get called. Right now the `mAborting` flag makes it hard to tell which state transitions are valid vs invalid (and what is supposed to happen).

The two ways to abort are:
* somebody wants to abort some work, this causes transitions to `WORK_ABORTING` to `WORK_ABORTED` ; any work that was in flight is now complete (and was aborted if needed).
* abort in preparation for a retry, something like «decision to retry» -> `WORK_ABORTING_FOR_RETRY` (aborting work) -> `WORK_PENDING` (reset) -> `WORK_RUNNING` (run) …

> @@ -159,6 +159,41 @@ ProcessManagerImpl::shutdown()
}
}

+void
+ProcessManagerImpl::shutdownProcess(std::shared_ptr pev)

this implementation of `shutdownProcess` is not desirable in the typical case (it’s doing `kill -9 …`):
we want «clean» shutdown by default in case processes start sub processes. The process hierarchy is typically something like `stellar-core -> bash -> aws_cli-> aws_cli_children`, a «force kill» of `bash` may leave the `aws_cli` process (and children) running.

You can keep this implementation under a «force» parameter, but the MVP may not need this (ie: the only place where we would need it is if we’re implementing timeout for abort)

> @@ -27,14 +27,14 @@ class ProcessManagerImpl : public ProcessManager
// Subprocesses will be removed asynchronously, hence the lock on
// just this member
std::recursive_mutex mImplsMutex;
— std::map> mImpls;
+ std::map> mImpls;

rename this: it’s not `mImpls` anymore

> @@ -49,11 +49,12 @@ class ProcessManager : public std::enable_shared_from_this,
{
public:
static std::shared_ptr create(Application& app);
— virtual ProcessExitEvent runProcess(std::string const& cmdLine,
— std::string outputFile = «») = 0;
+ virtual std::shared_ptr

if we move to using to `std::shared_ptr` across the board, should we make `ProcessExitEvent` non-copyable/movable?

Добавить комментарий