[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [cross-project-issues-dev] download.eclipse.org unavailable
|
Hi
Thank you all for hitting problems quite quickly once you were
engaged. Perhaps this 'bystander's' perspective may help to
understand the need to communicate better.
I first became aware of the problem after receiving notification
a little after 2:42 EDT 1-Aug that a weekly OCL rebuild had
failed. Investigation of the log pointed a finger at the GIT repo
and eclipsestatus.io indicated that a major outage was in progress
with an 'investigating' tweet. Clearly someone was on the case and
so the bystander effect took over and I didn't raise any reports
or emails to distract.
'investigating' status advanced to 'fix-in-progress' after an
hour.
But then nothing for a further 5 hours, at which point we got 'it
will take 13 hours'. On twitter someone asked when the 13 hours
started; one might have hoped that it would be from the
'fix-in-progress' time. This tweet and an 'ETA?' tweet were never
answered.
17 hours later we got 'most websites' back, which might be true
but with important services down, it was misleading. It took a
further perhaps 4 hours for
https://download.eclipse.org/tools/orbit/downloads/latest-I
to return, and 50 hours before projects-storage.eclipse.org
was back and another couple of hours to get /shared/common/apache-ant-latest/bin/ant
back.
IMHO the outage lasted until at least the
restoration of projects-storage.eclipse.org
at Aug 4 8:50 and so one of the issues to be addressed by the
postmortem must be why the status page still reports no incidents
or outage on the whole of the 3rd Aug when, for committers at
least, there was no useable service all day.
I must thank the team again for their hard work with a very
difficult problem, but must also stress that the communication was
very poor. So much so that at 3:07 EDT on 4th Aug I sent a private
email to Ed Merks speculating that:
The total silence from the team is now way beyond
incompetence/discourtesy/embarrassment; there must be another
reason.
Paranoia sets in.
Is some government / hostile agency intervening to prevent
communication?
Are the team voluntarily maintaining silence to contain a
security issue?
Please ensure that whenever possible
the status updates are much more informative.
Regards
Ed Willink
On 09/08/2021 21:45, Denis Roy wrote:
I very much appreciate the sympathy
and the support. In the end, the Infra team can do better than
this. We'll lick our wounds and go back to the drawing board
to make sure we don't repeat the same mistakes twice.
Postmortem is written, pending review
with my team.
Denis