Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [cross-project-issues-dev] download.eclipse.org unavailable

Hi

Thank you all for hitting problems quite quickly once you were engaged. Perhaps this 'bystander's' perspective may help to understand the need to communicate better.

I first became aware of the problem after receiving notification a little after 2:42 EDT 1-Aug that a weekly OCL rebuild had failed. Investigation of the log pointed a finger at the GIT repo and eclipsestatus.io indicated that a major outage was in progress with an 'investigating' tweet. Clearly someone was on the case and so the bystander effect took over and I didn't raise any reports or emails to distract.

'investigating' status advanced to 'fix-in-progress' after an hour.

But then nothing for a further 5 hours, at which point we got 'it will take 13 hours'. On twitter someone asked when the 13 hours started; one might have hoped that it would be from the 'fix-in-progress' time. This tweet and an 'ETA?' tweet were never answered.

17 hours later we got 'most websites' back, which might be true but with important  services down, it was misleading. It took a further perhaps 4 hours for https://download.eclipse.org/tools/orbit/downloads/latest-I to return, and 50 hours before projects-storage.eclipse.org was back and another couple of hours to get /shared/common/apache-ant-latest/bin/ant back.

IMHO the outage lasted until at least the restoration of  projects-storage.eclipse.org at Aug 4 8:50 and so one of the issues to be addressed by the postmortem must be why the status page still reports no incidents or outage on the whole of the 3rd Aug when, for committers at least, there was no useable service all day.

I must thank the team again for their hard work with a very difficult problem, but must also stress that the communication was very poor. So much so that at 3:07 EDT on 4th Aug I sent a private email to Ed Merks speculating that:

The total silence from the team is now way beyond incompetence/discourtesy/embarrassment; there must be another reason.

Paranoia sets in.

Is some government / hostile agency intervening to prevent communication?

Are the team voluntarily maintaining silence to contain a security issue?

Please ensure that whenever possible the status updates are much more informative.

    Regards

        Ed Willink


On 09/08/2021 21:45, Denis Roy wrote:

I very much appreciate the sympathy and the support. In the end, the Infra team can do better than this.  We'll lick our wounds and go back to the drawing board to make sure we don't repeat the same mistakes twice.

Postmortem is written, pending review with my team.



Denis



Virus-free. www.avast.com

Back to the top