x

What is going on with all the outages?

This is not the first time we've had a major outage of the AskSSC site. What's going on to prevent these? What's going on to improve communications out from OSQA?
more ▼

asked Feb 28, 2011 at 11:35 AM in Default

ThomasRushton gravatar image

ThomasRushton ♦
33.9k 18 20 44

@ThomasRushton, this appears to have been a DNS-related glitch that impacted the site's availability for a couple of minutes. To characterize this as a "major outage" is not reasonable.
Feb 28, 2011 at 11:51 AM rickross ♦♦
@rickross - it was down for about one hour, in the peak of the US/UK day - that doesn't go un-noticed.
Feb 28, 2011 at 11:59 AM Kev Riley ♦♦
@Kev Riley, DNS changes can persist for varying lengths of time according to how and where things get cached. It is beyond our control. I assure you the site was not down for an hour, even if you could not reach it.
Feb 28, 2011 at 12:03 PM rickross ♦♦
@rickross - sure I appreciate that, and I'm not saying that you guys were to blame, just that it was down, and to us ASK regulars it's a major thing. I'm sure between us here we are all responsible for systems, websites, applications etc. and if they go down for mere minutes, we get our ear chewed off!
Feb 28, 2011 at 12:08 PM Kev Riley ♦♦

I get my ear chewed off if things start running slowly, let alone go down!

@rickross I'm sorry if you think my characterisation of this issue was unreasonable. As @kev riley says, though, it was an outage of reasonable length of time, during the peak working day for US and towards the end of the working day in the UK. This certainly was noticed, and not just by a couple of demented regulars either!
Feb 28, 2011 at 12:14 PM ThomasRushton ♦
(comments are locked)
10|1200 characters needed characters left

5 answers: sort voted first

Having dug further into the issue, we have been able to determine some useful data points, but no clear explanation of what went wrong. In short, the AskSSC site failed to come up normally during a routine restart of the Apache server at 12:27:06 PM EST, and it didn't generate any notifications because the page that DID come up was still returning a normal "200" error code. A subsequent restart at 14:11:26 PM EST brought things back to normal. None of the other sites in the system was affected, and nothing at all was changed in the AskSSC code or configuration data. It simply didn't come up correctly and didn't notify us that anything was wrong.

We're trying to recreate the circumstances on test servers to understand what took place. We have also identified a possible method to ensure that any future virtual host startup failure would reliably lead to the production of proper error codes. At the end of the day, we still don't know why AskSSC didn't restart normally, and the site configuration files have not been changed. DNS was not the cause, but neither were site updates or configuration changes. Something simply failed, and nothing caught the failure. We're on the case to try to prevent this happening in future.

On behalf of all of us on the OSQA team, I apologize to everyone who was upset by the downtime. It shouldn't have happened, and we're working to prevent this type of failure from recurring.

more ▼

answered Feb 28, 2011 at 01:11 PM

rickross gravatar image

rickross ♦♦
186 1 2 3

@rickross - Thanks for getting back to us with that. I appreciate & understand your frustrations, particularly when it's one of those things that Just Should Not Happen.
Feb 28, 2011 at 01:13 PM ThomasRushton ♦
(comments are locked)
10|1200 characters needed characters left
Hopefully tonight or tomorrow someone at Red Gate will get an answer from OSQA as to what happened (assuming they know). Also I'm sure the point will be stressed that this is the 2nd time this month that we have suffered an outage without any prior knowledge. Not saying that this was avoidable - can't say until we know more - but it certainly isn't helpful. We are taking small steps to try and grow this site, but things like this just push us back sooooo far.
more ▼

answered Feb 28, 2011 at 11:57 AM

Kev Riley gravatar image

Kev Riley ♦♦
53.2k 47 49 76

(comments are locked)
10|1200 characters needed characters left
One would think updates to this site would be done during non peak hours, not during the day. I understand it is always daylight hours somewhere but give me a break. 1:00 Eastern time on a Monday. Not cool.
more ▼

answered Feb 28, 2011 at 11:43 AM

Tim gravatar image

Tim
36.4k 38 41 139

I would assume that OSQA can generate usage stats for the site so they can see when (historically) the site is not being used as heavily...

Quite right, though - not cool.
Feb 28, 2011 at 11:51 AM ThomasRushton ♦
No update was done to AskSSC. In fact, nothing was done that SHOULD have affected AskSSC in any way, but apparently Apache got its panties in a bunch. Are AskSSC users doing the same?
Feb 28, 2011 at 11:53 AM rickross ♦♦
For me the site was down close to an hour. DNS or not it is odd that a DNS error would take is to a clean site where we can create new users, post questions etc. We had a go at it for awhile on there. In my experience with DNS issues is DNS works or it doesn't, not send me to a default vanilla home page. Regardless I am just glad it is back up and resolved itself.
Feb 28, 2011 at 12:14 PM Tim
(comments are locked)
10|1200 characters needed characters left
Last I heard, we were working on getting better communication with the OSQA guys, but it appears more communication is needed. Lots more. I've a sense, these are not enterprise experienced support people who have to answer to business users, but I could be wrong. In the past they've asked us to talk about OSQA issues on their own Ask site, so going there is probably a good next step for us regulars.
more ▼

answered Feb 28, 2011 at 11:54 AM

Grant Fritchey gravatar image

Grant Fritchey ♦♦
101k 19 21 74

We are quick to complain because we care.
Feb 28, 2011 at 12:18 PM Kev Riley ♦♦
@rickross we do appreciate all you all do for us and we all too feel the pain of supporting production systems. If you can't tell we are all huge fans of the site and come to depend on it for stress and comedy relief. Please don't think for a moment we don't appreciate you guys and all your work supporting the site.
Feb 28, 2011 at 12:22 PM Tim

@Kev Riley, we try to be be quick to respond because you care. :) We're still digging to understand what happened here, but I assure you that nothing changed in the AskSSC code or config at all. There should have been NO downtime, yet something evidently went wrong.

Regardless, we apologize to all for the inconvenience, and we certainly do appreciate that you guys are superb members of a superb online community.
Feb 28, 2011 at 12:27 PM rickross ♦♦
@rickross, thanks and it IS appreciated that you respond.
Feb 28, 2011 at 12:33 PM Kev Riley ♦♦
@rickross - I commented elsewhere, but deleted that and will comment here. That apologize word goes a long way, at least with me. I do appreciate that you guys are doing OSQA FoC - hopefully you get back from us some of the more constructive feedback that usually ends up going back via Mr Massey...
Feb 28, 2011 at 12:39 PM Matt Whitfield ♦♦
(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

New code box

There's a new way to format code on the site - the red speech bubble logo will automatically format T-SQL for you. The original code box is still there for XML, etc. More details here.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.

Topics:

x151
x1

asked: Feb 28, 2011 at 11:35 AM

Seen: 1294 times

Last Updated: Feb 28, 2011 at 12:49 PM