Having dug further into the issue, we have been able to determine some useful data points, but no clear explanation of what went wrong. In short, the AskSSC site failed to come up normally during a routine restart of the Apache server at 12:27:06 PM EST, and it didn't generate any notifications because the page that DID come up was still returning a normal "200" error code. A subsequent restart at 14:11:26 PM EST brought things back to normal. None of the other sites in the system was affected, and nothing at all was changed in the AskSSC code or configuration data. It simply didn't come up correctly and didn't notify us that anything was wrong. We're trying to recreate the circumstances on test servers to understand what took place. We have also identified a possible method to ensure that any future virtual host startup failure would reliably lead to the production of proper error codes. At the end of the day, we still don't know why AskSSC didn't restart normally, and the site configuration files have not been changed. DNS was not the cause, but neither were site updates or configuration changes. Something simply failed, and nothing caught the failure. We're on the case to try to prevent this happening in future. On behalf of all of us on the OSQA team, I apologize to everyone who was upset by the downtime. It shouldn't have happened, and we're working to prevent this type of failure from recurring.
One would think updates to this site would be done during non peak hours, not during the day. I understand it is always daylight hours somewhere but give me a break. 1:00 Eastern time on a Monday. Not cool.
Last I heard, we were working on getting better communication with the OSQA guys, but it appears more communication is needed. Lots more. I've a sense, these are not enterprise experienced support people who have to answer to business users, but I could be wrong. In the past they've asked us to talk about OSQA issues on their own Ask site, so going there is probably a good next step for us regulars.
Hopefully tonight or tomorrow someone at Red Gate will get an answer from OSQA as to what happened (assuming they know). Also I'm sure the point will be stressed that this is the 2nd time this month that we have suffered an outage without any prior knowledge. Not saying that this was avoidable - can't say until we know more - but it certainly isn't helpful. We are taking small steps to try and grow this site, but things like this just push us back sooooo far.