Wow - EQ 2 Servers Still Down?

G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

Thomas T. Veldhouse wrote:
> Meaffwin <suka_@cox.net> wrote:
> >
> > /Delurk
> >
> > You admit you don't know much, but then you call them idiots
because
> > you assume they can't go back to the pre-patch setup? That's
pretty
> > cute.
> >
> > /Relurk
> >
>
> I don't know much as far as details of what happened. What is
> absolutely clear is that they don't have a backout strategy, or they
> would have used it. Is that clear enough?

So you think bringing up the servers, even though they don't know where
the bug was that trashed everything, into a previous state (that
possibly includes the bug) is a good idea. That way we can go through
this all over again. Gotcha.
 

Shane

Distinguished
Apr 7, 2004
754
0
18,980
Archived from groups: alt.games.everquest (More info?)

On Sat, 18 Dec 2004 16:11:02 GMT, "Lou Vincze" <biglou@ix.netcom.com>
wrote:

>What's it been - 24 hours? Any eta on this mess?
>
>Lou
>

From EQ2 chat;

GMGrog: ... back. I'll explain what I know about what happened for
those that haven't heard.

Manev: Norpan, there is no ETA about the servers coming up. My
apologies for the inconvenience.

GMGrog: (It's 8:24am PST now for time reference to your own time
zone.) Yesterday at 7:00am servers went down briefly, usually
morning reset. They came back up at 7:30am and folks logged in...

GMGrog: Players logging in or after zoning a few times found their
quest journal was empty. (Not everyone but very many.)

GMGrog: The recipe books were empty, spells were missing like Call of
Qeynos/Overlord or Orc Master Strike.

GMGrog: They brought the servers back down at 10:00am. They'd been up
for 2.5 hours.

Manev: Thanks for the information Duvaries. Our development team
working on it.

GMGrog: Devs, ops, programmers, all were working on the problem by
them already.

GMGrog: The servers have been down since. They've made progress,
servers are up internally now being tested. They still don't have an
hard time for when the servers will be back up.

GMGrog: They rolled everything back to 7:00am PST before the servers
reset. That means anyone that was online, the things that happened
during the 2.5 hour up time are gone as if they didn't happen.

GMGrog: They know how frustrating this very long wait is. They'll be
wiping all exp debt and giving free play time. They haven't explained
that "free time" in detail yet, but will once everything is back on
track.

GMGrog: That's pretty much all I know.

GMGrog: (cut & pasting that to save it because I keep typing all that
out every 30 mins or so...)
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

Meaffwin <suka_@cox.net> wrote:
>
> /Delurk
>
> You admit you don't know much, but then you call them idiots because
> you assume they can't go back to the pre-patch setup? That's pretty
> cute.
>
> /Relurk
>

I don't know much as far as details of what happened. What is
absolutely clear is that they don't have a backout strategy, or they
would have used it. Is that clear enough?

--
Thomas T. Veldhouse
Key Fingerprint: 2DB9 813F F510 82C2 E1AE 34D0 D69D 1EDC D5EC AED1
Spammers please contact me at renegade@veldy.net.
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

On 18 Dec 2004 17:11:59 GMT, "Thomas T. Veldhouse" <veldy71@yahoo.com>
wrote:

>Meaffwin <suka_@cox.net> wrote:
>>
>> /Delurk
>>
>> You admit you don't know much, but then you call them idiots because
>> you assume they can't go back to the pre-patch setup? That's pretty
>> cute.
>>
>> /Relurk
>>
>
>I don't know much as far as details of what happened. What is
>absolutely clear is that they don't have a backout strategy, or they
>would have used it. Is that clear enough?

I don't think the delay is rolling back, I think the delay is finding
which obscure line of code caused things to go to hell and back.

And, you know, any IT guy can tell you that even the best "back out
strategy" can fail.

--
Dark Tyger

Sympathy for the retailer:
http://www.actsofgord.com/index.html
"Door's to your left" -Gord
(I have no association with this site. Just thought it was funny as hell)

Protect free speech: http://stopfcc.com/
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

Dark Tyger wrote:
> On 18 Dec 2004 17:11:59 GMT, "Thomas T. Veldhouse" <veldy71@yahoo.com>
> wrote:
>
>
>>Meaffwin <suka_@cox.net> wrote:
>>
>>>/Delurk
>>>
>>>You admit you don't know much, but then you call them idiots because
>>>you assume they can't go back to the pre-patch setup? That's pretty
>>>cute.
>>>
>>>/Relurk
>>>
>>
>>I don't know much as far as details of what happened. What is
>>absolutely clear is that they don't have a backout strategy, or they
>>would have used it. Is that clear enough?
>
>
> I don't think the delay is rolling back, I think the delay is finding
> which obscure line of code caused things to go to hell and back.
>
> And, you know, any IT guy can tell you that even the best "back out
> strategy" can fail.
>
I gather from what I have read that they just rebooted the machines and
they went fubar in the next few hours. If that's the case they are
tracking down an active bug that is very very serious, that just
suddenly reared it's head.
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

"Dark Tyger" <darktiger@somewhere.net> wrote in message
news:icp8s0db047rumbbv5kddgm7nfha7s6m3d@4ax.com...
> On 18 Dec 2004 17:11:59 GMT, "Thomas T. Veldhouse" <veldy71@yahoo.com>
> wrote:
>>
>>I don't know much as far as details of what happened. What is
>>absolutely clear is that they don't have a backout strategy, or they
>>would have used it. Is that clear enough?
>
> I don't think the delay is rolling back, I think the delay is finding
> which obscure line of code caused things to go to hell and back.
>
> And, you know, any IT guy can tell you that even the best "back out
> strategy" can fail.
>
> --
> Dark Tyger

Working for one of the 'big boys' in IT myself, I can definitely confirm
that as truth. And I also know from personal experience that when a down
system event occurs, the source of the outage or offending line of code is
not always easily found.


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.807 / Virus Database: 549 - Release Date: 12/7/2004
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

In article <41c464df$0$200$8046368a@newsreader.iphouse.net>,
"Thomas T. Veldhouse" <veldy71@yahoo.com> wrote:
> I don't know much as far as details of what happened. What is
> absolutely clear is that they don't have a backout strategy, or they
> would have used it. Is that clear enough?

More likely is that they have a backout strategy, but this was not a
situation in which that helps. For example, here is one possible
sequence of events that would give results like we are seeing:

1. They roll out a patch. People play for a couple hours and serious
problems are found (in this case, quests disappearing).

2. They roll back the patch. That's not sufficient, though...all that
will do is prevent further people from losing quests and whatever else
was zapped for the people who played.

3. They roll back the database to restore the deleted quests. Perhaps
this is not really a roll back, but something more time consuming, such
as restoring the database to an alternate DB server, and then
selectively updating the character tables on the live database from the
restored alternate, to try to restore the quests WITHOUT rolling back XP
or items that people acquired while the bad patch was live.

4. They test with the restored data and the pre-patched servers, and
find out the problem is STILL there--quests are disappearing. Upon
investigating further, they find that the problem was not due to the
patch at all, but rather corruption in the database--it was just a
horrible coincidence that this showed up right after a patch (anyone who
works in software can tell you dozens of stories of coincidences like
that). So, they have to do a full rollback on the database (or, rather,
restore from backup, after reinitializing the database).

5. However, before that, they have to run a thorough check of the
hardware to make sure it wasn't bad hardware that corrupted the DB.

--
--Tim Smith

--
--Tim Smith
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

On 19 Dec 2004 15:53:58 GMT, johndoe@example.com wrote:

>I'm a software guy, buy I've seen
>hardware guys at work, and usually they can identify issues pretty
>quickly using self-diagnostics and such (assuming you're using "real"
>hardware and have a "real" support contract).

Key word here: -USUALLY-.

--
Dark Tyger

Sympathy for the retailer:
http://www.actsofgord.com/index.html
"Door's to your left" -Gord
(I have no association with this site. Just thought it was funny as hell)

Protect free speech: http://stopfcc.com/
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

johndoe@example.com wrote:
] What else could happen that would require a team of hardware engineers
] 40 hours to figure out the problem? I'm a software guy, buy I've seen
] hardware guys at work, and usually they can identify issues pretty
] quickly using self-diagnostics and such (assuming you're using "real"
] hardware and have a "real" support contract).

It depends... I've seen Cray hardware engineers try for a day or two
trying to figure out why a Cray YMP-2 was acting the way it was. So
its not always obvious.

JimP.
--
http://www.linuxgazette.net/ Linux Gazette
http://blue7green.drivein-jim.net/ December 4, 2004
http://www.drivein-jim.net/ October 24, 2004:
http://crestar.drivein-jim.net/new.html Dec 5, 2004 AD&D
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

In article <41c5a416$0$95329$a1866201@visi.com>, johndoe@example.com
wrote:
> As a professional computer geek, I'm terribly curious what went wrong.
> I'm under the impression EQ1 uses a farm of Linux machines on the
> server side, and assumed SOE would do the same for EQ2. If that's
> true, it's probably not the servers themselves that had the issue, at
> least I wouldn't think so, since they could be easily swapped. I
> don't recall where I heard the rumor SOE uses a farm of Linux machines
> to host EQ1, though, so I may be remembering wrong, and even if it's
> true, who knows what they used for EQ2...

I doubt they are using Linux for EQ1. EQ1 came out early enough that
Linux was probably not even considered. Also, they are known to like
Oracle for databases at SOE, and I am not sure Oracle was available for
Linux when EQ1 came out.

Linux certainly can be used for MMORPGs. DAoC is on Dell servers
running a version of Red Hat customized by Mythic, and using MySQL for
their database. But DAoC came out a couple years after EQ1, when Linux
had advanced quite a bit.

--
--Tim Smith
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

On 19 Dec 2004 15:53:58 GMT, johndoe@example.com wrotC:DRIVE_E

>What else could happen that would require a team of hardware engineers
>40 hours to figure out the problem? I'm a software guy, buy I've seen
>hardware guys at work, and usually they can identify issues pretty
>quickly using self-diagnostics and such (assuming you're using "real"
>hardware and have a "real" support contract).

I think what you need to also consider is the "Try
this..test...nope..." cycle. For some systems, running through the
checklist of fixes takes a few seconds to a few minutes each. Other
times, it takes hours.

A program I worked on would sometimes manifest bugs only under extreme
-- but real world -- conditions. It was a financial simulator, and
clients would set up runs for a weekend and then come back on Monday.
If what they came back to was an error screen, I heard about it. Big
time. In-house testing usually used ten minute to 1 hour runs; for
obvious reasons, testing EVERY feature in a three day run was not
practical. We'd set off one "big run" just to be sure, but since there
were thousands of options, we couldn't test every combinatin of
settings to be 100% sure one particular blend -- often with only one
type of data -- wasn't going to crash it.

If it takes even a half hour to apply a patch, boot, test, and try
again, even a smallish checklist of "stuff to try" can take a long
time. THEN, you have to apply it on every server, and check them all.

This isn't "Windows crashed, reboot, what's the big deal?" stuff here.
This is uber-complicated. (I know YOU probably know this, but a lot of
people with no experience in complex networked systems think their
methods of dealing with a single-system glitch scale effortlessly to a
server farm running an astoundingly complex progrma.)

And bugs aren't always evident. A long time ago, I was a 4D programmer
with Peat Marwick. One of our divisions constantly lsot data with the
program I was working on. I eventually tracked it down to the fact
they had entered "Aetna insurance" with the "AE" ligature character,
instead of "A" "E". 4Ds indexing for text couldn't handle high-ASCII
in a field. Kaboom. This never showed in any test cases, because we
never used high ASCII in names.

People who call themselves "programmers" because they took a Visual
Basic class during summer school very rarely have any grasp of what
coding is like in the real world, and have ridiculous expecations of
both the predictability of bugs and the difficulty of fixing them in a
timely manner.
*----------------------------------------------------*
Evolution doesn't take prisoners:Lizard
"I've heard of this thing men call 'empathy', but I've never
once been afflicted with it, thanks the Gods." Bruno The Bandit
http://www.mrlizard.com
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

Graeme Faelban <RichardRapier@netscape.net> wrote:
>
> It actually sounded like they ran into a hardware issue. They
> specifically mentioned working with their vendors to address some issues.
>

Yes they did mention that. However, I am not convinced that was the
crux of the problem. First, if it was a routing issue, or a central
database issues, then all the servers would have come up at the same
time. However, they brought up machines one at a time over many many
hours, so it appears that there was an issue with each of the servers.
I HIGHLY doubt they had a hardware issue on each and every machine. If
they did, then there is some very odd circumstances leading up to this
indeed.

--
Thomas T. Veldhouse
Key Fingerprint: 2DB9 813F F510 82C2 E1AE 34D0 D69D 1EDC D5EC AED1
Spammers please contact me at renegade@veldy.net.
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

Thomas T. Veldhouse <veldy71@yahoo.com> wrote:
> Graeme Faelban <RichardRapier@netscape.net> wrote:
>>
>> It actually sounded like they ran into a hardware issue. They
>> specifically mentioned working with their vendors to address some issues.
>>

> Yes they did mention that. However, I am not convinced that was the
> crux of the problem. First, if it was a routing issue, or a central
> database issues, then all the servers would have come up at the same
> time. However, they brought up machines one at a time over many many
> hours, so it appears that there was an issue with each of the servers.
> I HIGHLY doubt they had a hardware issue on each and every machine. If
> they did, then there is some very odd circumstances leading up to this
> indeed.
Aw, come on. Everything we can do is guess. Sure you may be able to make
an educated guess - but it's a guess nonetheless. It could be the load
balancing that has gone haywire and after replacing it and tweaking the
new unit they did bring up the servers slowly. But as I said - even this
is a guess. ;)

I doubt SoE need to invent another reason for such a downtime if they
DID have a real one to cope with. So they oversimplify the matter but it
should be real though.


Hagen
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

Hagen Sienhold <durragon@web.de> wrote:
> Aw, come on. Everything we can do is guess. Sure you may be able to make
> an educated guess - but it's a guess nonetheless. It could be the load
> balancing that has gone haywire and after replacing it and tweaking the
> new unit they did bring up the servers slowly. But as I said - even this
> is a guess. ;)
>

Load balancing would not have caused people to find that they quest log
was empty and other such software issues.

> I doubt SoE need to invent another reason for such a downtime if they
> DID have a real one to cope with. So they oversimplify the matter but it
> should be real though.
>

I don't think they invented the hardware problem. I do believe though
that the hardware problem was probably not the sole cause of the outage
and I might go so far to say that it was not the primary cause of the
outage [based on their own notes about missing quests in the quest logs,
etc] it doesn't look like hardware alone was the issue.

--
Thomas T. Veldhouse
Key Fingerprint: 2DB9 813F F510 82C2 E1AE 34D0 D69D 1EDC D5EC AED1
Spammers please contact me at renegade@veldy.net.
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

On 20 Dec 2004 17:09:10 GMT, "Thomas T. Veldhouse" <veldy71@yahoo.com>
wrotC:DRIVE_E

>I don't think they invented the hardware problem. I do believe though
>that the hardware problem was probably not the sole cause of the outage
>and I might go so far to say that it was not the primary cause of the
>outage [based on their own notes about missing quests in the quest logs,
>etc] it doesn't look like hardware alone was the issue.

It's very possible the initial bug wasn't due to hardware -- but
hardware issue prevented a patch/rollback from working as desired. So
one software bug runs into one hardware bug, and nine months from now,
a lot of EQ widows give birth. :)
*----------------------------------------------------*
Evolution doesn't take prisoners:Lizard
"I've heard of this thing men call 'empathy', but I've never
once been afflicted with it, thanks the Gods." Bruno The Bandit
http://www.mrlizard.com
 

user

Splendid
Dec 26, 2003
3,943
0
22,780
Archived from groups: alt.games.everquest (More info?)

> People who call themselves "programmers" because they took a Visual
> Basic class during summer school very rarely have any grasp of what
> coding is like in the real world, and have ridiculous expecations of
> both the predictability of bugs and the difficulty of fixing them in a
> timely manner.

I think many people who call themselves "programmers" because they've
got several years experience doing development in real world situations
are still dismal at designing robust systems.

Not everyone of course... but far far far too many.

Of course, I also happen to think the greatest source of bugs in the
real world is the *deadline*. :p
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

42 <nospam@nospam.com> wrote in news:MPG.1c3109fcd3b1bfdf989942@shawnews:

>> People who call themselves "programmers" because they took a Visual
>> Basic class during summer school very rarely have any grasp of what
>> coding is like in the real world, and have ridiculous expecations of
>> both the predictability of bugs and the difficulty of fixing them in a
>> timely manner.
>
> I think many people who call themselves "programmers" because they've
> got several years experience doing development in real world situations
> are still dismal at designing robust systems.
>
> Not everyone of course... but far far far too many.
>
> Of course, I also happen to think the greatest source of bugs in the
> real world is the *deadline*. :p
>

Yep, 22 years as an embedded systems software developer here, and that is
still the biggest source of bugs. You just end up not having time to
address and test everything.

--
On Erollisi Marr in <Sanctuary of Marr>
Ancient Graeme Faelban, Barbarian Prophet of 69 seasons

On Steamfont
Graeme, 18 Dwarven Shaman, 15 Scholar
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

On Mon, 20 Dec 2004 15:27:05 -0500, Lizard wrote:

>So one software bug runs into one hardware bug, and nine months from now,
>a lot of EQ widows give birth. :)

Darn! I knew there was something I should have done this weekend rather than
trying to login every half hour :)
--
Henrik Dissing
Vork - Dwarf Warrior on Highkeep
Member of Highkeep Ring

(e-mail: hendis AT post DOT tele DOT dk)
 
G

Guest

Guest
Archived from groups: alt.games.everquest (More info?)

On Tue, 21 Dec 2004 00:11:21 GMT, 42 <nospam@nospam.com> wrotC:DRIVE_E

>Of course, I also happen to think the greatest source of bugs in the
>real world is the *deadline*. :p

Well, of course.

If you had infinite time to fix bugs, programs would be released
bug-free.

You don't. And there's a constant pressure, esp. in MMORPGs, to add
new features (which means new bugs) all the time.

If you're waiting for a bug-free game, you'll be waiting a long, long,
time.

I just said in an earlier post: Every product you buy was shipped with
a list of 'known bugs' on some developers desk. It was determined that
it was 'good enough', and that was that.
*----------------------------------------------------*
Evolution doesn't take prisoners:Lizard
"I've heard of this thing men call 'empathy', but I've never
once been afflicted with it, thanks the Gods." Bruno The Bandit
http://www.mrlizard.com