SGI takes Itanium & Linux to 1024-way

G

Guest

Guest
Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Yousuf Khan wrote:

> A single Linux image running across a 1024 Itanium processor machine.
>
> http://www.computerworld.com/hardwaretopics/hardware/story/0,10801,94564,00.html
>
>

"The users get one memory image they have to deal with," he [Pennington,
the interim director of NCSA] said. "This makes programming much easier,
and we expect it to give better performance as well."

Too early to call it a trend, but I'm encouraged to see the godfather of
the "Top" 500 list talking some sense as well:

callysto.hpcc.unical.it/ hpc2004/talks/dongarra-survey.ppt

slides 37 and 38.

A single system image is no simple cure. It may not be a cure at all.
But it's encouraging that somebody is taking it seriously enough to
build a kilonode machine with a single address space.

"Scalability" being a challenge for such installations (you can't just
order more boxes and more cable and take another rural county out of
agricultural production to move "up" the "Top" 500 list) the premium is
on processors with high single-thread throughput.

RM
 
G

Guest

Guest
Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Robert Myers wrote:

[SNIP]

> A single system image is no simple cure. It may not be a cure at all.
> But it's encouraging that somebody is taking it seriously enough to
> build a kilonode machine with a single address space.

Hats off to SGI, kilonode ssi is a neat trick. :)

Let's say you write code that makes use of a large single system
image machine. Let's say SGI fall behind the curve and you need
answers faster : Where can you go for another large single system
image machine ?

I see that kind of awfully clever machine as vendor lock-in waiting
to happen. If you want to avoid lock-in you end up writing your
code to the lowest common denominator, and in this case that will
probably remove any advantage gained by SSI (application depending
of course).

Cheers,
Rupert
 
G

Guest

Guest
Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Rupert Pigott wrote:

>
> Let's say you write code that makes use of a large single system
> image machine. Let's say SGI fall behind the curve and you need
> answers faster : Where can you go for another large single system
> image machine ?
>

What curve are we keeping up with these days?

The difference in scalability between the Altix and Blue Gene is
interesting mostly if you're trying to hit arbitrarily definied
milestones in a Gantt chart.

For hydro, a factor of ten in machine size is a 78% increase in number
of grid points available to resolve a given scale: whoop-de-ding. Maybe
there's something different about actinide-lanthanide decay series
that's worth understanding. I'll get around to it some time--even
though I strongly suspect I'm being led on a wild goose chase. The real
justification for the milestones on the Gantt chart of the last of the
big spenders is that a petaflop is a nice big round number for a goal.

> I see that kind of awfully clever machine as vendor lock-in waiting
> to happen. If you want to avoid lock-in you end up writing your
> code to the lowest common denominator, and in this case that will
> probably remove any advantage gained by SSI (application depending
> of course).
>

Blue Gene is now not awfully clever? :).

Commodity chip, flat address space. That sounds pretty vanilla to me.
How do you get more common than that? You can get an Itanium box with a
flat address space to your own personal work area much more readily than
you can get a Blue Gene.

There is no way not to leave you with the idea that I think single-image
machines are the way to go. I don't know that, and I'm not even certain
what course of investigation I would undertake to decide whether the way
to go or not. What I like about the single address space is that it
would appear to make the minimum architectural imposition on problem
formulation.

RM
 
G

Guest

Guest
Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Robert Myers wrote:

> How do you get more common than that? You can get an Itanium box with a
> flat address space to your own personal work area much more readily than
> you can get a Blue Gene.

Extend that argument further and you are buying Xeons.

The point is 1000 node machines with shared address spaces don't
fall out of trees. Who said anything about BlueGene anyways ?

> There is no way not to leave you with the idea that I think single-image
> machines are the way to go. I don't know that, and I'm not even certain

Over the long run I think it will be very hard to justify the
extra engineering and purchase cost over message passing gear.

> what course of investigation I would undertake to decide whether the way
> to go or not. What I like about the single address space is that it
> would appear to make the minimum architectural imposition on problem
> formulation.

People made the similar argument for CISC machines too. VAX
polynomial instructions come to mind. :)

Cheers,
Rupert
 
G

Guest

Guest
Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Rupert Pigott wrote:

> Robert Myers wrote:
>
>> How do you get more common than that? You can get an Itanium box with
>> a flat address space to your own personal work area much more readily
>> than you can get a Blue Gene.
>
> Extend that argument further and you are buying Xeons.
>

There is a fair question that could be asked for almost any application
these days: why not ia-32 (probably with 64-bit extensions). When
you've got superlinear interconnect costs, you want each node to be as
capable as possible. The application of that argument to Itanium in
this particular argument is wobbly, since the actual usefulness of
Itanium may be just as theoretical as the usefulness of the clusters
I've been worrying about.

> The point is 1000 node machines with shared address spaces don't
> fall out of trees. Who said anything about BlueGene anyways ?
>

I did. Blue Gene was the best contrast I could think of to a single
image Itanium machine in terms of cost, energy efficiency, and
scalability. There is no fundamental reason why BlueGene couldn't
become widely used and accepted, but it probably won't be because it
won't show up in the workspace of your average graduate student or
postdoc.

Your question is what do we do when we need more than 1000 nodes. It's
a fair question, but not the only one you could ask. My questions are:
where does the software that runs on the big machine come from, in what
environment was it developed, at what cost, and with what opportunities
for continued development.

>> There is no way not to leave you with the idea that I think
>> single-image machines are the way to go. I don't know that, and I'm
>> not even certain
>
>
> Over the long run I think it will be very hard to justify the
> extra engineering and purchase cost over message passing gear.
>

Hardware is cheap, software is expensive. If we've run out of
interesting things to do with making processors astonishingly powerful
and inexpensive, we certainly haven't run out of interesting things to
do in making interconnect astonishingly powerful and inexpensive.

>> what course of investigation I would undertake to decide whether the
>> way to go or not. What I like about the single address space is that
>> it would appear to make the minimum architectural imposition on
>> problem formulation.
>
>
> People made the similar argument for CISC machines too. VAX
> polynomial instructions come to mind. :)
>

The RISC/CISC argument went away when microprocessors were developed
that could hide RISC execution behind a CISC programming model. The
neat hardware insight (RISC) did not, in the end, impose itself on
applications. No more should a particular hardware reality about
multi-processor machines impose itself on applications.

RM
 
G

Guest

Guest
Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

"Yousuf Khan" <bbbl67@ezrs.com> wrote in message
news:e6kKc.237$S5k.21@news04.bloor.is.net.cable.rogers.com...
>A single Linux image running across a 1024 Itanium processor machine.
>
> http://www.computerworld.com/hardwaretopics/hardware/story/0,10801,94564,00.html
>


Nobody has said it yet, so guess I'll have to say it:
"Imagine a Beowulf cluster of ..."

--

... Hank

http://horedson.home.att.net
http://w0rli.home.att.net
 
G

Guest

Guest
Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Rupert Pigott wrote:

> Robert Myers wrote:
>
>> Rupert Pigott wrote:
>>

<snip>

>>
>> You are apparently arguing for the desirability of folding the
>> artificial computational boundaries of clusters into software. If
>
>
> That happens with SSI systems too. There is a load of information that
> has been published about scaling on SGI's Origin machines over the
> years. IIRC Altix is based on the same Origin 3000 design. You may
> remember that I quizzed Rob Warnock on this, he said that there were
> in practice little gotchas that tend to crop up at particular #'s of
> procs. He even noted that the gotcha processor counts tended to change
> with the particular generation of Origin.
>
>> that's a necessity of life, I can learn to live with it, but I'm
>> having a hard time seeing it as desirable. We are so fortunate as to
>> live in a universe that presents itself to us in midtower-sized
>> chunks? I'm worried. ;-).
>
>
> In my mind it's a question of fitting our computing effort to reality
> as opposed to living in an Ivory Tower. Some goals, while worthy,
> desirable, or even partially achievable, are basically impossible to
> achieve in reality. A genuinely *flat* address space is impossible
> right here and now. That SSI Altix box will *not* have *flat* address
> space in terms of time. It is a NUMA machine. :)
>

Well, yes, it is. The spread in latencies is more like a half a
microsecond, as opposed to five microseconds for the latest and greatest
of the DoE build-to-order specials.

On the question of Ivory Towers vs. reality, I believe that I am on the
side of the angels, naturally. If you believe the right question really
is: "What's the least expensive way we can get a high Linpack score?",
then clusters are a slam dunk, but I don't think that anybody worth
talking to on the subject really thinks that's the right question to be
asking.

As to access to 1000-node and even bigger machines, I don't need them.
What I need is to know what kind of machine a code is likely to run on
when somebody decides an NCSA-type installation is required.

How you will _ever_ scale _anything_ to the kinds of memory and and
compute requirements required to do even some very pedestrian problems
properly is my real concern, and, from that point of view, no
architecture currently on the table, short of specialized hardware, is
even in the right universe.

Given that _nothing_ currently available can really do the physics
right--with the possible exception of things like the Cell-like chips
the Columbia QCD people are using--and that nothing currently available
really scales in a way that I can imagine, I'm inclined to give heavy
emphasis to useability.

>>> It's a
>>> matter of choice over the long run... If you use the unique features
>>> of a kilonode Itanium box then you're basically locked-in. Clearly
>>> this is not an issue for some establishments, Cray customers are a
>>> good example. :p
>>>
>>
>> Can you give an example of something that you think would happen?
>
>
> Depends on the app. Stuff like memory mapping one large file for read
> and occasional write could cause some fantastic locking + latency
> issues when it comes to porting. :)
>

I understand just enough about operating systems to know that building a
1000-node image that runs on realizable hardware is a real
tour-de-force. I also understand that you can take off-the-shelf copies
of, say, RedHat Linux, and some easily-obtainable clustering software
and (probably) get a thousand beige boxes to run like a kilonode
cluster. Someone else (Linus, SGI, et al) wrote the Altix OS. Someone
else (Linus, RedHat, et al) wrote the OS for the cluster nodes. I don't
want to fiddle with either one. You want me to believe that I am better
off synchronizing processes and exchanging data across infiniband stacks
and through trips in and out of kernel and user space and with heaven
only knows how many control handoffs for each exchange than I am reading
and writing to my own user space under the control of a single OS, and I
just don't.

<snip>

>
> I mentioned Opteron, if HT really does suffer from crash+burn on
> comms failure then it is holding itself back. If that ain't the
> case I'd have figured that a tiny form factor Opteron + DRAM +
> router cards would be a reasonable component for high-density
> clusters and beige SSI machines. You'd need some facility for
> driving some links for longer distances than HT currently allows
> too ($$$). The next thing holding you back is tuning the OS + Apps
> to a myriad of possible configurations... :(

I'm guessing that, the promise of Opteron for HPC notwithstanding, HT is
going to be marginalized by PCI Express/Infiniband.

> [SNIP]
>
>> The optimistic view is that the chaos we currently see is the HPC
>> equivalent of the pre-Cambrian explosion and that natural selection
>> will eventually give us a mature and widely-adopted architecture. My
>> purpose in starting this discussion was simply to opine that single
>> image architectures have some features that make them seem promising
>> as a survivor--not a widely-held view, I think.
>
>
> I'm sure they'll have their place. But in the long run I think that
> PetaFLOP pressure will tend to push people towards message passing
> style machines. Consdier this though : Internet is becoming more and
> more prominent on daily life. The Spooks must have a fair old time
> keeping up with the sheer volume of data flowing around the globe.
> Distributed processing is a natural fit here, SSI machines just would
> not make sense. More and more governments and their civil servants
> will want to make use of this surveillance resource too, check out
> the rate at which legislation is legitimising their intrusion on the
> individual's privacy. The War on Terror has added more fuel to that
> growth market too. :)
>
Nothing that _I_ say about distributed processing is going to slow it
down, that's for sure, and that isn't my intent. If you've got a
google-type task, you should use google-type hardware. Computational
physics is not a google-type task.

RM
 
G

Guest

Guest
Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Robert Myers wrote:
> Rupert Pigott wrote:

[SNIP]

> As always, though, the complexity has to go somewhere. What I can see

Yes. I am painfully aware of Mashey's concerns about pushing
complexity from one place to another.

> of IEEE 1355 looks like an open source project to me. With open source,

LOL, not at all. It was a write up of the T9000's VCP. Bits and pieces
of that technology have made their way into proprietry solutions.

[SNIP]

> You still wind up with many of the same problems, though: software
> encrusted with everybody's favorite feature and interfaces that get
> broken by changes that are made at a level you have no control over
> (like the kernel) and that ripple through everything, for example.

Of course, but it's easier to change a kernel than it is to respin
silicon, or replace several thousand busted boards, right ? A lot of
MPP machines seem to give the customer access to the kernel source
which makes it easier for the desperados to fix the problems. :)

[SNIP]

> I wonder if part of what you object to with systems like Altix is that
> it seems like movement away from open systems and back to the bad old
> days. Could a bunch of geeks with a little money from, say, DARPA, do
> better? Maybe. I think it's been tried at least once. ;-).

I don't have a problem with Altix at all. I have a *concern* that
the SSI feature is rather like putting Chrome on a Porsche 917K if
you are really interested in getting good perf out of it on an
arbitary problem + dataset. Data locality is still a key issue.

I don't deny that it will make some apps easier, but in those cases
you are wide open to vendor lock-in IMO. There are worse vendors
than SGI of course, and I don't think they would be quite as evil
as IBM were reputed to be.

For those two reasons I question the long term viability of SSI
MPP machines.

Cheers,
Rupert
 
G

Guest

Guest
Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

In article <1090171989.132815@teapot.planet.gong>,
Rupert Pigott <roo@try-removing-this.darkboong.demon.co.uk> wrote:

> Robert Myers wrote:
>
> [SNIP]
>
> > A single system image is no simple cure. It may not be a cure at all.
> > But it's encouraging that somebody is taking it seriously enough to
> > build a kilonode machine with a single address space.
>
> Hats off to SGI, kilonode ssi is a neat trick. :)
>
> Let's say you write code that makes use of a large single system
> image machine. Let's say SGI fall behind the curve and you need
> answers faster : Where can you go for another large single system
> image machine ?
>
> I see that kind of awfully clever machine as vendor lock-in waiting
> to happen. If you want to avoid lock-in you end up writing your
> code to the lowest common denominator, and in this case that will
> probably remove any advantage gained by SSI (application depending
> of course).

Lets say, instead, that one has an application that seems to require a
256 node machine, but that need might grow in the next couple of years.
SGI's announcement takes the risk out of choosing SGI for that
application.

And after a few more years, a then current 256 node machine will be able
to take the place of a current 1024 node monster, if the application
doesn't grow too much and one is only worried about the machine or SGI
wearing out.
____________________________________________________________________
TonyN.:' tonynlsn@shore.net
'
 
G

Guest

Guest
Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Tony Nelson wrote:

[SNIP]

> Lets say, instead, that one has an application that seems to require a
> 256 node machine, but that need might grow in the next couple of years.
> SGI's announcement takes the risk out of choosing SGI for that
> application.

Regardless you are still effectively locked in if you become dependant
on the SSI feature.

There are also some other factors to take into account... Such as does
your application scale to 1024 on that mythical machine ? If it does
not who do you turn to if you are committed to SSI ?

> And after a few more years, a then current 256 node machine will be able
> to take the place of a current 1024 node monster, if the application
> doesn't grow too much and one is only worried about the machine or SGI
> wearing out.

Assuming clock rate cranking continues to pay off and the compilers
improve significantly. I figure it'll come down to how much cache Intel
can cram onto an IA-64 die, and that is a diminishing returns game.

BTW : If you read through the immense amount of opinionated stuff I
posted you will see that I actually give SGI some credit. The question
I raise though is : Is SSI really that useful given the lock-in factor ?

Cheers,
Rupert