Sign in with
Sign up | Sign in

Multi-core Processors: Interplay Of Hardware And Programming Models

By - Source: Tom's Hardware US | B 5 comments


Multicore architecture has created a little frenzy among programming language designers and software developers to hone up their concurrent and parallel programming skills. But is this push one way? Is it just the hardware driving the software and language design in multi-core era? Rajesh Karmani continues his article series on TG Daily which focuses on a dramatic shift in software development techniques to help developers exploit the horsepower of multi-core processors.

We know that applications have driven chip design in the past (ASIC’s being the extreme example). But performance has been the key factor in hardware design. And mostly it has dictated how software should look like right from von Neumann machines (sequential model) to caches (locality-awareness) to out-of-order execution (weaker consistency memory models). In the multi-core era, with programmability such a big challenge, the chip designers seem to be more accommodating towards language designers.

Previously published in this series:
An inside view: The $20 million Intel/Microsoft multicore programming initiative
Concurrent Programming: A solution for the multi-core programming era?

I previously discussed how concurrent programming has been traditionally viewed and done. As pointed out, in a shared memory model with many threads trying to access the shared state concurrently, managing correctness and consistency becomes a difficult task. It is not hard to argue that such model will not scale as the number of cores and henceforth the number of threads increase. No wonder it is sometimes referred to as "wild concurrency", and programming language researchers and practitioners are working hard to tame it. I discussed one such proposal called Transactional Memory (TM). It has generated enough interest to convince Sun to integrate TM support in its upcoming Rock processor.

In this article, I’ll discuss an alternate programming model, message-passing model of concurrency, and present my thoughts on what possible impact it can have on processor architecture. It’s been widely used in academia and research labs for a few years. These are the domains which had the demand (scientific computation, physical simulations, numerical computation are CPU intensive) and the resources (money, people, grids, clusters) for parallel programming. Not to mention the main driver behind parallelism is the huge amount of data these applications process. On the other hand, it’s only through the impact of multi-core chips that parallel programming is beginning to push itself into mainstream.

Message Passing Interface (MPI)

One of the prime reasons for its wide adoption is the standardization of message-passing model in early 90’s in the shape of MPI (Message Passing Interface), a specification subsequently implemented by different scientists. A list can be found here. Coincidently, two of the pioneering scientists behind MPI, Dr William Gropp and Dr Marc Snir, are faculty members at UIUC. Dr Marc Snir is also the co-director of UPCRC at the University of Illinois.

The basic goal of MPI is obtaining high performance, scalability and portability in these domains. Although it defines a large number of functions in its specification, there are six basic calls. Point-to-point communication includes both synchronous and asynchronous calls. Also it defines so-called collective communication patterns apart from the point-to-point communication. Among others these include broadcast, scatter, gather, reduce, scan, all-reduce and all-gather. MPI started out assuming no shared memory, but the later version incorporated Distributed Shared Memory architectures. A good introduction to MPI is available here. There are plenty of other references available on Google.

MPI has been quite successful in achieving its objectives and hence been widely used to write parallel programs. Although based on a message-passing model, it’s a library itself and not a full-blown programming language. Therein lays one of its strength; programming language independence and compatibility with legacy languages.

Problems with MPI

Although I don’t claim first-hand experience with MPI, it involves a fair amount of hand-tuning including partitioning of code and data, placement and scheduling across the multi-processor architectures. If the goal is to obtain the last iota of performance and stakes are high, it makes quite good sense. In fact, projects involving MPI have computer scientists working with physicists, astronomers and other domain scientists to deliver the performance. MPI has been deemed very low-level to the point of being called "assembly language" of parallel programming. With such a high entry barrier, MPI needs to raise its abstraction level with elegant constructs to define the different communication and coordination patterns for concurrent and parallel programming. Professor Kale from UIUC has been working on an adaptive implementation of MPI and a dynamic run-time to support placement and scheduling on multiprocessor machines. This is a big step forward from the low-level mechanisms in MPI.

Also, it is prone to deadlocks due to synchronous communication but been in the hands of expert programmers (scientists) so far, the problem has been masked. One can imagine how error-prone and tricky it can get for mainstream programming.

But MPI is not the end of message-passing models. A more abstract model based on asynchronous message-passing is Actor model of programming. Although originally proposed around 30 years ago, it has been receiving a lot more attention lately due to the problem of multicore programming. I briefly discussed Actor model in a previous article.

Message-passing model revisited

Message passing model (with no shared state) has some nice properties regarding concurrent programming. The non-determinism is due to the arrival order of messages. This can be resolved locally at each site, leading to local reasoning about correctness of programs. Compare this to multi-threaded model where shared state can be accessed from any point in any thread requiring global reasoning about correctness and consistency of programs. Moreover, it is more amenable to visualization of the program’s flow as local state is abstracted away. Messages represent the explicit data flow in the application and an easily-comprehensible picture of the program emerges. Visualization tools go towards solving one of the problems that I see with any large scale programming task: short attention span of the programmer in space (code files) and in time.

So what does message-passing model mean for chip architecture? With its low inter-core latency and high bandwidth, multi-core chips are more suited to message-passing than the grids, clusters which have been the traditional havens for such model. Tilera’s TILE64 chip has a 5-layered mesh interconnect and a development environment supporting inter-core message-passing. But with its small-sized caches, message-passing model has to employ the shared memory (logically, message passing model can be mapped physically to shared memory architecture and vice versa.) Off-chip shared memory provides an order of magnitude slow access than on-chip core-cache or core-core access. The bottom line being if message-passing model enables writing correct parallel programs there is an incentive for chip makers to provide the optimizing facilities in hardware.

Having talked about the two prominent models, there is a belief and a few good reasons that different programming models and a set of languages will co-exist for some time to come. Some of these reasons are the investment in research and learning, domain specific requirements etc. There have been efforts to catalog patterns of problems and solutions in parallel programming, just like the design patterns for object-oriented programming. One such effort is PPP. Basically the goal is to evaluate proposed programming models in the light of these patterns. There are some exciting times to look forward to.

Previously published in this series:
An inside view: The $20 million Intel/Microsoft multicore programming initiative
Concurrent Programming: A solution for the multi-core programming era?

About the author: Rajesh Karmani is a graduate student in the Department of Computer Science at University of Illinois at Urbana-Champaign. He is a recipient of the Sohaib and Sarah Abbasi Fellowship. His current area of interest is programming languages and software engineering. Previously, he has worked in wireless sensor networks and multi-agent systems.

Disclaimer: The views expressed in the article are author’s personal view and do not represent those of TG Daily, the University of Illinois, or the UPCRC at University of Illinois.

Display 5 Comments.
This thread is closed for comments
  • 0 Hide
    DXRick , May 22, 2008 4:30 AM
    I don't get the hoopla about multi-processing. Every Windows programming book I have read on Win32, MFC, and .NET has covered the basic concepts of creating threads and thread management. There are also books devoted to the concept, for those ready to go beyond the beginner stage.

    Any developer that does not learn the concepts would be in dire need of a career change.

    As for this article, it is too vague to make much sense to a Windows application programmer. Maybe it would make more sense to the system programmers that wrote Windows Server 2008. Microsoft has built powerful tools into .NET that are easy to understand and use for multi-threading.
  • 0 Hide
    martel80 , May 22, 2008 8:36 AM
    A distributed_memory-system-on-a-chip, a cluster-on-a-chip? Sounds interesting... :) 
  • 0 Hide
    navvara , May 22, 2008 1:10 PM
    Multicore processing needs to be handled at the OS level in such a manner that having 1,2,4 or 128 cores is transparent to whatever application is application running on the machine. I do hope the next version of Windows can do this as in my oppinion this is the biggest change that must be implemented in today's operating systems.

    Just my 2C.
  • 0 Hide
    navvara , May 22, 2008 1:24 PM
    Could not find the edit button....

    What I meant by the above is have something similar to DirectX but for general purpose code not only for rendering. With DirectX you don't need to paralelize anything yet the rendering is performed in parallel across hundreds of vector processors. This is what I think should happen with general purpose cpu's. This will be harder to achieve than paralelizing GPU's as vectors are very easy to manipulate but it can be done. And if anyone can do it then it's Microsoft.
  • 0 Hide
    DXRick , May 22, 2008 7:37 PM
    In DirectX 10, multi-threading support is ON by default. This means that DirectX must lock memory being accessed by the application. This has a significant impact on performance.

    The hardware (Intel, AMD, & Nvidia) and software (Microsoft) folks need to work together to improve the tools used by programmers to lock/unlock memory that can be updated by more than 1 thread (concurrency issues).

    I presume that this is what Intel and MS are working on.