In my opinion, AMD has done the wrong thing with it's 64-bit architecture. Because they introduced extra registers it makes all code longer, thus slower, and we have to switch mode to run 32-bit code. I could be wrong, but doesn't this make Hyper-Threading a lot harder or even impossible to implement? It looks like they dug a hole for themselves...

But I won't question the usefulness of 64-bit addressing. The way I see it, this could be done by making the MMX registers available for address referencing. So no extra registers are needed, and no mode switching, just a plain extension to 32-bit x86. Only when an MMX register is used in a reference they need to add a prefix byte for indicating wether the base or index registers is an MMX register.

If I'm totally wrong here and my idea is even worse than AMD's, please enlighten me...
33 answers Last reply
More about tomshardware
  1. AMd chips don't need hyperthreading. They need speed increases. they need to take their POS 1.4ghz and make it 2.4ghz and 3.4ghz. Otherwise AMD is done. Because by next year intel will have 4ghz cpu's and a 1.4ghz or a 2.0 ghz is not going to compete well against a 4ghz+ cpu's.

    "Bread makes me poop!" - Special Ed

    <A HREF="http://www.anandtech.com/mysystemrig.html?id=9933" target="_new"> My Rig </A>
  2. Plus Prescott I think sometime next year will go to 1066 FSB.
  3. The extra register array won't neccessarily increase code size that much. One more bit at most for register address location and only for x86-64 code. As for data sizes for 64-bit code, again, you can switch to 32-bit mode (which automatically happens when you run a 32-bit OS) and it's not a problem. It's a good evolutionary move, the same as Intel has done with every extension of x86.

    "We are Microsoft, resistance is futile." - Bill Gates, 2015.
  4. Every AA-64 instruction needs an extra REX prefix byte. The most significant nibble is always 4, to distinguish it from the other prefix bytes. The next bit indicates 32 or 64 bit operands. The last three bits are for reg, index and base field extension.

    The conclusion is that the average instruction length is now 4 instead of 3. So all your 64-bit programs are 33% longer, you need more cache and you need more bandwidth.

    With my idea, only the instructions that need 64-bit addressing need a prefix bit. And they could add short instructions to facilitate operations between general purpose and MMX registers. So the code is smaller and you don't need to switch modes, which presumably is better for Hyper-Threading.

    I also have some other thoughts. With the current addressing modes, you could have a total addressing space of 40 GB, instead of 4 GB. How? Well the general addressing format is:

    [base * scale + index + displacement]

    Base can be 4 giga, scale can be 8, index can be 4 giga and displacement can be 4 giga, so this is a total of 40 giga! I don't know if this is supported in the CPU, and Windows is currently limited to 2 GB, but I think this could make 32-bit programming live a bit longer...
  5. As I recall, the prefix is only neccessarily when switching to 64-bit mode. Once the processor is in 64-bit mode, instructions no longer require the prefix. As for instruction length, I don't think that's such a big deal. Worrying about instruction length is what caused the horror that is x86 to begin with.

    "We are Microsoft, resistance is futile." - Bill Gates, 2015.
  6. Uhm, ok, then explain to me how they can extend the register set without the REX byte?

    If x86 did not have a variable instruction length, it wouldn't support any complex addressing modes. So it would be much more RISC like. Suppose it had a fixed instruction length plus complex addressing, then we needed at least twice the memory bandwidth of the currently newest processor, and there wouldn't be any bandwidth left for processing data. So instruction lenght does count for a CISC architecture. Furthermore, variable instruction length allows lots of extensions, which is the main reason why x86 still exists...
  7. On the one hand:

    1.2.2 64-Bit Mode 64-bit mode—a submode of long mode—supports the full range
    of 64-bit virtual-addressing and register-extension features.
    This mode is enabled by the operating system on an individual
    code-segment basis. Because 64-bit mode supports a 64-bit
    virtual-address space, it requires a new 64-bit operating system
    and tool chain. <b>Existing application binaries can run without
    recompilation in compatibility mode, under an operating
    system that runs in 64-bit mode</b>, or the applications can also be
    recompiled to run in 64-bit mode.

    On the other hand:

    Register Extensions. 64-bit mode implements register extensions
    through a new group of instruction prefixes, called REX

    Both quotes originate from <A HREF="http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24592.pdf" target="_new">this PDF - "AMD64 Architecture
    Programmer’s Manual Volume 1: Application Programming"</A>

    I browsed through the document a little further, and on page 63 of the PDF, page 34 of the document, I found these examples:

    Example 1: 64-bit Add:
    Before: RAX =0002_0001_8000_2201
    RBX =0002_0002_0123_3301

    <b>48 01C3</b> ADD RBX,RAX ;48 is a REX prefix for size.

    Result: RBX = 0004_0003_8123_5502

    Example 2: 32-bit Add:
    Before: RAX = 0002_0001_8000_2201
    RBX = 0002_0002_0123_3301

    <b>01C3</b> ADD EBX,EAX ;32-bit add

    Result: RBX = 0000_0000_8123_5502
    (32-bit result is zero extended)

    So it seems to me, C0D1F13D, that only the specific instructions, working on 64-bit operandi require the REX-byte. Also, if you say this would enlarge the bandwidth requirements would rise trmendously, consider the fact that also the data have to be loaded: In regular x86 this would require 2x 8 byte (64 bit) and a 2 byte instruction, equaling to 18 bytes in total, while in x86-64 would resuire only one more. This is an increase of about 6%, not really to be called a huge increment ... Ok, if u do some instructions on data solely in the registers, then it would mean a 50% increse in instruction-size, but I still think on average only a 10% increase is to be expected.

    But I'll talk to you later, after you return from classes ... :wink:


    <i>Then again, that's just my opinion</i>
  8. Thanks Glenn :wink: ,

    Seems like AMD's solution is much closer to my idea than I thought. I should have read those docs a bit more carefully...

    Anyway, I still think that extending the register set this way and introducing a new mode was not the best way to go. But that's from my assembly and machine code programming point of view of course. I just hope Intel will come up with a better 64-bit x86 architecture. I'm afraid that for AMD it will turn out similarly to the 3DNow! vs. SSE battle.
  9. Intel creating a better 64-bit architecture is dumb, why would they scrap or even rehash the Itanium architecture when they already spent billions upon billions of dollars creating and researching it
  10. Just to clarify a few things.

    In 64bit long mode, the default operand size is 32 bits and no REX prefix is needed to use the existing 8 registers. For the most part existing code could execute in 64bit long mode if it wasn’t for the difference in stack operation and a few other operating instructions that mostly involve the use of byte AH and AL register operands. There are some caveats with the use of carry flags and overflow flags due to the sign extension of 32 bit operands. I can’t imagine a more drastic or subtle alteration to the x86 architecture for 64bit operation. You sound like a fanboy when you look to Intel to reinvent a wheel that is already round.

    edit: 7 registers 8 registers

    Dichromatic for your viewing plesure...
  11. I admit preferring Intel processors. But that's only because I haven't seen any really interesting processors from AMD yet (and if I do then I certainly won't stick to Intel just because). Not only does Intel have a higher quality in my opinion, their instruction encoding rules are a lot cleaner. I've written my own assembler, <A HREF="http://softwire.sourceforge.net" target="_new">SoftWire</A>, and had nothing but trouble with AMD specific instructions, which uses weird things like the same prefix twice and suffixes in the immediate field. Sure, 3DNow! was cool for a while and from a higher level you don't see any of the machine code oddities, but re-using the mmx registers makes it totally unusable for 3D rendering. SSE is just totally superior: <A HREF="http://sw-shader.sourceforge.net" target="_new">swShader</A>. Sooner or later I also expect Intel to use AMD's opcodes, in which case AMD can't keep its compatibility. It may even have to deprecate some of it's specific instructions, like the 64-bit instructions, when Intel goes another route.

    They could have left a lot more doors open, or just go for a totally new architecture like Intel did with Itanium, and focus on the server market. I know it's unfair with the monopoly position of Intel having the most money for research and all, but I'm not an economist or politician. In my opinion Intel just makes better processors and all AMD can do is try to beat them in ILP. For this I wish AMD all the luck...
  12. That’s unique about your assembler. Huh, I never had the need to do any non specific assembly.

    Anyways, I checked out your code, neat stuff. You’re using a RHS? How long have you been working on this?

    Dichromatic for your viewing plesure...
  13. Duh. I just found your comment, 4 Intense Months, not bad.

    Dichromatic for your viewing plesure...
  14. What's a RHS?
  15. Sorry, RHS = Right Handed System 3D coordinates. When your final transformation is made your view volume ends up on the negative z in a left handed system LHS you end up on the positive z.

    Dichromatic for your viewing plesure...
  16. While what you say is true you have to consider that additional registers also reduce code size by saving pushes and and moves to memory when available GPRs are not sufficiant for the operation at hand.
    Actually, I think, adding more general purpose registers was the smartest thing to do because register are never enough in 32-bit, already.
    Also, today, code size doesn't really matter. That's why nobody is coding in assembly nomore. The speed increase of 64bit over 32bit would probably be significant enough to even justify double the overhead in size.
    Hyperthreading has nothing to do with the number of registers, I believe, though I don't know a lot about CPU architecture.
    AMD has a patent on its own hyperthreading implementation but I think a multi-core configuration may be more likely. Especially given the fact that the 64bit CPUs will come with high speed interconnects.
  17. Quote:
    That's why nobody is coding in assembly nomore.

    Wrong. C0d1f1ed still is. I know (,) he's a freak :tongue: ...

    Seriously, though. Quite a lot of people still are coding assembly, I think. OSes, vidcard drivers, game-core's, scientific app's, ... I think they all require some serious manual machine-code programming. Never say 'no' when you mean 'very little' ...


    <i>Then again, that's just my opinion</i>
  18. I know. But I think even video drivers are not coded in pure assembly nomore.
    It's sad because I myself think it's a pure, easy language with way less overhead than C++ or C# when using MASM. If you're interested, make sure you take a look at this site:
    and that you drop by here:
  19. The only thing that's done mainly in asm today are BIOS's. Practically everything else is mainly done in some higher-order language like C. As for code-size. As the difference between processor speed and memory speed becomes greater, code size is actually becomming more and more of an issue. However, I don't think variable instruction length is the answer. More complex instructions (but fixed size) would offer a better alternative I think.

    "We are Microsoft, resistance is futile." - Bill Gates, 2015.
  20. Why does this matter? Behind the Renderer class everything is implementation dependent. One of the features I'm planning to add is zero-cost z-buffer clearing. This is possible by inverting the z coordinate and the compare mode. This can happen totally transparantly to the application programmer, and technically it would switch from LHS to RHS and back every frame...
  21. Doesn’t matter at all, it’s darn easy to switch between the two. I was just looking at your projection matrices and wasn’t sure. D3D internally uses a LHS OpenGL I believe uses a RHS. Although it is an unnatural transformation, to me the LHS seems more natural to me.

    That’s a unique idea flip flopping the view system. However it would only work if you have 100% z draw. (i.e. all z-pixels are filled every frame)

    Dichromatic for your viewing plesure...
  22. Who uses general purpose registers nowadays? :wink: No, really, every optimized loop for multimedia and the like use MMX. So you have eight general purpose registers plus eight MMX registers. By splitting operations into separate loops you don't need that many registers and you even can do software pipelining.

    My renderer takes this even futher by using all eight SSE registers. It uses automatic register allocation so it makes optimal use of the available registers and spilling is rare. I recently released <A HREF="http://softwire.sourceforge.net" target="_new">SoftWire 4.0.0</A> which features automatic register allocation for SSE registers as well as MMX and general purpose registers (excluding ebp and esp).

    Could you please explain the "speed increase of 64bit over 32bit"? It's not like they process twice as much data, it can just handle bigger numbers. It's a bit comparable to using 32-bit or 64-bit floating point numbers. There's no speed increase when using 64-bit floating point numbers. On the contrary. You have more cache misses, need more memory bandwidth, the processor needs more interconnections, has higher port latencies, etc.

    Hyper-threading indeed has nothing to do with the number of registers, but they would have to be able to switch modes for every micro-instruction and I'm quite sure this would take more time than what we win with Hyper-threading.
  23. isn't the bus width doubled? This would effectively nullify the performance hit caused by double width instructions I'd think...

    Athlon XP 1600+, MSI K7T PRO2 RU (POS), 2x256 MB CRUCIAL PC2100 CL2.5 memory, Asus V6800 DDR Delux (GF 256) video card, 6.4GB+27GB WD HD, 40GB IBM HD (all 7200RPM). My computer is an acronym
  24. Yes and no. You have to consider caching and the cache space it would take up.

    "We are Microsoft, resistance is futile." - Bill Gates, 2015.
  25. Recycling one of my previous posts…

    You guys must not forget that 64-bit means 64 logical operations in one instruction. One of the key features of x86 is its ability to test and retrieve bits from a register, alter, and report on their position, as found in the instructions BSF, BSR, BT, BTC, BTR, and BTS, (Bit Scan Forward, Bit Scan Reverse, Bit Test, Bit Test and Complement, Bit Test and Reset, Bit Test and Set, respectively). Math in integer is a total waste of time, as logic in floating point is just as useless. There are exceptions to these rules but haven't we all at one point in our engineering endeavors used a wrench as a hammer (no x86-64 reference intended).


    MMX/SSE(2)/3DNOW are all good at crunching a lot of data but when it comes down to parsing the results of that crunched data, they fall short. I know of no operations that allow you to find a single bit in a large bit field.

    Extra registers have their advantages/disadvantages. You can pass information in them and thusly removing the need for stack access. However, when you switch tasks, you have more information to save.

    Dichromatic for your viewing plesure...
  26. By "speed increase of 64bit over 32bit" I meant that you will be able to do bigger numbers with less memory access or save floating point initializations. Also, many FPU instructions take many cycles and don't pair well.
    By the way, Softwire looks pretty nice. As you iplemented you're own assembler, does it also run source code from other asssemblers?

    Also, if you claim that neither the additional registers nor the 64 bit extension deliver a speed increase then I wonder what did as basicly all Opteron reviews show it's 20-30% faster running the same software in 64bit than it is in 32bit?
  27. For most P4/Athlon processors

    Non MMX/SSE(2)/3DNOW properly pipelined instructions produce.

    Integer add 1/2 to 1 cycle.
    Integer Multiply 7-11 cycles.

    Floating Point Add 1 cycle.
    Floating Point Multiply 1 cycle.
    Theoretically the Athlon can do one Add and one Multiply per cycle/ P4 either Add or Multiply per cycle.

    Divides and Transcendentals 30+ for both Integer and Floating Point, with floating point considerably faster.

    Dichromatic for your viewing plesure...
  28. Bit scans and other bit operations are quite rare, so using 64-bit instead of 32-bit isn't going to speed up overall performance much.

    Math in integer is totally not a waste of time. All MMX code, and this includes your MP3 player, AVI/JPEG player and many 2D graphics routines, is based on integer math. You don't need much floating-point arithmetic for these applications.

    I don't understand what you mean by "parsing the result of that crunched date". SIMD instructions are targeted at doing the same operation on big arrays, not a whole algorithm on a single element. Structure of Arrays data organisation is preferred over Array of Structures.
  29. AMd chips don't need hyperthreading. They need speed increases. they need to take their POS 1.4ghz and make it 2.4ghz and 3.4ghz. Otherwise AMD is done. Because by next year intel will have 4ghz cpu's and a 1.4ghz or a 2.0 ghz is not going to compete well against a 4ghz+ cpu's.

    The thing they just got no team to make a impletation and test.All big cpu maker have test SMT solution for year Intel IBM Alpha HP SUN.That cover about 99% of CPU revenue.Intel and alpha been the 1 do discussion can be made on how invent the theory.Intel earlie 90 but paper only in year later or alpha in paper i got from 93 94.

    According to alpha a single 100 million transistor SMT should be faster that 2*50MT CPU or 2*75MT.On single tread to mulitreading and and every dicipline.

    [-peep-] french
  30. I'd say the K7/K8 design benefits a lot from SMT. There are a lot of execution resources on the Athlon and most of the time, it's probably remaining idle. Having the ability to process instructions from multiple threads may actually max out its decoding resources and actually create a bottleneck of the 9-way, 9-issue execution back-end.

    "We are Microsoft, resistance is futile." - Bill Gates, 2015.
  31. Maybe you´re right... How easy/hard is it to implement SMT, after all? Just out of curiosity... I don´t know exactly.
  32. Depends on the type of implimentation you want. Certainly it'd be easy enough to double certain resources like the program counter and just split the caches/queues in half. However, dynamically sharing resources through the use of tags on each instruction/data.......that's more complicated.

    "We are Microsoft, resistance is futile." - Bill Gates, 2015.
  33. I wonder if AMD would be capable of spending the resources to develop such a technology, with all their R&D focused on x86-64... It´s a big bet, mind you... I don´t think AMD would include SMT in their plans right now, but I don´t exactly know how complicated that is. That´s what I meant...
Ask a new question

Read More