Ok, you've all heard of the simulation of quantum computer, etc. I've been making in matlab. Matlab is nice and good, but once u get into complex calculations, it takes a looooong time, whereas writing the program in C++ code is like a thousand times more efficient. Lucky for me, Matlab has this wonderful comman called mcc -p m-file_name_here that converts the high level code u write in matlab programming language to C++ execuatble!!! This is very nice, however, as I had this issue with matlab itself, it doesn't let me go to 16 bit system b/c I get the following error:
"Exception! File: handler.cpp, Line:73
Product of dimensions is greater than maximum integer."
Ok, obviously, this can be fixed by using a long integer whatever (learning C++ right now). How can I change this, cause handler.cpp line 73 is:
"_PNH __cdecl _query_new_handler ("
What if you had admin rights to life?
He he... umm... so whats wrong. Mere childs play
__________________________________________________
<b>Speed kills!!!
Drive a Honduh
</b>
Hmmm. I've got an idea
10 PRINT "I am a programming GOD!"
20 GOTO 10
30 END
<b><font color=red>He who bargains with a dragon is either a fool or a corpse.</font color=red></b>
In C++ the long integer is just called <b> long </b>.
In a world without <font color=red>walls </font color=red>or <font color=green>fences </font color=green>, what use have we for <font color=red>Windows </font color=red>or <font color=green>Gates.</font color=green>
The only problem is a god who uses BASIC is not much of a god.
In a world without <font color=red>walls </font color=red>or <font color=green>fences </font color=green>, what use have we for <font color=red>Windows </font color=red>or <font color=green>Gates.</font color=green>
LOL
You're obviously thinking of visual basic.
My basic is better.
Besides, i just realised my program contains 33% bloat!!! Line 30 isnt needed at all! hehehe
<b><font color=red>He who bargains with a dragon is either a fool or a corpse.</font color=red></b>
But would u happen to know where exactly handler references this? Cause as I said, that line of code calls something, just what is my question.
What if you had admin rights to life?
I'll have to dig around a little, but I should be able to find it.
In a world without <font color=red>walls </font color=red>or <font color=green>fences </font color=green>, what use have we for <font color=red>Windows </font color=red>or <font color=green>Gates.</font color=green>
Cool, thanks. basically I'm looking to change the max size alloted for the integer.
What if you had admin rights to life?
I am not positive about this, so back up any files you change. In the header io.h there are multiple declarations for int that set it to max 64 bit. If you were to change every instance of that to however many bits you need, it will hypothetically work.
Hope it works for you.
In a world without <font color=red>walls </font color=red>or <font color=green>fences </font color=green>, what use have we for <font color=red>Windows </font color=red>or <font color=green>Gates.</font color=green>
It looks to me that you're declaring some sort of runtime memory that is huge. The _set_new_handler/_query_new_handler functions are used to override exceptions caused by the new operator. I don't know what the requirements of your algorithm are but I would look for something trying to allocate large amounts of memory. Make sure you are setting all your variables correctly. Have you tried stepping through the code with the debugger or tracing the stack back from the exception to see the offending line of code.
Complicated proofs are proofs of confusion.
On x86 platform there is no difference between long and int; they both have the same size - 32 bits.
Try to use a double. You will loose some accuracy. Probably there are better solutions, but this is a fast one.
Obs: If I where you I would not change the system header "io.h"
Razvan
Why dun u reccomend changing it? It may work. Is there something I'm overlooking?
What if you had admin rights to life?
Changing io.h is not going to fix your problem and it has necessary definitions that a lot of other system files use. I'm telling you it is a failure in a memory allocation.
Complicated proofs are proofs of confusion.
Interesting u say that actually, b/c I let my comp run a 13 bit system (matrix sizes of 2^16x2^16) and it gave just that error after bunch of processing- failure to allocate memory, and pointed to the same line. So, what can be done about that?
What if you had admin rights to life?
class HelloWorld{
public static void main (String[] args) {
System.out.println("Hello, World!" );
}
}
there you go, in java
<font color=blue>Unofficial Forum Cop</font color=blue>
2^16 x 2^16 = 65536 x 65536 = 4,294,967,296 = 2^32
and if you're using integers multiply that by 4 bytes an integer and you get
17,179,869,184 = 2^34
which is a huge amount of memory.
Even bitwise operations of 2^16 x 2^16 is going to create 2^29 = 536,870,912 bytes of memory.
I would say you would have to redesign your algorithm. Why you need such monstrous matrices?
Complicated proofs are proofs of confusion.
| Quote : I let my comp run a 13 bit system (matrix sizes of 2^16x2^16) and it gave just that error after bunch of processing- failure to allocate memory |
ROTFLOL!
Think about it a little... basically, a matrix like that has over <i>4 billion</i> elements; 2^32 to be exact. To my understanding, Matlab uses floating point, so the C equivalent of such a matrix would be:
double foo[65536][65536];
Since one double float is 8bytes, you're trying to allocate 2^3 * 2^32 = 2^5 * 2^30 = 32GB of memory per matrix. You sure your swapfile is up to the task?
<font color=red><b><i>You want WHAT on the [-peep-] CEILING?!</i></b></font color=red> -Michelangelo
The algorithm can't be made more efficient sadly; I have taken shortcuts in certain areas just to allow for faster computation, but what happens (and this is why a Turing machine is NOT a universal computer is that in trying to simulate a quantum computer, all possible states MUST be calculated, thus if there's a 13-bit system, it's 2^13. Then matrix transformations need to be done, and those are 2^13x2^13). So that's the issue. My comp can handle it; I prolly be getting 1-2GB RAM soon, and then it can also use HDD. Issue is that I need it to realize it's allowed to do that. I know the matrix sizes are insane, but that's how it is, no way out of it.
What if you had admin rights to life?
Oh yeah, and I'm no longer using matlab for it, I converted it to an exe in C++.
What if you had admin rights to life?
If you really have to run matrices of such a size, you have to think of it in a computer friendly way. Basically you have to chunk your memory. Lets do a 2^16 x 2^16 matrix. Instead of thinking of it as a chunk of 2^32 bytes of memory, think of it as 2^16 pointers to chunks of 2^16 bytes of memory. So instead of declaring
char matrix[65536][65536];
you would declare
char *matrix[65536] /* an array of pointers to chunks of memory */
then initialize your arrays as so
for(int j=0;j<65536;j++)
{
matrix[j]=new char[65536]; /* a doable size int's will be 4 times a large */
};
do some work with matrix[j][k] remember row access will be slow and column access will be relatively fast.
for(j=0;j<65536;j++)
{
delete [] matrix[j]
};
You could write class wrappers around the whole thing but the effect would be the same. Just make sure you have an extra 34gigs on your drive and a whole lot of time.
Darn markup language won't let you write "bracket i bracket" it thinks it means italic.
Edit: added "chunks of"
Complicated proofs are proofs of confusion.<P ID="edit"><FONT SIZE=-1><EM>Edited by Schmide on 10/10/02 07:51 PM.</EM></FONT></P>
Oh yeah major optimization.
when doing column access on a row. Use a temporary pointer like so.
char *temprow=matrix[j];
then access temprow[k] for all elements in row j. This will avoid the recalculation and access of pointer arrays and could double your speed.
Complicated proofs are proofs of confusion.
are you trying to program in C++ now?
that's fine but a bit difficult, don't you?
i've plugged my fingers into ?
? (or/&) <b>under</b> <b>?</b>
<b>?</b> that Greek agora ... dunno what happen ... that works?!?
Still, it's a huge job even for the fastest PC. Even if there weren't any available RAM constraints, I shudder to think about e.g multiplying two 2^16 square matrices. Correct me if I'm wrong, but you'd have to read 2 * 2^16 = 2^17 source elements <i>per result element</i>. Since there are 2^32 result elements, that's 2^49 total elements read. If one element is 2^3 bytes, there's 2^52 bytes to be read from memory. Thats 2^22 GB.
Assuming that the memory has 3.2GB/s effective throughput, one simple matrix multiplication would take about (2^22 / 3.2)s = 15 days. If the formula is hideously complicated and has so many matrices that virtual memory really kicks in, better get hot-swappable RAID0+1 too. You'll probably go through several broken HD's before the computation is complete.
Then again, it's not nearly that bad if you take the time to write L2 cache-friendly code. But that'll be a tough job if just one row or column is 2^3 * 2^16 = 2^19 bytes = 512kB.
2^13 square matrices are way easier, it's only 2^3 * 2^13 = 2^16 = 64kB per row or column. Still, it would be a good idea to try to make sure you compile for 3DNow or SSE. Suitably placed L2 cache line prefetches and proper usage of SIMD extensions could make a huge difference.
<font color=red><b><i>You want WHAT on the [-peep-] CEILING?!</i></b></font color=red> -Michelangelo
In response to:
_______________________________________________________
Changing io.h is not going to fix your problem and it
has necessary definitions that a lot of other system
files use. I'm telling you it is a failure in a memory
allocation.
_______________________________________________________
Thanks for catching that. I posted when I was a bit low on sleep when I posted about editing io.h
In a world without <font color=red>walls </font color=red>or <font color=green>fences </font color=green>, what use have we for <font color=red>Windows </font color=red>or <font color=green>Gates.</font color=green>
Yeah your math seems correct. Ideally you would transpose your 2nd matrix and your result matrix such that it would be dominated by row access not column access. Even if we assume memory and drives are not a factor your talking 2^16-1 adds + 2^16 multiplies =~2^17 flops an element or 2^33 or ~8 Gigaflops a row and we have 2^16 rows. Assuming at best we can do 2 flops a tick, we're talking (2^33*2^16)/2=2^49/2=2^48 ticks on a processor. I love nice powers of 2. Assume a 2ghz processor. So we have 2^48/2^31=2^17= 131,072 seconds of processing. Now that's 2,184.53 minuets or 36.40 hours or a day and a half. I think your numbers are closer to the truth and still generous.
Complicated proofs are proofs of confusion.
Well, your estimate probably serves as the ultimate ideal case and mine as an example of what could be accomplished with ideally prearranged datasets and รผber smart cache-friendly programming.
Assuming that 1/8 of the 2^16 row and the 2^16 transposed column fit nicely to 128kB, theoretically the CPU could be operating solely from the L1 and L2 cache for ~2^14 flops. That would be 2^13 ticks with 2 flops per tick assumption. Unrolling the loops so that the next parts of the row and column to be processed are prefetched more or less in parallel to the other half of the 256kB L2 cache, there's a small chance of keeping the CPU fully utilized.
Then again, the required external bandwith for a 2GHz (2^31 Hz) processor would be 2^(-3)MB * 2^(31-13) Hz = 1MB * 2^15 Hz = 1GB * 2^5 Hz = 32GB/s. Unfortunately, that kind of memory bandwith simply isn't available.
Moral of the story in this 2^16 square matrix multiplication example is the fact that when you have 256kB/512kB/whatever L2 cache and 32GB datasets, the best the cache can do is to keep the main memory operating at peak bandwidth. Conveniently forgetting the need for swapping for a while, IMO main memory bandwith is <b>the</b> bottleneck. Or the FSB in case of Athlons. I'm out of touch with CPU's, but I presume it's still 2x8x133MHz = 2.1GB/s with them?
32GB dataset (2^16 square matrix at double precision) spells swapping. If you want performance, probably the best bet is to use temporary files for datasets and break the datasets into manageable chunks so that actual swapping doesn't occur. That way, you get to control the what, when and how of file/disk operations.
Astonishingly enough, in this particular matrix multiplication case disk speed isn't the bottleneck. Assuming average 32MB/s read rate and 16MB/s write rate for large sequential file accesses, "swapping" a 32GB dataset to/from files in manageable chunks would only take (32768MB : 16MB/s) + (32768MB : 32MB/s) = 2048s (write) + 1024s (read) = 3072s. That's less than an hour. In our hypothetical cases, multiplying two 2^16 square matrices in memory only would've taken either 1.5 days (2GHz CPU limited) or 15 days (3.2GB/s memory bandwidth limited).
Factor in some less-than-optimal coding/compiling, and you're easily talking month per multiplication. What's an hour spent on temp file operations there? Think files and think NTFS, since FAT32 doesn't allow >4GB files.
<font color=red><b><i>You want WHAT on the [-peep-] CEILING?!</i></b></font color=red> -Michelangelo
Now let's calculate for 200Mhz processor w/ edo
What if you had admin rights to life?
Count me out. Do it yourself if you're interested.
<font color=red><b><i>You want WHAT on the [-peep-] CEILING?!</i></b></font color=red> -Michelangelo
Assuming that you have a Pentium class processor and 60ns EDO ram. You will get about 60mb int/80mb float transfer rate.
So by Napoleon's equation...2^22/0.08GBs = 52,428,800 seconds = 873,813.3 minuets = 14563.5 hours = 606.8 days = 1.66 years.
By my equation it is a bit easier. Since I used a 2ghz processor as my baseline you simply divide by 10 to get a 200mhz processor. So... 2^48/2^21=2^27 = 134,217,728 seconds = 2,236,962.1 minuets = 37,282.7 hours = 1553.4 days = 4.25 years.
So your bottleneck by these equations is your processor. Not like I'd wait more than a year for either estimate.
Complicated proofs are proofs of confusion.
And those years are assuming you do nothing else with your computer in that time and that the computer doesn't crash.
Just think only two minutes left to go and the power goes out, "Who unplugged my UPS?!?!?!" Sister, "I needed to use it for my hair dryer."
In a world without <font color=red>walls </font color=red>or <font color=green>fences </font color=green>, what use have we for <font color=red>Windows </font color=red>or <font color=green>Gates.</font color=green>
Oh I forgot, Pentium class processors can only perform 1 fp operation a tick max. So my equation would actually yield 8.5 years.
Complicated proofs are proofs of confusion.
Umm. doesn't quite jibe. If you downgraded from 2000MHz to 200MHz and went from 2 flops to 1 flop per tick in the process, wouldn't it be:
1.5days * 2 * 10 = 30days = 1 month?
Schmide, that 1.5 days was from your initial calculation for 2GHz processor with 2 flops per tick. Anyway, didn't remember that EDO was that slooow (80MB/s). Anyway, seems to me that "serious" scientific calculations have been limited mostly by memory bandwith available at the time, ever since i386DX.
*Giggle* the very first of them operated at 16MHz if I remember correctly.
<font color=red><b><i>You want WHAT on the [-peep-] CEILING?!</i></b></font color=red> -Michelangelo
Oops that would be a 2mhz processor. Way back in the 70's.
Complicated proofs are proofs of confusion.
Just to straighten out my calculations...
Depending on your processor and chipset. For floating point transfers, the Pentium Pro made it up to 106 mb/s with 60ns EDO ram. The 440LX PC66 hit 209 mb/s. The 440BX with PC100 was 333 mb/s. PC133 ranged from 400-475 mb/s, DDR brands run 600-800+ mb/s and Rambus runs 1.4 gb/s+
The exact calculations equal out to 2,684,354.56 seconds = 44739.2 minuets = 745.7 hours = 31 days...ideally.
Complicated proofs are proofs of confusion.
This is why I'm getting 8 proc hammer 64-bit system w/ QDR (hoping w/ dual channel to get 72GB/s).
What if you had admin rights to life?
I wish I had the money for a dual processor system. But 8 that is insane.
In a world without <font color=red>walls </font color=red>or <font color=green>fences </font color=green>, what use have we for <font color=red>Windows </font color=red>or <font color=green>Gates.</font color=green>
72GB/s memory bandwith? Please supply links.
<font color=red><b><i>You want WHAT on the [-peep-] CEILING?!</i></b></font color=red> -Michelangelo
Since he hasn't responded. I'm guessing he is assuming 400mhz quad data rate dual channel controllers on each hammer. So QDR at 400mhz is going to put out 4.8GBs x dual channel = 9.6GBs * 8 = 76.9 ~ 72. Its all crap since the sum of all links does not equal your total throughput anyways. Your throughput is only equal to your weakest link.
Complicated proofs are proofs of confusion.
No no (didn't respond earlier cause I was out) actually, what happens is that they developed QDR and then using hypertransport and using something or other which escapes me right now, u can attain 36GB/s, but since that can be used w/ dual channel, it'll be 72.
What if you had admin rights to life?
You're wrong, first of all there is no QDR RAM, and even if there was (the infamous DDR II has yet to be out in 2004 with TWICE the bandwidth), it'd be quite rare. The only thing close to it is QBM DDR which is quite nice, though I wonder if it'll be succesful. VIA will need to push a chipset soon for that, then SiS so that QBM DDR becomes mainstream to be purchased. QBM is indeed like Dual Channeled DDR only that it uses plain DDR sticks with clock shifting inside them (90ยบ). I had a topic a while ago in the CPU forum and RAM forum, search for it or go on Anandtech and find it.
Besides that, IIRC and if I know what I am saying, the Hammers use NUMA, and that supposedly is one processor one memory channel, they don't all necessarily share it to total the theoretical 72GB which already isn't true, but rather 43.2GB/sec (PC2700 Dual Channel*8).
Again I am not sure if this is how NUMA works, and if you really can't put all the bandwidth at once. But I agree with Schmide, you definitly won't get that theoretical without a LOT of variable elimination.
--
What made you choose your THG Community username/nickname? <A HREF="http://forumz.tomshardware.com/community/modules.php?name=Forums&file=viewtopic&p=19957#19957" target="_new">Tell here!</A>
Why do RAM bandwidth tests always show much less bandwidth than the actual real one? RDRAM RIMM4200 shows about 3.8GB, or even less, far from 4.2GB.
--
What made you choose your THG Community username/nickname? <A HREF="http://forumz.tomshardware.com/community/modules.php?name=Forums&file=viewtopic&p=19957#19957" target="_new">Tell here!</A>
For one, DRAM needs to be refreshed periodically, which might have an influence. Don't know how big, though. There are probably lots of other factors too.
Another thing is, the benchmark run on top of some OS. There's always some background activity going on (timer/RTC interrupts etc). I suppose they "steal" some of the available bandwidth.
<font color=red><b><i>You want WHAT on the [-peep-] CEILING?!</i></b></font color=red> -Michelangelo<P ID="edit"><FONT SIZE=-1><EM>Edited by Napoleon on 10/12/02 08:16 AM.</EM></FONT></P>
I'm no expert on ram. However, benchmarks are designed to be real world benchmarks. If you think about it, when memory access is random, it has to go through a CAS to RAS cycle every time it loads a cache line. So every time access is to random area, on a CAS 2 module you waste 2 cycles. If you waste 2 cycles every time you move 18 cycles, you get 90% efficiency. So at 90% efficiency of 4.2 GBs you get 3.8 GBs. it's a tuff one, kind of like the megahertz myth.
I've read these <A HREF="http://arstechnica.com" target="_new">arstechnica</A> articles twice and I still don't fully understand everything.
<A HREF="http://arstechnica.com/paedia/r/ram_guide/ram_guide.part1-1.html" target="_new">Part I
RAM and SRAM Basics</A>
<A HREF="http://arstechnica.com/paedia/r/ram_guide/ram_guide.part2-1.html" target="_new">Part II:Asynchronous and Synchronous DRAM</A>
<A HREF="http://arstechnica.com/paedia/r/ram_guide/ram_guide.part3-1.html" target="_new">Part III: DDR DRAM and RAMBUS</A>
Complicated proofs are proofs of confusion.
Heheh, that is the exact article I was reading on a couple months ago, which I hadn't finished reading. At first it was graspable, but in the third part after CAS timing, it becomes mixed up, I get mixed with banks and such. It's rare to get such because Ars is one of the most comprehensive tech websites to learn about chip architectures EVER, IMO.
In any case, do you think I have something right of my explanation to flame about the Hammer's memory system?! Anytime now I could get slapped with a reality check, might as well tell me now if you know!
--
What made you choose your THG Community username/nickname? <A HREF="http://forumz.tomshardware.com/community/modules.php?name=Forums&file=viewtopic&p=19957#19957" target="_new">Tell here!</A>
Ya know, now that I think of it, perhaps this might relate to CAS raises, so that when you use CAS2 than 2.5, your bandwidth raises in these tests, closer to theoretical one?
Someone should test my theory. (kinda like reducing branch mispredicts, therefore less bandwidth needed, this case is reverse though)
--
What made you choose your THG Community username/nickname? <A HREF="http://forumz.tomshardware.com/community/modules.php?name=Forums&file=viewtopic&p=19957#19957" target="_new">Tell here!</A>
Ok. I rereread the first part of the article. Dang I hate white on black text. After the first 30min you start to see stripes on the walls. Anyways. With regular SDRAM you have whole number have CAS latencies. (I.e. 1, 2, 3) With DDR you can have half latencies like 2.5. So as I understand it SDRAM generally bursts data after the row and column have been programmed. This row and column programming time is the CAS latency. The burst count can be 1,2,4,8, or full page. A full page can be 512 or 1024 ticks of memory transfers depending on chip type. So if you're moving a full page the CAS latency won't be a factor and thus it should run at full speed. So basically it's running at ~99%+ speed efficiency (EQ n=256,512... (n/(n+CAS))). With random access the best you can hope to get is 80% efficiency. (I.e. CAS 2 transferring 8 ticks 80%= 8/(8+2)). The lower the burst the lower the efficiency. 4 is 4/(4+2) =66% 2 is 2/(2+2)=50% and 1 is 1/(1+2)=33%. DDR efficiency is worse because your transfer rate is twice your clock. At full page you still get 99% efficiency. However, you still have the same tick latency with twice the transfer rate. So on an 8 transfer burst you have 8/(8+4) or 4/(4+2) = 66% efficiency. 4 is 2/(2+2)=50%. 2 and 1 are 1/(1+2) = 33%. Surprisingly, DDR CAS 2.5 you only loose a half cycle and because you start on a down tick and end on a down tick you can start your next CAS programming the next cycle. So on repeated transfer bursts you can still receive the same efficiency as CAS 2. However, the CPU still receives its data a half tick later.
I still can't explain RAMBUS and I have certain doubts that I got SDRAM right in the first place.
Complicated proofs are proofs of confusion.
Oh man, I'm going to read those articles some time and I'm sure they will make me completely mad... maybe if they where in Dutch it would be easier to understand. But it sounds interesting so I'm gonne try it if I have the time.
*Advertisement*
<b><A HREF="http://www.angelfire.com/dbz/dewrede/enterstrips.html" target="_new">Geert's Comics</A></b>
excuse me dunno much bout wat u'rv been writing about but i was wondering if ya cud help!
i keep getting runtime error and i dont know why
it says
line:0
line:480 when i click on a field
and ive googled it but cant find any answers i had google toolbar got rid of that but still dont work
any help wud be appreciated thanx
Are you freakin mad resurrecting this 4 year stale thread? This thread is so dead, even Wingy won't touch it anymore.
There are 983 identified and unidentified users. To see the list of identified users, Click here.
You are about to answer a thread that has been inactive for more than 6 months.
If you still wish to proceed, please ensure that your posting is original and does not duplicate or overlap any prior responses to this thread.

