Storage/PC solution for scientific research


We have large data sets, organized in many small files (~1-5mb each).
The files are organized in that way to provide flexibility in analysing the data, for different requirements the files are read one by one, and a certain calculation is applies to the loaded data (for example, fft in matlab..).

We wanted to have a couple of strong computers connected to a NAS so the calculations will be accessible through the network.
We tested the Synology DS1513+ and got disappointed, the read performance was about 20% of the declared capabilities of the server (over 1G network). We compared performance to reading directly from a USB3 disk and got about the same results.

We are now thinking about buying a strong PC with a raid controller hoping for better performance, what do you think?
Will the raid controller consume too much of the CPU power? Will the solution provide better reading performance?
Is there a better solution for us?
The budget is limited, and we need about 15TB for the data. Performance is a priority over backup/parity...

Thanks allot in advance!
4 answers Last reply Best Answer
More about storage solution scientific research
  1. Some of this depends upon the type of software that you are utilizing and whether or not the data has to be read or accessed on the server itself, or if basically the data is copied from the server storage onto the local workstation for the actual computation. If any computation work is being done on the server side, then yes, you need to be looking into an actual server solution which will offer you CPU horsepower.

    But even if that is not the case, I doubt that a simple NAS is going to give you the throughput that you are talking about. I'm not an expert with scientific computer systems, but this is pretty similar to a database system which is very disk and CPU intensive. Utilizing high-speed enterprise drives (SAS or better yet SSDs depending upon your budget and amount of overall data) in a high-performance RAID (such as RAID 10) is going to give you the greatest throughput to work with. However, you're quickly going to be limited by the throughput of your network (1 gig-E) so this makes the situation a little more complicated.

    Basically, the maximum throughput that you are going to be afforded through a gigabit network is approximately 100 MB/s which will be the average throughput of your standard SATA hard drives today anyways. If up to 100 MB/s is plenty of network throughput but you're still having limitations, then you may be running into bottlenecks with your CPU or your memory depending again on how your software and data works. If you need to have greater than 100 MB/s of throughput through the network off your central data share for access to your end workstations, then you're going to have to look into something better in your actual network infrastructure, such as link aggregation or 10 Gigabit Ethernet.
  2. Hi choucove,

    Thanks for your detailed answer!
    As I said before, we have many small files (~1-5mb each), which are read one after the other, while each time a file is opened there is some calculation on the data (using MATLAB, the files are numeric, saved as *mat files)

    You talk about 100 MB/s, however we found in all cases (reading from NAS, local disk or USB3 external drive) the speed is less 50MB/s most of the time.
    So are you saying this is due to the CPU/Memory being the bottleneck?

    So if we buy a strong PC and install RAID5 (is 10 much faster?) SATA3, and preform these calculations locally, will we get at least those 100MB/s?

    If it is not too much to ask for your opinion about these specs for such a computer:

    MB: Asus Z87 Deluxe
    CPU: Intel I7-4770 3.40GHz CPU
    Memory: KINGSTON DDR3 1600MHZ 4GB X4
    HD1: Sandisk 128GB SATA3 SSD
    HD2: Seagate 4TB NAS class hdd X6 (RAID5)
    PSU: ANTEC TPQ-850

    I understood the link aggregation is an expensive solution, because we'll need an expensive router..

    Thanks for helping, it is much appreciated!
  3. Best answer
    Again, I don't know the exact specifics of your software, and what operations and workload is being pushed on what devices. I'd have to know exactly what sort of hard drives you are using in your NAS currently, what sort of RAID array, and you'd have to look at the actual resource utilization monitors on your NAS while you are performing your operations to see if it's making a big hit on the CPU and memory demand of the unit before I could really conclude that is the issue, and whether it is worth it to make that upgrade. Of course a dedicated server system is going to offer greater overall performance at the CPU and memory level at least compared to a NAS, but is it actually worth the added expense is the question.

    As for some details. FAID 5 has some speed advantages over a single hard drive but it also can have a performance disadvantage. While working with very large hard drives it's also very likely that you will experience a drive failure, and the rebuild time to replace that drive will be so high that you will lose all your data due to an unrecoverable read error or even another drive failure. RAID 5 is not really that great for large capacity arrays. RAID 10 offers greater overall throughput and performance, and doesn't suffer from a terribly long rebuild time. Yes, you don't get as much storage capacity with it, which is the trade off, but much more recommended for database systems due to improved reliability and throughput.

    If you are going to set up a new system, are you designing this to be the server ONLY and connect to it still from other workstations, or is this new system going to actually store all the data AND run the operations all locally? I do not usually recommend doing a custom-built server system, especially if you are needing something rock solid, because ensuring all of the 100% compatibility with what you are doing, what hardware you are choosing, and supporting that entire system falls entirely on your shoulders.

    Link aggregation or 10 GbE would require switches that support that feature plus NICs on your end computers which also support that feature, but shouldn't require a different router. However, if this is a work environment with many other workstations, departments, or workers operating on a shared infrastructure then that might not be the best solution.
  4. Thanks again,

    I don't know what you mean by "exact specifics of your software", as I said we use MATLAB ( , and the discussion is relevant for the simplest kind of operation (summation).
    It goes like this - initiate a variable, open file which contains numerical matrix with dims 288x144x42, average along the first dim, add it to the initiated var, repeating this for the rest of the files..

    We don't have a NAS, but we tested how it preforms before purchasing, and those were the results. We tested the Synology DS1512+ , in RAID5 config, not sure which HD's it had but it is supposed to preform much better than ~40 MB/s. I didn't look at it's CPU monitor though. The specs says it has: Dual Core CPU 2.13GHz, with floating point and 1GB RAM. Does it sound like that might have been the bottleneck?

    I think the big question is if the high I/O operations is the cause of maybe the cpu being the problem when testing the NAS, or is it something else ? (because the operations on the data are basic..)

    Thank you!
Ask a new question

Read More

Performance Storage Business Computing NAS / RAID