Sign in with
Sign up | Sign in
Your question

Speeding file read and pattern search

Last response: in Applications
Share
October 24, 2009 11:28:35 AM

hi,

i have very big files named outfile1 to outfile4 and i want to grep "test" in all files so i developed a code "program1.pl" its working fine but run time is very high (for 4 files its taking 10.7 sec ) if i run this for more (>100) big files its running for longtime, so i want to speed up .. can any one suggest me to improve speed for this ..

######### program1.pl ##########
#!/usr/bin/perl
open (INFILE, "> grepfile");
foreach ($i = 1; $i< 5; $i++) {
$outfile = "outfile$i" ; # outfile1 .. outfile4
open ( OUTFILE, $outfile) ;
while (<OUTFILE>) {
if ($_ =~ /test/ ) {
print "$_ \n";
print INFILE "line: $_ ";
}
}
close(OUTFILE);
}
close (INFILE);
#########################################

a b L Programming
October 27, 2009 12:57:08 AM

First question that comes to mind is "why not just use grep?".

That aside, maybe you can try to read the whole file in an array in a single shot. It would require much more memory, but will also capitalize of the sequential HDD read instead of the much slower random access.

  1. $data_file="wrestledata.cgi";
  2. open(DAT, $data_file) || die("Could not open file!");
  3. @raw_data=<DAT>;
m
0
l
October 31, 2009 5:42:11 AM

hi Zenthar,
thank you for your post,
actually the time is taking for opening the files. the time taken for grep in a file line by line pattern match or array pattern match is same ( means no much difference), i need a help to avoid the time consuming for file opening. even i used split command to split large files into small for processing and using fork function to process individually , even grepping paralleled for multiple files, their also i found splitting files consuming much time than processing in each big files. is their way to get referrence of line number in a file and that reference should be used for pattern match instead of opening a files ..
m
0
l
a b L Programming
October 31, 2009 11:00:02 AM

Have you tried putting some time counters in your code to identify which part take the most time? The only other suggestion I would have it to maybe try a producer/consumer pattern using threads. In one thread you would read the file into a queue, and in the other one you would do the pattern match and output (the output could even be done in a 3rd thread). However, this would be mostly useful if the reader thread isn't the bottleneck.

You can find information on perl threading here and an example of a perl producer/consumer implementation here (3rd example: prodcons.cygperl).
m
0
l
Related resources
!