Speeding file read and pattern search

sgravi

Distinguished
Oct 24, 2009
2
0
18,510
hi,

i have very big files named outfile1 to outfile4 and i want to grep "test" in all files so i developed a code "program1.pl" its working fine but run time is very high (for 4 files its taking 10.7 sec ) if i run this for more (>100) big files its running for longtime, so i want to speed up .. can any one suggest me to improve speed for this ..

######### program1.pl ##########
#!/usr/bin/perl
open (INFILE, "> grepfile");
foreach ($i = 1; $i< 5; $i++) {
$outfile = "outfile$i" ; # outfile1 .. outfile4
open ( OUTFILE, $outfile) ;
while (<OUTFILE>) {
if ($_ =~ /test/ ) {
print "$_ \n";
print INFILE "line: $_ ";
}
}
close(OUTFILE);
}
close (INFILE);
#########################################

 

Zenthar

Distinguished
Dec 31, 2007
250
0
18,960
First question that comes to mind is "why not just use grep?".

That aside, maybe you can try to read the whole file in an array in a single shot. It would require much more memory, but will also capitalize of the sequential HDD read instead of the much slower random access.

Code:
$data_file="wrestledata.cgi";
open(DAT, $data_file) || die("Could not open file!");
@raw_data=<DAT>;
 

sgravi

Distinguished
Oct 24, 2009
2
0
18,510
hi Zenthar,
thank you for your post,
actually the time is taking for opening the files. the time taken for grep in a file line by line pattern match or array pattern match is same ( means no much difference), i need a help to avoid the time consuming for file opening. even i used split command to split large files into small for processing and using fork function to process individually , even grepping paralleled for multiple files, their also i found splitting files consuming much time than processing in each big files. is their way to get referrence of line number in a file and that reference should be used for pattern match instead of opening a files ..
 

Zenthar

Distinguished
Dec 31, 2007
250
0
18,960
Have you tried putting some time counters in your code to identify which part take the most time? The only other suggestion I would have it to maybe try a producer/consumer pattern using threads. In one thread you would read the file into a queue, and in the other one you would do the pattern match and output (the output could even be done in a 3rd thread). However, this would be mostly useful if the reader thread isn't the bottleneck.

You can find information on perl threading here and an example of a perl producer/consumer implementation here (3rd example: prodcons.cygperl).