Calculate largest field big data file

Problem

I wanted to upload a delimited field to mysql db, but hit the problem that the file contained nearly 3000 rows and no schema on the required size of each field.

Therefore I needed to traverse the file and calculate the length of each field. Then at the end, print the largest field found for each column.



Solution

Perl to the rescue! Pretty easy in Perl, after scratching my head attempting with awk. 🙂

Anyway the code is under the example tab. I’m using the pipe symbol as a delimiter “|” – so just substitute this with your delimiter, cat your file and pipe it through this script.

In the example I show how you can manipulate the file, to produce pipe delimited fields too.



Example


This is how to run the script. Basically you just need to pipe your output through find largest.

cat yourfile | ./find_largest.pl

Here is the code.

#!/usr/bin/perl

@highest=();

while( < STDIN > ) {

@thisline=split(/|/);

for($i=0;$i<=$#thisline;$i++) {

$thislength=length($thisline[$i]);

if($thislength > $highest[$i]) { $highest[$i]=$thislength; }

}

}

print(join("|",@highest)."n");

exit(0);

__END__

So for example, if I want to find the largest fields in one of my web logs - for crunching into a db:

  • First off I only want lines starting with a space, then a number.
  • Next I need to replace all multiple spaces between fields, with 1 space.
  • Then I replace the spaces between each field, with a pipe.
  • Lastly I pump it through find_largest.pl - which gives me the largest sized field


[[email protected]]/var/log/httpd% grep "^ [0-9]" access_log.tools
| sed -e 's/ / /g' -e 's/ /|/g' |
~/Perl_Bin/find_largest.pl
|14|1|1|21|6|5|103|9|3|5



Reference

[tags]Perl, Data Analysis, Perl Coding School[/tags]



Leave a Reply

Your email address will not be published. Required fields are marked *