Calculate largest field big data file

[problem]

I wanted to upload a delimited field to mysql db, but hit the problem that the file contained nearly 3000 rows and no schema on the required size of each field.

Therefore I needed to traverse the file and calculate the length of each field. Then at the end, print the largest field found for each column.

[/problem]

[solution]

Perl to the rescue! Pretty easy in Perl, after scratching my head attempting with awk. 🙂

Anyway the code is under the example tab. I’m using the pipe symbol as a delimiter “|” – so just substitute this with your delimiter, cat your file and pipe it through this script.

In the example I show how you can manipulate the file, to produce pipe delimited fields too.

[/solution]

[example]

This is how to run the script. Basically you just need to pipe your output through find largest.

cat yourfile | ./find_largest.pl

Here is the code.

 #!/usr/bin/perl

@highest=();

while( < STDIN > ) {

@thisline=split(/|/);

for($i=0;$i<=$#thisline;$i++) {

$thislength=length($thisline[$i]);

if($thislength > $highest[$i]) { $highest[$i]=$thislength; }

}

}

print(join("|",@highest)."n");

exit(0);

__END__

 

So for example, if I want to find the largest fields in one of my web logs - for crunching into a db:

  • First off I only want lines starting with a space, then a number.
  • Next I need to replace all multiple spaces between fields, with 1 space.
  • Then I replace the spaces between each field, with a pipe.
  • Lastly I pump it through find_largest.pl - which gives me the largest sized field


[marcus@bree]/var/log/httpd% grep "^ [0-9]" access_log.tools
| sed -e 's/ / /g' -e 's/ /|/g' |
~/Perl_Bin/find_largest.pl
|14|1|1|21|6|5|103|9|3|5

[/example]

[reference]

[tags]Perl, Data Analysis, Perl Coding School[/tags]

[/reference]

If you have found my website useful, please consider buying me a coffee below 😉

Leave a Reply

Your email address will not be published. Required fields are marked *