Calculate largest field big data file

[problem]

I wanted to upload a delimited field to mysql db, but hit the problem that the file contained nearly 3000 rows and no schema on the required size of each field.

Therefore I needed to traverse the file and calculate the length of each field. Then at the end, print the largest field found for each column.

[/problem]

[solution]

Perl to the rescue! Pretty easy in Perl, after scratching my head attempting with awk. 🙂

Anyway the code is under the example tab. I’m using the pipe symbol as a delimiter “|” – so just substitute this with your delimiter, cat your file and pipe it through this script.

In the example I show how you can manipulate the file, to produce pipe delimited fields too.

[/solution]

[example]

This is how to run the script. Basically you just need to pipe your output through find largest.

cat yourfile | ./find_largest.pl

Here is the code.

 #!/usr/bin/perl
 @highest=();
 while( < STDIN >  ) {
     @thisline=split(/|/);
     for($i=0;$i<=$#thisline;$i++) {
         $thislength=length($thisline[$i]);
         if($thislength > $highest[$i]) { $highest[$i]=$thislength; }
     }
 }
 print(join("|",@highest)."n");
 exit(0);
 __END__
  
 So for example, if I want to find the largest fields in one of my web logs - for crunching into a db:
  First off I only want lines starting with a space, then a number.
 Next I need to replace all multiple spaces between fields, with 1 space.
 Then I replace the spaces between each field, with a pipe.
 Lastly I pump it through find_largest.pl - which gives me the largest sized field
 
 
 [marcus@bree]/var/log/httpd% grep "^ [0-9]" access_log.tools
 | sed -e 's/  / /g' -e 's/ /|/g' |
 ~/Perl_Bin/find_largest.pl
 |14|1|1|21|6|5|103|9|3|5
 
 [/example]
 [reference]
 [tags]Perl, Data Analysis, Perl Coding School[/tags]
  Perl Doc - storable
 Linux Man Page - sed
 Linux Man Page - grep
 
 [/reference]
 If you have found my website useful, please consider buying me a coffee below 😉

Leave a Reply Cancel reply