13 June 2015

A BEGINNER'S GUIDE TO DATA ANALYSIS WITH UNIX UTILITIES (PART 3)

Part 3 - Usecases

You have read part 1 and part 2 of this series and wonder whether you will be able to actually apply all those new and complicated commands for anything practical? Wonder no more: In the third and last post I compiled a list of real world use cases to nudge your inspiration.

Automation of continuous reports
Problem: Revenue report for team needs to be generated and emailed every week.
Solution: Use mysql, mail and a cron job
$ vim revenue_cron.sh # create bash script that cron job can call
echo "select * from RevenueStats_TeamFoo;" | mysql some_database -B > ~/$(date +%y%m%d_rev_report.tsv) #pull data from mysql and call the tsv current_date_rev_report
mv ~/$(date +%y%m%d_rev_report.tsv) '~/jvdh/Google Drive/rev_report/' #move report into Google Drive folder where it will be uploaded automatically
echo "Hi Team,
The latest revenue reporting tsv is ready for download at
Google Drive.
Greetings,
jvdh" | mail -s "marketview report is ready" team@foobar.com" #exit vim
$ crontab -e
0 3 * * 4 ~/code/revenue_cron.sh #every Thursday at 3am, assuming you saved the script in folder ~/code/revenue_cron.sh


Client Reportings

Problem: We need to quickly generate a opportunity reporting for every client in our portfolio
Solution: Use a for loop and copy the mysql output to the Google Drive folder
$ vim opp_packs.sh
#!/bin/bash
for i in 'client a' 'client b'  'client c'  'client ...'
do
 client=$(echo $i | sed 's/\s//g')  #drive file name must not have spaces
 echo $client
 echo "select
    division_name as client,
    customer_id,
    account_name,
    campaign_name,
    sum(rev_potential_usd/1.2696) as revenue_potential_eur
    from Rev_Opportunities
    where division_name = '${i}'
    group by 1,2;" | mysql some_database -B > '~/jvdh/Google Drive/rev_opps/${i}.csv'done
$ chmod 740 opp_packs.sh #set permissions for file $ ./opp_packs.sh #execute script
Data Cleaning
Problem: We need to clean a csv (consolidated.csv) file with keywords (> 2mio lines)

csv looks like this:

kw, primary-category, secondary-category
foobar, /Business & Industrial/Business Management/Technology Consulting, /Apparel
Haftbefehl, /Arts & Entertainment/Music & Audio, /Law & Government
pottery, /Hobbies & Leisure/Antiques & Collectibles/Collectible Bric-A-Brac, /Hobbies & Leisure/Antiques & Collectibles/Antique Ceramics & Pottery
...
Solution: Use a combination of sed, cut and split
$ sed /\Adult*/d consolidated.csv > clean.csv #remove adult category
$ cut -f1,5 clean.csv > pre_final.csv # should be only primary category and keyword
$ grep $'\t' pre_final.csv > final.csv # remove keywords without category; use $ because bash only knows \t is tab when we type “$” before argument
$ sort -t$'\t' -k2 final.csv | less # sort csv

Those are only a few useful appliances of command line scripting for data analysis. Due to my current job, the examples are probably pretty reporting &  automation heavy, but rest assured that there are also many applications for much deeper analytical undertakings: See some examples in the excellent blog posts of Greg Rega or Bernd Zuther.