04 June 2015

A BEGINNER'S GUIDE TO DATA ANALYSIS WITH UNIX UTILITIES (PART 2)

Part 2 - Useful tools

While you learned how to survive in a command line environment in the last post, the second part of this series will get you up and running with the tools you need for data analysis. After reading this post you will not only be able to view & manipulate big data in the command line, but also know how to automate repetitive task such as generating reports or sending notification emails!

LESS
Less was developed in the 1980's as an advancement of more, an earlier file viewer. Let's deal with the obvious right in the beginning:
  • less > more
  • less is more, more or less
  • cat is a simple file reader, no more no less
$ wget https://www.gutenberg.org/files/11/11-h/11-h.htm ; html2text 11-h.htm > alice.txt ; less alice.txt #download alice in wonderland, convert from html to text and display content
/Alice -> searches for Alice in viewed file
-> searches forward for next occurrence of Alice
-> searches backward for previous occurrence of Alice

shift + g -> scroll to bottom
gg -> scroll to top
:n -> goto linenumber n
-> exits less
cat -vet alice.txt | less #show special characters like carriage returns and tabstops


MAIL
To send a simple email from the command line, use the mail command:
$ echo "content of email" | mail -s "subject line" recipient@host.com #send email
For attachments:
$ mutt -s 'Attachments' -a a1.zip < message.txt
This allows you to share small reports directly by email and notify your team when a big report is ready for download from a website or Google Drive. In Linux this normally works out of the box. On OSX you probably need to specify a SMPT server (learn how to do this here).

FILTERS

Filters are commands that manipulate data and are thus an important part of every analyst's toolbox.
grep
A classic filter example is grep, that allows you to search for strings in files. Example:
$ grep Alice alice.txt | less #print every line that includes "Alice"
$ grep -i -c "white rabbit" alice.txt #count number of lines that include "white rabbit" case insensitive
$ grep -i -o "\S" alice.txt | sort -f | uniq -i -c | sort -r -g #grep matches everything that is not a space, pipes it to sort, which pipes it through a unique filter that also gives a frequency count which then is sorted again.
This gives us the frequency distribution of letters in alice in wonderland.
output:
15442e
12239t
10112a
9496o
8675i
8179h
8069n
7548s
6648r
5756d
and so on (compare to frequency of letters in English language - no surprises here).
$ grep -i -o "\b\S*\b" alice.txt | sort -f | uniq -ic | sort -r -g | less
#display frequency distribution of words in alice in wonderland
yields:
1824the
944and
809to
695a
632of
610it
553she
545I
481you
462said
432in
404Alice
You can now build your own ngram viewer ;-)
Going forward, maybe more business relevant:
Let's imagine our company organizes its revenue data in sql tables and we want to find out revenues by company and (media) agency.

$ echo "select third_party,company,sum(revenue_usd) from RevenueStats_X where third_party_type = 'agency' and quarter = 4 group by 1,2;" | mysql some_database -B > top_companies.tsv #pull data from mysql and redirect output to file "top_companies.tsv"

The resulting csv could look like this:

Third_party company sum(revenue_usd)
WPP Group adidas 84
WPP Group foobar 42
Publicis Group foobar 21
...


$ grep -i "wpp group" top_companies.tsv | sort -r -t, -g -k 3 | less #shows top companies of agency wpp in descending order 
$ sort -r -t , -g -k 3 top_companies.tsv | head -n 100 | grep -i -c "wpp group" #shows how often wpp is agency for top 100 companies
awk and sed

awk, like bash, is an interpreted programming language. It was designed for text processing and runs commands against streams of data. To be precise it actually runs commands against tables, because every input file is spliced into sequences of records (lines) and every record is separated into fields.
An awk program is usually structured like this:
awk '{pattern}' input_file
The pattern consists of a BEGIN statement (optional), instructions what awk should do with the records that are being streamed from the file and an END statement (optional). In the command below we set the FS and OFS variable to change the input delimiter (field separator) to "\t" and the output delimiter  to "," and then print the 1st and 3rd field of every record.
$ awk 'BEGIN {FS ="\t" ; OFS=","} ; {print $1,$3}' top_companies.tsv | less
#displays only agency and revenue
As you see, we can select fields by $fieldnumber and change the field separator to ",". We can also roll up revenue ($3) to agency ($1) by using a for-loop in awk. Note that we are using printf to be able to display numbers in a nice format (default seems to be scientific notation):
$ awk 'BEGIN {FS ="\t" ; OFS=","; rev[""]=0;} ; {agency=$1; revenue=$3; rev[agency]+=revenue}; END {for (i in rev) {printf "%s,%d\n",i,rev[i]}}' top_companies.tsv | sort -k2 -r -t, -g > top_agencies.csv #roll up revenue to agency and sort output

The resulting csv could look like this:

Third_party, company, sum(revenue_usd)
WPP Group, 126
Publicis Group, 23
...


sed, formally awk's predecessor, also works with data streams from text-files and is mainly used for substitution. Let's say we want to change the name "Publicis Group" to "Publicis":
$ sed 's/Publicis Group/Publicis/' top_agencies.csv | less
We could also delete the record of Publicis Group altogether:
$ sed '/Publicis Group/d' top_agencies.csv | less
By the way, sed also accepts regex:
$ sed 's/[aeiou]/jvdh/g' top_agencies.csv | less #replace every vowel with "jvdh" (because we can)
$ sed -r 's/(.*),(.*)/\2,\1/g' top_agencies.csv | less #reverse order of columns
Other useful filters
  • tr -> transliterates or deletes. Example:
$ cat top_agencies.csv | tr 'aeiou' 'uoiea' | less #replaces every a in top_agencies.csv with u, every e with o and so on…
$ cat alice.txt | tr A-Za-z N-ZA-Mn-za-m | less #rot13 - to "decrypt", just use same tr command again
$ cat top_agencies.csv | tr -d 'aeiou' | less # deletes every vowel from top_agencies.csv
  • sort -> Does exactly what it says on the tin
$ sort -r -t, -g -k 3 top_companies.csv | less #shows top companies in descending order
  • cut -> cuts a column from your file
$ cut -d, -f1 top_agencies.csv | less #cuts first column from csv with delimiter ","
  • uniq -> like unique formula in Excel
$ cut -d, -f1 top_companies.csv | uniq | less #eliminates duplicates from column with agencies
  • split -> splits a file in smaller files
$ split -n 5 top_companies.csv # splits top_divisions.csv in 5 equal-sized files
  • cat -> concatenates 2 or more files
$ cat top_companies.csv top_agencies.csv | less
  • head / tail -> view first 10 / last 10 lines of a file
SCHEDULING JOBS
The beauty (and flaw) of Unix is that, initially, it has been created by programmers mostly with programmer's needs in mind. As programmers have an aversion for repetitive work, automation tools belong to most basic set of Unix utilities.
The name cron comes from the Greek word for time, χρόνος (chronos) and is used in Unix to signify a daemon (background process) that sources its orders from a crontab file. By using the crontab file you can order this daemon to execute scripts or commands for you at regular intervals.
$ crontab -e #edit crontab. On first opening it will prompt you to specify an
editor. I will assume that you use vim
i -> insert mode
0 3 * * * command #execute command every day at 3:00 AM in the morning
ESC -> command mode
:wq -> save and quit
Crontab configurations explained:
 * * * * *  command to execute
 | | | | |
 | | | | |
 | | | | .----- day of week (0 - 6) (0 to 6 are Sunday to Saturday, or use names; 7 is Sunday, the same as 0)
 | | | .---------- month (1 - 12)
 | | .--------------- day of month (1 - 31)
 | .-------------------- hour (0 - 23)
 .------------------------- min (0 - 59)
at
If you are scared of daemons because you confuse them with demons (like the people of Athens who succumbed to the same fallacy when they prosecuted Socrates) you can use command at:
$ at teatime
at> echo "boil water" | mail -s "make tea" email@example.com #sends mail to email@example.com at 4pm today or on the next day if it is after 4pm
$ at 2015-05-09 22:00
at> cvlc "http://mp3-live.swr.de/swr2_m.m3u" --sout file/mp3:/usr/local/foobar/home/jvdh/Music/keith_jarrett_swr2.mp3 --run-time=14400 --stop-time=14400 vlc://quit #schedule recording of mp3 stream from 22pm to 2am by using command line version of vlc player
You can also say at 15:45 or 15:45 + 3 days and so on, but unlike cron, at cannot schedule jobs in constant intervals. Therefore, cron is handy for recurring tasks like weekly reports and at is useful for one-time tasks.

I hope this new knowledge about filters and daemons allows you to automate a lot of monkey work and enables you to run deeper analysis than you could before! If you would like to internalize those new concepts, you can do so by solving this problem:

Using the mail command send a rot13-encoded message to a friend, asking him to reply in rot13 as well. If you get an encoded message back, try to decode it. Finally, compare your solution with my solution here. Also check out part 3 which will discuss a few real world use cases of command line tools!