31 May 2015

A BEGINNER'S GUIDE TO DATA ANALYSIS WITH UNIX UTILITIES (PART 1)

Part 1 - Unix Basics

Imagine your goal is to view or manipulate large quantities of data. Google Spreadsheets are crashing, so is Excel - what do you do? A demon from an ancient world might offer itself as your sword to slash the Gordian knot of big data: Unix utilities. They are versatile, much faster and have better performance than Excel. The best thing? They are not only free, but (if you are running Linux or OS X) already installed on your computer.

However, the price you pay is the time spent learning how to use these initially unintuitive and badly documented tools. The first part of this blog post series wants to help you take the initial step into a journey through the Unix utilities (which you may never want to end). Afterwards, we are going to explore powerful data manipulation tools in part 2 and a few use cases in part 3. If you already know your way around the command line and you only want to check out the data science applications, I suggest you skip to part 2 or part 3 right away.

Let's start with a quick history:

Unix was developed by Ken Thompson in 1972 because he wanted an operating system that allowed him to play "Space Travel" on his cast-off PDP - 7. It has since been advanced and now constitutes the foundation of most operating systems, the prominent exception being Windows. Both Mac OS and Linux are based on the Unix philosophy and include many Unix utilities.

Now, lets move on to something practical:

FILESYSTEM BASICS


To access the command line, open a terminal on your Linux or OS X machine. As the command line does not provide us with a graphic interface where we can use the mouse, we have to rely on the "shell", which is a program that accepts commands as text input and converts commands to appropriate operating system functions. Here is an overview of terminal commands that get you started:
ls
lists content of current directory
ls -l
a detailed listing of the content of the current directory
pwd
print path of current working directory
cd / cd ~
go to home directory
cd parent_dir/child_dir
enter directory "child_dir" that is the child of "parent_dir" by relative path from current directory
cd /foo/data/rw/users/jo/jvdh
enter directory by absolute path from root (note the slash in the beginning)
cd ..
go to parent directory of current directory
mkdir new_dir_to_be_created
create a new directory "new_dir_to_be_created"
rm file_to_be_deleted
delete file "file_to_be_deleted"
rm -R dir_to_be_deleted
remove "dir_to_be_deleted" and its content
mv file_old_name new_path/file_new_name
move file to new path and/or rename it
cat file_to_be_viewed
prints content of "file_to_be_viewed"
less large_file_to_be_viewed
shows content of "large_file_to_be_viewed". Other than cat it doesn't print everything right away, but lets you scroll through and search the content
vim file_to_be_edited
opens up vim, a very popular and useful text editor for command line interfaces
chmod 740 file_to_be_executed
changes permission of "file_to_be_executed" to read,write & execute for you, read for your group and no permission for other users
cp file_to_be_copied location_of_copy
copies file "file_to_be_copied" to location "location_of_copy"
man any_command
shows manual page of "any_command"
echo "foo"
prints foo
# comment
starts a comment. everything after "#" is ignored by bash
foo ; bar
executes command foo and then command bar

BASH


As described above, we need a command line shell that interprets and sends commands like the ones above to the operating system. In our case, this is bash, which is one of the most popular shells out there and can also be used as a scripting language. Other shells in use include zsh or powershell (for Windows). Bash has many nice features (that may or may not be exclusive to it):
Tab completion
If you're completely new to Unix and bash, tab completion makes your life much less miserable. Tab completion works by knowing your environment and can 'help' you locate and execute things. Used as a practical helper in bash, it can help you not only type less, but also locate directories and files that are either ambiguous or too long to type. Let's start with the example of a really long file path:
/foo/data/rw/users/jo/jvdh/www/index.html
It would be really painful to have to type the entire file path every time you want to access this file. Instead, it would be better to type only the necessary characters into bash, and then tab to auto complete the rest. As seen below, the tab completion will work once the condition that the object is unique has been met. Then, bash knows what you're after and can complete the rest. If it doesn't, it simply prints a list of the possibilities after typing TAB-TAB as shown in the last example. Once you complete the text, making the search unique, a single tab will tab complete the rest.
vim /fo[TAB]/da[TAB]ta/rw/[TAB]users/jo/[TAB]jvdh/www/in[TAB]dex.html
Moreover, you can use the up arrow and down arrow to search for commands in your terminal history.
Good to know
  • to stop a job (e.g. rm some_dir), press ctrl + c. To send it to the background, press ctrl + z. To quit a program (e.g. the python interpreter), press ctrl + d. There are exceptions, though: To exit vim, you have to press escape to go into command mode and then enter ":wq" (to save and quit) or ":q!" (to quit without saving). Other programs may require other commands
  • most commands and programs take flags, signified by "-f" or "--foobar". They allow you to modify the behavior of the command/program or get help (via the --help or -h flag). You can usually see all available flags by typing "man commandor "command --help"
  • You can save a sequence of bash commands in a script with an ".sh" extension. After you have changed the permissions to read,write execute for you (chmod 700 script.sh) you can run the script by typing /foo/bar/script.sh or ./script.sh if your current directory is bar.
  • To use the output of a bash command as the argument of another command you have to transform it to a variable, either like this:
$ today=$(date +%F) #save current date in YYYY-MM-DD format to variable "today"
$ mv file_name $today"_file_name" #rename filename to include date
or like this:
$ mv file_name $(date +%F_file_name.csv) #transforms to variable directly in
argument

REDIRECTION AND PIPES


These are 2 of the most powerful Unix features and have great explanatory strength for its early success (as stated in this video from 1982). Redirection takes the output of a command, or the input of a file or command, and redirects it. The symbol > redirects to an argument (like a text file), and the symbol < redirects the argument (like to a command). For example
$ mysql some_database -B < my_sql_script.sql > my_csv.tsv #redirects my_sql_script.sql to mysql and the output of mysql to the text-file my_csv.tsv
Piping is another powerful Unix method. A pipe (|) takes the output from a command and sends that output to an argument or utility. For example
$ echo "select * from NorthAmerica_RevenueStats_All limit 100;" | mysql some_database -B > my_csv.tsv #pipes the output of echo "..." to mysql and redirects mysql output to my_csv.tsv
You can imagine that there are a lots of really long and/or powerful bash one-liners that pipe and redirect the input and output of manyfold commands.

If you have made it that far, congratulations! You can now 'survive' in a command line environment. As they say, practice makes perfect, so go ahead and solve these exercises. Solutions can be found here.

The next part in this series will deal with useful command line tools for data analysis. Stay tuned.