Intro to data manipulation through GNU Utilities (part 1)
9th June 2006
This is a quick tutorial to go over how using a few basic commands can save you hours of work (either through code or manual labor). If you do not have a version of *nix installed, I would suggest installing Cygwin. Cygwin is a free Linux-like enviroment which default installs Bash, cut, cat, etc.
You should only work with ASCII text files with these commands. Executing these commands on Binaries tend to not go too well. So if you have a Word document, you should first save whatever file you are working with in ASCII text. The goal here is not to worry about formatting of the text, but more on the content itself. Secondly you should run the dos2unix command on the document. This will get rid of those crazy ^M characters that Microsoft uses as a new line character.
Here are the common commands that you will use often for data manipulation:
- cat
- cut
- grep
- sed/awk
The last one is a bit more difficult and whenever I find time to do a small writeup on regular expressions I’ll cover that.
The cat command is going to be the basis for all of the commands. You can actually use both the cut and grep command without using cat, but for I find the commands much easier to use this way and it also will help show how powerful the *nix enviroment really is.
The cut command is very useful for removing a section of text from each line seperated by a hard line break. This is very powerful if you are dealing with a lot of repetive information seperated by line breaks. For instance, each line contains someone’s name, phone number, and social security number seperated by some sort of delimiter.
ie: Joe|555-555-5555|123-45-6789
Bob|555-123-5555|123-13-2345
The last command, grep, is used to search for lines containing a given pattern. This is often used to search for a line containing information that is often encompassed with a lot of non-related junk data. Grep is great for log files, config files, and reports.
Posted in Code | 1 Comment »