===== 4. Texting and Driving =====

==== Using regular expressions ====

  * Explanation on regular expression at: http://www.linuxforu.com/2011/04/sed-explained-part-1/
  * visualize regular expression with: http://www.regexper.com

==== Searching and mining a text inside a file with grep ====

  * Usage of grep: <code># searching for lines containing a pattern:
$ grep "pattern" filename
this is the line containing pattern

# read from stdin:
$ echo -e "this is a word\nnext line" | grep word 
this is a word

# search in multiple files:
$ grep "match_text" file1 file2 file3 ... 

# highlight the word in the line:
$ grep word filename --color=auto
this is the line containing word

# use full set of regex:
$ grep -E "[a-z]+" filename
#or
$ egrep "[a-z]+" filename

# output only the matching portion of the line:
$ echo this is a line. | egrep -o "[a-z]+\."
line.

# Print all lines except the ones containing match_pattern:
$ grep -v match_pattern file # -v inverts the matches.

# Count the number of matching lines:
$ grep -c "text" filename
10

# count the number of matching items (could be many per line):
$ echo -e "1 2 3 4\nhello\n5 6" | egrep -o "[0-9]" | wc -l
6

# print the line number of matching strings:
$ cat sample1.txt
gnu is not unix
linux is fun
bash is art

$ cat sample2.txt
planetlinux

$ grep linux -n sample1.txt
2:linux is fun

# or

$ cat sample1.txt | grep linux -n

# print the character offset of the match per line:
$ echo gnu is not unix | grep -b -o "not" # -b is always used with -o
7:not
</code>

  * Recursively search in all text files in a directory: <code>$ grep "text" . -R -n
# for instance:
$ cd src_dir
$ grep "test_function()" . -R -n
./miscutils/test.c:16:test_function();
</code>

  * Ignore case of pattern: <code>$ echo hello world | grep -i "HELLO"
hello</code>

  * grep by matching multiple patterns:<code>$ echo this is a line of text | grep -e "this" -e "line" -o
this
line

# or we could use a pattern file:
$ cat pat_file
hello
cool

$ echo hello this is cool | grep -f pat_file
hello this is cool
</code>

  * Including and excluding files in a grep search:<code>$ grep "main()" . -r  --include *.{c,cpp}

  # or 
  $ grep "main()" . -r --exclude "README" </code>

  * Using grep with xargs:<code>$ echo "test" > file1
$ echo "cool" > file2
$ echo "test" > file3

$ grep "test" file* -lZ | xargs -0 rm</code>

  * Silent output for grep (when we only want to know if there was a match or not): <code>#!/bin/bash 
#Filename: silent_grep.sh
#Desc: Testing whether a file contain a text or not 
if [ $# -ne 2 ]; then
  echo "Usage: $0 match_text filename"
  exit 1
fi
match_text=$1 
filename=$2
grep -q "$match_text" $filename
if [ $? -eq 0 ]; then
  echo "The text exists in the file"
else
  echo "Text does not exist in the file"
fi</code>

  * Print lines before and after a match:<code># In order to print three lines after a match, use the -A option:
$ seq 10 | grep 5 -A 3
5
6
7
8

# In order to print three lines before the match, use the -B option:
$ seq 10 | grep 5 -B 3
2
3
4
5

# Print three lines after and before the match, and use the -C option as follows:
$ seq 10 | grep 5 -C 3
2
3
4
5
6
7
8
</code>

==== Cutting a file column-wise with cut ====

  * Usage of cut:<code>#Prototype:
cut -f FIELD_LIST filename

# Example:
$ cat student_data.txt 
No  Name  Mark  Percent
1  Sarath  45  90
2  Alex  49  98
3  Anu  45  90

$ cut -f1 student_data.txt
No 
1 
2 
3 

$ cut -f2,4 student_data.txt
Name     Percent
Sarath   90
Alex     98
Anu       90

# print the inverted colum matches:
$ cut -f3 --complement student_data.txt
No  Name    Percent 
1   Sarath  90
2   Alex    98
3   Anu     90
</code>

  * Specifying the delimiter characted can be done with -d:<code>$ cut -f2 -d";" delimited_data.txt</code>
  * We could also specify range of characters (-c), bytes (-b), defining fields (-f)

==== Using sed to perform text replacement ====

  * sed usage:<code># Prototype:
$ sed 's/pattern/replace_string/' file
Or:
$ cat file | sed 's/pattern/replace_string/'</code>

  * To save the changes in the source file we use the -i flag:<code>$ sed -i 's/text/replace/' file</code>
  * Additional usage: <code># for global replacement:
$ sed 's/pattern/replace_string/g' file

# could also stop replacement on Nth occurence:
$ echo thisthisthisthis | sed 's/this/THIS/2g' 
thisTHISTHISTHIS
$ echo thisthisthisthis | sed 's/this/THIS/3g' 
thisthisTHISTHIS
$ echo thisthisthisthis | sed 's/this/THIS/4g' 
thisthisthisTHIS

# we can use any delimiter in sed:
sed 's:text:replace:g'
sed 's|text|replace|g'

# need to escape delimiter if applicable:
sed 's|te\|xt|replace|g'

# remove blank lines:
$ sed '/^$/d' file

# Use the match string:
$ echo this is an example | sed 's/\w\+/[&]/g'
[this] [is] [an] [example]

# use the substring matches:
$ echo this is digit 7 in a number | sed 's/digit \([0-9]\)/\1/'
this is 7 in a number

$ echo seven EIGHT | sed 's/\([a-z]\+\) \([A-Z]\+\)/\2 \1/'
EIGHT seven

# Combination of expressions:
$ sed 'expression' | sed 'expression'
$ sed 'expression; expression'
$ sed -e 'expression' -e expression'

# supporting string evaluation (with double quotes)
$ text=hello
$ echo hello world | sed "s/$text/HELLO/" 
HELLO world 
</code>


==== Using awk for advanced text processing ====

  * Structure of awk script: <code>awk ' BEGIN{  print "start" } pattern { commands } END{ print "end" }' file

# for example:
$ awk 'BEGIN { i=0 } { i++ } END{ print i}' filename
</code>

  * When the arguments of print are comma separated, they are printed with a space delimiter: <code>$ echo | awk '{ var1="v1"; var2="v2"; var3="v3"; \
print var1,var2,var3 ; }'
v1 v2 v3

# otherwise we could do:
$ echo | awk '{ var1="v1"; var2="v2"; var3="v3"; \
print var1 "-" var2 "-" var3 ; }'
v1-v2-v3
</code>

  * Special variables in awk: <code>
NR: current record number(eg. current line when lines are used as records)
NF: current field number (separated by space in the current record)
$0: text content of current line
$1: text of first field
$2: text of second field.

# for instance:
$ echo -e "line1 f2 f3\nline2 f4 f5\nline3 f6 f7" | \
awk '{
print "Line no:"NR",No of fields:"NF, "$0="$0, "$1="$1,"$2="$2,"$3="$3 
}' 
Line no:1,No of fields:3 $0=line1 f2 f3 $1=line1 $2=f2 $3=f3 
Line no:2,No of fields:3 $0=line2 f4 f5 $1=line2 $2=f4 $3=f5 
Line no:3,No of fields:3 $0=line3 f6 f7 $1=line3 $2=f6 $3=f7

# print the last field with:
print $NF,

# The previous before:
print $(NF-1)
</code>

  * Perform summation: <code>$ seq 5 | awk 'BEGIN{ sum=0; print "Summation:" } 
{ print $1"+"; sum+=$1 } END { print "=="; print sum }' 
Summation: 
1+ 
2+ 
3+ 
4+ 
5+ 
==
15</code>

  * Passing variable to awk: <code>$ VAR=10000
$ echo | awk -v VARIABLE=$VAR '{ print VARIABLE }'
10000

# Or:
$ var1="Variable1" ; var2="Variable2"
$ echo | awk '{ print v1,v2 }' v1=$var1 v2=$var2
Variable1 Variable2

# When using a file input:
$ awk '{ print v1,v2 }' v1=$var1 v2=$var2 filename</code>

  * Explicitly read a line: <code>$ seq 5 | awk 'BEGIN { getline; print "Read ahead first line", $0 } { 
print $0 }'
Read ahead first line 1
2
3
4
5</code>

  * Specify conditions for line processing: <code>$ awk 'NR < 5' # first four lines
$ awk 'NR==1,NR==4' #First four lines
$ awk '/linux/' # Lines containing the pattern linux (we can specify 
regex)
$ awk '!/linux/' # Lines not containing the pattern linux
</code>

  * We can set the delimiter with -F: <code>$ awk -F: '{ print $NF }' /etc/passwd

# or
$ awk 'BEGIN { FS=":" } { print $NF }' /etc/passwd

# We can set the output fields separator by setting OFS="delimiter" in the BEGIN block.</code>

  * Read output of command from awk:<code>$ echo | awk '{ "grep root /etc/passwd" | getline cmdout ; print cmdout }'
root:x:0:0:root:/root:/bin/bash</code>

  * Using for loop in awk:<code># Prototype:
for(i=0;i<10;i++) { print $i ; }
# or:
for(i in array) { print array[i]; }</code>

  * String manipulation in awk: <code>length(string): This returns the string length.
index(string, search_string): This returns the position at which search_string is found in the string.
split(string, array, delimiter): This stores the list of strings generated by using the delimiter in the array.
substr(string, start-position, end-position): This returns the substring created from the string by using the start and end character offsets.
sub(regex, replacement_str, string): This replaces the first occurring regular expression match from the string with replacment_str.
gsub(regex, replacment_str, string): This is similar to sub(), but it replaces every regular expression match.
match(regex, string): This returns the result of whether a regular expression (regex) match is found in the string
or not. It returns a non-zero output if a match is  found, otherwise it returns zero. Two special variables are 
associated with match(). They are RSTART and RLENGTH. The RSTART variable contains the position at which the 
regular expression match starts. The RLENGTH variable contains the length of the string matched by the regular 
expression.
</code>


==== Finding the frequency of words used in a given file ====

  * Scrip to use: <code>#!/bin/bash
#Name: word_freq.sh
#Desc: Find out frequency of words in a file
if [ $# -ne 1 ];
then
  echo "Usage: $0 filename";
  exit -1
fi
filename=$1
egrep -o "\b[[:alpha:]]+\b" $filename | \
awk '{ count[$0]++ }
END{ printf("%-14s%s\n","Word","Count") ;
for(ind in count)
{  printf("%-14s%d\n",ind,count[ind]);  }
}'</code>

==== Compressing or decompressing JavaScript ====

  * Could use a script such as: <code>$ cat sample.js |  \
tr -d '\n\t' |  tr -s ' ' \
| sed 's:/\*.*\*/::g' \
| sed 's/ \?\([{}();,:]\) \?/\1/g' </code>

  * For decompression: <code>$ cat obfuscated.txt | sed 's/;/;\n/g; s/{/{\n\n/g; s/}/\n\n}/g' </code>

==== Merging multiple files as columns ====

  * paste can be used to do column wise concatenation: <code>$ cat file1.txt
1
2
3
4
5

$ cat file2.txt
slynux
gnu
bash
hack

$ paste file1.txt file2.txt -d ","
1,slynux
2,gnu
3,bash
4,hack
5,</code>

==== Printing the nth word or column in a file or line ====

  * Using awk: <code>$ awk '{ print $5 }' filename

# or:
$ ls -l | awk '{ print $1 " :  " $8 }'
-rw-r--r-- :  delimited_data.txt
-rw-r--r-- :  obfuscated.txt
-rw-r--r-- :  paste1.txt
-rw-r--r-- :  paste2.txt</code>

==== Printing text between line numbers or patterns ====

  * Print a range of lines with awk: <code># To print the lines of a text in a range of line numbers, M to N, use the following syntax:
$ awk 'NR==M, NR==N' filename

# Or using stdin:
$ cat filename | awk 'NR==M, NR==N'

# to print lines in a section starting with start_pattern and ending with end_pattern, we use:
$ awk '/start_pattern/, /end _pattern/' filename

# for instance:
$ cat section.txt 
line with pattern1 
line with pattern2 
line with pattern3 
line end with pattern4 
line with pattern5 

$ awk '/pa.*3/, /end/' section.txt 
line with pattern3 
line end with pattern4
</code>

==== Printing lines in the reverse order ====

  * we can use tac instead of cat:<code>tac file1 file2 ...

# for instance:
$ seq 5 | tac
5 
4 
3 
2 
1

# separator can be specified with -s "separator"</code>

  * Same thing with awk: <code>$ seq 9 | \
awk '{ lifo[NR]=$0 } 
END{ for(lno=NR;lno>-1;lno--){ print lifo[lno]; } 
}'

# Note that in shell \ is used to break a single line command into multiple lines.</code>

==== Parsing e-mail addresses and URLs from text ====

  * For email, the regex to use is: <code>[A-Za-z0-9._]+@[A-Za-z0-9.]+\.[a-zA-Z]{2,4}

# for instance:
$ egrep -o '[A-Za-z0-9._]+@[A-Za-z0-9.]+\.[a-zA-Z]{2,4}'  url_email.txt
slynux@slynux.com 
test@yahoo.com 
cool.hacks@gmail.com
</code>

  * For HTTP URL the regex pattern is:<code>http://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4}</code>

==== Removing a sentence in a file containing a word ====

  * Using sed for instance: <code>$ sed 's/ [^.]*mobile phones[^.]*\.//g' sentence.txt</code>

==== Replacing a pattern with text in all the files in a directory ====

  * With find and sed: <code>$ find . -name *.cpp -print0 |  xargs -I{} -0 sed -i 's/Copyright/Copyleft/g' {}</code>
  
  * Or we can use the exec form: <code>$ find . -name *.cpp -exec sed -i 's/Copyright/Copyleft/g' \{\} \;

# or:
$ find . -name *.cpp -exec sed -i 's/Copyright/Copyleft/g' \{\} \+
# This second form will combine multiple filenames together before sending them to sed.
</code>

==== Text slicing and parameter operations ====

  * Replacing text techniques: <code>$ var="This is a line of text"
$ echo ${var/line/REPLACED} 
This is a REPLACED of text</code>
  
  * Produce a substring: <code>${variable_name:start_position:length}

# for instance:
$ string=abcdefghijklmnopqrstuvwxyz
$ echo ${string:4}
efghijklmnopqrstuvwxyz

$ echo ${string:4:8}
efghijkl

# We can also specify counting from the end of the string:
$ echo ${string:(-1)}
z
$ echo ${string:(-2):2}
yz
</code>