===== 5. Tangled Web, Not At All! ===== ==== Downloading from a web page ==== * Using wget: wget URL # specify output file with -O # specify log file (instead of stdout) with -o: $ wget ftp://example_domain.com/somefile.img -O dloaded_file.img -o log # Specify number of retries with -t: $ wget -t 5 URL $ wget -t 0 URL # retries infinitely. # Restrict the download speed: (k for kilobyte, m for megabyte) $ wget --limit-rate 20k http://example.com/file.iso # resume downloading: $ wget -c URL # copy complete website: $ wget --mirror --convert-links exampledomain.com # or limit the depth of the copy: $ wget -r -N -l -k DEPTH URL # Access pages with authentication: $ wget --user username --password pass URL ==== Downloading a web page as plain text ==== * Usage of lynx: $ lynx URL -dump > webpage_as_text.txt * We can use the -nolist option to remove the numbers added for the links reference. ==== A primer on cURL ==== * Prevent progress information display for curl with --silent option. * Curl usage: # Write to a file from the URL filename: $ curl URL --silent -O # to show progress bar: $ curl http://slynux.org -o index.html --progress # resume download: $ curl -C - URL # specify the referer string: $ curl --referer Referer_URL target_URL # specify cookies: $ curl http://example.com --cookie "user=slynux;pass=hack" # Set user agent: $ curl URL --user-agent "Mozilla/5.0" # pass additional header: $ curl -H "Host: www.slynux.org" -H "Accept-language: en" URL # specify speed limit: $ curl URL --limit-rate 20k # authentificate with curl: $ curl -u user:pass http://test_auth.com # or with password prompt: $ curl -u user http://test_auth.com # Use the -I or -head option with curl to dump only the HTTP headers, without downloading # the remote file. For example: $ curl -I http://slynux.org ==== Accessing Gmail e-mails from the command line ==== * Could use a script such as:#!/bin/bash #Desc: Fetch gmail tool username='PUT_USERNAME_HERE' password='PUT_PASSWORD_HERE' SHOW_COUNT=5 # No of recent unread mails to be shown echo curl -u $username:$password --silent "https://mail.google.com/mail/ feed/atom" | \ tr -d '\n' | sed 's::\n:g' |\ sed -n 's/.*\(.*\)<\/title.*<author><name>\([^<]*\)<\/ name><email> \([^<]*\).*/From: \2 [\3] \nSubject: \1\n/p' | \ head -n $(( $SHOW_COUNT * 3 )) </code> ==== Parsing data from a website ==== * Parsing content is usually done with sed and awk:<code>$ lynx -dump -nolist http://www.johntorres.net/BoxOfficefemaleList.html | \ grep -o "Rank-.*" | \ sed -e 's/ *Rank-\([0-9]*\) *\(.*\)/\1\t\2/' | \ sort -nk 1 > actresslist.txt</code> ==== Image crawler and downloader ==== * Could use a script such as:<code>#!/bin/bash #Desc: Images downloader #Filename: img_downloader.sh if [ $# -ne 3 ]; then echo "Usage: $0 URL -d DIRECTORY" exit -1 fi for i in {1..4} do case $1 in -d) shift; directory=$1; shift ;; *) url=${url:-$1}; shift;; esac done mkdir -p $directory; baseurl=$(echo $url | egrep -o "https?://[a-z.]+") echo Downloading $url curl -s $url | egrep -o "<img src=[^>]*>" | sed 's/<img src=\"\([^"]*\).*/\1/g' > /tmp/$$.list sed -i "s|^/|$baseurl/|" /tmp/$$.list cd $directory; while read filename; do echo Downloading $filename curl -s -O "$filename" --silent done < /tmp/$$.list</code> * usage example: <code>$ ./img_downloader.sh http://www.flickr.com/search/?q=linux -d images</code> ==== Web photo album generator ==== * Typical script for thumbnail generation: <code>#!/bin/bash #Filename: generate_album.sh #Description: Create a photo album using images in current directory echo "Creating album..." mkdir -p thumbs cat <<EOF1 > index.html <html> <head> <style> body { width:470px; margin:auto; border: 1px dashed grey; padding:10px; } img { margin:5px; border: 1px solid black; } </style> </head> <body> <center><h1> #Album title </h1></center> <p> EOF1 for img in *.jpg; do convert "$img" -resize "100x" "thumbs/$img" echo "<a href=\"$img\" ><img src=\"thumbs/$img\" title=\"$img\" /></ a>" >> index.html done cat <<EOF2 >> index.html </p> </body> </html> EOF2 echo Album generated to index.html </code> ==== Twitter command-line client ==== - We need to download the bash-oauth library from https://github.com/livibetter/bash-oauth/archive/master.zip - Then install from the sub dir bash-oauth-master with: <code># make install-all</code> - Go to https://dev.twitter.com/apps/new and register a new app. - Provide read/write access to the new app. - Retrieve the consumer key and the consumer secret - Then use the following script: <code>#!/bin/bash #Filename: twitter.sh #Description: Basic twitter client oauth_consumer_key=YOUR_CONSUMER_KEY oauth_consumer_secret=YOUR_CONSUMER_SECRET config_file=~/.$oauth_consumer_key-$oauth_consumer_secret-rc if [[ "$1" != "read" ]] && [[ "$1" != "tweet" ]]; then echo -e "Usage: $0 tweet status_message\n OR\n $0 read\n" exit -1; fi source TwitterOAuth.sh TO_init if [ ! -e $config_file ]; then TO_access_token_helper if (( $? == 0 )); then echo oauth_token=${TO_ret[0]} > $config_file echo oauth_token_secret=${TO_ret[1]} >> $config_file fi fi source $config_file if [[ "$1" = "read" ]]; then TO_statuses_home_timeline '' 'shantanutushar' '10' echo $TO_ret | sed 's/<\([a-z]\)/\n<\1/g' | \ grep -e '^<text>' -e '^<name>' | sed 's/<name>/\ - by /g' | \ sed 's$</*[a-z]*>$$g' elif [[ "$1" = "tweet" ]]; then shift TO_statuses_update '' "$@" echo 'Tweeted :)' fi </code> * Then to use the script: <code>$ ./twitter.sh read Please go to the following link to get the PIN: https://api.twitter.com/ oauth/authorize?oauth_token=GaZcfsdnhMO4HiBQuUTdeLJAzeaUamnOljWGnU PIN: 4727143 Now you can create, edit and present Slides offline. - by A Googler $ ./twitter.sh tweet "I am reading Packt Shell Scripting Cookbook" Tweeted :) $ ./twitter.sh read | head -2 I am reading Packt Shell Scripting Cookbook - by Shantanu Tushar Jha </code> ==== Creating a "define" utility by using the Web backend ==== * Register for an account on a dictionary website. * Then use a script such as: <code>#!/bin/bash #Filename: define.sh #Desc: A script to fetch definitions from dictionaryapi.com apikey=YOUR_API_KEY_HERE if [ $# -ne 2 ]; then echo -e "Usage: $0 WORD NUMBER" exit -1; fi curl --silent http://www.dictionaryapi.com/api/v1/references/learners/ xml/$1?key=$apikey | \ grep -o \<dt\>.*\</dt\> | \ sed 's$</*[a-z]*>$$g' | \ head -n $2 | nl </code> ==== Finding broken links in a website ==== * lynx and curl can be used for find broken links: <code>#!/bin/bash #Filename: find_broken.sh #Desc: Find broken links in a website if [ $# -ne 1 ]; then echo -e "$Usage: $0 URL\n" exit 1; fi echo Broken links: mkdir /tmp/$$.lynx cd /tmp/$$.lynx lynx -traversal $1 > /dev/null count=0; sort -u reject.dat > links.txt while read link; do output=`curl -I $link -s | grep "HTTP/.*OK"`; if [[ -z $output ]]; then echo $link; let count++ fi done < links.txt [ $count -eq 0 ] && echo No broken links found.</code> ==== Tracking changes to a website ==== * We use curl and diff to do this: <code>#!/bin/bash #Filename: change_track.sh #Desc: Script to track changes to webpage if [ $# -ne 1 ]; then echo -e "$Usage: $0 URL\n" exit 1; fi first_time=0 # Not first time if [ ! -e "last.html" ]; then first_time=1 # Set it is first time run fi curl --silent $1 -o recent.html if [ $first_time -ne 1 ]; then changes=$(diff -u last.html recent.html) if [ -n "$changes" ]; then echo -e "Changes:\n" echo "$changes" else echo -e "\nWebsite has no changes" fi else echo "[First run] Archiving.." fi cp recent.html last.html</code> ==== Posting to a web page and reading the response ==== * Automating POST request with curl: <code>$ curl URL -d "postvar=postdata2&postvar2=postdata2" # for instance: $ curl http://book.sarathlakshman.com/lsc/mlogs/submit.php -d "host=test- host&user=slynux" <html> You have entered : <p>HOST : test-host</p> <p>USER : slynux</p> <html> </code> * With wget we can post with the --post-data argument: <code>$ wget http://book.sarathlakshman.com/lsc/mlogs/submit.php --post-data "host=test-host&user=slynux" -O output.html $ cat output.html <html> You have entered : <p>HOST : test-host</p> <p>USER : slynux</p> <html></code>