As I’m also one of those reading books from DLI, and not particularly liking to fetch one by one the pages in TIF format, I’ve been tinkering this script for about a year, and I think it’s fairly decent by now. It expects you to give links from the search results.
Perhaps it might be useful to someone else. Please do tell about how it fared, if you try it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | #!/bin/bash # test if anything was given to the script if [[ $1 = "" ]]; then echo "You must give the URL to be processed." echo "Enter 'ocr' as second argument to use ocrodjvu." exit 1 fi url=$1 # converts % codes url=`echo "$url" | sed "s/%20/ /g"` url=`echo "$url" | sed "s/%27/'/g"` url=`echo "$url" | sed "s/%28/(/g"` url=`echo "$url" | sed "s/%29/)/g"` # gets the variables title=`echo "$url" | sed -r "s/.*title1=([^\&]*)\&.*/\1/g"` author=`echo "$url" | sed -r "s/.*author1=([^\&]*)\&.*/\1/g"` pages=`echo "$url" | sed -r "s/.*pages=([^\&]*)\&.*/\1/g"` path=`echo "$url" | sed -r "s/.*url=([^\&]*)/\1/g"` # shows them echo "" echo -e "Author:\t$author" echo -e "Title:\t$title" echo -e "Pages:\t$pages" echo -e "Path:\t$path" # assembles the filename filename=`echo "$author" - "$title"` # tests if the directory named $filename already exists, # if not, it's created, then changes to its path if [ -d "$filename" ]; then echo -n "The directory '"$filename"' already exists. " else mkdir "$filename" fi cd "$filename" # creates directories to hold the .tif and .djvu files if [ ! -d tif ]; then mkdir tif fi if [ ! -d djvu ]; then mkdir djvu fi cd tif # if there is a 'last' file, makes the script continue # from that; otherwise, starts from 1 if [ -f last ]; then firstpage=`cat last` echo "Resuming from page $firstpage..." firstpage=`echo "$firstpage+1" | bc` else firstpage=1 fi echo "" tput sc # iterates the download for each file # the exact path is a hack that happens to work... # it avoids downloading files again by checking the # timestamp of each file, that is the "-N" option for i in $(seq $firstpage $pages); do echo -n Page $(printf "%08d" $i)... if [[ $path == *data1* ]]; then wget -N -q --random-wait http://www.new1.dli.ernet.in/$path/PTIFF/$(printf "%08d" $i).tif elif [[ $path == *data2* ]]; then wget -N -q --random-wait http://www.new1.dli.ernet.in/$path/PTIFF/$(printf "%08d" $i).tif elif [[ $path == *data3* ]]; then wget -N -q --random-wait http://www.new1.dli.ernet.in/$path/PTIFF/$(printf "%08d" $i).tif else wget -N -q --random-wait http://www.new.dli.ernet.in/$path/PTIFF/$(printf "%08d" $i).tif fi if [ $? = 0 ]; then echo -n " done." echo "$(printf "%08d" $i)" > last # converts to djvu cjb2 $(printf "%08d" $i).tif ../djvu/$(printf "%08d" $i).djvu > /dev/null 2>&1 tput el1 tput rc else echo "error!" fi done cd .. # assembles the djvu pages in one bundle djvm -c ../"$filename".djvu djvu/*djvu # ocr if [ "$2" = "ocr" ]; then ocrodjvu -o "$filename (ocr)".djvu "$filename".djvu fi |
