Co[de]mmunications

Programming and ramblings on software engineering.

Shell-Scripting for Fun and Profit: Finding, Downloading, and Merging Course Slides to a Single PDF

Motivation

Professors often provide a page with links to lecture slides in PDF format. The main benefit of joining them to a single document, is that one can use a reader’s search function to find topics and keywords faster than when searching in multiple documents.

The Script

Change the URL variable to the page with links to lecture slides. Edit the grep and pdfjoin lines to match filenames of provided lecture slides.

#! /bin/sh

TARGET_FILE=all-lectures.pdf
URL="http://www.ida.liu.se/~TDDC88/theory/lectures.shtml"

for prog in wget lynx pdfjoin; do
    which $prog 1>/dev/null
    if [ $? -ne 0 ]; then
        echo $prog needed but not found.
        exit 1
    fi
done

PDF_URLS=$(
    lynx -listonly -dump -hiddenlinks=merge $URL \
    | tail -n+4 \
    | awk '{print $2}' \
    | grep 'lecture-.*-pps6.pdf'
    )

TEMPDIR=$(mktemp -d)
cd $TEMPDIR
echo Fetching PDFs...
wget $PDF_URLS
echo Joining documents...
pdfjoin $(seq -f 'lecture-%g-*-pps6.pdf' 1 $(echo $PDF_URLS | wc -w))
cd -
mv $TEMPDIR/*-joined.pdf $TARGET_FILE
rm -r $TEMPDIR
echo PDF $TARGET_FILE was generated.

Future Improvements

  • Get rid of the lynx dependency
  • Download all found documents simultaneously (in background jobs, perhaps)

Comments