Amazon’s Glacier storage service provides very cheap storage.  As of this writing, it is $.011 per GB.  However some transfer fees apply so make sure you are aware of what it costs to transfer a large amount of files.  Glacier is designed to use offline storage and to restore can take 3 to 5 hours to process the request.  It is perfect for backups that you think you will rarely have to touch.

I wanted to use this but I had 3 main requirements:

  • I wanted it encrypted with open standards and I wanted nobody but me to have the keys: Trust No One!
  • I wanted the ability to do incremental backups because I have many GB of data I want to upload
  • The solution must work in Linux since that is where my files are stored

Incremental backups with Glacier can be difficult because to do incremental backups you have to know what is already backed up and you can’t know what is in Glacier without waiting 3 to 5 hours.  And if the backed up files are encrypted you can’t easily determine if they are the same files or not.  Alternately, you can record what you have backed on the local computer with a flat file or database.  My backup script uses a flat file to record what has been backed up along with the modify time and file size.  The script will be pasted at the bottom of this post.  This script will upload your files to S3 and then you can set the Lifecycle on your Bucket to archive it to Glacier after 1 day.  If you need to restore from Glacier to S3 so you can restore to your computer, you can right click the object in the web interface and Initiate Restore.  I was not able to do that on a folder, but I found a Windows program called S3 Browser that would allow you to initiate a restore on a folder.  Since restoring will be a rare occurrence that will work for now.  I need to test if it works in Wine.

The only thing you should need to install is s3tools so you can use the s3cmd command.  It is in the repositories of most major distros.  You will need to run “s3cmd –configure” so you can generate a .s3cfg file to store your Amazon keys as well as your encryption key.  You can move the .s3cfg file to a safe place if you want to protect your encryption key.  You will have to specify its location in the script.  Also, you need to have your encryption key recorded in a different place.  It can’t be recovered for you, so don’t lose it!

You need to use the SOURCE array to list what directories you want backed up.  The find command will then go through each directory and pull out the files.  A sha1 hash is made of the file name, date and file size.  Once the file is uploaded to S3 that info is written to the log file.  The log file needs to always stay on the computer because it knows what has been backed up.  The script searches for the hash instead of a file because I didn’t want to have collisions with similar file names such as /path/to/file and /another/path/to/file.  I suppose I could have a collision with the hash but that would be extremely rare.

Besides the SOURCE variable, you will also need to specify $logFile which is your flat file with your uploaded file info.  $bucket is your bucket name in S3 and $s3cfg is the location of your config file for s3cmd.

s3cmd does not give any exit status so scripting for it is difficult.  I capture any output to stderr and save it to a variable. If the variable has any output, then it considers the upload failed.  However I do check to see if a file was uploaded and if the time stamp is within two minutes of the current time (allowance for variance in remote and local time) then it will be considered successful.  The event will be logged in the error log.  If there is no file or it does not have a time stamp within the last 2 minutes, stderr is sent to the error log and it has been considered failed.

s3cmd uses the -e switch to encrypt your files.  It uses gpg’s (or pgp’s) symmetric key encryption of CAST5 (CAST-128) which is also RFC 2144.  If for some reason s3cmd will not work on my system, I am using an open standard for encryption and I know I can use other programs to un-encrypt my Wedding and baby photos and have tested this.

The script also uploads the backup log to S3 for safe keeping as well as this backup script in case it is not listed in your list of files to backup.

Here is the script.  If you have any questions or suggestions on how to make it better email me.

Update: the month after the charge for the file transfer into glacier, my 83GB of data cost me 85 cents to host.

#!/bin/bash
#
 
# Note, to pull a file from s3 use "s3cmd get s://bucket/file destinationfile"
# You must have the proper .s3cfg file in place to decrypt the file.
 
# You may also use "gpg encryptedfile" and supply the encryption code if you download
# from the web interface. Good luck.
 
# The bucket should be set to transfer to Glacier. To retreive, you need to initiate a
# retrieval request from the s3 web interface. To retreieve and entire folder, there is a
# windows program called S3 Browser that can transfer entire folders out of Glacier.
 
# Define the folders of files to be backed up in SOURCE
SOURCE=(
"/home/owner/Documents"
"/home/owner/Pictures"
"/mnt/files/Photographs"
"/mnt/files/Documents"
"/mnt/files/Home Movies"
)
 
IFS=$(echo -en "\n\b")
logFile=/mnt/files/scripts/backupmanifest.log
bucket=MyBucket
s3cfg=/home/owner/.s3cfg
touch $logFile
echo Finding files and performing backup: Please wait...
 
# for loop to go through each item of array SOURCE which should contain the
# directories to be backed up
 
for i in "${SOURCE[@]}"
do
 
# nested for loop to run find command on each directory in SOURCE
 
for x in `find $i`
do
# x is each file or dir found by 'find'. if statement determines if it is a regular file
 
if [ -f "$x" ]
then
# create a hash to mark the time and date of the file being backed up to compare later for
# incremental backups
 
fileSize=`stat -c %s $x`
modTime=`stat -c %Y $x`
myHash=`echo $x $fileSize $modTime | sha1sum`
 
# If statement to see if the hash is found in log, meaning it is already backed up.
# If not found proceed to backup
 
if ! grep -q $myHash $logFile
then
echo Currently uploading $x
 
# s3cmd command to put an encrypted file in the s3 bucket
# s3out var should capture anything in stderr in case of file transfer error or some other
# problem. If s3out is blank, the transfer occurred without incident. if an error occurs
# no output is written to the log file but output is written to an error log and s3out is
# written to the screen.
 
s3out=$(s3cmd -c $s3cfg -e put $x s3://$bucket/$HOSTNAME$x 2>&1 > /dev/null)
if [ "$s3out" = "" ]
then
echo $x :///: $fileSize :///: $modTime :///: $myHash >> $logFile
else
# s3out had content, but was possibly a warning and not an error.. Checking to see if
# there exist an upload file within the last 2 minutes. If so, the file will be considered
# uploaded. Two minutes is to account for variance between local and remote time signatures.
 
date1=$(date --date="$(s3cmd ls s3://$bucket/$HOSTNAME$x | awk '{print $1 " " $2 " +0000"}')" +%s)
date2=$(date +%s)
 
datediff=$(($date2-$date1))
 
if [[ $datediff -ge -120 ]] && [[ $datediff -le 120 ]]
then
echo There was a possible error but the time of the uploaded file was written within
echo the last 2 minutes. File will be considered uploaded and recorded as such.
echo $x :///: $fileSize :///: $modTime :///: $myHash >> $logFile
echo `date`: $x had warnings but seemed to be successfully uploaded and was logged to main log file >> $logFile.err
else
echo $s3out
echo `date`: $s3out >> $logFile.err
fi
echo ------------------------------------------------------------------------------------
fi
fi
fi
 
done
done
 
# processed all files in SOURCE. Now upload actual script and file list. They are not encrypted.
 
echo Uploading $logFile
s3cmd put $logFile s3://Linux-Backup > /dev/null
echo Uploading $0
s3cmd put $0 s3://Linux-Backup > /dev/null
 
echo
echo Backup to S3 has been completed. You may proceed with life.