Digitization
March 7 2019
This is the story of how I digitized all of my family's photos, home movies and files. I discuss the challenges I faced along the way as well as how I dealt with them.
Over the 2017 holidays, I decided to collect and digitize all of my family’s home media. This included printed photos, tapes and various already-digital storage media. My goal was to make it easier to access our memories and preserve them by saving them to the cloud. I was particularly concerned by the deterioration of the discs and tapes, but making all of the content available from anywhere was also appealing.
The first stage of this process was to collect all of the media and whatever I needed to copy it onto a computer.
All of the below scripts were written for Bash 4 and ran on a Mac. I relied on shellharden to improve these scripts. I hope these can be useful to others, but it should go without saying to use them at your own risk.
Discs
We had created many computer backups over the years as we upgraded our PC’s. This resulted in a large stack of CD’s and DVD’s that I would need to copy to my Mac and dig through. Getting the data off was not challenging, but tedious. I wrote the below to automatically copy all data off of the disc and eject it once finished, so that I could watch TV “be productive” while using the least effort to set up and finish the copy.
#!/usr/bin/env bash
# Assumes macOS
# Get /dev/disk* of external CD/DVD drive
dev="$(drutil status | grep -m1 -o '/dev/disk.*')"
if [ -z "$dev" ]; then
echo "Error: no disk inserted"
exit 1
fi
# Get /Volumes/* of external CD/DVD, assumes mounted
mount="$(mount | grep '{$dev}\w*' | grep -o '/Volumes/.*')"
in="${mount%% (*}"
out="$PWD/${in##*/}"/
mkdir -p "$out"
cp -Rv "$in" "$out"
drutil eject
I did find a few discs that macOS could not read. I assumed these were corrupt, and tried dd, ddrescue, sleuthkit and foremost with no luck. A disc repair tool didn’t help make the discs readable either. I ended up trying a Windows PC, which was able to get most of the data off. They were apparently created with Windows Live OneCare and were never finished burning.
Each time that we moved my mom to a new laptop, we would copy the entirety of her Desktop, Photos and other user folders to her new Desktop. This would also copy over old backups, making a 3 or 4 level deep hierarchy of Desktop and backup folders on the latest backups. These backups held a lot of junk files that I didn’t care about either. This, in combination with the many different backup discs, resulted in a lot of duplicated files, many of which I had no desire to keep. At this point, I used old fashioned manual labor to sort through the contents.
Photos
Digital Photos
Having sorted out all of the photos copied from discs and computer backups, I had ended up with a huge number of duplicates. findimagedupes worked great to eliminate most of the duplicates. While a few copies remained, most still had their EXIF data allowing me to select the ones with larger file sizes (and presumably resolution) by hand. ExifTool allowed me to sort all of my photos safely, and find any images with identical EXIF data. I used the following script to organize all of the photos into a consistent format, which I could then organize by event or location later.
#!/usr/bin/env bash
exiftool -r "$1" -d %y%m%d%H%M%S%%-c.%%e '-filename<filemodifydate' '-filename<createdate' '-filename<datetimeoriginal'
Film Photos
As of this point, I have still not digitized our film photos. My mom shot film from way before I was born until well into the 2000’s. We also have a bunch of prints from my grandparents and other relatives. This leaves me with a literal mountain of photos to digitize, on top of the original film negatives and even some slides.
At least it seems that all of my mom’s film photos were printed so we can just scan them into the computer; I even bought a specialty roller scanner just for 4”x5” prints, as otherwise it would probably take an eternity on our cheap multifunction printer/scanner. For large prints from my grandparents we will have to use either the bed scanner or a camera mounted to a tripod.
Videos
Tapes
For our home movies I had to digitize VHS, Video8 and a single VHS-C. I processed all of these by recording them being played back through a USB capture card. This was a relatively lengthy process, as the videos had to be played at normal speed, and I couldn’t automate the capture. I used our VHS player and a camcorder I bought off eBay to play back our tapes. I bought a VHS-C to VHS adapter just for that single tape which, of course, turned out to already have been copied to VHS. The first Video8 camcorder I bought broke after only a couple of recordings. The second time around, I bought a high-end model made much more recently with some image correction and an S-Video output.
Digital Videos
I gathered our digital videos from disc backups and from my parents’ current computers. These were saved in a bunch of different formats such as .MOV, .VOB and .wmv, so I opted to convert them all to .mp4 with FFmpeg.
#!/usr/bin/env bash
# Find and convert all MOD, wmv and DVD videos to mp4
# $1: input dir
shopt -s globstar nullglob # requires bash 4
for file in "$1"/**/*.{MOD,VOB,wmv}; do
out="$(echo "$file" | sed 's/VIDEO_TS\///')"
ffmpeg -i "$file" "${out%.*}".mp4
done
DVDs store their chapters as separate .VOB files. In some cases, I wanted these to be re-merged into a single video for playback.
#!/usr/bin/env bash
# Merge all .mp4's under $1/ into $1.mp4 in default order (for DVD scenes)
# $1: input dir
in="$(
for file in "$1"/*.mp4; do
echo "file '$PWD/$file'";
done)"
out="$(basename "$1")".mp4
ffmpeg -f concat -safe 0 -i <(echo "$in") -c copy "$out"
On Mac I used QuickTime Player (Edit > Trim …) to manually trim the start and end of each video after digitized and converted to .mp4. There is a good opportunity here for some sort of automatic trimming based on video & audio analysis.
Audio Cassettes
There were a couple of Compact Cassettes (audio tapes) my parents wanted digitized. I recorded these into the computer using the same capture card as for video. The software did not have an mono, audio-only mode, so I had to do this conversion myself.
ffmpeg’s ffprobe tool confirmed the audio was in aac format, so I chose to copy out the audio stream into a .m4a (aac) audio file. The pan filter let me save only the left audio channel.
#!/usr/bin/env bash
# Remove video channel from cassette recordings and convert to mono
# $1: input dir
# $2: output dir
shopt -s globstar #requires bash 4
for file in "$1"/**/*.mp4; do
out="$2/${file#$1}"
mkdir -p "$(dirname "$out")"
ffmpeg -i "$file" -vn -af "pan=mono|c0=c0" "${out%%.mp4}.m4a"
done
As with videos, I used QuickTime Player to trim the audio recordings after they were processed.
Schoolwork
I found a couple of old school assignments where I had either lost the digital original, or were never on the computer in the first place. My parent’s multifunction printer/scanner was the easiest way for me to scan the documents, but the output was a separate .jpg file per page. ImageMagick’s convert tool allowed me to combine these into a single .pdf.
#!/usr/bin/env bash
# Merges all .jpg's in $1 into $1.pdf, based on default order
# $1: input dir
convert "$1"/*.jpg "$(basename "$1")".pdf
I also have a bunch of art I made back in high school. Smaller pieces were scanned using the flatbed. Large artwork was photographed with my DSLR on a tripod with as even of lighting as possible. I then trimmed the photos down to fit the art.
Emails
I was able to track down a couple old email accounts using the data off of the backup discs. While my old Hotmail had long since had its contents deleted, I was able to download all of my data from Gmail using Google Takeout. This also happened to include a lot of high school assignments and photos. Afterwards, I closed down all of these accounts, as well as any others I had regained access to.
While I didn’t care about saving the emails outside of just holding onto the .mbox, I wanted the attachments of old photos and schoolwork. The format Takeout gives emails in is .mbox, which is supposed to be at least standard-ish. I had no luck extracting the email attachments by command-line with ripmime, so I used Thunderbird with ImportExportTools. To extract the .mbox in a way I could work with:
- Local Folders (right click) > ImportExportTools > Import mbox file.
- Local Folders (right click) > ImportExportTools > Export all messages in the folder > HTML format (with attachments).
From here, I wrote a script to organize all of the attachments so that I could quickly find the ones I wanted.
#!/usr/bin/env bash
# Extract attachments from gmail mbox exported from Thunderbird
# $1: input dir
# $2: output dir
mkdir -p "$2"
for file in "$1"/messages/*/*; do
out="${file##*messages/}"
cp -nv "$file" "$2/${out/\//-}"
done
Conclusion
After all of this, I copied all of the data to a USB hard drive as well as the cloud. I am currently going through and cleaning up the files by hand as I find time. I’d eventually like a neatly organized cloud folder of videos, pictures, etc., but at least the critical issue of the media degrading is taken care of.