Understanding huffduff-video bandwidth usage

As of 2015-04-29, huffduff-video is serving ~257 GB/mo via S3, which costs ~$24/mo in bandwidth. I’m ok with that, but it could probably be lower.

As always, measure first, then optimize. I turned on S3 access logging, waited 24h, then ran these commands to collect and aggregate the logs to see who’s downloading these files:

aws --profile personal s3 sync s3://huffduff-video/logs .
grep REST.GET.OBJECT 2015-* | grep ' 200 ' | cut -d' ' -f8,20- \
  | sort | uniq -c | sort -n -r > user_agents

This gave me some useful baseline numbers. Over a 24h period, there were 482 downloads, 318 of which came from bots. (That’s 2/3!) Looking at the top user agents by downloads, five out of six were bots. The one exception was the Overcast podcast app.

FlipboardProxy (142 downloads)
Googlebot (67)
Twitterbot (39)
Overcast (47)
Yahoo! Slurp (36)
Googlebot-Video (34)

(Side note: Googlebot-Video is polite and includes Etag or If-Modified-Since when it refetches files. It sent 68 requests, but exactly half of those resulted in an empty 304 response. Thanks Googlebot-Video!)

I switched huffduff-video to use S3 URLs on the huffduff-video.s3.amazonaws.com virtual host, added a robots.txt file that blocks all bots, waited 24h, and then measured again. The vast majority of huffduff-video links on Huffduffer are still on the s3.amazonaws.com domain, which doesn’t serve my robots.txt, so I didn’t expect a big difference…but I was wrong. Twitterbot had roughly the same number, but the rest were way down:

Overcast (76)
Twitterbot (36)
FlipboardProxy (33)
iTunes (OS X) (21)
Yahoo! Slurp (20)
libwww-perl (18)
Googlebot (14)

(Googlebot-Video was way farther down the chart with just 4 downloads.)

This may have been due to the fact that my first measurement was Wed-Thurs, and the second was Fri-Sat, which are slower social media and link sharing days. Still, I’m hoping some of it was due to robots.txt. Fingers crossed the bots will eventually go away altogether!

snarfed.org

Ryan Barrett's blog

Understanding huffduff-video bandwidth usage

One thought on “Understanding huffduff-video bandwidth usage”

Leave a Reply