As always, measure first, then optimize. I turned on S3 access logging, waited 24h, then ran these commands to collect and aggregate the logs to see who’s downloading these files:
aws --profile personal s3 sync s3://huffduff-video/logs .
grep REST.GET.OBJECT 2015-* | grep ' 200 ' | cut -d' ' -f8,20- \
| sort | uniq -c | sort -n -r > user_agents
This gave me some useful baseline numbers. Over a 24h period, there were 482 downloads, 318 of which came from bots. (That’s 2/3!) Looking at the top user agents by downloads, five out of six were bots. The one exception was the Overcast podcast app.
- FlipboardProxy (142 downloads)
- Googlebot (67)
- Twitterbot (39)
- Overcast (47)
- Yahoo! Slurp (36)
- Googlebot-Video (34)
(Side note: Googlebot-Video is polite and includes Etag or If-Modified-Since
when it refetches files. It sent 68 requests, but exactly half of those resulted
in an empty 304 response. Thanks Googlebot-Video!)
I switched huffduff-video to use S3 URLs on the
huffduff-video.s3.amazonaws.com
virtual host,
added a
robots.txt file
that blocks all bots, waited 24h, and then measured again. The vast majority of
huffduff-video links on Huffduffer are still on the
s3.amazonaws.com domain, which doesn’t serve my robots.txt, so I didn’t
expect a big difference…but I was wrong. Twitterbot had roughly the same
number, but the rest were way down:
- Overcast (76)
- Twitterbot (36)
- FlipboardProxy (33)
- iTunes (OS X) (21)
- Yahoo! Slurp (20)
- libwww-perl (18)
- Googlebot (14)
(Googlebot-Video was way farther down the chart with just 4 downloads.)
This may have been due to the fact that my first measurement was Wed-Thurs, and
the second was Fri-Sat, which are slower social media and link sharing days.
Still, I’m hoping some of it was due to robots.txt. Fingers crossed the bots
will eventually go away altogether!
as of mid 2019, this is up to ~$100/mo, largely due to organic growth. i’m ok with that. consider it one of my donations to the open web.