As always, measure first, then optimize. I turned on S3 access logging, waited 24h, then ran these commands to collect and aggregate the logs to see who’s downloading these files:
aws --profile personal s3 sync s3://huffduff-video/logs .
grep REST.GET.OBJECT 2015-* | grep ' 200 ' | cut -d' ' -f8,20- \
| sort | uniq -c | sort -n -r > user_agents
This gave me some useful baseline numbers. Over a 24h period, there were 482 downloads, 318 of which came from bots. (That’s 2/3!) Looking at the top user agents by downloads, five out of six were bots. The one exception was the Overcast podcast app.
- FlipboardProxy (142 downloads)
- Googlebot (67)
- Twitterbot (39)
- Overcast (47)
- Yahoo! Slurp (36)
- Googlebot-Video (34)
(Side note: Googlebot-Video is polite and includes Etag
or If-Modified-Since
when it refetches files. It sent 68 requests, but exactly half of those resulted
in an empty 304
response. Thanks Googlebot-Video!)
I switched huffduff-video to use S3 URLs on the
huffduff-video.s3.amazonaws.com
virtual host,
added a
robots.txt
file
that blocks all bots, waited 24h, and then measured again. The vast majority of
huffduff-video links on Huffduffer are still on the
s3.amazonaws.com
domain, which doesn’t serve my robots.txt
, so I didn’t
expect a big difference…but I was wrong. Twitterbot had roughly the same
number, but the rest were way down:
- Overcast (76)
- Twitterbot (36)
- FlipboardProxy (33)
- iTunes (OS X) (21)
- Yahoo! Slurp (20)
- libwww-perl (18)
- Googlebot (14)
(Googlebot-Video was way farther down the chart with just 4 downloads.)
This may have been due to the fact that my first measurement was Wed-Thurs, and
the second was Fri-Sat, which are slower social media and link sharing days.
Still, I’m hoping some of it was due to robots.txt
. Fingers crossed the bots
will eventually go away altogether!
as of mid 2019, this is up to ~$100/mo, largely due to organic growth. i’m ok with that. consider it one of my donations to the open web.