When I have to decide whether to implement a feature in Bridgy, or how to prioritize tasks, I often make assumptions like most indie web sites have an h-card, or PSCs and PSLs never got much traction. I know they’re based on anecdotal evidence, not actual data, but it’s all I have, so I run with it.
Clearly not ideal. I’d love to use real data instead! Here’s a project idea: crawl indieweb sites and generate usage stats for microformats2 classes and other indieweb features.
Tantek and others have proposed a similar Indie ThinkUp idea for more non-technical statistics, e.g. frequency of each post type (post vs reply vs like, etc.), how often you thank people, how often you curse, etc.
Straw man design proposal:
- Seed from IRC_People and maybe all domains that have ever logged into IndieAuth. Don’t even bother spidering, at least to start; just crawl those domains.
- Try to identify the server. (Known, WordPress, etc.)
- Parse every h-entry on the front page and every h-feed linked from the front page.
- Count all instances of mf2 classes. Identify them by the mf2 prefixes: h-, p-, u-, dt-, and e-.
- Aggregate per page and per domain so we can answer questions like what fraction of posts are photo posts? and how many people use syndication links?
- Generate a static html report with simple graphs using D3 or Google Charts or whatever.
- Set up a cron job to do all this once a day or so.
- Dump the entire dataset as CSV so people can pull it in into Excel (or R, Wolfram Alpha, etc) and do their own analyses.
- Store and report dates for each mf2 class: first seen, last seen, etc., both global and per domain.
- Crawl and analyze features in silo posts too, e.g. PSCs/PSLs.