Indie stats

Bear and Ben have taken this and run with it! See the wiki page and GitHub project.

When I have to decide whether to implement a feature in Bridgy, or how to prioritize tasks, I often make assumptions like most indie web sites have an h-card, or PSCs and PSLs never got much traction. I know they’re based on anecdotal evidence, not actual data, but it’s all I have, so I run with it.

Clearly not ideal. I’d love to use real data instead! Here’s a project idea: crawl indieweb sites and generate usage stats for microformats2 classes and other indieweb features.

Tantek and others have proposed a similar Indie ThinkUp idea for more non-technical statistics, e.g. frequency of each post type (post vs reply vs like, etc.), how often you thank people, how often you curse, etc.

Straw man design proposal:

  • Seed from IRC_People and maybe all domains that have ever logged into IndieAuth. Don’t even bother spidering, at least to start; just crawl those domains.
  • Try to identify the server. (Known, WordPress, etc.)
  • Parse every h-entry on the front page and every h-feed linked from the front page.
  • Count all instances of mf2 classes. Identify them by the mf2 prefixes: h-, p-, u-, dt-, and e-.
  • Aggregate per page and per domain so we can answer questions like what fraction of posts are photo posts? and how many people use syndication links?
  • Generate a static html report with simple graphs using D3 or Google Charts or whatever.
  • Set up a cron job to do all this once a day or so.

Stretch goals:

  • Dump the entire dataset as CSV so people can pull it in into Excel (or R, Wolfram Alpha, etc) and do their own analyses.
  • Store and report dates for each mf2 class: first seen, last seen, etc., both global and per domain.
  • Crawl and analyze features in silo posts too, e.g. PSCs/PSLs.

Also on IndieNews.

3 thoughts on “Indie stats

Leave a Reply

Your email address will not be published.