extract compacted PyBlosxom comments

If you use PyBlosxom, and you use compact_comments.sh to compact your comment files, and you ever need to reverse the process to get back to one file per comment – for example, if you’re importing them into WordPress – here’s how.

Following these instructions, all comments in *-all.cmt files will be extracted to *-[TIMESTAMP].cmt files, where [TIMESTAMP] is the comment’s UNIX timestamp in seconds since the epoch.

First, run these commands:

# escape a few characters: ` and $
sed -i.bak 's/`/\`/g; s/\\$/\$/g;' *-all.cmt

# convert and insert cat > EOFs for each individual comment
sed -i '/^<?xml version="1.0" encoding="utf-8"?>$/d' *-all.cmt
sed -i '/^<items>$/d' *-all.cmt
sed -i '/^<\/items>$/d' *-all.cmt
sed -i 's/^<\/item>$/<\/item>\nEOF\n/' *-all.cmt
sed -i 's/^<item>$/<?xml version="1.0" encoding="utf-8"?>\n<item>/' *-all.cmt

Now, we need to do a multi-line regexp replace across files. Unfortunately, sed can’t do this, so I resorted to Emacs dired-mode’s query-replace-regexp. Other suggestions are welcome! In any case, replace this regexp:

<\?xml version="1.0" encoding="utf-8"\?>
\(.+
\)*<pubDate>\(.+\)</pubDate>

with this:

cat > \2.cmt <<EOF
\&

Finally, evaluate the cat > EOF commands injected into the -all.cmt files:

#!/bin/bash
for file in *-all.cmt; do
  base=`basename "$file" -all.cmt`
  sed -i "s/^cat > /cat > ${base}-/" "$file"
  source "$file"
done

Enjoy!

One known bug: if there are multiple comments with the exact same timestamp for a post, this will clobber all but the last one.

Leave a Reply

Your email address will not be published. Required fields are marked *