I have been ripping DVDs I own to MP4 files for convenience of viewing on a tablet. Sometimes I need to get additional information about the episodes. For example I wanted to name the episodes of the Granada Sherlock Holmes TV series with the title of the tale for easy selection. For example 17-The_Musgrave_Ritual.mp4 is much preferred to 17.mp4.
On Wikipedia, episodes of many TV series are tabulated. You can highlight the contents of the table, and paste into a Libreoffice spreadsheet. This can then be exported as a CSV file for futher processing, e.g. with a Python program to generate a shell script that will rename the files the desired way.
This blog post is to point out that screen scraping will also capture the underlying characters in the tables, including extended characters in UTF-8 encoding. No surprise that this includes the non-breaking space: or 0xA0 in 8-bit encoding or \uC2A0 in UTF-8 encoding. So when processing the CSV file, this needs to be converted to a space or your shell scripts won't work. Here's an example of the conversion needed.
datestring = row[5].replace(u"\xa0", " ")
This was to generate a touch -d 'date' episode.mp4 command. Touch kept telling me the date format was invalid until I investigated the date string and found a non-breaking space in it.
No comments:
Post a Comment