Wednesday, 19 January 2022

Scp cannot handle file times before the Unix epoch

Native Linux filesystems have been able to store file timestamps before the Unix epoch, 1 Jan 1970, for some time now, due to use of 64-bit time_t.

Today I discovered that scp cannot transfer timestamps before the epoch. Here I copied a file which contains a scan of an old photo that I have timestamped back to the day it was sent:

-rw-r--r-- 1 me users 245768 Jan  1  1970 /tmp/1946-07-31-myrtle.pdf

By comparison, rsync does the right thing:

-rw-r--r-- 1 me users 245768 Jul 31  1946 /tmp/1946-07-31-myrtle.pdf

I suppose I could look into the scp protocol to discover why this is.

Watch out for non-breaking spaces on screen scrapes

I have been ripping DVDs I own to MP4 files for convenience of viewing on a tablet. Sometimes I need to get additional information about the episodes. For example I wanted to name the episodes of the Granada Sherlock Holmes TV series with the title of the tale for easy selection. For example 17-The_Musgrave_Ritual.mp4 is much preferred to 17.mp4.

On Wikipedia, episodes of many TV series are tabulated. You can highlight the contents of the table, and paste into a Libreoffice spreadsheet. This can then be exported as a CSV file for futher processing, e.g. with a Python program to generate a shell script that will rename the files the desired way.

This blog post is to point out that screen scraping will also capture the underlying characters in the tables, including extended characters in UTF-8 encoding. No surprise that this includes the non-breaking space:   or 0xA0 in 8-bit encoding or \uC2A0 in UTF-8 encoding. So when processing the CSV file, this needs to be converted to a space or your shell scripts won't work. Here's an example of the conversion needed.

datestring = row[5].replace(u"\xa0", " ")

This was to generate a touch -d 'date' episode.mp4 command. Touch kept telling me the date format was invalid until I investigated the date string and found a non-breaking space in it.