Moving the blog images from Flickr to Digital Ocean Spaces

Cyberhades Digital Ocean Spaces

Cyberhades Digital Ocean Spaces

In this blog we have been using Flickr as our main images repository since 2008. We even paid for a pro account for a couple of years back in 2015 and 2016, however I can’t recall the benefits of a pro account versus the free one.

Our experience with Flickr has been always very positive and never had an issue with them, but after the acquisition of Flickr by SmugMug, they recently changed the policies, and they announced that the free accounts will be limited to 1,000 images, the rest will be removed. In Cyberhades we have exactly 3,997 images in Flickr, so if we want to maintain all these images we need to pay for a pro account, which is around $50 a year or $6 a month. The Flickr pro account offer more than just unlimited number of images, and if you are a photographer you may benefit from these other perks, but in our case, besides de CDN we are not getting any benefit.

The reason for this blog post is not to talk about that we moved out from Flickr, but how we did it.

The first thing we needed to decide was where to migrate. After looking around into several cloud providers, we decided to go with Digital Ocean’s (DO) Spaces. One thing I do like about DO is their fixed price policy, and also we moved our blog infrastructure to DO about 3 years ago and the service, and experience have been excellent.

DO offers a service called Spaces. It is compatible with AWS S3, this means you can interact with this service using any tool that can interact with AWS S3 and it also has a CDN, which is pretty much all we need for the blog.

Once we had decided where to migrate, it was time to get our hands dirty. The first we need was to download our pictures from Flickr. Luckily Flickr allow you to download all the data they have about you, including all the files (images, videos, etc). You can do this from your account settings page. There is an option to request your data, after you do so, it can take a while, depending on how many files you have there.

When your data is ready, you will see something like this:

Flickr Data

Flickr Data

Each one of these zip archives contain 500 files. After downloading these archives and extract their content, we faced out first problem. The filenames are not the same when we uploaded them. Flickr adds an id number plus “_o” at the end of the filename, before the extension. For instance, if you upload an image to Flickr with the following name: libro-microhistorias-informatica--nuevo0xword.jpg, Flickr will store it with something like: libro-microhistorias-informatica--nuevo0xword_8768892888_o.jpg. That 8768892888 is an unique identifier.

The next problem we had to solve was how to know what picture corresponded to what link in our current blog posts. Some of the links to Flickr looked like this: https://farm4.staticflickr.com/3803/8768892888_8932423465.jpg. As you can see here, there is not picture name, although we have the picture id. So we needed a way to map the pictures id with their name in order to later replace all these links.

To do that I wrote a small Python script, where I read all the images I downloaded from Flickr, extract the id from their name and put such id in a dictionary as a key and the filename as its value. Here is a little snippet:

def loadFilenames(picspath):

    dict = {}

    onlyfiles = [f for f in listdir(picspath) if isfile(join(picspath, f))]
    for entry in onlyfiles:
        tokens = entry.split("_")
        if tokens[-3].isdigit():
            dict[tokens[-3]] = entry
        if len(tokens) > 3 and tokens[-4].isdigit():
            dict[tokens[-4]] = entry

    return dict

The next step was to find all the links to Flickr on our over 6,000 blog posts. We found out tha the links weren’t consistent, and there were slightly different link format. We identified 3 different groups, so to match these groups we came up with 3 different regular expressions:

matches = re.findall("http[s]?://farm?.\.static\.*flickr\.com/\d+/\d+_\w+\.[a-z]{3,4}", c)
matches = matches + re.findall("http[s]?://www\.flickr\.com/photos/cyberhades/\d+/*", c)
matches = matches + re.findall("http[s]?://c?.\.staticflickr\.com/\d+/\d+/\d+_\w+\.[a-z]{3,4}", c)

After this, we needed to go through each of the blog post, find any link that matched any of these regular expressions and replace them with the links from DO Spaces, but to do this, we needed first our images in DO.

Spaces is $5 a month, and you get 250gb of space. When we created our Space, we activated the CDN option. Once the Space was created, we were ready to start uploading our images. You can use their web interface, a third party client compatible with S3 or write your own code to do so. As a passionate developer, I took the latter option :) and I wrote a small Go application for that. One thing to keep in mind is to make sure you make the images public, and also you need to set the right Content-Type. You will also need to create a token from the DO website to access to your Space. Here is a little snippet:

func GetFileContentType(out *os.File) (string, error) {

  // Only the first 512 bytes are used to sniff the content type.
  buffer := make([]byte, 512)

  _, err := out.Read(buffer)
  if err != nil {
    return "", err
  }
  contentType := http.DetectContentType(buffer)

  return contentType, nil
}

...
// Make the file public
userMetaData := map[string]string{"x-amz-acl": "public-read"}

// Upload the file with FPutObject
n, err := client.FPutObject(spaceName, objectName+strings.Replace(path, dirPath, "", 1), path, minio.PutObjectOptions{ContentType: contentType, CacheControl: cacheControl, UserMetadata: userMetaData})
if err != nil {
  log.Fatalln(err)
}
...

There is one more thing we needed to do before uploading all the images to our Space. The images we downloaded from Flickr are the original ones, this means, most of them are pretty large and not suitable for a blog post. One think Flickr does when you upload an image, it creates different sizes out of the original one, ideal for blogs and other matters. So before uploading our images, we need to downscale them to 600px width. Also, we don’t want to upscale any image that is smaller and lastly we want to keep the aspect ratio. To do this we used the magnificent ffmpeg:

for i in *; do ffmpeg -i $i -vf "scale='min(600,iw)':-1'" ${i%.*}_opt.${i#*.}; done

The scale='min(600,iw)' option, we are telling ffmpeg to only scale these images which width is larger than 600px. And with :-1 we are telling to resize the height to number of pixels that will keep the aspect ratio. This will generate another set of files with the same name, but adding “_opt” at end of their name (before the extension), this way, we’ll preserve the original images.

Now is time to upload our optimized images to DO.

Cyberhades Images

Cyberhades Images

One your images are uploaded, we can see we have two links available to access them, Origin and Edge. The link we are interested is the Edge link, which uses the CDN to delivery our images.

Finally, all we need to do is to replace all the Flickr links in our blog posts with the new ones. Here is a snippet of the code that takes care of that part:

...
for match in matches:
    if '_' in match:
        k = match.split('/')[-1].split('_')[0]
    else:
        tokens = match.split('/')
        if tokens[-1].isdigit():
            k = tokens[-1]
        else:
            k = tokens[-2]

    if k in dict:
        c = c.replace(match, "https://cyberhades.ams3.cdn.digitaloceanspaces.com/imagenes/" + dict[k])

print(entry)
o = open(entry, "w", encoding = "ISO-8859-1")
o.write(c)
o.close()
...

The complete Python script can be found here.

The wrap up this post, DO Spaces is not cheaper than paying for a Flickr pro account (if you pay yearly), but with Spaces, we have more control over our pictures, and now all the links are pretty consistent, which means, if we need to do another migration or any other thing, it will make our life way easier.