Speed improvements in FlickrMetadataSynchr by exploiting parallelism

Yesterday I worked on a new version of my FlickrMetadataSynchr tool and published the 1.3.0.0 version on CodePlex. I wasn’t really planning on creating a new version, but I was annoyed by the old version in a new usage scenario. When you have an itch you have to scratch it! And it is always good to catch up on some programming if recent assignments at work don’t include any coding. So what caused this itch?

FlickrMetadataSynchr-v1.3.0.0About two weeks ago I got back from my holiday in China with about 1,500 pictures on my 16 GB memory card. I always make a first selection immediately after taking a picture, so initially there were lots more. After selection and stitching panoramic photos, I managed to get this down to about 1,200 pictures. Still a lot of pictures. But storage is cheap, tagging makes search easy, so why throw away any more? One of the perks of a Pro account on Flickr is that I have unlimited storage, so I uploaded 1,1173 pictures (5.54 GB). This took over 12 hours because Flickr has limited uploading bandwidth.

Adding metadata doesn’t stop at tagging pictures. You can add a title, description and geolocation to a picture. Sometimes this is easier to do on your local pictures, and sometimes I prefer to do it on Flickr. The FlickrMetadataSynchr tool that I wrote is a solution to keeping this metadata in sync. You should always try to stay in control of your data, so I keep backups of my e-mail stored in the “cloud” and I store all metadata in the original picture files on my hard drive. Of course I backup those files too. Even offsite by storing an external hard drive outside my house.

Back to the problem. Syncing the metadata for 1,1173 pictures took an annoyingly long time. The Flickr API has some batch operations, but for my tool I have to fetch metadata and update metadata for pictures one-by-one. So each fetch and each update uses one HTTP call. Each operation is not unreasonably show, but when adding latency to the mix it adds up to slow performance if you do it sequentially.

Imperative programming languages like C# promote a sequential way of doing things. It is really hard to exploit multiple processor cores by splitting up work so that it can run in parallel. You run into things like data concurrency for shared memory, coordinating results and exceptions, making operations cancellable, etc. Even with a single processor core, my app would benefit from exploiting parallelism because the processor spends most of its time waiting on the result of the HTTP call. This time can be utilized by creating additional calls or processing results of other calls. Microsoft has realized that this is hard work for a programmer and great new additions are coming in .NET Framework 4.0 and Visual Studio 2010. Things like the Task Parallel Library and making debugging parallel applications easier.

However, these improvements are still in the beta stage and not usable yet for production software like my tool. I am not the only user of my application and “xcopy deployability” remains a very important goal to me. For example, the tool does not use .NET 3.5 features and only depends on .NET 3.0, This is  because Windows Vista comes with .NET 3.0 out of the box and .NET 3.5 requires an additional hefty install. I might make the transition to .NET 3.5 SP1 soon, because it is now pushed out to all users of .NET 2.0 and higher through Windows Update.

So I added parallelism the old-fashioned way, by manually spinning up threads, locking shared data structures appropriately, propagate exception information through callbacks, making asynchronous processes cancellable, waiting on all worker threads to finish using WaitHandles, etc. I don’t use the standard .NET threadpool for queing work because it is tuned for CPU bound operations. I want to have fine grained control over the number of HTTP connections that I open to Flickr. A reasonable number is a maximum of 10 concurrent connections. This gives me almost 10 ten times the original speed for the Flickr fetch and update steps in the sync process. Going any higher puts me at risk of being seen as launching a denial-of-service attack against the Flickr web services.

If you want to take a look at my source code, you can find it at CodePlex. The app was already nicely factored, so I didn’t have to rearchitect it to add parallelism. The sync process was already done on a background thread (albeit sequentially) in a helper class, because you should never block the UI thread in WinForms or WPF applications. The app already contained quite a bit of thread synchronization stuff. The new machinery is contained in the abstract generic class AsyncFlickerWorker<TIn, Tout> class. Its signature is

/// <summary>
/// Abstract class that implements the machinery to asynchronously process metadata on Flickr. This can either be fetching metadata
/// or updating metadata.
/// </summary>
/// <typeparam name="TIn">The type of metadata that is processed.</typeparam>
/// <typeparam name="TOut">The type of metadata that is the result of the processing.</typeparam>
internal abstract class AsyncFlickrWorker<TIn, TOut>

It has the following public method

/// <summary>
/// Starts the async process. This method should not be called when the asychronous process is already in progress.
/// </summary>
/// <param name="metadataList">The list with <typeparamref name="TIn"/> instances of metadata that should
/// be processed on Flickr.</param>
/// <param name="resultCallback">A callback that receives the result. Is not allowed to be null.</param>
/// <typeparam name="TIn">The type of metadata that is processed.</typeparam>
/// <typeparam name="TOut">The type of metadata that is the result of the processing.</typeparam>
/// <returns>Returns a <see cref="WaitHandle"/> that can be used for synchronization purposes. It will be signaled when
/// the async process is done.</returns>
public WaitHandle BeginWork(IList<TIn> metadataList, EventHandler<AsyncFlickrWorkerEventArgs<TOut>> resultCallback)

It uses the generic class AsyncrFlickrWorkerEventArgs<TOut> to report the results:

/// <summary>
/// Class with event arguments for reporting the results of asynchronously processing metadata on Flickr.
/// </summary>
/// <typeparam name="TOut">The "out" metadata type that is the result of the asynchronous processing.</typeparam>
public class AsyncFlickrWorkerEventArgs<TOut> : EventArgs

The subclass AsyncPhotoInfoFetcher is one of its implementations.

/// <summary>
/// Class that asynchronously fetches photo information from Flickr.
/// </summary>
internal sealed class AsyncPhotoInfoFetcher: AsyncFlickrWorker<Photo, PhotoInfo>

These async workers are used by the FlickrHelper class (BTW: this class has grown a bit too big, so it is a likely candidate for future refactoring). Its method that calls async workers is generic and has this signature:

/// <summary>
/// Processes a list of photos with multiple async workers and returns the result.
/// </summary>
/// <param name="metadataInList">The list with metadata of photos that should be processed.</param>
/// <param name="progressCallback">A callback to receive progress information.</param>
/// <param name="workerFactoryMethod">A factory method that can be used to create a worker instance.</param>
/// <typeparam name="TIn">The "in" metadata type for the worker.</typeparam>
/// <typeparam name="TOut">The "out" metadata type for the worker.</typeparam>
/// <returns>A list with the metadata result of processing <paramref name="metadataInList"/>.</returns>
private IList<TOut> ProcessMetadataWithMultipleWorkers<TIn, TOut>(
    IList<TIn> metadataInList,
    EventHandler<PictureProgressEventArgs> progressCallback,
    CreateAsyncFlickrWorker<TIn, TOut> workerFactoryMethod)

This method contains an anonymous delegate that acts as the result callback for the async workers. Generics and anonymous delegates make multithreaded life bearable in C# 2.0. Anonymous delegates allow you to use local variables and fields of the containing method and class in the callback method and thus easily access and change those to store the result of the worker thread. Of course, make sure you lock access to shared data appropriately because multiple threads might callback simultaneously to report their results.

And somewhere in 2010 when .NET 4.0 is released, I could potentially remove all this manual threading stuff and just exploit Parallel.For 😉

Leave a Reply

Your email address will not be published. Required fields are marked *