Summary

We are going to implement an optimized Image Duplciate Detection algorithm with Python using one of Python's parallel modules (PyCUDA, PyParallel). The goal of this project is to search, within a large image corpus, sets of images that are similar. This is a significant issue in the field of computer vision and has applications in many areas such as photo management.


Background

Our solution involves two steps: (1) A parallelizable hash algorithm to convert images into short binary hashes (2) A parallelizable indexing and searching algorithm that enables fast lookups of duplciates.


(1) Hash Algorithms

To enable efficient processing of images, we compute with image hashes instead of their full information. We are going to use a locality-sensitive hashing (LSH) algorithm to hash images. In contrast to random hashing algorithms such as md5 which rely on turning small changes in input into drastic differences in hash output, LSH algorithms ensure that similar images have hashes that are closer in hamming distance. There are a few candidate LSH algorithms. The most naive method is Average Hash, which involves dividing an image into 64 blocks for a 64-bit hash and assigning each block a value of 0 or 1 depending on whether the average brightness of pixels in the block is above or below the brightness average of all blocks. Our parallel version will compute average block brightness for all blocks in parallel. We have also looked at several other more robust LSH algorithms such as pHash, which uses Discrete Cosine Transform (DCT). More information about hashing can be found at here.


(2) Search Algorithms

After generating LSH binary hashes for all images, we need a method to find similar hashes quickly. We consider two images to be similar if they are close in hamming distance, hamming distance being the number of positions in the bit hash with different bit values. Since we are concerned with finding similar and not identical images, we are interested in images that are within some threshold Hamming Ball r of the target image. If we store image information in one large hash table indexed with their n-bit hashes, each lookup has time complexity in $O(\sum_{i=k}^r {n \choose k})$. This does not scale very well as it increases exponentially with r. The alternative, multi-index hashing, is much more efficient and parallelizable. The big idea is that on insertion, it divides each hash into contiguous shorter hashes, which are hashed into separete hash tables and on lookups, the target hash is divided into the same corresponding shorter hashes and each hash table is looked up separately. This offers much opportunity for parallelism as the insertion and lookup of each hash table can be done in parallel. More information about hash indexing can be found in this paper.


Parallelism occurs not just on the algorithmic level as has been described but also on the tasks level. In other words, since images are loaded in bulk, image duplicates detection processing occurs in parallel across all images once all image hashes have been computed and inserted into the appropriate hash tables.


Challenges

One of the bottlenecks that we can foresee for this project is memory I/O. In the preliminary hashing stage, we will compute hashes for all images in parallel. However, computing image hashes involve reading image files stored on disk which makes it memory-bound. This can severely limit the amount of parallelism we can achieve in the image hashing stage as the memory bus cannot cope with a large number of simultaneous file I/O requests. To build a scalable parallel image duplicate detection system, we have come up with a potential solution to overcome this problem. We can stored image information (RGBA values of each pixel) in our database (possibly as a python list) as soon as the image is uploaded so we would only have to query this value instead of loading the image during the hashing stage. Since it is difficult to achieve any degree of parallelism with memory-bound operations, we have essentially tried to separate the memory-intensive and computation-intensive parts of the hashing stage in order to fully exploit the potential for parallelism.


Another challenge is deciding how we are storing hashes. We are currently debating between two options: mySQL database and Python dictionaries. For mySQL we expect optimized lookups and concurrency-safe operations, insertions might be slow due to its complex internal locking system. Python dictionaries might be more efficient on small datasets but is unlikely to scale. This is a problem we are still working on.


We do not expect high communications traffic as the nature of this problem is such that each image upload and duplicate search (of a given target image) an happen independently without need to exchange information with other processes.


Resources

We have not found any codebase that matches our requirements so we are likely to be starting from scratch. We have found this article on hashing algorithms and this paper on multi-index hashing.


We will be using the PyCUDA module for our implementation, and so we need this module installed on the GHC5205 machines, which have the GTX 480 chips that we need to run performance tests on. Our choice of using the GTX 480 GPUs is attributed to the fact that we have ready access to machines with these chips. Having written a circle renderer for this architecture, we are also familiar with writing program optimized for it.


Goals/Deliverables


Definite Goals

We want to implement two versions of the image duplicate detection system, the serial and parallel versions. We expect our parallel implementation to have about 50x speedup from the serial implementation as each individual image computation is independent and as a result the system is extremely parallelizable. We also expect that the hashing stage will yield less speedup than searching stage as hashing requires many writes to the database which have higher latency than reads that make up most of the searching stage.


Reach Goals

Should time permit, we will create a web application with a front end that allows users to upload images and a backend that will process the images and find duplicates. This will be a nice addition to our demo.


Demo

We will prepopulate our database with images, run the duplicate search algorithm, and display the image duplicates found, with metrics on accuracy and precision.


Platform Choice

We are using Python as both of us are very familiar with the language. Also, considering that we are building the system from scratch, we wanted to use a language that allows us to create a complex system relatively quickly.


Schedule

# Time Period Todo list
1 5 April to 11 April
  • Read PyCUDA documentation and get it working on small examples
  • Image uploading module
  • Naive image hashing implementation
  • 2 12 April to 18 April
  • Multi-index hashing implementation (Have a working system)
  • Generating image test set (Generate artifical image duplicates)
  • Store image duplicate truth (Store actual image duplication information for comparison)
  • Store image duplicate search results
  • Compare results with truth and generate accuracy/precision metrics
  • Write Checkpoint report
  • 3 19 April to 25 April
  • Introduce parallelism
  • 4 26 April to 2 May
  • Debug
  • (Optional) Create web application
  • 5 3 May to 9 May
  • Debug
  • Write Report
  • Prepare presentation