Image Audit - CNK's Blog

I am upgrading a very old Wagtail project - initially built on Wagtail 1.3 which is way before Wagtail introduced the ReferenceIndex or using the file_hash to check for duplicate images during the upload process. After upgrading to Wagtail 6.3 and as part of moving the site to new hosting, I decided to clean up some of the duplicates.

Duplicate Images

Here is the basic query for duplicate images:

    SELECT id, file, file_hash FROM core_customimage
    WHERE file_hash IN (
        SELECT file_hash FROM core_customimage GROUP BY file_hash HAVING count(*) > 1
    )
    ORDER BY file_hash;

Is it safe to delete all but one of the duplicates? Can’t tell from just that query. We need to find out which of these (if any) are in use. To do that we need to join out to the reference index. In my install, my custom image model has the content type id of 37. And any rows what have NULL for the reference index id column are NOT referenced anywhere else in the code. Those can safely be deleted.

    SELECT core_customimage.id, file, file_hash, wagtailcore_referenceindex.id
    FROM core_customimage
    LEFT OUTER JOIN wagtailcore_referenceindex
         ON wagtailcore_referenceindex.to_object_id = core_customimage.id
         AND to_content_type_id =37
    WHERE file_hash IN (
      SELECT file_hash FROM core_customimage GROUP BY file_hash HAVING count(*) > 1
    )
    ORDER BY file_hash;

Once I deleted all the duplicate images that were not used anywhere, I had a few where both copies of an image were in use. Since there was just a handful, I used the Wagtail admin UI to locate where the images were being used and edited the pages to use only one of the 2 copies. Then I could safely delete the other, now unused, copies.

Missing Image Files

I also had some situations where I thought the original image might be missing or corrupt. In a previous project, I had used code like this to check the image was still in S3 where my database thinks it should be:

    from django.core.files.storage import default_storage

    for image in CustomImage.objects.all():
        if not default_storage.exists(image.file.name):
            print(f"Image {image.title} (id: {image.id} collection: {image.collection_id}) is missing {image.file}")
            continue

However, because of the way I have my S3 bucket configured, exists was returning False for all images - even those I could see in the browser. This appears to be something to do with the details of HEAD requests with boto3 - and perhaps I didn’t have my credentials configured correctly in the shell I was using for testing. In any case, since my image urls are public, instead of fighting with exists, I used the python requests library to check the images exist and are publically available.

    for img in CustomImage.objects.all():
        url = f"https://{bucket_name}.s3.amazonaws.com/{img.file.name}"
        response = requests.head(url)
        if response.status_code == 200:
            # print(f"found {url}")
            continue
        else:
            print(f"File check failed: {img.id}, {response.status_code}, {img.file.name}")

Document checks

We can do the same things to identify duplicate documents. Again I hard coded the content type id; you will need to figure out what this should be for your installation.

    SELECT id, file, file_hash FROM wagtaildocs_document
    WHERE file_hash IN (
        SELECT file_hash FROM wagtaildocs_document GROUP BY file_hash HAVING count(*) > 1
    )
    ORDER BY file_hash;

    SELECT wagtaildocs_document, file, file_hash, wagtailcore_referenceindex.id
    FROM wagtaildocs_document
    LEFT OUTER JOIN wagtailcore_referenceindex
            ON wagtailcore_referenceindex.to_object_id = wagtaildocs_document
            AND to_content_type_id = 5
    WHERE file_hash IN (
        SELECT file_hash FROM wagtaildocs_document GROUP BY file_hash HAVING count(*) > 1
    )
    ORDER BY file_hash;

Because documents are private and are in the default storage, we can use the exists option I had trouble with for images.

    from django.core.files.storage import default_storage

    for doc in Document.objects.all():
         file_path = doc.file.name
         if default_storage.exists(file_path):
             # print(f"{file_path} exists in S3.")
             pass
         else:
             print(f"The file does NOT exist in S3: {file_path}.")