A man in a hat

The art of image description

8th March 2023

By Eleanor Margolies

‘All photographs – not only those that are so called “documentary”… can be fortified by words.’ – Dorothea Lange

Standing next to photographs, captions can reflect on them both intimately and indirectly. For example, Dorothea Lange sometimes juxtaposed her photographs with two or three sentences of words spoken by the subjects themselves, adding information that isn’t visible in the image such as ‘locations, forms of labor, routes of travel, racial and ethnic differences’

(in An American Exodus: A Record of Human Erosion, with Paul Schuster Taylor, 1939).

Often, however, words are simply used to point at images, whether in a passing reference (‘As the graph below shows…’) or an identifying caption (‘Lord Kitchener’). The assumption is that readers can see the image for themselves, and that all the readers see the same thing. But the dialogue between image and text is taking on a new significance in digital publication, going beyond literal description to analyse and question visual culture.

When you take a digital photo, metadata about the image – the location, date and time when it was taken, the size of the file, the type of device used – is automatically stored alongside the picture. When you send the picture on to a friend, this metadata travels with it. If you open the photo up in editing software, or post it on Twitter or Instagram, you have the option to add another field of metadata, a short account of the image in words. This sentence or two of description, categorised as ‘alt text’, is crucial for blind and partially sighted people who use screen reading software to access websites. Screen readers work their way across webpages reading aloud all the text they find, including the mechanics of online navigation such as links, menus and buttons.

If a screen reader comes to an image with embedded alt text, it reads it aloud: ‘A man in a black hat’. If there’s no alt text, there’s a hole in the meaning.

A photo of a cowboy pointing a pistol straight at the viewer. He has curly hair and a moustache and is wearing a ten-gallon hat tipped back on his head. The image is a still from the 1903 film The Great Train Robbery, and has been lightly colourised. — The Great Train Robbery, 1903 (Photo: www.moma.org/learn/moma_learning/edwin-s-porter-the-great-train-robbery-1903)

You can leave it to artificial intelligence to generate alt text automatically, but the results are lifeless lists of generic objects or activities. Microsoft Word describes the image above as ‘a person with a mustache’, and the image below as a ‘picture containing person, sky, outdoor, standing’.

A still from the 2021 Western film The Harder They Fall showing three people walking towards us. The man in the centre is black, with a greying beard, and wears a ten-gallon black hat, a white long-sleeved vest and striped prison trousers, with the striped jacket of the uniform over his arm. To the right, another black man in a cowboy hat and long coat; to the left, a black woman in leather jacket and black hat. — The Harder They Fall, 2021

Algorithms for image recognition are fed banks of sample material to ‘learn from’. When you assert that ‘I’m not a robot’ and undertake ‘verification’ by clicking on ‘all the pictures containing traffic lights’ (or ambulances or motorcycles), you are teaching the machine to do better at identifying objects in photos, mainly to improve the navigation systems of self-driving cars. The material fed to the algorithms determines what they can recognise: gaps in the source material or the biases of the programmers will shape what the software is able to ‘see’. Notoriously, the Google algorithms that had been fed a restricted diet of ‘white people on the internet’ failed to recognise Black people in photos, perpetuating historical erasure and absorbing racist slurs. In this context, why wouldn’t any writer want to take the tool of image description back into their own hands?

Shannon Finnegan and Bojana Coklyat, disabled artists and activists, make a powerful case for treating the writing of alt text as a creative challenge, a chance to play within constraints, as with translation or poetry. In ‘Alt text as poetry’ [https://alt-text-as-poetry.net/] they quote a short description by Madison Zalopany – ‘A screenshot of me being very impressed by my nephew Harry’s new hat. The hat is a plastic green roof taken from a doll’s house’ – to illustrate how writers can play with word choice, sequencing and comic timing within this short form.

Meanwhile, images on Instagram and other social media are increasingly accompanied by longer ‘image descriptions’. These have an access function, making the visual material available to blind or partially sighted audiences, and also provide a space to expand the convention of the ‘photo credit’, providing more information than the usual name of photographer and subjects. Who brought into existence the moody scene posted on Instagram by a theatre company? The people responsible might include the designers of set, costume, projection and lighting, the artists responsible for props, puppets, wigs, hair and make-up, the movement directors and choreographers. The campaign to #CreditTheCreatives has taken off in digital spaces, where image descriptions can also become polemical, using hashtags to locate imagery in debates or discourse that may be unexpected or challenging.

The practice of describing images takes away the ‘obviousness’ of seeing. Georgina Kleege, a professor of English at the University of California who has written on her own experience of blindness in Sight Unseen (1999) and on representations of blindness in art and access to visual art for blind people in More Than Meets the Eye (2018), has described the pedagogical value of ‘mutual description’. When students describe images for each other, they engage more deeply with visual meaning. For Kleege, the practice began in a fiction writing workshop, when a student commented with surprise on a back cover photograph of Raymond Carver. She asked what it was that surprised him. She goes on: ‘He was unable to explain and showed the photo to another student who agreed that Carver’s photo did not match his expectations. Soon, other students weighed in, pointing out specific aspects of the photograph, the author’s expression, posture and clothing, set against what they had come to believe about the author from his literary aesthetic. There then followed a lengthy discussion of the photograph with what turned out to be very rich analysis of the visual rhetoric of the image, and the convention of authors’ photographs in general.’

Georgina Kleege and Scott Wallin,‘Audio Description as a Pedagogical Tool’ in Disability Studies Quarterly, Vol. 35 No. 2 (2015).

Similarly, research by Rachel Hutchison and Alison Eardley on the experience of museum visitors shows that those who listen to audio descriptions of works preserve richer memories of the works, both in detail and affect, than visitors who haven’t experienced the description. They suggest that this is not because the guided listeners spent a longer time looking, but might be attributed to the multi-sensory character of descriptions and the way they construct narrative. Examples of this kind of guided looking in action are found in the regular ‘Slow Looking’ events at The Photographers’ Gallery, designed to be accessible to blind and visually impaired visitors while open to all. Four or five photographs are described, with the discussions that follow ranging over expressions and gestures, photographic composition, details of architecture and clothing, personal associations and shared cultural references.

A translation from image to words, like a translation between two languages, always involves choices: it is subjective, creative. As Finnegan and Coklyat point out, ‘what we see and how we name it is political’. They ask: ‘When and how do we describe race, gender, disability status, age, height, weight, etc? How do we acknowledge visual cues about the expression of identity without making assumptions about how a person identifies? How do we decide what information about a person is important to understanding the image? How do we respond to the fact that many people have made intentional and specific choices about language related to their identity, but we may not know them or the choices that they’ve made?’

https://alt-text-as-poetry.net/

When writing an image description, I find myself crawling over the surface of the image. I look at who is pictured, but also how their bodies are arranged in space and the relationships between them. Who is looking at who? I go back and forth between foreground and background, wondering what to mention first, since unlike the (apparent) simultaneity of visual impressions, words always have to follow one after another. What was it that first caught my eye? Those shoes, with the pale blue ankle socks, must be a deliberate reference to The Wizard of Oz. Will everyone get the idea if I call them ‘ruby slippers’? And does it matter if not everyone gets the same idea?

Dutifully listing the contents of a photograph, I might succeed in capturing its literal subject, but miss the point, or what Roland Barthes in Camera Lucida (1980) calls the ‘punctum’, the detail that catches at the heart: ‘The punctum of a photograph is that accident which pricks me (but also bruises me, is poignant to me).’ Paying attention to the punctum allows me to notice what I don’t understand – the photograph seems to be telling me that this detail is important, but I don’t know why. I become aware of the limitations of my experience, my own range of cultural references. If I can, I ask for help from the creator of the image, or someone who knows the world it portrays.

The alt text suggested by Microsoft Word for the book cover below is ‘A person wearing a hat’ – a description ‘automatically generated’ with ‘medium confidence’.

The description would remain inadequate for access purposes even if we were to add the information that the person portrayed is Aldwyn Roberts (stage name ‘Lord Kitchener’). Machine-generated descriptions can give us some useful information about things. But they can’t find human meaning in images: they don’t tell us about atmosphere, relationships, context or history; they never tell us why the hat.