My thoughts on algorithms for image aesthetics

I have worked on image processing algorithms on and off for nearly 30 years. I really don’t have much to show for it because in reality I found it was hard to build on algorithms that already existed. What am I talking about, don’t all techniques evolve? Well, yes and no. What I have learned over the years is that although it is possible to create unique, automated algorithms to process images, in most cases it is very hard to make those algorithms generic, i.e. apply the algorithm to all images, and get aesthetically pleasing results. And I am talking about image processing here, i.e. improving or changing the aesthetic appeal of images, not image analysis, whereby the information in an image is extracted in some manner – there are some good algorithms out there, especially in machine vision, but predominantly for tasks that involve repetition in controlled environments, such as food production/processing lines.

The number one thing to understand about the aesthetics of an image is that it is completely subjective. In fact image processing would be better termed image aesthetics, or aesthetic processing. Developing algorithms for sharpening an image is all good and well, but it has to actually make a difference to an image from the perspective of human perception. Take unsharp masking for example – it is the classic means of applying sharpening to an image. I have worked on enhanced algorithms for sharpening, involving morphological shapes that can be tailored to the detail in an image, and while they work better, for the average user, there may not be any perceivable difference. This is especially true of images obtained using modern sharp optics.

How does an algorithm perceive this image? How does an algorithm know exactly what needs sharpening? Does an algorithm understand the aesthetics underlying the use of Bokeh in this image?

Part of the process of developing these algorithms is understanding the art of photography, and how simple things like lenses, and how various methods of taking a photo effect the outcome. If you ignore all those and just deal with the mathematical side of things, you will never develop a worthy algorithm. Or possibly you will, but it will be too complicated for a user to understand, let alone use. As for algorithms that supposedly quantify aesthetics in some manner – they will never be able to aesthetically interpret an image in the same way as a human.

Finally, improving the aesthetic appeal of an image can never be completely given over to an automated process, although the algorithms provided in many apps these days are good. Aesthetic manipulation is still a very fluid, dynamic, subjective process accomplished best through the use of tools in an app, making subtle changes until you are satisfied with the outcome. The problem with many academically-motivated algorithms is that they are driven more from a mathematical stance, rather than one based on aesthetics.

Japanese Are-Bure-Boke style photography

Artistic movements don’t arise out of a void. There are many factors which have contributed to the changes in Japanese society. Following World War 2 Japan was occupied by the United States, leading to the introduction of Western popular culture and consumerism, which was aptly termed Americanization. The blend of modernity and tradition was likely to lead to some waves, which was magnified by the turbulent changes occurring in Western society in the late 1960s, e.g. the demonstrations against the Vietnam War. In the late 1960s, Japan’s rapid economic growth began to falter, exposing a fundamental opposition to Japan’s postwar political, economic and cultural structure, which lead to a storm of protests by the likes of students and farmers.

It had a long-term effect on photography, forcing a rethink on how it was perceived. In November 1968 a small magazine called Provoke was published, conceived by art critic Koji Taki (1928-2011) and photographer Takuma Nakahira, with poet Takahiko Okada (1939-1997) and photographer Yutaka Takanashi as dojin members. Daido Moriyama joined a for the second and third issues, bringing with him his early influences of Cartier-Bresson. The subtitle for the magazine was “Provocative Materials for Thought”, and each issue was composed of photographs, essays and poems. The magazine had a lifespan of three issues, the Provoke members disbanding due to a lack of cohesion in their ideals.

The ambitious mission of Provoke to create a new photographic language that could transcend the limitations of the written word was declared with the launch of the magazine’s first issue. The year was 1968 and Japan, like America, was undergoing sweeping changes in its social structure.

Russet Lederman, 2012

The aim of Provoke was to rethink the relationship between word and image, in essence to create a new language. It was to challenge the traditional view of the beauty of photographs, and their function as narrative, pictorial entities. The photographs were fragmented images that rethought the established aesthetic of photography. The photographs they published were an collection of “coarse, blurred and out-of-focus” images, characterized by the phrase Are‑Bure‑Boke (pronounced ah-reh bu-reh bo-keh). It roughly translates to “rough, blurred and out-of-focus”, i.e. grainy (Are), blurry (Bure) and out-of-focus (Boke).

An example of Daido Moriyama’s work.

They tried random triggering, they shot into the light, they prized miss-shots and even no-finder shots (in which no reference is made to the viewfinder). This represented not just a new attitude towards the medium, but a fundamental new outlook toward reality itself. Of course that is not to say that every photograph had the same characteristics, because there are many different ways of taking a picture. The unifying characteristic is the ability to push beyond the static boundaries of traditional photographic aesthetics. Provoke provided an alternative understanding of the post-war years, one that had traditionally been quite Western centric.

Further reading:

How do we perceive photographs?

Pictures are flat objects that contain pigment (either colour, or monochrome), and are very different from the objects and scenes they represent. Of course pictures must be something like the objects they depict, otherwise they could not adequately represent them. Let’s consider depth in a picture. In a picture, it is often easy to find cues relating to the depth of a scene. The depth-of-field often manifests itself as a region of increasing out-of-focus away from the object which is in focus. Other possibilities are parallel lines than converge in the distance, e.g. railway tracks, or objects that are blocked by closer objects. Real scenes do not always offer such depth cues, as we perceive “everything” in focus, and railway tracks do not converge to a point! In this sense, pictures are very dissimilar to the real world.

If you move while taking a picture, the scene will change. Objects that are near move more in the field-of-view than those that are far away. As the photographer moves, so too does the scene, as a whole. Take a picture from a moving vehicle, and the near scene will be blurred, the far not as much, regardless of the speed (motion parallax). This then is an example of a picture for which there is no real world scene.

A photograph is all about how it is interpreted

Photography then, is not about capturing “reality”, but rather capturing our perception, our interpretation of the world around us. It is still a visual representation of a “moment in time”, but not one that necessarily represents the world around us accurately. All perceptions of the world are unique, as humans are individual beings, with their own quirks and interpretations of the world. There are also things that we can’t perceive. Humans experience sight through the visible spectrum, but UV light exists, and some animals, such as reindeer are believed to be able to see in UV.

So what do we perceive in a photograph?

Every photograph, no matter how painstaking the observation of the photographer or how long the actual exposure, is essentially a snapshot; it is an attempt to penetrate and capture the unique esthetic moment that singles itself out of the thousands of chance compositions, uncrystallized and insignificant, that occur in the course of a day.

Lewis Mumford, Technics and Civilization (1934)

How do we perceive depth from flat pictures?

Hang a large, scenic panorama from a wall, and the picture of the scene looks like the scene itself. Photographs are mere imitations of life, albeit flat renditions. Yet although they represent different realities, there are cues on the flat surface of a photograph which help us perceive the scene in depth. We perceive depth is photographs (or even paintings) because the same type of information reaches our visual system from photographs of scenes as from the scenes themselves.

Consider the following Photochrom print (from the Library of Congress) of the Kapellbrücke in the Swiss city of Lucerne, circa 1890-1900. There is no difficulty perceiving the scene as it relates to depth. It is possible to identify buildings and objects in the scene, and obtain an understanding of the relative distances of objects in the scene from one another. These things help define its “3D” ness. The picture can be seen from another perspective as well. The buildings on the far side of the river get progressively smaller as they progress along the river from the left to right, and the roof of the bridge is much larger in the foreground than it is in the distance. There is no motion parallax, which is the relative movement of near and far objects were we physically moving around the scene. These things work together to define our perception of the prints flatness.

Kapellbrücke in Lucerne
Fig. 1: Flatness – The Kapellbrücke in Lucerne

Our perception of the 3D nature of a flat photograph comes from the similarity of information reaching the human visual system from an actual 3D scene, and one described in a photograph of the same scene.

What depth cues exist in an image?

  • Occlusion – i.e. overlapping or superimposition. If object A overlaps object B, then it is presumed object A is closer than object B. The water tower in Fig.1 hides the buildings on the hill behind it, hence it is closer.
  • Converging lines – As parallel lines go into the distant, they become closer together. The bridge’s roofline in Fig.1 gets smaller as it moves higher in the picture.
  • Relative size – Objects that are larger in an image are perceived to be closer than those which are further away. For example, the houses along the far riverbank in Fig. 1 are roughly the same height, but become smaller as they progress from the left of the picture towards the centre.
  • Lighting and shading – Lighting is what brings out the form of a subject/object. The picture in Fig. 1 is light on the right, and darker on the right, this is effectively shown in the water tower which has a light side, and a side with shadows. This provides information about the shape of the tower.
  • Contrast – For scenes where there is a large distance between objects, those further away will have a lower contrast, and may appear blurrier.
  • Texture gradient – The amount of detail on an object helps understand depth. Objects that are closer appear to have more detail, and as it begins to loose detail those areas are perceived to be further away.
  • Height in the plane – An object closer to the horizon is perceived as being more distant than objects above or below it.

Examples of some of these depth cues are explained visually below.

Examples of depth cues in pictures

What is motion parallax?

Motion parallax is one of those perceptual things that you notice the most when looking out the window of a fast moving vehicle, like a train. It refers to the fact that objects moving at a constant speed across the frame will appear to move a greater amount if they are nearer to an observer (or camera) than they would if they were at a great distance (parallax = change in position). This phenomenon is true whether (i) the observer/camera is moving relative to the object, or (ii) object itself that is moving relative to the observer/camera. The rationale for this effect has to do with the distance the object moves with respect to the percentage of the camera’s field of view that it moves across. This helps provide perceptual cues about difference in distance and motion, and is associated with depth perception.

Consider the example below simulating taking a photograph out of a moving vehicle. The tree that is 300m away will move 20m in a particular direction (opposite the direction of the vehicle), but only traverse 25% of the field-of-view. The closer tree, which is only 100m away will move out of the frame completely with the same 20m displacement.

Motion parallax is an attribute of perception, so it exists in real scenes, but not when one views a photograph. Can a photograph contain artifacts of motion parallax? Yes, and it is easy – just take a photograph from a moving vehicle (trains are best), using a relatively slow shutter speed. The picture below was taken on the VIA train to Montreal, using my iPhone pressed up against the glass, with the focus plane approximately in the middle of the window.

A ballad of the senses

When you’re an infant those memories made aren’t really that accessible when you get older. That’s because humans generally suffer from something scientists term infant amnesia. Something to do with rapid neuron growth disrupting the brain circuitry that stores old memories, making them inaccessible (they are not lost, but tucked away). Of course you don’t want to remember everything that happens in life… that would clog our brains with a bunch of nothingness. But we all have selective memories from infancy which we can visualize when they are triggered. For me there are but a couple, and they are usually triggered by an associative sense.

The first is the earthy smell of a cellar, which triggers fleeting memories of childhood times at my grandmothers house in Switzerland. The second is also of the same time and place – the deep smell of wild raspberries. These memories are triggered by olfactory senses, making the visual, however latent, emerge even if for a brief moment. It is no different to the other associations we make between vision, smell, and taste. Dragonfruit is a beautiful looking tropical fruit, but it can have a bitter/tart taste. Some of these associations have helped us survive over the millennia.

Raspberries on a bush.
Mmmm… raspberries… but you can’t smell them, or taste the ethyl formate (the chemical partially responsible for their flavour)

It makes you wonder then if these sense-experiences don’t allow us to better retain memories. If you travel to somewhere like Iceland, and take a picture of a geyser, you may also smell faint wisps of sulphur. There is now an association between a photograph of geyser, and physically experiencing it. The same could be said of the salty Atlantic air of Iles de la Madeleine, or the resinous smell of walking through a pine forest. Memory associations. Or maybe an Instagram of a delicious ice cream from Bang Bang ice-cream. Again an association. But how many of the photos we view lack context because we don’t have an association between the visual, and information gathered from our other senses. You can view a picture of the ice cream on Instagram, but you won’t know what it tastes or smells like, and therefore the picture only provides half the experience.

When visual data becomes a dull noise

There was a time when photographs had meaning, and held our attention, embedded something inside our minds. Photographs like The Terror of War taken by Nick Ut in 1972 during the Vietnam War.  But the digital age has changed the way we consume photographs. Every day we are bombarded with visual content, and due to the sheer volume, most of it makes little if any lasting impact.

Eventually, the visual data around us becomes an amalgam of blurriness and noise, limiting the amount of information we gain from it.

The human visual system is extremely adept at processing visual information. It can process something like 70 images per second [1,2], and identify images in as little as 13 milliseconds. But it was never really designed to see the variety of visual data now thrust at it. When we evolved, vision was purely to used to interpret the world directly surrounding us, primarily from a perspective of survival, and the visual data it provided was really quite simple. It was never really designed to look at screens, or read books. There was no real need for Palaeolithic humans to view something as small as text in a book. Over time visual processing systems evolved as human life evolved.

The greatest change in visual perception likely occurred when the first civilizations appeared. Living in communities meant that the scope and type of visual information changed. The world became a busier place, more cluttered from a sensory perspective. People no long had to use their vision as much for hunting and gathering, but adapted to live in a community setting, and an agricultural way of life. There likely was very little change in thousands of years, maybe even until the advent of the Industrial Revolution. Society became much more fast paced, and again our vision had to adapt. Now in addition to the world around us, people were viewing static images called photographs, often of far-flung exotic places. In the ensuing century, visual information would play an increasing role in people’s lives. Then came the 21st century, and the digital age.

The transient nature of digital information has likely changed the way we perceive the visual world around us. There was a time when viewing a photograph may have been more of an ethereal experience. It can still be a magical experience, but few people likely realize this. We are so bombarded with images that they fill every niche of our lives, and many people likely take them for granted. Our visual world has become super-saturated. How many Instagram photographs do we view every day? How many of these really make an impact on our lives? It may be that too much visual information has effectively morphed what we perceive on a daily basis into a dull noise. It’s like living next to a busy rail-line – what seems noisy at first over time gets filtered out. But what are we loosing in the process?

[1] Potter, M., “Meaning in visual search”, Science, 187(4180), pp.965–966 (1975)
[2] Thorpe, S., Fize, D., & Marlot, C., “Speed of processing in the human visual system”, Nature, 381(6582), pp.520–522 (1996)

Every colour photograph is a manipulation of the truth

Previous discussions have focused on the quasi untruths the camera produces. What is the greatest of them? The freezing or blur of movement? The distortion of perspective? Or maybe the manipulation of colour? When it comes to colour, where does the truth lie? Colour is interpreted differently by each person, and even the camera itself. No one may truly understand the complexities of how colour is actually perceived. Most people see a blue sky, but what shade of blue? Consider the following photograph taken at Point Pleasant Park, in Halifax (Nova Scotia). The sky seems over-saturated, but there was no processing done. Is it natural, or an affect of being in the right place at the right time?

Prince of Wales Tower, Point Pleasant Park, Halifax

Colours in a digital photograph are a result of many differing processes – light passes through the various glass optics of the lens, and is absorbed by the sensor which converts the photons into a digital signal. This does not mean that the colours which exist in a scene will be properly interpreted. The pure “light” of white can be used to manipulate the colours of a photograph, something called white balancing. Scroll through the available choices, and the colour temperature of the photograph will change. Sometimes we manipulate colours through white balancing, other times through manipulation of the colour histogram, all to make the contents of the photograph seem more akin to our perception of realism. Sometimes we add colour to add a sense of non-realism. Sometimes we saturate the colours to make them seem bright, and other times we mute them. 

Take a photograph of something. Look at the colours in the scene, and try to remember what they looked like. Maybe take the same photo with different cameras. It is hard to reproduce the exact colour… so in many ways the photograph the camera produces is something of a generic interpretation to be manipulated in a human way to some visual aesthetic. Which takes us to the question of what is the truth? Is there any real truth to a photograph? 

Nothing has a true colour- it is all varying perceptions of the interaction of light and colour pigments, and the human eye. We apply filters in Instagram to make things seem more vivid and hyper real, or desaturated and contemplative. There is no right or wrong way of understanding colour, although our experiences are influenced by the other senses such as smell. I mean, as far as wavelengths go, the Earth’s sky is really more of a bluish violet colour, but because of the human visual system we perceive it as pale blue. So maybe our own eyes are manipulating the truth?

The perception of enhanced colour images

Image processing becomes more difficult when you involve colour images. That’s primarily because there is more data involved. With monochrome images, there is really only intensity. With colour images comes chromaticity – and the possibility of modifying the intrinsic colours within an image whilst performing some form of image enhancement. Often, image enhancement in colour images is challenging because the impact of the enhancement is very subjective.

Consider this image of Schynige Platte in Switzerland. It is very colourful, and seems quite vibrant.

The sky however seems too aquamarine. The whole picture seems like some sort of “antique photo filter” has been applied to it. How do we enhance it, and what do we want to enhance? Do we want to make the colours more vibrant? Do we want to improve the contrast?

In the first instance, we merely stretch the histogram to reduce the gray tonality of the image. Everything becomes much brighter, and there is a slight improvement in contrast. There are parts of the image that do seem too yellow, but it is hard to know whether this is an artifact of the original scene, or the photograph (likely an artifact of dispersing yellow flower petals).

Alternatively, we can improve the images contrast. In this case, this is achieved by applying a Retinex filter to the image, and then taking the average of the filter result and the original image. The resulting image is not as “bright”, but shows more contrast, especially in the meadows.

Are either of these enhanced images better? The answer of course is in the eye of the beholder. All three images have certain qualities which are appealing. At the end of the day, improving the aesthetic appeal of a colour image is not an easy task, and there is no “best” algorithm.