Image resolution and human perception

Sometimes we view a poster or picture from afar and are amazed at the level of detail, or the crispness of the features, yet viewed from up close this just isn’t the case. Is this a trick of the eye? It has to do with the resolving power of the eye.

Images, whether they are analog photographs, digital prints, or paintings, can contain many different things. There are geometric patterns, shapes, colours – everything needed in order to perceive the contents of the image (or in the case of some abstract art, not perceive it). Now as we have mentioned before, the sharpest resolution in the human eye occurs in the fovea, which represents about 1% of the eyes visual field – not exactly a lot. The rest of the visual field until the peripheral vision has progressively less ability to discern sharpness. Of course the human visual system does form a picture, because the brain is able to use visual memory to form a mental model of the world as you move around.

Fig.1: A photograph of a photograph stitched together (photographed at The Rooms, St.John’s, NFLD). .

Image resolution plays a role in our perception of images. The human eye is only able to resolve a certain amount of resolution based on viewing distance. There is actually an equation used to calculate this: 2/(0.000291×distance(inches)). A normal human eye (i.e. 20-20 vision) can distinguish patterns of alternating black and white lines with a feature size as small as one minute of an arc, i.e. 1/60 degree or π/(60*180) = 0.000291 radians.

So if a poster were viewed from a distance of 6 feet, the resolution capable of being resolved by the eye is 95 PPI. That’s why the poster in Fig.1, comprised of various separate photographs stitched together (digitally) to form a large image, appears crisp from that distance. It could be printed at 100 DPI, and still look good from that distance. Up close though it is a different story, as many of the edge features are quiet soft, and lack the sharpness expected from the “distant” viewing. The reality it that the poster could be printed at 300 DPI, but viewed from the same distance of 6 feet, it is unlikely the human eye could discern any more detail. It would only be useful if the viewer comes closer, however coming closer then means you may not be able to view the entire scene. Billboards offer another a good example. Billboards are viewed from anywhere from 500-2500 feet away. At 573ft, the human eye can discern 1.0 PPI, at 2500ft it would be 0.23 PPI (it would take 16 in2 to represent 1 pixel). So the images used for billboards don’t need to have a very high resolution.

Fig.2: Blurry details up close

Human perception is then linked to the resolving power of the eye. Resolving power is the ability of the eye to distinguish between very small objects that are very close together. To illustrate this further, consider the images shown in Fig.3. They have been extracted from a digital scan of a vintage brochure taken at various enlargement scales. When viewing the brochure it is impossible to see the dots associated with the printing process, because they are too small to discern (and that’s the point). The original, viewed on the screen is shown in Fig.3D. Even in Fig.3C it is challenging to see the dot pattern that makes up the print. In both Fig.3A and 3B, the dot pattern can be identified. It is no different with any picture. But looking at the picture close up, the perception of the picture is one of blocky, dot matrix, not the continuous image which exists when viewed from afar.

Fig.3: Resolving detail

Note that this is an exaggerated example, as the human eye does not have the discerning power to view the dots of the printing process without assistance. If the image were blown up to poster size however, a viewer would be able to discern the printing pattern. Many vintage photographs, such as the vacation pictures sold in 10-12 photo sets work on the same principle. When provided as a 9cm×6cm black-and-white photograph, they seem to show good detail when viewed from 16-24 inches away. However when viewed through a magnifying glass, or enlarged post-digitization, they lack the same sharpness as viewed from afar.

Note that 20-20 vision is based on the 20ft distance from the patient to the acuity chart when taking an eye exam. Outside of North America, the distance is normally 6 metres, and so 20-20 = 6-6.

How do we perceive photographs?

Pictures are flat objects that contain pigment (either colour, or monochrome), and are very different from the objects and scenes they represent. Of course pictures must be something like the objects they depict, otherwise they could not adequately represent them. Let’s consider depth in a picture. In a picture, it is often easy to find cues relating to the depth of a scene. The depth-of-field often manifests itself as a region of increasing out-of-focus away from the object which is in focus. Other possibilities are parallel lines than converge in the distance, e.g. railway tracks, or objects that are blocked by closer objects. Real scenes do not always offer such depth cues, as we perceive “everything” in focus, and railway tracks do not converge to a point! In this sense, pictures are very dissimilar to the real world.

If you move while taking a picture, the scene will change. Objects that are near move more in the field-of-view than those that are far away. As the photographer moves, so too does the scene, as a whole. Take a picture from a moving vehicle, and the near scene will be blurred, the far not as much, regardless of the speed (motion parallax). This then is an example of a picture for which there is no real world scene.

A photograph is all about how it is interpreted

Photography then, is not about capturing “reality”, but rather capturing our perception, our interpretation of the world around us. It is still a visual representation of a “moment in time”, but not one that necessarily represents the world around us accurately. All perceptions of the world are unique, as humans are individual beings, with their own quirks and interpretations of the world. There are also things that we can’t perceive. Humans experience sight through the visible spectrum, but UV light exists, and some animals, such as reindeer are believed to be able to see in UV.

So what do we perceive in a photograph?

Every photograph, no matter how painstaking the observation of the photographer or how long the actual exposure, is essentially a snapshot; it is an attempt to penetrate and capture the unique esthetic moment that singles itself out of the thousands of chance compositions, uncrystallized and insignificant, that occur in the course of a day.

Lewis Mumford, Technics and Civilization (1934)

How do we perceive depth from flat pictures?

Hang a large, scenic panorama from a wall, and the picture of the scene looks like the scene itself. Photographs are mere imitations of life, albeit flat renditions. Yet although they represent different realities, there are cues on the flat surface of a photograph which help us perceive the scene in depth. We perceive depth is photographs (or even paintings) because the same type of information reaches our visual system from photographs of scenes as from the scenes themselves.

Consider the following Photochrom print (from the Library of Congress) of the Kapellbrücke in the Swiss city of Lucerne, circa 1890-1900. There is no difficulty perceiving the scene as it relates to depth. It is possible to identify buildings and objects in the scene, and obtain an understanding of the relative distances of objects in the scene from one another. These things help define its “3D” ness. The picture can be seen from another perspective as well. The buildings on the far side of the river get progressively smaller as they progress along the river from the left to right, and the roof of the bridge is much larger in the foreground than it is in the distance. There is no motion parallax, which is the relative movement of near and far objects were we physically moving around the scene. These things work together to define our perception of the prints flatness.

Kapellbrücke in Lucerne
Fig. 1: Flatness – The Kapellbrücke in Lucerne

Our perception of the 3D nature of a flat photograph comes from the similarity of information reaching the human visual system from an actual 3D scene, and one described in a photograph of the same scene.

What depth cues exist in an image?

  • Occlusion – i.e. overlapping or superimposition. If object A overlaps object B, then it is presumed object A is closer than object B. The water tower in Fig.1 hides the buildings on the hill behind it, hence it is closer.
  • Converging lines – As parallel lines go into the distant, they become closer together. The bridge’s roofline in Fig.1 gets smaller as it moves higher in the picture.
  • Relative size – Objects that are larger in an image are perceived to be closer than those which are further away. For example, the houses along the far riverbank in Fig. 1 are roughly the same height, but become smaller as they progress from the left of the picture towards the centre.
  • Lighting and shading – Lighting is what brings out the form of a subject/object. The picture in Fig. 1 is light on the right, and darker on the right, this is effectively shown in the water tower which has a light side, and a side with shadows. This provides information about the shape of the tower.
  • Contrast – For scenes where there is a large distance between objects, those further away will have a lower contrast, and may appear blurrier.
  • Texture gradient – The amount of detail on an object helps understand depth. Objects that are closer appear to have more detail, and as it begins to loose detail those areas are perceived to be further away.
  • Height in the plane – An object closer to the horizon is perceived as being more distant than objects above or below it.

Examples of some of these depth cues are explained visually below.

Examples of depth cues in pictures

A ballad of the senses

When you’re an infant those memories made aren’t really that accessible when you get older. That’s because humans generally suffer from something scientists term infant amnesia. Something to do with rapid neuron growth disrupting the brain circuitry that stores old memories, making them inaccessible (they are not lost, but tucked away). Of course you don’t want to remember everything that happens in life… that would clog our brains with a bunch of nothingness. But we all have selective memories from infancy which we can visualize when they are triggered. For me there are but a couple, and they are usually triggered by an associative sense.

The first is the earthy smell of a cellar, which triggers fleeting memories of childhood times at my grandmothers house in Switzerland. The second is also of the same time and place – the deep smell of wild raspberries. These memories are triggered by olfactory senses, making the visual, however latent, emerge even if for a brief moment. It is no different to the other associations we make between vision, smell, and taste. Dragonfruit is a beautiful looking tropical fruit, but it can have a bitter/tart taste. Some of these associations have helped us survive over the millennia.

Raspberries on a bush.
Mmmm… raspberries… but you can’t smell them, or taste the ethyl formate (the chemical partially responsible for their flavour)

It makes you wonder then if these sense-experiences don’t allow us to better retain memories. If you travel to somewhere like Iceland, and take a picture of a geyser, you may also smell faint wisps of sulphur. There is now an association between a photograph of geyser, and physically experiencing it. The same could be said of the salty Atlantic air of Iles de la Madeleine, or the resinous smell of walking through a pine forest. Memory associations. Or maybe an Instagram of a delicious ice cream from Bang Bang ice-cream. Again an association. But how many of the photos we view lack context because we don’t have an association between the visual, and information gathered from our other senses. You can view a picture of the ice cream on Instagram, but you won’t know what it tastes or smells like, and therefore the picture only provides half the experience.

When visual data becomes a dull noise

There was a time when photographs had meaning, and held our attention, embedded something inside our minds. Photographs like The Terror of War taken by Nick Ut in 1972 during the Vietnam War.  But the digital age has changed the way we consume photographs. Every day we are bombarded with visual content, and due to the sheer volume, most of it makes little if any lasting impact.

Eventually, the visual data around us becomes an amalgam of blurriness and noise, limiting the amount of information we gain from it.

The human visual system is extremely adept at processing visual information. It can process something like 70 images per second [1,2], and identify images in as little as 13 milliseconds. But it was never really designed to see the variety of visual data now thrust at it. When we evolved, vision was purely to used to interpret the world directly surrounding us, primarily from a perspective of survival, and the visual data it provided was really quite simple. It was never really designed to look at screens, or read books. There was no real need for Palaeolithic humans to view something as small as text in a book. Over time visual processing systems evolved as human life evolved.

The greatest change in visual perception likely occurred when the first civilizations appeared. Living in communities meant that the scope and type of visual information changed. The world became a busier place, more cluttered from a sensory perspective. People no long had to use their vision as much for hunting and gathering, but adapted to live in a community setting, and an agricultural way of life. There likely was very little change in thousands of years, maybe even until the advent of the Industrial Revolution. Society became much more fast paced, and again our vision had to adapt. Now in addition to the world around us, people were viewing static images called photographs, often of far-flung exotic places. In the ensuing century, visual information would play an increasing role in people’s lives. Then came the 21st century, and the digital age.

The transient nature of digital information has likely changed the way we perceive the visual world around us. There was a time when viewing a photograph may have been more of an ethereal experience. It can still be a magical experience, but few people likely realize this. We are so bombarded with images that they fill every niche of our lives, and many people likely take them for granted. Our visual world has become super-saturated. How many Instagram photographs do we view every day? How many of these really make an impact on our lives? It may be that too much visual information has effectively morphed what we perceive on a daily basis into a dull noise. It’s like living next to a busy rail-line – what seems noisy at first over time gets filtered out. But what are we loosing in the process?

[1] Potter, M., “Meaning in visual search”, Science, 187(4180), pp.965–966 (1975)
[2] Thorpe, S., Fize, D., & Marlot, C., “Speed of processing in the human visual system”, Nature, 381(6582), pp.520–522 (1996)

More on Mach bands

Consider the following photograph, taken on a drizzly day in Norway with a cloudy sky, and the mountains somewhat obscured by mist and clouds.

Now let’s look at the intensity image (the colour image has been converted to 8-bit monochrome):

If we look at a region near the top of the mountain, and extract a circular region, there are three distinct regions along a line. To the human eye, these appear as quite uniform regions, which transition along a crisp border. In the profile of a line through these regions though, there are two “cliffs” (Aand B) that marks the shift from one region to the next. Human eyes don’t perceive these “cliffs”.

The Mach bands is an illusion that suggests edges in an image where in fact the intensity is changing in a smooth manner.

The downside to Mach bands is that they are an artificial phenomena produced by the human visual system. As such, it might actually interfere with visual inspection to determine the sharpness contained in an image.

Mach bands and the perception of images

Photographs, and the results obtained through image processing are at the mercy of the human visual system. A machine cannot interpret how visually appealing an image is, because aesthetic perception is different for everyone. Image sharpening takes advantage of one of the tricks of our visual system. Human eyes see what are termed “Mach bands” at the edges of sharp transitions, which affect how we perceive images. This optical illusion was first explained by Austrian physicist and philosopher Ernst Mach (1838–1916) in 1865. Mach discovered how our eyes leverage the use of contrast to compensate for its inability to resolve fine detail. Consider the image below containing ten squares of differing levels of gray.

Notice how the gray squares appear to scallop, with a lighter band on the left, and a darker band on the right of the squares? This is an optical illusion, in fact the gray squares are all uniform in intensity. To resolve the brain/eyes deficiency in being able to resolve detail, incoming light gets processed in such a manner than the contrast between two different tones is exaggerated. This gives the perception of more detail. The dark and light bands seen on either side of the gradation are the Mach bands. Here is an example of what human eyes see:

What does this have to do with manipulation techniques such as image sharpening? The human brain perceives exaggerated intensity changes near edges – so image sharpening uses this notion to introduce faux Mach bands by amplifying intensity edges. Consider as an example the following  image, which basically shows two mountain sides, one behind the other. Without looking too closely you can see the Mach bands.

Taking a profile perpendicular to the mountain sides provides an indication of the intensity values along the profile, and shows the edges.

The profile shows three plateaus, and two cliffs (the cliffs are ignored by the human eyes). The first plateau is the foreground mountainside, the middle plateau is the mountainside behind that, and the uppermost plateau is some cloud cover. Now we apply an unsharp masking filter to the image, to sharpen the image (radius=10, mask weight=0.4)

Notice how the UM filter has the effect of adding a Mach band to each of the cliff regions.

How many colours are in a photograph?

The number of colours in a 24-bit colour image is 256³ or 16,777,216 colours. So how many colours are there in a 8 MP photo? Consider the following beautiful photograph:

Picture of a flower on a Japanese quince tree.
A picture of a flower from a Japanese quince

In this image there are 515,562 unique colours. Here’s what is looks like as a 3D RGB histogram:

Most photographs will not contain 16 million colours (obviously if they have less than 16 MP, that’s a given). If you want to check out some images that do, try allrgb.com. Here is another image with more colours: 1,357,892 to be exact. In reality, very few real everyday photographs contain that amount of hue varieties.

Stained glass window at Metro Charlevoix in Montreal
Stained glass window at Metro Charlevoix in Montreal

Now as the average number of colours humans can perceive is only around a million, having 16 million colours in an image is likely overkill.

Why human eyes are so great

Human eyes are made of gel-like material. It is interesting then, that together with a 3-pound brain composed predominantly of fat and water, we are capable of the feat of vision. Yes, we don’t have super-vision, and aren’t capable of zooming in on objects in the distance, but our eyes are magical. Eyes are able to focus instantaneously, and at objects as closer as 10cm, and as far away as infinity. They also automatically adjust for various lighting conditions. Our vision system is quickly able to decide what an object is and perceive 3D scenes.

Computer vision algorithms have made a lot of progress in the past 40 years, but they are by no means perfect, and in reality can be easily fooled. Here is an image of a refrigerator section in a grocery store in Oslo. The context of the content within the image is easily discernible. If we load this image into “Google Reverse Image Search” (GRIS), the program says that it is a picture of a supermarket – which is correct.

Now what happens if we blur the image somewhat? Let’s say a Gaussian blur with a radius of 51 pixels. This is what the resulting image looks like:

The human eye is still able to decipher the content in this image, at least enough to determine it is a series of supermarket shelves. Judging by the shape of the blurry items, one might go so far to say it is a refrigerated shelf. So how does the computer compare? The best it could come up with was “close-up”, because it had nothing to compare against. The Wolfram Language “Image Identification Program“, (IIP) does a better job, identifying the scene as “store”. Generic, but not a total loss. Let’s try a second example. This photo was taken in the train station in Bergen, Norway.

GRIS identifies similar images, and guesses the image is “Bergen”. Now this is true, however the context of the image is more related to railway rolling stock and the Bergen station, than Bergen itself. IIP identifies it as “locomotive engine”, which is right on target. If we add a Gaussian blur with radius = 11, then we get the following blurred image:

Now GRIS thinks this scene is “metro”, identifying similar images containing cars. It is two trains, so this is not a terrible guess. IIP identifies it as a subway train, which is a good result. Now lets try the original with Gaussian blur and a radius of 21.

Now GRIS identifies the scene as “rolling stock”, which is true, however the images it considers similar involve cars doing burn-out or stuck in the snow (or in one case a rockhopper penguin). IIP on the other hand fails this image, identifying it as a “measuring device”.

So as the image gets blurrier, it becomes harder for computer vision systems to identify, whereas the human eye does not have these problems. Even in a worst case scenario, where the Gaussian blur filter has a radius of 51, the human eye is still able to decipher its content. But GRIS thinks it’s a “photograph” (which *is* true, I guess), and IIP says it’s a person.

30-odd shades of gray – the importance of gray in vision

Gray (or grey) means a colour “without colour”… and it is a colour. But in terms of image processing we more commonly use gray as a term synonymous to monochromatic (although monochrome means single colour). Now grayscale images can potentially come with limitless levels of gray, but while this is practical for a machine, it’s not useful for humans. Why? Because the structure of human eyes is composed of a system for conveying colour information. This allows humans to distinguish between approximately 10 million colours, but only about 30 shades of gray.

The human eye has two core forms of photoreceptor cells: rods and cones. Cones deal with visioning colour, while rods allow us to see grayscale in low-light conditions, e.g. night. The human eye has three types of cones sensitive to magenta, green, and yellow-to-red. Each of these cones react to an interval of different wavelengths, for example blue light stimulates the green receptors. However, of all the possible wavelengths of light, our eyes detect only a small band, typically in the range of 380-720 nanometres, what we known as the visible spectrum. The brain then combines signals from the receptors to give us the impression of colour. So every person will perceive colours slightly differently, and this might also be different depending on location, or even culture.

After the light is absorbed by the cones, the responses are transformed into three signals:  a black-white (achromatic) signal and two colour-difference signals: a red-green and a blue-yellow. This theory was put forward by German physiologist Ewald Hering in the late 19th century. It is important for the vision system to properly reproduce blacks, grays, and whites. Deviations from these norms are usually very noticeable, and even a small amount of hue can produce a noticeable defect. Consider the following image which contains a number of regions that are white, gray, and black.

A fjord in Norway

Now consider the photograph with a slight blue colour cast. The whites, grays, *and* blacks have taken on the cast (giving the photograph a very cold feel to it).

Photograph of a fjord in Norway with a cast added.

The grayscale portion of our vision also provides contrast, without which images would have very little depth. This is synonymous with removing the intensity portion of an image. Consider the following image of some rail snowblowers on the Oslo-Bergen railway in Norway.

Rail snowblowers on the Oslo-Bergen railway in Norway.

Now, let’s take away the intensity component (by converting it to HSB, and replacing the B component with white, i.e. 255). This is what you get:

Rail snowblowers on the Oslo-Bergen railway in Norway. Photo has intensity component removed.

The image shows the hue and saturation components, but no contrast, making it appear extremely flat. The other issue is that sharpness depends much more on the luminance than the chrominance component of images (as you will also notice in the example above). It does make a nice art filter though.