More People Are Using Subtitles - Are Sound Mixers To Blame?

June 5, 2024 Mike Thornton

Martin Lewis, a well-known consumer champion in the UK, posted a poll on his social media channels asking about watching English-language TV with subtitles turned on. What caught Mike Thornton’s eye were the comments on both Twitter and Facebook from normal TV viewers and listeners about the state of TV sound. In this article, Mike takes a closer look at the comments and what we can do about the issues that users are having, which force them to turn the subtitles on.

Martin Lewis Poll

See this content in the original post

We are going to look at the data first. Here is Martin’s poll. He asked…

“Today's Twitter Poll: Do you usually/often watch English language TV with the English subtitles on?”

There were 49,631 votes. How do the results of Martin Lewis’s poll compare with research already undertaken?

In November 2021, to coincide with Caption Awareness Week, UK-based Stagetext, a deaf-led charity making the arts a more welcoming and accessible place, commissioned Sapio. In October 2021, they interviewed 2,003 people, the results of which were weighted to represent Great Britain’s general population. Two-thirds (67%) of the population do not describe themselves as deaf, deafened, or hard of hearing.

Take a look at the results…

See this chart in the original post

The usage seems to be inversely proportional to the expected usage groups with 4 out of 5 people aged 18 to 25 using subtitles all or part of the time, whereas less than a quarter of people aged 56 to 75 said they use subtitles all or part of the time, even though twice as many people in that group declare themselves as deaf, deafened or hard of hearing. Melanie Sharp, Stagetext’s chief executive said…

"I think there's far more acceptance of subtitles by young people because it's the norm, whereas with an older age group, it isn't necessarily the norm."

Interestingly, there were two comments to Martin’s poll that speak to this…

“Students I teach in high school say they do it after growing up with YouTube and sound quality often not being completely clear.”
“Hearing damage in the younger generations from headphone use, mark my words.”

That is the UK, but what about the US? In May 2022, language tutors Preply surveyed 1,265 Americans on their use and opinions of subtitles in entertainment. 49% identified as men, 48% identified as women and 3% identified as non-binary or preferred not to indicate their gender. Of the respondents, 16% were Baby Boomers (58-76), 22% were Generation X (42-57), 46% were Millennials (26-41), and 16% were Generation Z (10-25).

See this chart in the original post

It is interesting to see the correlation between the Preply survey and the Shapio survey of Brits. However, the differences across the age ranges were not as marked in the US survey as they were in the UK survey.

Before we return to looking at the comments to Martin Lewis’s poll on Twitter and Facebook, it would be helpful to analyse the reasons why people use subtitles. There are a number of valid reasons for using subtitles that are not because of poor sound.

Why Do People Use Sub Titles?

The US survey asked two questions on this subject.

See this content in the original post

Preply’s results show that 53% of Americans are using subtitles more often than they used to, which implies that things have deteriorated over time. That could be because as people get older, their hearing naturally deteriorates, but I suspect that is not the main reason. What this survey makes very clear is that there are major problems with dialogue intelligibility, and they are getting worse, not better. So why is this?

Looking at the comments of the Martin Lewis poll in the UK on May 14th 2024, there are clearly several reasons why people use subtitles that are not technical or mix-related.

People watching TV who don’t want to disturb others, whether that is neighbours, young children sleeping or other family members.
People whose hearing has deteriorated with age and I would have to include myself in this category. However I would say that it is our responsibility to make sure our mixes make it as easy as possible for consumers to enjoy the content we mix.
People who are hard of hearing that isn’t related to age and can include conditions like tinnitus as well as other conditions that have a negative impact on their hearing.
People who are neurodivergent. This was a new one for me, but some of the reasons given included children with Developmental Language Disorders, autism, MS, Dyspraxia, and ADHD. For example, HW commented… “Watching with subtitles can really support adults with ADHD and Autism to pick up on social communication cues and reduce distraction. It can help develop reading skills for other viewers. I wish this had been available when I was growing up.”
People with strong accents, mainly in dramas. We can extend this to people whose first language isn’t English and who find subtitles helpful for understanding what is being said.
They use subtitles to help them learn a new language.

What Are The Issues That We Can Do Something About

Having covered these issues, we will now look at the issues that are technology-related, like the change in the design of televisions and speakers, intelligibility, dynamic range and other ‘mix’ issues. For each area, I will list a selection of comments and then explore the problem and any solutions. I am anonymising the comments here.

Please remember these are comments from real people watching and listening to the content we are creating.

Intelligibility

“I often can't make out what's being said due to overly loud background music or mumbled speech.”
“There's so much background noise with "mood music" and so many actors seem to mumble or talk quickly!”
“I have subtitles on approx 90% of the time. It's my default setting on my system. I have difficulty discriminating speech when there is a lot of background noise so for most films, dramas etc, subtitles are a must. Not so much an issue with documentaries.”
“My hearings fine, in many shows and movies the actors mumble or the background music/sounds is so loud you can’t hear the dialogue properly.”
“Find it hard to make out dialogue over "background" music a lot of the time so the subtitles are always on.”
“I do if the dialogue isn't very clear, which it often isn't on American programmes. I don't have any trouble with the BBC so it must be a quality thing.”
“Need it sometimes as the actors in particular talk muffled and can't understand what they are saying or dialect.”
“The clarity of speech and background noise/music make subtitles a necessity for me.”
“So many people don’t speak clearly or speak up when being interviewed and the constant background noise and music, mostly too loud, makes it impossible to hear speech over the top. All programmes seem to have issues with promoting a good level of speech at the expense of background noise.”
“Background music does it for me, as then so can’t hear the dialogue.”

The first point I want to make is that there is no one single reason why intelligibility has deteriorated in broadcast and OTT content. As with so many things, it is a combination of factors that ultimately result in normal-hearing people having to resort to turning on the subtitles to follow the narrative, and we will explore those later in this article.

Before we do that, let’s take a closer look at intelligibility. The dictionary definition of intelligibility is…

“the quality or condition of being intelligible - capable of being understood; comprehensible; clear enough to be understood.”

In non-tonal (Western) languages, consonants are really important. The consonants (k, p, s, t, etc.) are predominantly found in the frequency range above 500 Hz, more specifically, in the 2 kHz-4 kHz frequency range. However, take a look at the diagram below. There isn’t really a correlation between the amount of energy in the consonants in the 2 to 4 kHz band and their importance to intelligibility.

Image courtesy of the DPA Microphone University

What doesn’t help is that it is very difficult to make the constants louder. Try it for yourself, is very difficult. when you project or shout you make the vowels louder but the consonants stay pretty well the same level. The lack of energy in the consonants also makes them much easier to mask or drown out with other sounds like sound effects, foley or music.

If you would like to know more about the science behind intelligibility then check out our article Speech Intelligibility - The Facts That Affect How We Hear Dialog for all the scientific detail.

Back to this article. Next, we will look at the reasons that have come together to produce this ridiculous position that normal-hearing people are using subtitles.

Intelligibility Meter

An Intelligibility meter would help by at least giving a quantitative measure of intelligibility. There are already two options.

Public Address systems, especially where safety announcements need to be given, are required to have a measurable Speech Transmission Index. The Speech Transmission Index reflects how a transmission path affects speech intelligibility; it is a measurement that does not take listeners and talkers into account but just measures the transmission channel, which means that factors such as hearing loss, poor articulation and other (human) limitations are not taken into account. if you want to know more about this, then start out by reading this white paper entitled Speech intelligibility measurements in practice.

Back to media and broadcasting, even though it is impossible to measure the full transmission path as there are way too many variables, most of which we are covering in this article, it is an issue engaging developers.

Back in 2018, iZotope recognized this to be an important factor and included an Intelligibility Meter in their audio visualization and metering software - Insight 2. Not for the first time iZotope has been the first to take a concept and turn it into a product. In this case an intelligibility meter for our industry built into Insight 2, which is a first in audio metering.

Being the first brings its own challenges, in that you have to set the style and the standard, and iZotope has risen to that challenge with the Intelligibility meter in Insight 2. The top meter, which doesn’t have a scale but rather a target to aim for, is very intuitive, and interestingly, the target moves when you change the expected environment the consumer might be in when listening to the content. However, the bottom 2 meters with a scale in phons is a throwback to the intelligibility measurements for sound reinforcement and emergency announcement systems, but in the context of broadcasting and OTT, their precise meaning is not clear. For me, there is still some work to do, but when you don’t have anything to go on, you have to start somewhere, especially when trailblazing.

It is always going to be difficult, but that has never stopped iZotope’s ingenuity before. I hope they will continue to improve this new concept, which will be so helpful by giving us a quantitative measurement of the intelligibility of our mixes. This is especially true as more and more of the content we mix is being consumed in noisy and challenging locations, with playback systems that do not offer the best sound quality.

In December 2020, Steinberg included an Intelligibility Meter in Nuendo 11. It turns out that Steinberg’s Intelligibility Meter has been developed by the very clever people at Fraunhofer.

The new feature analyses incoming audio signals via a speech intelligibility model with automatic speech recognition technology and calculates how much effort the listener must put into understanding the spoken words within the mix. Dr Jan Rennies-Hochmuth, Head of Personalized Hearing Systems at Fraunhofer IDMT, explains…

“The tool Intelligibility Meter measures objective speech intelligibility in media production in real time, controlled by artificial intelligence. It is a results of several years of hearing research in Oldenburg.”

You can learn much more about this intelligibility meter in our article Fraunhofer Intelligibility Meter Used In Nuendo 11.

The Desire For Realism Resulting In ‘Mumbling Actors’

Back to the comments from Martin Lewis’s poll…

“I wonder if modern audio mixes have are a factor. Christopher Nolan films are famous for atrocious mixes: he refuses to do any post fixes to audio and thinks b'ground sound is as important as dialogue. He won't even do alternate audio tracks for hard of hearing. Easy enough these days.”
“There is so much mumbling and background music plus dark screens on tv dramas these days the subtitles are required viewing!”
“Mumbling so we have the subtitles on to hear the parts of the sentences that drop off.”
“Poor elocution and poor production, especially in films.”
“Older programmes/films are easier to understand with clear diction. Nowadays actors just mumble and speak too quickly. Impossible to follow the plot sometimes without subtitles, whether or not you are hard of hearing.”
“I use it on more modern productions as some of the diction is shocking.”
“The diction is usually so bad that subtitles are required.”
“My brother and I watched Oppenheimer together and decided quarter of the way through we needed subtitles.”
“I cannot understand a show without subtitles it’s just mumbling. I’m 32.”
“So difficult to hear some of the mumbling dialogue in films these days - that’s when I use subtitles.”
“Too much mumbling on some programmes makes it difficult to understand what they are saying without subtitles.”
“The actors seem to mumble a lot and I also struggle with some accents so it’s either turn the tv up really loud or out on subtitles.”
“It's a mixing issue. The dialogue and the backing music are so often at odds with each other. I had to watch Oppenheimer with my remote in my hand because having the volume high enough to make out what people were saying was loud enough that the music was deafening.”

There continues to be a growing trend towards more realism. Actors, can and should, explore different techniques to portray their characters. However, if this involves "realism" - delivering dialogue in a "realistic" way rather than in a way that can be heard at the back of a theatre, as we have seen from the UK and US research, this isn’t ending well and directors need to understand this. The problem is that going for the realistic approach means that the dialogue, is not likely to make it all the way through the system so that the end users can still understand what is being said.

After all, there is nothing realistic about TV productions, whether documentaries or drama, studio based or on location. So why consider trying taking the realistic approach?

Believable, absolutely! Realistic, definitely not!

When it comes to mixing, I do not believe you can mix TV shows with natural dynamics in the dialogue. If you do, it makes the dialog harder to hear and understand. How you choose to reduce the dialogue dynamic range is up to you. It can be with the faders, clip level or compression. However, restricting the dynamics is essential, especially for content to be consumed at home or on the move.

Realism is just not possible and I feel that the push towards ‘realism’ is flawed on so many levels. Consider the lighting, the way it is shot, how the story is put together, none of it is real so why apply realism to the sound, it is bonkers!

A Lack Of Light

An extension of this push to ‘realism’ is scenes shot in the dark, which makes it very hard for people to lip-read. A couple of comments from Martin Lewis’s poll picked this up…

“I don't have trouble hearing most programmes. I'll turn them on if it's really badly mixed. What I would like is a facility to turn up the lights in modern dramas. It seems to be a fashion that dramas are filmed in the dark. Natural lighting makes it difficult to see them.”
“We’re both over 30 and subtitles are always on if available. Crap audio mixes, not having the greatest TV and mumbling actors. While we’re on this kind of thing can the editors please MAKE THE SCENE BRIGHTER!”

For example, in the UK a few years ago, in the post 2nd World War drama series - SSGB, there were a number of scenes shot in the dark where the narrative needed to get across that people needed to whisper so they weren’t overheard and also hiding in shadows and out after dark so they weren’t seen. The problem with this realistic approach is that intelligibility suffers when people cannot see the speaker’s lips moving. There is an art to speaking quietly and still be heard clearly. Back in the day, it was called a ‘stage whisper,’ but these techniques don’t seem to be taught anymore in drama school, to the extent that we have at least one generation of actors who don’t have this skill anymore.

This push for realism means directors don’t feel it is necessary, but it is a problem, and Preply’s research confirms this, with 44% of respondents saying that highlighting the impact of low lighting is having on intelligibility.

TV Drama Is Not A Feature Film

“When the Directors favour cinematic sound over home TV audience then the only outcome will be subtitles on your telly. You just can’t hear half the story otherwise.”

I also believe that mixing TV drama like a feature film is daft. For example, at night, the consumer will almost certainly have the TV volume much quieter, especially if they have young children, so all that quieter stuff won’t be heard. If that includes quiet dialog, the narrative can get lost, and they end up turning on the subtitles to be able to follow the story.

Pre-existing Knowledge - Those Involved In The Production All Know What Is Being Said

“The current trend for mumbled speech in films and dramas, because directors think it adds atmosphere, has made me wonder about doing it. The directors know the script so they know what's being mumbled in the dark and whether it's important or just an aside.”

Another big issue at play as to whether a particular line is intelligible or not, is that everyone involved in the production knows what is being said, they have lived with it through pre-production, script editing, shooting, and post-production. This means they probably know the script as well as the actors, if not better!

What this familiarity with the script means is that they can hear the words even when they are not clearly intelligible. For example, this can happen when the drama is being shot, the director knows what is being said, and even if the sound team asks for a retake it is likely to be received with a hard stare and "I can hear it, what's your problem"! When we get to the dub when the director comes to sign off on a scene, again they know what is being said and so may well be asking for the FXs and/or music to be lifted to increase the sense of drama in the scene to a much higher level than they would if they were new to the production and hearing it for the first time.

Changes In Production Techniques - More Multi-camera, Less Use Of Boom Mics

“Since actors don't speak to boom mics anymore, they tend to mumble, and bad mixing with loud music or effects make a good part of the speech unintelligible.”

Shooting a scene using more than one camera means that your use of a boom mic is compromised at best, as at least one of the cameras tends to be a wild shot, meaning the boom mic cannot get in close enough to pick up a clean sound. Consequently, location sound teams end up relying on personal radio mics. As we learned in our article Speech Intelligibility - The Facts That Affect How We Hear Dialog, the spectrum of speech recorded on a person's chest normally lacks frequencies in the important range of 2-4 kHz, where the constants are, resulting in reduced speech intelligibility.

In the same article, we also learnt that just over the head, where the boom mic would normally be, is a great position for getting the best speech intelligibility. All of this means that the growth of multi-camera shoots results in a double-whammy: We lose the use of a boom mic and replace it with personal radio mics, often in the chest area, which don’t pick up the consonants as well as the boom mic. As we learnt, speech intelligibility is all about the constants.

Loudness Range Too High

“I use subtitles but only because audio designers have gotten terrible at mixing levels. Dialogue is quiet while everything else is loud. Cant turn it up for the talking without everything else also going up.”
“1 find that I have to keep turning the volume up and down depending on whether it's more talking or more music and action.”
“Over 30, I find sound is mixed so that speech is very quiet and if I turn it up loud enough to hear then music and effects are deafening. So I'd rather have it quieter and put subtitles on.”
“Weirdly the volume go up and down during films and tv so I don’t always catch what’s said. It can be distracting reading but it works and kids seem to enjoy it too.”
“Everything has deafening sound effects and barely audible dialogue these days and I’m fed up with having to sit with the remote constantly turning the volume up or down!”
“The sound techs/show creators have messed with the levels so much that I often have to rewind, put subs on, read it, just to know what’s being said sometimes! Why must we have this awful mouse level dialogue?”
“There have been a couple of movies where the music/effects and the speech had wildly different volume levels, so I needed the subtitles then.”
“The music and effects are so loud compared to the dialogue I have to turn the volume down to a level where I can’t catch what they’re saying.”

This issue is directly connected to the increased use of subtitles. TV drama is becoming increasingly cinematic in style. From a sound perspective, a cinematic style does not translate to a domestic situation where neither the playback system nor the background noise of the room can be controlled, unlike in a cinema theatre, where there is complete end-to-end control.

In addition, a domestic environment is a much smaller room and smaller rooms cannot handle louder sounds as well as larger rooms. We must always remember to consider how the content we create is going to be consumed and in what environment.

The Law Of Averages

Rolling back to before loudness normalisation was introduced, although there were issues with loudness jumps, with the peak level normalisation, our dialog would often be close to or at peak level. The outcome of this style was mixes with dialog close to headroom, which meant that not much could go higher than the speech. With content normalised to loudness, and the additionally available headroom, which we have with BS 1770 based delivery specs, it seems that there has been an excessive move in how much of a mix is louder than the dialog. This has two outcomes, as more and more of the mix is louder than the dialog when the Integrated loudness is measured, the loudness of the dialog is pushed down relative to the loudness of the complete mix because there is more content in the mix that is louder than the dialog, it has to be that way, it’s the law of averages!

Because there is more content that is louder than the anchor point, usually the dialog, the Loudness Range increases, which is bad for content consumed in a domestic environment. Content with a larger Loudness Range will have a wider range of louder and quieter sounds. Because the dialog loudness gets pushed down relative to the Integrated Loudness, this means people will set their TV volume so that the loud stuff (often the music) is at a comfortable listening volume but then because of the excessive Loudness Range, the dialog is not loud enough to be able to follow and so rather than have to be constantly adjusting the volume up and down the consumer turns on the subtitles.

Reduce The Loudness Range

In my article Has Netflix Turned The Clock Back 10 Years Or Is Their New Loudness Delivery Spec A Stroke Of Genius? I investigated both Integrated Loudness and Dialog Loudness using the Nugen Audio Dolby Dialog Intelligence Gating algorithm on 4 different programmes: Amazon Prime’s The Grand Tour, The BBC’s Blue Planet and then 2 programmes I mixed myself. The first of these was Cow Dust Time, which was a documentary for BBC Radio 3. This is the UK's public service classical music channel, and the house style permits a wider dynamic range than normal. It was for the strand Between The Ears, which is a strand where the brief positively encourages soundscapes and more sound design than most radio documentaries. The second of my own mixes was Doctor’s Dementia, which was a more conventional documentary for BBC Radio 4, the public service speech channel here in the UK.

See this content in the original post

What is interesting is that for both The Grand Tour and Planet Earth 2, the Dialog Intelligence measurement correctly reflected the lower level dialog that I picked up in my earlier article Are TV Mixes Becoming Too Cinematic? and produced a normalised dialog-gated loudness of -26.1 LKFS for Planet Earth 2 and -26.3 LKFS for The Grand Tour Ep 2, compared to the R128 full mix measurements of 0 LU (-23 LUFS). Looking at my two speech-dominated documentaries, the Dialog Gated measurement for Cow Dust Time and Doctor’s Dementia were much closer to the R128 full mix measurement of 0 LU (-23 LUFS).

See this content in the original post

As part of this experiment, I also investigated what would happen to the dialogue level if I reduced the LRA. As I could not remix some of the programmes in this experiment, I ran all the mixes through LM-Correct 2 from Nugen Audio, which is designed to repurpose content for different platforms. The aim was to reprocess Plant Earth 2 and The Grand Tour Ep 2 down to an LRA of around 10 and then again for an LRA of around 8 and see how that affected the dialogue level using the Dialog Detection option in VisLM 2 from Nugen Audio and here are the results…

As you can see, in both cases, reducing the LRA of the mix increased the dialog level. Planet Earth 2 started from a much larger LRA, more akin to a Netflix kind of mix, and by bringing the LRA down from 16.5 to 9.5, the dialog level came up from -26.1 to -23.5 making it a much more pleasant listen and one where you wouldn’t need to reach for the remote control.

Clearly, I am not the only one who thinks that LRA matters. In the UK, The Digital Production Partnership (DPP) updated its unified UK delivery specs for all UK broadcasters and added this guidance on Loudness Range…

Loudness Range - This describes the perceptual dynamic range measured over the duration of the programme - Programmes should aim for an LRA of no more than 18LU
Loudness Range of Dialogue - Dialogue must be acquired and mixed to be clear and easy to understand - Speech content in factual programmes should aim for an LRA of no more than 6LU. A minimum separation of 4LU between dialogue and background is recommended.

In Canada, the CBC and Radio Canada both now require that the LRA be less than 8 or 10 LU. They also go further and specify that the Integrated loudness for the complete program AND the integrated loudness of the dialogue stem must BOTH be -24 LKFS. Lastly, the momentary loudness must not exceed +10LU above the target loudness. In addition, whilst maintaining a -24 LKFS target, the momentary loudness must always remain below -14 LKFS.

Moving onto OTT providers, Netflix in their Netflix Audio Mix Specifications & Best Practices v1.0 provide LRA recommendations. They say…

The following loudness range (LRA) values will play best on the service:

5.1 program LRA between 4 and 20 LU
2.0 program LRA between 4 and 18 LU
Dialog LRA of 7 LU or less
Difference between FX content and Dialog of 4 LU

The Delivery System

As part of the delivery system, whether it is satellite, digital terrestrial, or OTT, both the sound and picture get heavily data compressed using "lossy" algorithms - most commonly H264 for the video and a variant of AAC for the sound.

A lossy audio codec reduces the data bandwidth needed by literally throwing away things it thinks you can't hear, and once it’s gone, it’s gone. As we learnt, consonants are much quieter than the vowel sounds and so there is a greater chance that key information about the consonants could be thrown away as part of the lossy codec process. but intelligibility is not just about the sound. As we learnt with the McGurk Effect intelligibility can be affected by what we see. or not see.

As we covered in our article Netflix Announce 'Studio Quality' Sound To Their Streaming Service - Find Out More Now. Not long after Scott Kramer joined Netflix as Manager, Sound Technology | Creative Technologies & Infrastructure, they reviewed ‘Stranger Things 2’ with the Duffer brothers in a living room environment as the brothers like to check how viewers would experience their work. At one point in the first episode, there was a car chase scene that they found didn’t sound as crisp as it had done on the mixing stage.

Even though Scott was new in post at the time, he reported as saying “A lot of it was mushy," and words like “mushy” and "smeared" are ones that Scott and his team found themselves using when describing audio that just isn’t quite as crisp as it should be.

Stranger Things is a very popular series on Netflix and Scott very quickly realised this was something that needed to be “made right”. Netflix pulled in their engineering teams as they were determined to make it right, no matter how much effort it was going to take. The solution to the problem was to deliver a higher bitrate for the audio on Stranger Things 2 but rather than just fix this one series they have been working hard to roll out improved audio more broadly.

It was interesting example of the Netflix culture at work and doing what was needed to support their creative partners. Watch this video to hear the story from the perspective of the Netflix staff including Scott Kramer….

Netflix told us that most TV devices that support 5.1 or Dolby Atmos are capable of receiving better sound. Depending on your device and bandwidth capabilities, the bitrate you receive may vary:

5.1: From 192 kbps up to 640 kbps
Dolby Atmos: From 448 kbps up to 768 kbps for subscribers to their Premium plan

You can get the fully story by reading our article Netflix Announce 'Studio Quality' Sound To Their Streaming Service - Find Out More Now. There is no doubt that Netflix would not have spent the time and money increasing the delivery bandwidth if the difference was insignificant.

The Speakers In The TV

“TV speakers are rubbish since flat screen tvs.”
“It’s as much to do with audio quality (sound mix)/crappy speakers in flat screens, as it has to do with multitasking, phones, distractions…”
“The advent of sound bars doesn’t fix the problem of crap sound on flat screen TVs, or maybe it’s just our one that’s rubbish so yes, I use subtitles where available. Habit I got into trying to watch The Wire …”
“It depends on the program, i don’t think sound quality is always good. Sometimes speech seems muffled or rushed. My hearing is excellent. Perhaps i need a soundbar?”

This is another area that has come in for a lot of criticism in the press and even at the governmental level. Tom Harper, who was the Director of War and Peace has said that, while he respects the views of sound recordists, in his opinion and experience, if there are audibility problems...

“They arise at the broadcast and TV reception point, as the soundtrack is played out on reduced bandwidth to two tiny speakers.”

As flat-screen plasma and LED screens have become the norm, there is less and less room for the loudspeakers in the consumer's TV. Back in the good old days of CRT TVs, there was a good-sized cabinet that worked well with a reasonably sized speaker to produce reasonable sound, with a good chance it was also forward-facing.

With flat-screen TVs and the desire for smaller and smaller bezels, there isn’t anywhere to put the speakers on the front. So they are often tucked away around the back with very small drivers, and then we wonder why we get intelligibility complaints. In our article Speech Intelligibility - The Facts That Affect How We Hear Dialog, we learned that the optimum position for intelligibility is being one metre from the person speaking and both the person speaking and the person listening facing each other. If one is not facing the other, the intelligibility drops off. Similarly, with these slim TVs with speakers around the back, the speakers no longer face the viewer, so the intelligibility will be further compromised.

As a result, we have seen a significant growth in forward-facing soundbars to effectively replace the crappy, badly positioned speakers in most LED screen-based TVs.

Downmixing

“Audio mix engineers seem to favour background noise (ambience) and muzak over dialogue. Especially in movies mixed to 5.1 where it is assumed you have a centre channel. If your TV / AV does not correctly blend the channels, it can be really hard to hear what is being said.”
“I use a Pro Logic II surround receiver and I get a really great centre channel separation for dialogue.”

Whilst we are with the consumer’s tech another contributing factor to intelligibility is downmixing. The delivery specs typically require that the centre channel is reduced by 3dB in the downmix. While that may be technically correct, I wonder if it's sonically the best thing to do, as there's a distinct acoustic difference between the discrete centre channel in 5.1 and the phantom mono centre of a stereo pair of speakers. When you mix on 5.1, do you monitor the stereo downmix, then go back and check and maybe slightly tweak the mix in 5.1? After all, perhaps more than 90% of the viewers will be listening in stereo, which makes it important for us to check the downmix, whether we are required to deliver an LoRo or LtRt stereo mix, or if it is going to be derived in the consumer's equipment.

Phantom Centre

Research has shown that a stereo system with a phantom centre channel will also compromise intelligibility. This effect results from acoustical crosstalk that occurs when two identical signals arrive at the ear, with one slightly delayed compared to the other. The resultant comb-filtering effect cancels out some frequencies in the audio. Other research has shown a small but measurable improvement in intelligibility by utilising a central loudspeaker for speech instead of a phantom centre.

The Solution

My advice to consumers is to go for a soundbar. This way, the sound is anchored to the TV. With a 5.1 system, there are six speakers, any number of which could be in the wrong place. I remember going to someone’s house to find they had a domestic 5.1 system with the left and right speakers on each side of the TV and the centre and surround speakers propped up along the back of the sofa against the wall, which meant all the dialog was coming from behind you!

What Can We Do About This?

How can normal hearing people feeling it necessary to use subtitles to be able to understand what is being said be resolved? Just as the problems are manyfold, so the there isn’t one quick fix.

The simple answer to the question ‘What Can We Do About This?” is to improve everything. To answer the question more fully is going to be longer, but one thing is for sure: I don’t believe that this is going to be something a plug-in can fix.

A Better Understanding Of The Many Issues

The single biggest improvement would be a better understanding and appreciation of all the issues, especially for those who have the influence, and that tends to be those who are in control of the creative side, like the directors and then those who hold the purse strings like the producers and the people commissioning the content. This is especially relevant when it comes to the push towards realism but also in the choice of who and where content should be mixed, how the content is acquired and also in the script and location choices.

What About Mixing In Correctly Sized Rooms?

I can understand why Netflix might want a delivery spec with a wider dynamic range because a lot of their content was made for the big screen rather than the small screen. However, it seems they have been transferring the same production values to the content they commission for the small screen, which is, in my view, flawed. Commenting on our article Loudness and Dialog Intelligibility in TV Mixes - What Can We Do About TV Mixes That Are Too Cinematic? Reid Caulfield, referring to the new Netflix specs, said...

“Mixes meant for the "At-Home" environment MUST be mixed - or remixed, if it was originally done in a large theatre - in a near-field environment at 79dB. NOT a large theatre at 85 because someone needed to fit 40 people in the room. And, it cannot be mixed in that large environment simply with the large speaker arrays turned off and the near fields turned on. It needs to be mixed in a much smaller TV-oriented room.”

He then suggests how this could be policed...

By specifying all elements be delivered in a Dolby Atmos-At-Home "wrapper." Even if the show has not been mixed as an Atmos presentation, by specifying delivery as an ADM file, they guarantee that the source room's size data and speaker layout is included in the associated metadata that travels with the data file and program content.

I couldn't agree more about the need to remix cinema content or mixing content commissioned for consumption "At-Home" in smaller spaces at a more appropriate monitor level, like 79. I like his idea of using the Dolby Atmos-At-Home Wrapper as it will include the metadata of the room it was mixed in, making it much easier to police.

But until the powers-that-be take up Reid's suggestion, what can we do about mixes that have an excessive LRA for domestic consumption?

Is It Time To Make A Maximum LRA Required As Part Of The Spec?

As I have demonstrated, an LRA of anything above 10LU is too high for content created for domestic consumption. In my view, a maximum of 18 to 20LU is way too high, so maybe it is time to add an LRA figure to the BS 1770 standard. At the very least, it should be a requirement in the broadcaster’s delivery specs rather than advisory.

What About Using Object Based Audio?

Another option which definitely helps is the use of Object Based Audio and MPEG-H codecs. Check out these two articles, in which we show how object based audio has applications for delivering content to end users with user friendly controls to adjust what the consumer hears.

See this gallery in the original post

In the article Object Based Audio Can Do So Much More Than Just Dolby Atmos? We Explore, we looked at the work that Lauren Ward, a Postgraduate Audio Engineering researcher, with a passion for Broadcast Accessibility from Salford University. Lauren’s research has been looking at a methodology whereby different audio objects in a piece of content are scored for how important each object is to the narrative. If an object is essential to the story, like the dialog, or a door opening, they are scored as essential. Other sounds, like ambiences and music that add to the narrative, but if they weren’t there, you would still be able to follow the story, are scored progressively less.

Then there is a single control that you can adjust from a full normal mix through to an essential-only mix for the very hard of hearing. I have had a chance to try this out on a visit to Salford University and found it very simple and intuitive, and the process of scoring the objects would be very easy to do during the production process.

The single control interface is much simpler than other personalised options, which present multiple-level controls for each object, such as commentary, FXs, home crowd, away crowd, etc.

Since we published this article, Lauren’s research has moved on with a public-beta experiment here in the UK. This experiment took a recent episode of BBC One TV medical drama ‘Casualty’ and presented a version of it on the BBC website that includes a slider button in addition to the volume control. Keeping this additional slider on the right-hand side retains the standard audio mix. Moving the slider to the left progressively reduces background noise, including music, making the dialogue crisper. The experiment reached the attention of the UK national press, including this article in The Times. This was a time-limited demo that has been extended. Here is a small section of the experiment to show you how simple and effective this is…

Although Casualty is based in an Accident and Emergency department of a large UK hospital, A&E, in this context, stands for ‘Accessible and Enhanced’ audio. In this BBC project, they are trialling a new feature that allows the consumer to change the audio mix of the episode to best suit their own needs and preferences.

Although the project is aimed at the 11 million Britons with hearing loss and any others who struggle to make out what actors are saying, the UK press spotted that commuters who stream shows on noisy trains and buses could also benefit.

As we showed in our article, Is This The Answer To TV Audio Critics? Object Based Audio Case Studies Presented At The AES 146th Convention In Dublin, this technology can be built into consumer TVs and as the BBC Casualty experiment shows, web-based and streaming services could easily build this into players hosted on smart TVs and then everyone, both normal hearing and hearing impaired people could benefit from this excellent system.

I do not believe this will be difficult to implement, there are, of course two parts, the implementation of the slider at the consumer’s end and the ranking of the content as it is in production. As Lauren explains..

“Our technology adds two things to the process of making and watching a TV programme. The first occurs after filming, when the audio mixing takes place. At this point each sound, or group of sounds, has an importance level attached to it (stored in metadata) by the dubbing mixer or producer.”

You could have a rating system like Avid has in Pro Tools for rating clips. It would be very easy to have a narrative importance rating system in the production process and then for that metadata to be embedded into the delivery stream. Lauren explains…

“Some non-speech sounds, such as the flatlining beep of a heart monitor in Casualty, are crucial to a show’s narrative. The technology allows these noises to stay prominent while non-essential sounds are turned down.”

Object-based audio offers the consumer a lot more control. It also provides content providers with the technology to deliver one stream of object-based content and then use the metadata to render the most appropriate version for the hardware the consumer is using to playback the content.

In Summary

As we said at the beginning of this article, the reasons normal-hearing people resort to using subtitles are many and varied, and they often compound to make things even worse.

As an audio post-production editor and mixer, I feel we have failed if consumers with normal hearing have to turn on subtitles to follow the narrative. It is beholden on us and those with influence and control of the budgets and creative choices to understand the issues and that we all work together to resolve this failure for the consumers we serve.

That’s what I think. What do you think? Please do share your thoughts and experiences in the comments below…

See this content in the original post