Closed captions are becoming a noticeable and regular technology in video media content. For instance, there are shorter round-up styled videos made by Now This, which have elevator music playing over images and videos with closed captions transcribing any speech. Also, TikTok creators use closed captions for their videos, from simple vlogs or updates to memes and skits. Quality can vary from platform to platform as closed captions on some YouTube videos are, in places, nonsense. Therefore, creators have to write their captions off their own back, sometimes enlisting foreign speakers’ help to translate for their wide audience.
Other AI-backed closed captions are far more impressive, though, and the technology will continue to develop and hit higher standards as its use extends beyond entertainment and into the b2b industry and education sector. How exactly does this technology work, though? Here’s how.
Recognising Speech and Audio
To begin, the AI program must be able to differentiate between speech and audio. AI is a capable learner. It can be fed lots of data – be it numerical, lexical, visual, or audio – and begin to spot patterns. Let’s talk about the tank-spotting urban legend, as a cautionary tale. Military heads wanted to be able to spot enemy tanks in a dense forest quicker and easier than a human eye could. Scientists began training an algorithm to do it. They had successful results initially, and then things started to go sideways. What they realized was that the algorithm was recognising what photo of a tank was taken in the daytime and which one was taken at night. The moral: it’ll learn as well as the programmer teaches. Things have moved on since then. For example, AI can help spot wildfires in their earliest stages.
For speech and audio, the AI is fed appropriate material and taught what to listen out for. It can detect speech from different accents, dialects, frequencies, and whether they are singing or speaking. For closed captions to be as useful as possible, they can learn who is talking about transcribing. This enables an audience to be aware of vocal changes if they don’t see visual cues. Auto captioning software companies like Verbit offer Zoom users their services and advertise 99% accuracy and speaker identification.
It can also identify objects. Everything from a chainsaw buzz to birds chirping to a chair being dragged across a floor to a phone ringing: as much appropriate audio description should be able to be distinguished for certain content.
Vocabulary and Language
There are many words in a language that sound like other words in the same language, let alone those that sound like those in a foreign language. AI learns to contend with words like ‘insight’ and ‘incite,’ and how a French speaker might use the word ‘selfie’ not get confused that it’s an English word. Language is a complicated thing. This is all without getting into the complex philosophical difficulties of Derrida, Rousseau, and other philosophers’ language.
The technology will know as much as it’s shown. It is exposed to a wide vocabulary to cope with colloquialisms for video-conference software, for instance. It can also gain contextual information to distinguish between words like ‘bear’ and ‘bare,’ so no unfortunate confusions arise.
As mentioned, this technology will only improve as it is implemented in a range of environments, and the demand for it increases. The demand is vocal and has to be, as this technology cannot read our thoughts.