Thoughts on machine transcription

As a journalist and business/technology writer, I regularly record interviews. I also listen to recordings of other people’s interviews. Following each recording, I embark on the slog of transcribing what interview subjects say.

I regularly deal with different accents, so I can (gleefully) make out every word said in the following video.

There’s a point to my including this video here beyond the gratuitous fun of it, but first I need to explain a few other things.

How I transcribe notes today

I’ve gradually improved the efficiency of this process to what I do today:

  • During interviews, I both record and take notes.
  • Should an interview subject say something I want to come back to, I note the time index on my recording device (like QuickTime on my Mac) in my notes next to the thing said.
  • When I “transcribe” I go to those time indices to get the quotes I know I want.
  • I might review whole recordings afterward (at higher speeds, to save time).

This process helps me to cut down on transcription time. It’s still kludgy, but it works. Outside of hiring other people to transcribe interviews, this is about the most effective way to put a person’s spoken words in a text document. (I’m open to suggestions, by the way.)

The promise of transcription software

Transcription software has been around for a long time. It has suffered from the inability to accurately transcribe a person’s voice unless the person spent time (about 15 minutes or so, in some cases – maybe more for Scots 😉 “training” the software. Initial training wasn’t enough; ongoing refinement of the software’s understanding of the person’s voice is also needed.

Such systems are useless to me. I want the words of other people to be transcribed, and I was never going to ask interviewees to spend time training my transcription software.

More recently, voice recognition and dictation software built into mobile devices and computer operating systems heralded the imminent arrival of the holy grail of transcription: understanding multiple voices WITHOUT the owners of those voices having to train the software.

I’m sure this is happening, since I regularly use dictation to create messages on my iPod and iPad, and have also done this on my Mac, with great results.

I recently tested a promising system called Trint. The beta of Trint offered one hour of recording for free, but the good people at Trint gave me more uploads to do more testing (or as they call it, “Trinting”) to help me transcribe articles I’m working on.

Where does the name “Trint” come from?

Jeff Kofman, Trint’s CEO & co-founder, tells me he invented the word by combining transcription + interview” (or transformation + interview). (Our email correspondence started because I asked Trint for more transcription time, and Kofman hopped onto that email thread.)

He has high expectations for his stab at nomenclature:

We wanted something easy to say and easy to spell that could be turned into a noun or verb. When investors ask me where the company will be in five years my response: The Dictionary.

How Trint works

It’s pretty easy to get started on Trint.

  • Create an account
  • Upload an audio file
  • Choose the accent of the speaker (more on this later)
  • Give Trint a few minutes to transcribe. Transcription can take as long as the recording itself, although it can take less time.
  • Once the transcription is done, Trint emails you a link so you can dig right in.

Kofman tells me that for a little while, it might be difficult to get a Trint account. Here’s what he wrote:

the response to the release of our free open beta was staggering. We went viral very quickly and felt we had to impose limits while we move to finish the payment system. We expect to be offering a paid service in March. Current account holders will have access first, then those on the waiting list (which is growing very quickly.) As a startup we want to make sure that we don’t feel pressured to scale so quickly that we demand overwhelms us.

My “Trinting” experience

The recorded words appear as Courier (0r a Courier-like – I can’t tell) font in an online document that’s part of a simple interface that’s easy on the eyes.

In parts, the text is spot-on, which is surprising given the two voices and the background noise in each recording I tested. But it is far from fully accurate throughout entire recordings.

Trint offers a choice of accent for the whole recording. That can be an issue if the people on the recording speak using different accents. Strike that – it’s a large issue, since accounting for every accent used to speak English around the world could take about as long as calculating the last value of pi.

The video towards the beginning of this post is just one example. Have three minutes to spare? Watch this video too.

Aside from accents, fluctuating volume and speed during conversations can play havoc with Trint’s transcription efforts.

At the less serious end of such issues, Trint inserts a period whenever people pause. It’s easy enough to get rid of. The text isn’t fixed. You can edit it in the browser. The playback sometimes pauses as you edit text. (I didn’t see consistent behaviour here.)

Play/pause and rewind 5s buttons stay to the right of the text as you scroll or play through it. It would be handy if Trint could work with the dedicated media control buttons on my Mac keyboard, but it doesn’t. (This is a personal preference – I find I work faster when I keep my hands on the keyboard and don’t need to move the mouse pointer.)

Kofman sent me a screen shot of these keyboard shortcuts from the Help menu. (Yes, it’s a little shameful for a technical writer to admit he didn’t look at the instructions.)

Trint Keyboard Shortcuts

“Clearly we need to signpost the keyboard shortcuts much better!” Kofman admits. (Meanwhile, I’m directing the old “RTM” acronym at myself, since I know better.) Kofman has been so responsive to my questions that I’m going to quote him a little more.

You can track progress in a few ways.

  • Text changes from faint to bold once it’s “spoken.”
  • A waveform of the conversation lines the bottom of the screen and a solid vertical line travels from left to right along this waveform as you progress through the recording.
  • Trint helpfully inserts time indices to separate the speech into blocks ranging from several seconds to a minute and a half. I like how this orients the reader and how it mimics my own time-stamp habit.

I wanted to separate the introductory part of an interview from the “meat” of that interview, but there’s no button to insert a bookmark in the recording. You can do that by typing in the text itself.

To find bookmarks, or other things, a built-in text search or scan tool would help. I simply used the one built into my browser, and coupled with the ability to edit text, I could separate the transcribed text into sections.

I frequently speed up or slow down playback speed. Sometimes the talk is just tangential background stuff that I can understand at faster speeds but don’t need for my work. At other times, speech may be unclear and slowing playback can help me decipher what was said. Fortunately, a control below the text lets me play back recordings at speeds ranging from half to twice the original speed.

I’d like to see the screen scroll itself to keep the currently spoken text in the middle of the screen. Software like Spritz Reader handles this elegantly, albeit in a different way.

Other features in Trint include the ability to highlight and strike parts of transcriptions. You can limit playback to just the parts you’ve highlighted, or you can skip the parts you strike.

The current deal-breaker

Many of the shortcomings mentioned to this point are minor annoyances. Here’s the one that sticks out for me: when I selected text and hit the keyboard shortcut for “copy” Trint left a “c” in place of the text I wanted to copy. To compound this issue, I couldn’t paste the text back in place. I’ve had to retype text passages, then select and copy them using the mouse popup menu to place into my article outlines.

If I can properly copy and paste within this WordPress post as I write, I don’t see why Trint can’t let me do the same thing.

I exported Trinted interviews (the site uses “Trint” as a verb) to a Microsoft Word document, where this peculiar behaviour doesn’t happen, but I then needed to go back and forth between Trint (to control the recording) and Word.

That’s why I played back my original recordings using Quicktime (and my keyboard shortcuts) and edited text in Word instead of continuing to work in Trint. I suspect this goes against Trint’s wishes, since it might be learning from people’s corrections entered in the Trint window to sharpen its transcription skills, but I was in a rush (as I often am) to get quotes from interviews for articles clients were waiting for.

Here’s Kofman again to fill in the blanks:

Customers will have the option of giving us their corrected returns. When an unknown word is corrected the same way by several users it can be entered into the corpus. We won’t see the full transcript, just the corrected words, but we will let users opt in and opt-out as they choose. Several news organizations have told us they would like us to do this, so that names, places and terms in the news that are transcribed incorrectly today can be correct tomorrow.

Conclusions

  • Trint is certainly a minimum viable product, but it has some distance to travel before it becomes a part of my business toolkit.
  • Don’t look to Trint for exact transcriptions. That still calls for a human ear.
  • Trint seemed to lessen the time needed to transcribe and fetch the quotes I wanted, even if I exported the transcribed text to Word and used my original recording instead of the Trint interface.
  • A lack of “standard features” like keyboard shortcut support (copy, paste, audio playback/pause/forward/rewind) make the Trint interface a non-starter for me.
  • I’m curious to find out what Trint decides to charge when they get their payment systems working.
  • I want Trint to improve its offering, since there’s a lot of promise in this technology.

Even if I’m bearish on Trint today, I recognize the beta nature of this tool. (“Beta” in the software world means a product that still requires testing and polish before it’s ready for prime time.)

I hope Trint improves. Given how often I transcribe recorded interviews containing two or more voices, I could see myself subscribing to Trint, or other tools like it, if they make transcription easier for me.

4 Comments
  1. Hey Luigi! Thanks for taking the time to review Trint. The team is really happy to hear about the positive points you’ve raised – it’s also great to see how Trint can grow to help fellow Trinters simplify their workflows. The suggestions you make to improve the user experience are definitely valid, in fact, many are in development right now.

    The copy and paste functionality is a bit more complex: text seen on screen includes invisible code and metadata to control its behavior, function and format. The dev team is investigating an appropriate way to “clean” the pasted text of its metadata before it is added to the Trint Editor, to avoid disrupting functionality. As you can imagine, this requires a fair bit of testing before it goes live!

    As a general note, while we believe Trint gives you the best automated transcription available, it’s not offering to provide 100% accurate transcription (Trint is magic… but it can’t perform miracles!). Trint is designed to do the heavy lifting of transcription, so it’s easy for Trinters to polish and get to perfect using significantly less time, money and effort.

    If you know anyone else who’d be interested in taking Trint for a spin, they can upload an hour of audio or video completely free – the team would welcome their feedback too! It’s easy to sign up for the next wave of accounts on the Trint homepage: trint.com

  2. I see the hour of free time is now only 30 minutes.

    I’m a transcriptionist by trade and was shown the Trint site by an employer who wanted my thoughts. I gave it a spin on a clear audio of two people speaking, one at the mic, one on the phone. I was less than impressed.

    I realized I would have to edit the document, so I was prepared for that.

    Even clear audio from the speaker near the phone was mangled. And since it was mangled so badly and I had to edit the document so closely (the period. at. every. hesitation. drives. me. nuts.) I can’t say that I really trust I’m going to end up with an accurate document to return to my client. Since I pride myself on returning as close to 100% accuracy on documents as possible, this isn’t going to work for me. Especially at $.25 per audio minute that Trint charges.

    The only possible scenario I can imagine is a lecture given in pristine conditions by someone who never pauses except at the ends of sentences, has an American English Midwestern accent (my clear speaker was from Seattle…and Trint consistently misheard the same words as something else) and where there is no background noise. Possibly students would find this useful, where inaccuracies wouldn’t matter since they know the content.

    It was an interesting 30 minutes, but I’m highly doubtful I’d ever use Trint in my day-to-day work.

    • Thanks Carol. Sounds like Trint is still where it was when I tested it about 6 months ago. I still hope they can improve the technology, but from what I understand it’s a daunting task.

  3. Interesting review and comments. Does Trint perform any better than Googles voice to text service?

    Simon

Leave a Reply