How we turn hundreds of research papers into a podcast every day

Keeping up with all the newest AI research isn’t humanly possible: The volume of new papers published every single day is absolutely crushing. But it’s really important for us to keep track of the latest developments—so we went looking for some superhuman help.

We built ourselves a simple tool called Verso Smart to turn academic AI papers into brief podcast episodes. The project has been a fascinating experiment in stringing together a bunch of different AI agents to complete a complex task.

Since Verso Smart is an experiment, we haven’t made it public. But we think it’s a useful template for designing systems that expand an organization’s newsgathering capabilities, while keeping us humans in control the whole way through.

Here’s a recent episode:

As you’ll hear, our synthetic podcast hosts are (somewhat cornily) discussing several recent AI papers. But there’s a big realization hiding inside this audio clip: We can now use language models to turn nearly infinite forms of information into a huge range of outputs.

Do you want a weekly email about the latest restaurant permit applications in your neighborhood? A breathless TikTok-style video about each meeting of the city council subcommittee on transportation? A brief Spanish-language summary of the 30 cooking newsletters you get every morning? All of this is possible!

Here’s how we built our prototype.

Think like an editor; build like a developer

We approached our podcast generator as an editorial problem first.

You can think of the journalistic process in five basic stages: research, selection, reporting, production, and distribution. Breaking it down this way helped us see where automation made sense and—crucially—where we wanted to maintain human control.

Here’s one way to think about the journalistic process—whether it includes AI or notHere’s one way to think about the journalistic process—whether it includes AI or not

To put this process in a traditional newsroom context:

  1. Research is when a reporter goes looking for good stories by reading widely and calling up sources;

  2. Selection is the pitching and approval process;

  3. Reporting is the information-gathering and synthesis step;

  4. Production is medium-dependent—it might be writing and editing an article for print or digital, recording and producing a video for YouTube or TikTok, or preparing a podcast or radio segment;

  5. Finally, distribution is delivering the editorial team’s hard work to the public, whether by posting it on a news site, a social platform, or sending it to the good old printing press.

How Verso Smart works

We can use the journalistic process above to explain how our system works, too.

  1. Research: First, Verso Smart searches arXiv, a widely used repository of technical research, for AI-related papers published in the last three days. This step doesn’t require AI.

  2. Selection: Next, the system uses OpenAI’s gpt-4o-mini, an inexpensive language model, to sift through the hundreds of recent papers and pick the eight that are most relevant to our interests. Verso Smart then asks us, the users, to pick our three favorites from the stack.

  3. Reporting: The system passes the papers we chose to gpt-4o, a more powerful but pricier AI model, to write a conversational podcast script featuring two hosts. We’ve prompted the model to write a punchy script that explains complex ideas and draws out connections between the research paper and the media industry.

  4. Production: Verso Smart sends this script off to OpenAI’s tts-hd-01, a text-to-speech model, which generates clips of the two hosts’ lines read by two different synthetic voices. Then, the system stitches those clips into one audio file.

  5. Distribution: Finally, Verso Smart writes a newsletter using gpt-4o-mini that links to the original papers and summarizes each paper in one sentence, attaches the podcast file, and sends it to us.

Here’s a sample Verso Smart newsletterHere’s a sample Verso Smart newsletter

How we built it

After thinking through the design, we leaned on language models to help us code the system. We primarily used Claude (via Cursor, an AI-infused developer app), because we find it’s the most competent co-programmer.

We deliberately mixed AI models based on their strengths: gpt-4o for complex tasks like understanding academic papers and writing natural dialogue, and the cheaper gpt-4o-mini for simpler jobs like selecting papers we might care about based on their abstracts. In the future, we might swap out gpt-4o for Claude’s new Sonnet model, which often has a more pleasant writing style.

The system is built in Python. Each of the five process steps is a script; main.py orchestrates the whole operation:

verso_smart/
├── main.py
├── arxiv_fetcher.py
├── paper_selector.py
├── script_generator.py
├── audio_generator.py
└── email_sender.py

Room to grow

This is the first version of our experiment. There’s a million ways we could make this work better or expand it to new uses.

For example, we could replace arXiv papers with any number of other programatically accessible data sources:

  • Academic databases (Elsevier, Springer, PLOS)
  • Government data portals (datasf.org, data.gov)
  • Meeting transcripts and public records
  • RSS feeds and newsletters

Since we have a fixed audience, we hard-coded our preferences into the system. But if you were making this for a diverse set of listeners, you might allow them to adjust their preferred tone, style, and level of expertise, select different synthetic voices, or even choose the podcast language.

It’s surprisingly cheap to generate a podcast, at about 15 cents per episode. This matters because it means newsrooms can experiment with similar tools without breaking the bank.

What we learned

  1. Think in processes. Chunking up the journalistic process helped us identify the steps AI can accelerate, and where human judgment is essential. We kept ourselves in the selection loop because that’s where editorial judgment matters most.

  2. Risk tolerance varies. Since this tool is just for us, not for publication, we can stomach some uncertainty in the output. The stakes are relatively low: If our synthetic hosts misinterpret a paper, we’re not going to publish an error-ridden article because of it—at worst, we’ll sound a little silly if we bring it up in conversation.

  3. Just do it! Building this wasn’t particularly complex or expensive. That’s kind of the point: It’s a blueprint for how newsrooms can experiment with AI tools quickly and cheaply. Start small, solve a specific problem, and learn by doing. (If you’re not comfortable with code, check out user-friendly AI-building tools like Mind Studio. These let you string together AI agents without having to write any code at all.)

  4. Did you just kill podcasting?? I don’t think so! First off, just listen to the podcast: It’s useful, but it’s not emotionally gratifying or narratively interesting. We could make it much better with some tinkering, but deep storytelling still remains squarely in the human domain. For example, our synthetic hosts couldn’t pull off a dynamic one-on-one interview, adjusting to the tenor and route of the conversation on the fly. There’s still an enormous amount of safe space for thoughtful, deep audio reporting.


Want to hear another episode? Drop us a line at and we’ll send you tomorrow’s.

Note: This post is adapted from a workshop Kaveh delivered at an October Hacks/Hackers event in Berkeley.


Tags
tools

Date
December 4, 2024