Create a podcast with LLMs

Overview

LLMs made available by companies such as OpenAI, Anthropic, Google, and others can be useful for learning new things. I enjoy listening to podcasts and audiobooks, but there are some topics that may not have a lot of coverage. To bridge this gap, I set out to create an automated workflow where an LLM can generate the content for podcasts and a follow-up pipeline to convert this to audio files and ultimately a combined mp3.

Tools Used

To build this out, a few important tools to go over:

Anthropic Claude: Specifically the Sonnet API although there are many options for different models both through Anthropic and other LLM providers.
LangChain and LangGraph: Useful for building the agent workflows and interacting with the Anthropic API.
ElevenLabs: Used to create mp3 files from text with a natural sounding and customizable voice.
FFmpeg and pydub: Used when working with and combining mp3 files.

LLM Graph Setup

Rather than making a single LLM call to generate a podcast transcript in one shot, dividing into multiple subtasks can lead to a better result. In this way, each agent has a more defined job that can be customized with unique system prompts and workflows.

Given a user question, the data flows with the following steps:

Subtopic Agent

For a given topic and optional reference document, this agent generates a list of subtopics. For example, creating a podcast for beginners learning Python we might prompt something like this:

I'm looking for a podcast that would help me learn Python programming. Start with the basics as if someone who is new to programming.

This particular agent is given the user’s request along with a system prompt to help formulate the subtopics with specific guidelines such as making it friendly for an audio format. The response to the above was:

‘Why Python? The Friendly Programming Language for Everyday Problems’
‘Getting Started: Setting Up Your Python Environment and Writing Your First Program’
‘Python Fundamentals: Variables, Data Types, and Basic Operations’
‘Control Flow: Making Decisions and Repeating Actions in Python’
‘Functions and Modules: Building Blocks for Organized Code’
‘Working with Data: Lists, Dictionaries, and File Handling’
‘Real-World Python: Solving Practical Problems Step by Step’
‘Common Beginner Mistakes and How to Avoid Them’
‘Next Steps: Projects and Resources to Continue Your Python Journey’

I won’t go through the code in detail for all of the agents in the graph (GitHub repo posted at the end), but for this first one:

def subtopic_agent(state) -> Command[Literal['subtopic_router_agent']]:

    model = get_model()

    # Format the reference document section
    reference_doc = state.get('reference_document', '')
    if reference_doc:
        reference_section = f"Use the following reference document as a guide for content style, examples, and factual information:\n\n{reference_doc}"
    else:
        reference_section = "No reference document provided."

    agent_system_prompt = SystemMessage(TOPIC_GENERATING_SYSTEM_PROMPT.format(
        podcast_topic=state['topic'],
        reference_document_section=reference_section
    ))

    model = model.with_structured_output(SubtopicOutput)

    # Invoke with custom system prompt
    messages = [
        agent_system_prompt,
        *state['messages']
    ]
    output = model.invoke(messages)

    # Update state with the structured output and end
    return Command(
        goto='subtopic_router_agent',
        update={
            'subtopics': output.subtopic_list
        }
    )

This one agent is part of the larger graph and populates the system prompt with the topic, reference document, and invokes with the user’s message. The structured output is simply a list of strings:

class SubtopicOutput(BaseModel):
  "Subtopic output"
  subtopic_list: list[str] = Field(description='A list of subtopics')

Finally, once the subtopics are finalized the graph routes to the next node for topic generation routing.

Subtopic Router Agent

This piece of the graph loops through each subtopic and calls a generator agent. Though it’s not completely that simple - as part of maintaining context across generations (e.g. the ‘Python Fundamentals’ section might need to know what was said in the ‘Getting Started’ section), the generator keeps track of what was previously covered in the form of a list of summaries and provides that as well. This agent is essentially an orchestrator which keeps track of the progress, all of input components, and then sends off tasks to the generator agents.

Subtopic Generator Agent

This agent is called from the router and is responsible for generating text for a singular subtopic. It’s given other context such as past summaries and reference documents, but is able to focus on generating just one specific component of the podcast.

Filewriter Agent

Last in the graph, this doesn’t actually make any LLM calls, but is defined in the overall graph workflow for ease of use. It writes out a txt file for each of the generated subtopic scripts.

Audio Conversion and Combination

This piece is a separate process which allows for review of the txt files first before converting to audio. Given a txt file, ElevenLabs is used to generate an audio mp3 file. There’s some file operations before and after, but the bulk of the code is this logic:

elevenlabs = ElevenLabs(api_key=api_key)

audio = elevenlabs.text_to_speech.convert(
    text=text,
    voice_id="bIHbv24MWmeRgasZH58o",
    model_id="eleven_turbo_v2_5", # or for better quality model_id="eleven_multilingual_v2",
    output_format="mp3_44100_128",
    seed=42
)

The API is straightforward. Pass in the text, voice you want to use (specified by id), model, and output format. ElevenLabs will return audio chunks that can be written to a file and played as an mp3.

At this point after looping through the mp3 files we have one per subtopic. Using pydub these separate mp3 files can be combined into one large file which represents the podcast episode. At a high level, the code to combine the files looks like this for a list of mp3 paths:

# Load the first audio file
combined = AudioSegment.from_mp3(input_paths[0])
print(f"Loaded: {input_files[0]}")

# Append each subsequent audio file
for i, file_path in enumerate(input_paths[1:], 1):
    audio = AudioSegment.from_mp3(file_path)
    combined += audio
    print(f"Added: {input_files[i]}")

combined.export(output_path, format="mp3")

Wrapping up in CLI

The above process is wrapped up in a command line interface. More information on setup is included in the repo readme file, but a simple example might look something like this:

python main.py generate "Python Tips" "Create a podcast about Python best practices with 3 subtopics"
python main.py convert
python main.py combine

Summary

Using graph frameworks such as LangGraph allows us to break something complex (generating a podcast from a simple prompt) into something more reliable and repeatable using subagents.

Caveat: Use these tools for good such as learning or entertainment, don’t go spam a bunch of podcast episodes!

All examples and files available on Github.