GCP

From PDF to long audiobook: A journey through text extraction, AI optimization, and speech synthesis

November 1, 2024

I started this journey with several different services, each seemingly more fancy and easy to use than the last. I thought they’d be the magic bullet for my text-to-audio needs. The reality, however, was that I ended up with Google Cloud services for all steps. Why? Because of much higher limits, instant availability, surprisingly low costs (yes, low!), and generally faster go-to-market. It turns out, sometimes the not-so-flashy choice is the one that gets the job done, especially when working with large documents and creating long audio outputs.

In recent years, the process of converting documents into spoken words has evolved dramatically, thanks to innovations in AI and machine learning. Let’s dive into the latest advancements and challenges that have emerged in text extraction, language models, and speech synthesis, highlighting how different technologies compare to each other.

Challenge 1: Limited output context size in LLMs

One of the significant challenges with large language models (LLMs) is their limited output context size. Most LLMs can only generate a certain amount of text before hitting their limits. Gemini 1.5 Pro is among the few that have an extended output context, supporting up to 8,000 tokens.

This larger context size allows for more cohesive and detailed responses, which is particularly useful for generating lengthy or complex content, such as entire chapters of books, detailed technical guides, or even legal documents where continuity and detail are crucial. For example, in legal document processing, maintaining the full context without splitting it into smaller sections ensures the legal language and meaning remain intact. Similarly, in script writing, having a larger output context means the model can generate more extensive scenes or dialogues without losing coherence.

In comparison, many other LLMs, like GPT-3.5, are restricted to approximately 4,000 tokens. This means that while other models may require users to split content into multiple parts, Gemini 1.5 Pro can handle longer narratives more seamlessly, reducing the need for human intervention and allowing for a more fluid content generation process.

Challenge 2: Reliable service for getting text out of PDFs

Initially, I tried using Python libraries like PyPDF2 to extract text from PDFs, but the output quality for non-English text was pretty low. I didn’t want to spend time fine-tuning these tools — I simply wanted high-quality output as soon as possible. This led me to use the Vertex AI Vision API, which is a powerful tool for extracting text from complex visual documents.

The Vertex AI Vision API offers advanced capabilities for document text detection, making it ideal for processing PDFs, scanned documents, and other image-based formats. By leveraging the Google Cloud Vision API, users can efficiently extract textual content from images, enabling further processing and transformation.

The Vision API is especially useful when dealing with mixed-content documents, such as reports that include both textual information and images. It ensures that text extraction is accurate and comprehensive, allowing downstream processes like text optimization and audio synthesis to work with high-quality content.

Code example:

def async_detect_document(gcs_source_uri, gcs_destination_uri):
    """Start async document text detection."""
    client = vision.ImageAnnotatorClient()

    # Set up the feature for document text detection
    feature = vision.Feature(type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)
    
    # Configure the input source (PDF in GCS)
    gcs_source = vision.GcsSource(uri=gcs_source_uri)
    input_config = vision.InputConfig(gcs_source=gcs_source, mime_type='application/pdf')
    
    # Configure the output destination in GCS
    gcs_destination = vision.GcsDestination(uri=gcs_destination_uri)
    output_config = vision.OutputConfig(gcs_destination=gcs_destination, batch_size=1)

    # Create the async request
    async_request = vision.AsyncAnnotateFileRequest(
        features=[feature], input_config=input_config, output_config=output_config)

    # Send the async batch request
    operation = client.async_batch_annotate_files(requests=[async_request])
    print('Waiting for the operation to finish.')
    
    # Wait for the operation to complete (timeout after 7 minutes)
    operation.result(timeout=420)
    print('The output is written to GCS with prefix: {}'.format(gcs_destination_uri))

Challenge 3: Input context limitations

Input context size is another crucial factor for effective document processing and conversion. Gemini 1.5 Pro leads the industry with an unprecedented 2 million tokens for input context, setting it far apart from competitors.

The ability to handle such a vast amount of input context is transformative for applications involving large documents, such as research papers or entire books. For example, analyzing a complex, multi-section technical manual becomes significantly easier with Gemini 1.5 Pro, as it can process the entire manual in one go without needing to split it into smaller sections. This ensures that all cross-references and dependencies between sections are preserved, leading to more accurate analysis and summaries.

Models like GPT-4 and Claude have much lower input context limits, typically around 32,000 tokens at best. This makes Gemini 1.5 Pro an ideal choice for applications requiring comprehensive document analysis without splitting the data into chunks. It minimizes potential loss of context and improves accuracy in processing and summarization.

Code example:

# Example to demonstrate handling large input context
large_input_text = """
Chapter 1: Introduction... [followed by a large document containing millions of words]
"""

# Tokenize a large input document
input_ids = tokenizer(large_input_text, return_tensors='pt', truncation=False).input_ids

# Generate summary or extract information using the model
output = model.generate(input_ids, max_length=500)
print(tokenizer.decode(output[0], skip_special_tokens=True))

This example shows how Gemini 1.5 Pro can process very large documents, making it suitable for scenarios like book analysis or research paper summaries.

Challenge 4: Long synthesis in Text to Speech (TTS)

Another major challenge is the capability of Text-to-Speech (TTS) systems to synthesize long durations of speech. Most TTS services on the market are limited to generating audio for 10 to 15 minutes of text in one go. Vertex AI Text-to-Speech, however, is capable of much longer synthesis sessions, addressing a key limitation faced by many users.

The ability to produce longer speech synthesis without the need for breaking content into smaller segments makes Vertex TTS highly efficient for applications like audiobook generation and lengthy podcasts. For example, creating an entire audiobook in one seamless synthesis session significantly reduces the manual effort required to combine multiple audio files, resulting in a more cohesive listening experience. Similarly, for podcast creators who want to narrate long scripts or multi-part stories, Vertex TTS enables the generation of smooth, uninterrupted audio, avoiding the disruptions that typically occur with shorter synthesis limits.

Competing services often face limitations where longer texts need to be split, leading to potential disruptions in the audio flow or added effort to manually stitch together multiple outputs. Vertex’s capabilities streamline the process, providing a smoother and more natural experience for listeners, without the interruptions or unnatural pauses that come from multiple segmented outputs.

Code example:

async def step3_text_to_speech(input_file, output_file='output_audio.mp3'):
    client = texttospeech.TextToSpeechLongAudioSynthesizeAsyncClient()

    with open(input_file, 'r', encoding='utf-8') as f:
        text = f.read()

    input_config = texttospeech.SynthesisInput(ssml=text)

    voice = texttospeech.VoiceSelectionParams(
        language_code="cs-CZ",
        name="cs-CZ-Wavenet-A"
    )

    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.LINEAR16
    )

This code demonstrates how to use Vertex AI TTS to synthesize long documents, allowing the creation of continuous and cohesive audio outputs.

Conclusion

The advancements brought by Gemini 1.5 Pro, Vertex AI Vision, and Vertex TTS push the boundaries of what’s achievable in AI-driven document processing and speech synthesis. Extended context sizes, both for input and output, enable more coherent and less fragmented content handling. Vertex AI Vision ensures accurate text extraction, especially from complex documents, providing a reliable starting point for downstream tasks. Meanwhile, the long synthesis capabilities of Vertex TTS open new possibilities for creating uninterrupted audio experiences, improving user satisfaction and reducing manual intervention.

Key benefits:

Gemini 1.5 Pro:
- Extended output context size (up to 8,000 tokens) for more cohesive and detailed responses.
- Unprecedented input context size (2 million tokens), ideal for large document analysis without splitting.
Vertex AI Vision API:
- Advanced text extraction capabilities, particularly for complex or mixed-content documents.
- Reliable and accurate OCR, reducing the need for manual fine-tuning.
Vertex TTS:
- Long synthesis capabilities for seamless audiobook and podcast creation.
- Reduced need for manual effort by avoiding the disruptions of segmented outputs.

These technologies represent significant steps forward, addressing core challenges that have long hindered the seamless transformation of text into spoken words.

Read next from the author: The future of work lies in connecting information