skip to content
reelikklemind

PDF-Translation

PDF-Translation


PDF-Translation


Easily extract text from PDFs while preserving formatting, translate the content accurately, and convert it back into a well-structured, professional PDF 📄


GitHub Link: https://github.com/edisedis777/PDF-Translation

Features

This program provides the following functionalities:

  • Extracts text from PDFs while attempting to preserve formatting
  • Detects and formats headers
  • Preserves bullet points
  • Maintains paragraph breaks
  • Splits the content into files of 100 pages each
  • Adds page separators between pages
  • Includes error handling and progress feedback

PDF Extraction

Requirements

To use this program, install the required library:
sh
pip install pdfplumber

Modify the pdf_path and output_dir variables in the script to match your needs.

Key Improvements

  • Original indentation is now preserved using leading spaces
  • Empty lines are kept exactly as they appear in the source
  • Headers are detected more accurately based on positioning and style
  • List formatting is preserved while maintaining original indentation
  • Page breaks are more clearly marked with separator lines
  • Added parameter to better handle column detection
  • Improved header detection
  • Better handling of list items and indentation
  • Enhanced exception handling for better debugging

Translation

Requirements

Install the required translation library:
sh
pip install deep-translator

Key Features

  • Preserves all markdown formatting, including:
    • Headers
    • Code blocks
    • Inline code
    • Links
    • Line breaks
  • Handles large files by splitting them into chunks
  • Includes a retry mechanism with exponential backoff
  • Preserves original file names with an "_english" suffix

Note: The script uses the free deep_translator library, which is more stable and does not require async handling.


Markdown to PDF Conversion

Requirements

To convert translated markdown text into a professional-looking PDF, install:
sh
pip install markdown weasyprint PyPDF2

WeasyPrint Dependencies

Depending on your operating system, additional dependencies may be required:

  • macOS:
    sh
    brew install cairo pango
  • Ubuntu/Debian:
    sh
    sudo apt-get install build-essential python3-dev python3-pip python3-setuptools python3-wheel python3-cffi libcairo2 libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev shared-mime-info
  • Windows:
    Follow WeasyPrint's installation guide for Windows.

Key Enhancements in PDF Formatting

Document Structure

  • Professional margins and page size
  • Page numbers centered at the bottom
  • Proper spacing between elements
  • Automatic page breaks before new sections

Typography

  • Body text: Times New Roman (optimized for readability)
  • Headings: Arial (better contrast)
  • Optimized font sizes and line heights
  • Justified text alignment with hyphenation

Headings

  • Clear visual hierarchy
  • Proper spacing before and after
  • No page breaks immediately after headers

Lists

  • Proper indentation
  • Different bullet styles for nested levels
  • Consistent spacing
  • Improved line height for readability

Tables

  • Full-width design
  • Subtle borders
  • Proper cell padding
  • No page breaks within tables

Code Blocks

  • Monospace font
  • Light background
  • Rounded corners
  • Proper padding
  • Overflow handling
  • No page breaks within blocks

Additional Features

  • Styled blockquotes
  • Proper link colors
  • Responsive images
  • Horizontal rules

Credits

  • pdfplumber Library
  • deep-translator library
  • weasyprint PyPDF2 library

License

  • Distributed under the GNU Affero General Public License v3.0 License. See LICENSE for more information.

<div align="right">

Back To Top ⬆️
</div>