PDF-Translation
Easily extract text from PDFs while preserving formatting, translate the content accurately, and convert it back into a well-structured, professional PDF 📄
GitHub Link: https://github.com/edisedis777/PDF-Translation
Features
This program provides the following functionalities:
- Extracts text from PDFs while attempting to preserve formatting
- Detects and formats headers
- Preserves bullet points
- Maintains paragraph breaks
- Splits the content into files of 100 pages each
- Adds page separators between pages
- Includes error handling and progress feedback
PDF Extraction
Requirements
To use this program, install the required library:
sh
pip install pdfplumber
Modify the pdf_path
and output_dir
variables in the script to match your needs.
Key Improvements
- Original indentation is now preserved using leading spaces
- Empty lines are kept exactly as they appear in the source
- Headers are detected more accurately based on positioning and style
- List formatting is preserved while maintaining original indentation
- Page breaks are more clearly marked with separator lines
- Added parameter to better handle column detection
- Improved header detection
- Better handling of list items and indentation
- Enhanced exception handling for better debugging
Translation
Requirements
Install the required translation library:
sh
pip install deep-translator
Key Features
- Preserves all markdown formatting, including:
- Headers
- Code blocks
- Inline code
- Links
- Line breaks
- Handles large files by splitting them into chunks
- Includes a retry mechanism with exponential backoff
- Preserves original file names with an "_english" suffix
Note: The script uses the free deep_translator
library, which is more stable and does not require async handling.
Markdown to PDF Conversion
Requirements
To convert translated markdown text into a professional-looking PDF, install:
sh
pip install markdown weasyprint PyPDF2
WeasyPrint Dependencies
Depending on your operating system, additional dependencies may be required:
- macOS:
sh
brew install cairo pango - Ubuntu/Debian:
sh
sudo apt-get install build-essential python3-dev python3-pip python3-setuptools python3-wheel python3-cffi libcairo2 libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev shared-mime-info - Windows:
Follow WeasyPrint's installation guide for Windows.
Key Enhancements in PDF Formatting
Document Structure
- Professional margins and page size
- Page numbers centered at the bottom
- Proper spacing between elements
- Automatic page breaks before new sections
Typography
- Body text: Times New Roman (optimized for readability)
- Headings: Arial (better contrast)
- Optimized font sizes and line heights
- Justified text alignment with hyphenation
Headings
- Clear visual hierarchy
- Proper spacing before and after
- No page breaks immediately after headers
Lists
- Proper indentation
- Different bullet styles for nested levels
- Consistent spacing
- Improved line height for readability
Tables
- Full-width design
- Subtle borders
- Proper cell padding
- No page breaks within tables
Code Blocks
- Monospace font
- Light background
- Rounded corners
- Proper padding
- Overflow handling
- No page breaks within blocks
Additional Features
- Styled blockquotes
- Proper link colors
- Responsive images
- Horizontal rules
Credits
- pdfplumber Library
- deep-translator library
- weasyprint PyPDF2 library
License
- Distributed under the GNU Affero General Public License v3.0 License. See
LICENSE
for more information.
<div align="right">
Back To Top ⬆️
</div>