🇨🇳 汉语文字 - 点击左边的下放箭头显示中文翻译
文本摘要器
一款由 Facebook 的 BART (双向自回归变换器) 驱动的文本摘要工具,并增强了批量处理、量化和丰富的控制台输出等高级功能。
功能特性
- 高级摘要:使用 BART-large-CNN 模型生成高质量摘要
- 文本预处理:清理和规范化输入文本
- 长文本处理:分割并摘要冗长文档
- 批量处理:通过异步支持高效摘要多个文本
- 量化:可选的模型量化以加快推理速度
- 丰富输出:使用 Rich 库实现精美的控制台格式
- 可配置:通过配置类实现灵活参数设置
- 错误处理:稳健的验证和日志记录
- 内存管理:支持上下文管理器和缓存清理
安装
- 克隆仓库:
git clone <https://github.com/yourusername/advanced-text-summarizer.git>
cd advanced-text-summarizer
安装依赖项:
pip install -r requirements.txt
需求
- Python 3.8+
- torch
- transformers
- rich
- 完整列表请参阅 'requirements.txt'
使用方法
命令行
摘要单个文本:
python summarizer.py --text "这里是需要摘要的长文本"
摘要多个文本:
python summarizer.py --batch "文本 1" "文本 2" "文本 3"
Python API
-
AdvancedTextSummarizer
类提供了一个用于文本摘要的编程接口,允许我们将其集成到 Python 脚本或应用程序中。 - 以下是如何使用它的示例。
基本用法
使用默认设置摘要单个文本:
from summarizer import AdvancedTextSummarizer
# 初始化摘要器
summarizer = AdvancedTextSummarizer()
# 需要摘要的文本
text = """
人工智能 (AI) 的发展对全球各个行业都产生了重大影响。
从医疗保健到金融,人工智能驱动的应用简化了操作,提高了准确性,
并开启了新的可能性。
"""
# 生成并打印摘要
summary_data = summarizer.summarize(text)
summarizer.print_summary(summary_data)
# 直接访问摘要
print("摘要:", summary_data["summary"])
自定义配置
使用 SummarizerConfig
自定义摘要器:
from summarizer import AdvancedTextSummarizer, SummarizerConfig
# 自定义配置
config = SummarizerConfig(
model_name="facebook/bart-large-cnn",
quantize=True, # 启用量化以提高速度
max_length=100, # 最大摘要长度
min_length=30, # 最小摘要长度
repetition_penalty=2.0 # 更强的重复惩罚
)
# 使用上下文管理器进行资源管理
with AdvancedTextSummarizer(config) as summarizer:
text = "这里是你的长文本..."
summary_data = summarizer.summarize(text)
summarizer.print_summary(summary_data)
批量处理
异步摘要多个文本:
import asyncio
from summarizer import AdvancedTextSummarizer
async def main():
summarizer = AdvancedTextSummarizer()
texts = [
"人工智能正在通过更好的诊断彻底改变医疗保健。",
"自动驾驶汽车使用机器学习进行导航。",
]
# 摘要多个文本
summaries = await summarizer.summarize_batch_async(texts)
for text, summary in zip(texts, summaries):
print(f"原文: {text}")
print(f"摘要: {summary}\\\\n")
# 运行异步函数
asyncio.run(main())
主要方法
-
summarize(text, max_length=150, min_length=50, ...)
摘要单个文本。返回一个包含original_text
、cleaned_text
和summary
的字典。 -
summarize_batch_async(texts, batch_size=2, ...)
异步摘要多个文本。返回一个摘要列表。 -
print_summary(summary_data)
使用 Rich 库显示格式化的摘要。 -
clear_cache()
清除内部缓存以释放内存。
注意事项
- 确保首先安装依赖项:
pip install -r requirements.txt
- 需要 Python 3.8+ 以支持 asyncio
- 使用上下文管理器(
with
语句)以确保正确的资源清理
来自 维基百科 的示例文本
“强化学习 (RL) 是机器学习和最优控制的一个跨学科领域,关注智能体应如何在动态环境中采取行动以最大化奖励信号。强化学习是三种基本机器学习范式之一,与监督学习和无监督学习并列。
Q学习在其最简单的形式中将数据存储在表格中。随着状态/动作数量的增加(例如,如果状态空间或动作空间是连续的),这种方法变得不可行,因为智能体访问特定状态并执行特定动作的概率会降低。
强化学习与监督学习的不同之处在于,它不需要呈现标记的输入输出对,也不需要明确纠正次优动作。相反,重点是在探索(未知领域)和利用(当前知识)之间找到平衡,目标是最大化累积奖励(其反馈可能不完整或延迟)。[1] 对这种平衡的探索被称为探索-利用困境。
环境通常以马尔可夫决策过程 (MDP) 的形式陈述,因为许多强化学习算法使用动态规划技术。[2] 经典动态规划方法和强化学习算法之间的主要区别在于,后者不假定知道马尔可夫决策过程的精确数学模型,并且它们针对的是精确方法变得不可行的大型 MDP。[3]”
示例输出
╭────────────────────── 文本摘要 ─────────────────────╮
│ │
│ 原文文本 这里是你的输入文本... │
│ 摘要 [粗体绿色]一个简洁的摘要...[/] │
│ │
╰──────────── 使用 BART 生成 ─────────────────────────╯
配置选项
参数 | 描述 | 默认值 |
model_name | 要使用的预训练模型 | "facebook/bart-large-cnn" |
quantize | 启用模型量化 | False |
max_length | 最大摘要长度 | 150 |
min_length | 最小摘要长度 | 50 |
length_penalty | 对较长摘要的惩罚 | 1.0 |
repetition_penalty | 对重复词元的惩罚 | 1.5 |
num_beams | 用于搜索的束数量 (beam) | 4 |
batch_size | 用于处理的批次大小 | 2 |
开发
运行测试
python summarizer.py # 在 main 函数中取消注释 run_tests()
贡献
- Fork 本仓库
- 创建功能分支 (
git checkout -b feature/amazing-feature
) - 提交更改 (
git commit -am 'Add amazing feature'
) - 将分支推送到远程 (
git push origin feature/amazing-feature
) - 创建拉取请求 (Pull Request)
致谢
- 使用 Hugging Face 的 Transformers 构建
- 使用 Rich 增强显示效果
- 示例文本来自 维基百科
- 灵感来源于 使用 DistillBart 模型进行文本摘要
许可证
- 根据 GNU Affero General Public License v3.0 许可证分发。详情请参阅
LICENSE
文件。
Text Summarizer 文本摘要器
A text summarization tool powered by BART (Bidirectional and Auto-Regressive Transformer) from Facebook, enhanced with advanced features like batch processing, quantization, and a rich console output.
Features
- Advanced Summarization: Uses BART-large-CNN model for high-quality summaries
- Text Preprocessing: Cleans and normalizes input text
- Long Text Handling: Splits and summarizes lengthy documents
- Batch Processing: Summarize multiple texts efficiently with async support
- Quantization: Optional model quantization for faster inference
- Rich Output: Beautiful console formatting using Rich library
- Configurable: Flexible parameters via configuration class
- Error Handling: Robust validation and logging
- Memory Management: Context manager support and cache clearing
Installation
- Clone the repository:
git clone <https://github.com/yourusername/advanced-text-summarizer.git>
cd advanced-text-summarizer
Install dependencies:
pip install -r requirements.txt
Requirements
- Python 3.8+
- torch
- transformers
- rich
- See 'requirements.txt' for full list
Usage
Command Line
Summarize a single text:
python summarizer.py --text "Your long text here that needs summarization"
Summarize multiple texts:
python summarizer.py --batch "Text 1" "Text 2" "Text 3"
Python API
- The
AdvancedTextSummarizer
class provides a programmatic interface for text summarization, allowing us to integrate it into our Python scripts or applications. - Below are examples of how to use it.
Basic Usage
Summarize a single text with default settings:
from summarizer import AdvancedTextSummarizer
# Initialize the summarizer
summarizer = AdvancedTextSummarizer()
# Text to summarize
text = """
The development of artificial intelligence (AI) has significantly impacted various industries worldwide.
From healthcare to finance, AI-powered applications have streamlined operations, improved accuracy,
and unlocked new possibilities.
"""
# Generate and print summary
summary_data = summarizer.summarize(text)
summarizer.print_summary(summary_data)
# Access the summary directly
print("Summary:", summary_data["summary"])
Custom Configuration
Use SummarizerConfig to customize the summarizer:
from summarizer import AdvancedTextSummarizer, SummarizerConfig
# Custom configuration
config = SummarizerConfig(
model_name="facebook/bart-large-cnn",
quantize=True, # Enable quantization for speed
max_length=100, # Maximum summary length
min_length=30, # Minimum summary length
repetition_penalty=2.0 # Stronger repetition penalty
)
# Use context manager for resource management
with AdvancedTextSummarizer(config) as summarizer:
text = "Your long text here..."
summary_data = summarizer.summarize(text)
summarizer.print_summary(summary_data)
Batch Processing
Summarize multiple texts asynchronously:
import asyncio
from summarizer import AdvancedTextSummarizer
async def main():
summarizer = AdvancedTextSummarizer()
texts = [
"AI is revolutionizing healthcare with better diagnostics.",
"Self-driving cars use machine learning to navigate.",
]
# Summarize multiple texts
summaries = await summarizer.summarize_batch_async(texts)
for text, summary in zip(texts, summaries):
print(f"Original: {text}")
print(f"Summary: {summary}\\n")
# Run the async function
asyncio.run(main())
Key Methods
- summarize(text, max_length=150, min_length=50, ...)
Summarizes a single text. Returns a dictionary with original_text, cleaned_text, and summary. - summarize_batch_async(texts, batch_size=2, ...)
Asynchronously summarizes multiple texts. Returns a list of summaries. - print_summary(summary_data)
Displays a formatted summary using the Rich library. - clear_cache()
Clears the internal cache to free memory.
Notes
- Ensure dependencies are installed first: pip install -r requirements.txt
- Requires Python 3.8+ for asyncio support
- Use the context manager (with statement) for proper resource cleanup
Sample Text from Wikipedia
"Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning.
Q-learning at its simplest stores data in tables. This approach becomes infeasible as the number of states/actions increases (e.g., if the state space or action space were continuous), as the probability of the agent visiting a particular state and performing a particular action diminishes.
Reinforcement learning differs from supervised learning in not needing labelled input-output pairs to be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) with the goal of maximizing the cumulative reward (the feedback of which might be incomplete or delayed).[1] The search for this balance is known as the exploration–exploitation dilemma.
The environment is typically stated in the form of a Markov decision process (MDP), as many reinforcement learning algorithms use dynamic programming techniques.[2] The main difference between classical dynamic programming methods and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the Markov decision process, and they target large MDPs where exact methods become infeasible.[3]"
Example Output
╭────────────────────── Text Summary ──────────────────────╮
│ │
│ Original Text Your input text goes here... │
│ Summary [bold green]A concise summary...[/] │
│ │
╰──────────── Generated with BART ─────────────────────────╯
Configuration Options
Parameter | Description | Default |
model_name | Pre-trained model to use | "facebook/bart-large-cnn" |
quantize | Enable model quantization | False |
max_length | Maximum summary length | 150 |
min_length | Minimum summary length | 50 |
length_penalty | Penalty for longer summaries | 1.0 |
repetition_penalty | Penalty for repeated tokens | 1.5 |
num_beams | Number of beams for search | 4 |
batch_size | Batch size for processing | 2 |
Development
Running Tests
python summarizer.py # Uncomment run_tests() in main
Contributing
- Fork the repository
- Create a feature branch (git checkout -b feature/amazing-feature)
- Commit your changes (git commit -am 'Add amazing feature')
- Push to the branch (git push origin feature/amazing-feature)
- Create a Pull Request
Acknowledgments
- Built with Transformers by Hugging Face
- Enhanced display with Rich
- Sample text from Wikipedia
- Inspired by Text Summarization with DistillBart Model
License
- Distributed under the GNU Affero General Public License v3.0 License. See
LICENSE
for more information.