Choose Your Experience
Select the option that best describes your use case
For Developers
Technical interface with detailed API information, tokenization details, and advanced matching parameters. Perfect for researchers, developers, and technical users who want to understand the underlying mechanics.
- ✓ View tokenization process
- ✓ See technical match details
- ✓ Document IDs and positions
- ✓ API response format
For Authors & Writers
Simple, streamlined interface designed for authors and content creators who want to check if their work appears in The Pile training dataset. No technical jargon - just clear results.
- ✓ Simple, clean interface
- ✓ Longer text support
- ✓ Clear match highlights
- ✓ Easy-to-understand results
About The Pile
The Pile is an 825GB English text corpus created by EleutherAI for training large language models. It contains text from books, articles, websites, and other sources collected before 2020. Many popular AI models have been trained on this dataset or similar collections.
This tool helps you determine if specific text appears in The Pile, which can be useful for understanding potential training data overlap, copyright concerns, or research purposes.