Pile Detector On-Line Demo

← Pile Detector

Home
Author Check
Developer Demo (current)

Check for Training Data Matches in The Pile.
Need ideas? Try Project Gutenberg ❔ or Wikipedia Random Page ❔

Enter text to check:

Conceptually ❔, we use an index in disk of hashes for 10-grams (10 contiguous tokens). We only index hashes that appear in less than 4,096 documents. Other hashes are marked as "high frequency" and considered matched by default.

Our API returns:

the positions of the matched tokens (shown in light blue),
the positions of high frequency n-grams (shown in brown),
the document number of the matched documents, and,
the sections of The Pile where they appear.

For PileDetector.com, we consider a full match when a sequence of 30 tokens or more has up to 20% unmatched tokens. (As mentioned, unmatched tokens in high frequency n-grams are considered matched by default.)

Results

Ready

Tokenization Information

Match Results

Loading...

No significant matches found in the training data.

Matches Found! Some portions of this text appear to match content in the training data.

Pile Detector © 2023-2025 Textualization Software Ltd.