Check for Training Data Matches in The Pile.
Need ideas? Try Project Gutenberg or Wikipedia Random Page

Conceptually , we use an index in disk of hashes for 10-grams (10 contiguous tokens). We only index hashes that appear in less than 4,096 documents. Other hashes are marked as "high frequency" and considered matched by default.

Our API returns:

  • the positions of the matched tokens (shown in light blue),
  • the positions of high frequency n-grams (shown in brown),
  • the document number of the matched documents, and,
  • the sections of The Pile where they appear.

For PileDetector.com, we consider a full match when a sequence of 30 tokens or more has up to 20% unmatched tokens. (As mentioned, unmatched tokens in high frequency n-grams are considered matched by default.)

Results
Ready
Tokenization Information
Match Results
Loading...
No significant matches found in the training data.
Matches Found! Some portions of this text appear to match content in the training data.