So just words? Not phrases, sentences, paragraphs.... Only unique words regardless of written language?

According to Google, they have a nice collection of over 13 million unigrams(words). Assuming 5.1 characters per word, about 66 megabytes(more like 80MB with metadata). I believe it's limited to English, though. Assuming around 5000 written languages, perhaps 323(391) GB is a good upper limit figure. This is using zero compression. Compression would be interesting on such a unique dataset. There's also a matter of delimiters, and potentially the character sets used(metadata).