Quote Originally Posted by FireXtol View Post
So just words? Not phrases, sentences, paragraphs.... Only unique words regardless of written language?

According to Google, they have a nice collection of over 13 million unigrams(words). Assuming 5.1 characters per word, about 66 megabytes(more like 80MB with metadata). I believe it's limited to English, though. Assuming around 5000 written languages, perhaps 323(391) GB is a good upper limit figure. This is using zero compression. Compression would be interesting on such a unique dataset. There's also a matter of delimiters, and potentially the character sets used(metadata).
I tend to agree. You could likely store all unique words that have ever been written in all of human history with half a terabyte. Further advances in compression could shrink that somewhat, but I am not sure there is anymore payout to that. Mass storage expansion and communication speeds have trumped that development, the same way that the Internet has all but crushed the compact disk and the floppy disk.