-
Mass Storage Question
Here's an interesting one--how many bytes of storage are required on a hard drive to store every single word ever written by humans in any language that humans have ever used for communication?
Can you provide a reasonanly close estimate? Please advise. :ehh:
-
Re: Mass Storage Question
Just one byte, a really really big bite! :D
-
Re: Mass Storage Question
Quote:
Originally Posted by
RobDog888
Just one byte, a really really big bite! :D
Just one big byte?
Gigabyte, Terabyte, Petabyte, or a larger byte? ;)
-
Re: Mass Storage Question
Quote:
Originally Posted by
Code Doc
Here's an interesting one--how many bytes of storage are required on a hard drive to store every single word ever written by humans in any language that humans have ever used for communication?
Can you provide a reasonanly close estimate? Please advise. :ehh:
You aren't providing enough information. Are you including merely published works, or are you also including all my email, my grocery list in my pocket, etc?
plus, there's the fact there are over 4000 current spoken languages but not all of them even have written words, and some share. Mandarin and cantonese for example use the same written text. Do we include dialects here? you get my drift. Some estimating must be done. I seem to rmember someone once computing that an encyclopedia would fit on a floppy disk, so i am pretty sure a terabyte would be plenty big.
-
Re: Mass Storage Question
Quote:
Originally Posted by
Code Doc
Can you provide a reasonanly close estimate?
No..
-
Re: Mass Storage Question
Do we count only published (printed) material or any write/scribble in general? Do we count copies (in case of printed material do we count one book just once or once for every copy printed)?
-
Re: Mass Storage Question
you're gonna need a hell load of space and time to fill it on a hard drive.
-
Re: Mass Storage Question
-
Re: Mass Storage Question
OK, I'll try to define the problem more accurately. Let's assume:
(1) All different words written by all humans in any language. Matched words are to be ignored. Common slang words are acceptable.
(2) Words composed of multiple words do not count. These must be separate words and not joined by hyphens or combined by inept text messagers or those text messagers trying to show off.
(3) Trivial concocted abbreviations and combinations are not to be included, such as URDum, PITA, and TIA.
Regardless of these restrictions, I have been told that the answer is many Petabytes, even with compression. I am having a hard time believing that.
Now what would it take to store all words?
-
Re: Mass Storage Question
What about when new words are created or eveolve during your writting process to the hard drive? Will yoiu have an ever continuing process processing new words? This will be a never ending task so we can not tell you how much space :p
:D
-
Re: Mass Storage Question
Quote:
Originally Posted by
Code Doc
(2) Words composed of multiple words do not count. These must be separate words and not joined by hyphens or combined by inept text messagers or those text messagers trying to show off.
Most languages contain many words formed from other words, or words with prefixes and suffixes.
What about languages which don't have words?
-
Re: Mass Storage Question
Quote:
Originally Posted by
penagate
What about languages which don't have words?
Which remiinds me, my wife had this strange look the other day, afterwards she blamed for not ..... Sounds like a language without words;)
-
Re: Mass Storage Question
So just words? Not phrases, sentences, paragraphs.... Only unique words regardless of written language?
According to Google, they have a nice collection of over 13 million unigrams(words). Assuming 5.1 characters per word, about 66 megabytes(more like 80MB with metadata). I believe it's limited to English, though. Assuming around 5000 written languages, perhaps 323(391) GB is a good upper limit figure. This is using zero compression. Compression would be interesting on such a unique dataset. There's also a matter of delimiters, and potentially the character sets used(metadata).
-
Re: Mass Storage Question
Quote:
Originally Posted by
FireXtol
Compression would be interesting on such a unique dataset.
Since we talk about unique datasets, i don't think that compression will be that "interesting"
-
Re: Mass Storage Question
I suppose if words were randomly unique, compression would be trivial. Language presents lot of redundancy, and with limited character sets there's two good indicators compression can be substantial. I'd imagine a 90% compression ratio would be possible; shrinking the upper limit to under 40GB.
-
Re: Mass Storage Question
you should think about how many languages are... think about it, the English dictionary contains a bit over 700.000 words. so all languages together, would be millions of words.
-
Re: Mass Storage Question
Quote:
Originally Posted by
Justa Lol
you should think about how many languages are... think about it, the English dictionary contains a bit over 700.000 words. so all languages together, would be millions of words.
Hmmm. Point taken. Using Google's 13 and some odd million unigrams, times 5000 written languages is about 68 billion words. I'm not sure compression could reduce the number of bytes to lower than the word count. That'd be really impressive! Perhaps 70 GB is a more reasonable upper limit given these assumptions/figures.
-
Re: Mass Storage Question
average : 70,000 words per language 80 languages currently, language updates each 300 years
or so. text file of a dictionary is about 3Mbyte, humans existed 200000 years.
but how many communities and type of humans on average in the past? do a range.
do niandratals count ?
i guess a Gbyte is more than enougth
-
Re: Mass Storage Question
Quote:
Originally Posted by
moti barski
average : 70,000 words per language 80 languages currently, language updates each 300 years
or so. text file of a dictionary is about 3Mbyte, humans existed 200000 years.
but how many communities and type of humans on average in the past? do a range.
do niandratals count ?
i guess a Gbyte is more than enougth
i would say they don't since they didn't have written language. Just pictograms.
-
Re: Mass Storage Question
i think there are more then 80 languages? i speak 8 languages my self, and being able to speak 10% of the languages currently seems a bit much :D
there are 192 or 196 (193 or 197 Faroe Island is a country, not a part of denmark, only a member of the danish kingdom) countries in the world, depending on how you define country. and i bet over half of those have their own language.
-
Re: Mass Storage Question
Quote:
Originally Posted by
FireXtol
So just words? Not phrases, sentences, paragraphs.... Only unique words regardless of written language?
According to
Google, they have a nice collection of over 13 million unigrams(words). Assuming 5.1 characters per word, about 66 megabytes(more like 80MB with metadata). I believe it's limited to English, though. Assuming around 5000 written languages, perhaps 323(391) GB is a good upper limit figure. This is using zero compression. Compression would be interesting on such a unique dataset. There's also a matter of delimiters, and potentially the character sets used(metadata).
I tend to agree. You could likely store all unique words that have ever been written in all of human history with half a terabyte. Further advances in compression could shrink that somewhat, but I am not sure there is anymore payout to that. Mass storage expansion and communication speeds have trumped that development, the same way that the Internet has all but crushed the compact disk and the floppy disk. :ehh:
-
Re: Mass Storage Question
Quote:
Originally Posted by
Code Doc
I tend to agree. You could likely store all unique words that have ever been written in all of human history with half a terabyte. Further advances in compression could shrink that somewhat, but I am not sure there is anymore payout to that. Mass storage expansion and communication speeds have trumped that development, the same way that the Internet has all but crushed the compact disk and the floppy disk. :ehh:
compression on storage mediums seems to be passe'. However transmitted data will probably receive compression for years to come.
-
Re: Mass Storage Question
Quote:
Originally Posted by
BillGeek
Who'll do the re-typing?
The monkeys of course! Bonus points if they churn out a Shakespeare play.
-
Re: Mass Storage Question
Quote:
Originally Posted by
kregg
The monkeys of course! Bonus points if they churn out a Shakespeare play.
the monkeys are currently busy, they're working for youtube now.
-
Re: Mass Storage Question
Quote:
Originally Posted by
Justa Lol
the monkeys are currently busy, they're working for youtube now.
That explains a LOT of the videos on there:lol:
-
Re: Mass Storage Question
Quote:
Originally Posted by
NickThissen
That explains a LOT of the videos on there:lol:
i suppose so, but i meant the "a team of highly trained monkeys have been dispatched to deal with the situation" error message xD
-
Re: Mass Storage Question
Quote:
Originally Posted by
Justa Lol
i suppose so, but i meant the "a team of highly trained monkeys have been dispatched to deal with the situation" error message xD
I laugh so hard when I see that error message :lol: