How to compress a string of digits?

**Lior** · May 25th, 2001, 05:27 AM

Hello,

Anyone got an idea of how to compress a string of digits (0-9) ?

Let's say the string is:
"897459473597349857893475897234895798347598734985798347589734897589347985793487598734985798347589734 8
9578347598798347534975983479857947578974984735983749857983475987349857983475987349579384759837498573
9487598734975983475987349857934758973495793475783459873498579384759873497598734953497593475973459709
740243522332432432438208394234020394908324"

I know I can take every 2 digits and convert them into a char using the CHR function. In some cases I can convert even 3 digits into a char (in case the 3 digits gives me a number between 100-255).

This technique compresses the string into a half of it. (at least).

Does anyone have a better compression technique or algorithm for compressing such string?

Thank you very much.

TB · May 26th, 2001, 04:25 AM

I have a solution for your problem, but it won't with long strings.
You can see the string as a big number. So you can convert it to a binary number.
Because VB can't handle with such big numbers, you will have to write a function that converts the string into a binary string.

There is another solution, you can make in C++.
I got a class, called FLINT.
With that, you can handle with long, and I mean very long numbers.
I can send you that class and the manual for it.

**Lior** · May 26th, 2001, 12:51 PM

A little problem with your solution:

Even when I will convert it into a binary number, and then save each 8 digits of the binary string as a byte, I will not be able to reverse it back.
because if (for e.g.) I saved the byte 11001011, it can be reversed in many many different combinations:
1100+10+11
1100+1011
1100+101+1
1+100+10+1+1
and many many many more options.

How will I be able to reverse it to the EXACT combination ???

**Lemon Lime** · May 26th, 2001, 02:53 PM

it is possible as long as you use a header, but it might give you a larger filesize....

**HarryW** · May 26th, 2001, 03:01 PM

Ever heard of Huffman coding? Since you have a limited character set here, you might be able to use that to reduce the amount of info stored.

**Lemon Lime** · May 26th, 2001, 03:13 PM

Originally posted by HarryW
Ever heard of Huffman coding? Since you have a limited character set here, you might be able to use that to reduce the amount of info stored.

you're right, but Lior tries to use other things since two years ago...
he is quite obsessed, coz he wants to kick Huffman Coding's ass...

**Lior** · May 26th, 2001, 08:36 PM

Acctually, I have learned most of the standard formats used today.
This includes Huffman Encoding, LZW (ZIP), LZH, and the simple RLE.
I am tring to make a new algorithm for lossless data-compressing, which might compress with better ratio.
My key is to be able to compress ANY file.
This means I will be able to compress even compressed files!
You see, the compression will not be depended on some orders of chars, which will prevent it of re-compressing.
The more you use the algorithm on the new-compressed file, the better ratio you'll get.
Even if I compress files into 90% of their original size, it worths it! because you will be able to compress the new compressed file again, and now it became only 81% of the original size.
Go on this way, and congratulations! You're responsible for a new format in the world! ZIP will die...

TB · May 28th, 2001, 09:47 AM

No, there is only one combination.

If you have the number 10111010.
You convert it into a decimal number that way:

You begin with the last digit:
0, that means 0 * 2^0
1, 1 * 2^1 +
0, 0 * 2^2 +
1, 1 * 2^3 +
1, 1 * 2^4 +
1, 1 * 2^5 +
0, 0 * 2^6 +
1, 1 * 2^7
------------
After the addition of those numbers, you'll have the decimal number.
You will get problems withh a long binary number, to convert it back into a decimal, because VB cannot handle such big numbers.
You should write your program in C++.
If you want to, I can send you the classes, so that you can handle that number.

**Lior** · May 28th, 2001, 01:11 PM

Hi man, I think there was a little misunderstanding here and I didnt make myself clear enough.
I didnt say that the whole binary number can be converted into many decimal numbers. What I said is that if I saparate it into binary chunks (which some of them will have less than 8 binary digits), then there are many many combinations.
I think an example will make it clearer:

As you know there are 256 different ASCII codes (0-255).
And the number 11111111 in binary base is 255 in decimal base.
This means I can assign binary numbers to the whole ASCII chars table.
Now, Lets say that the char "A" is the most frequent char in the file, and therefore I assigned him the shortest binary number (0).
and the char "B" is the next frequent char appears in the file, and that's why he gets the second shortest binary number (1).
And then "C" is the next most frequent char, and is assigned with the next shortest binary number available (10).

Now, let's say I have a file containing this: "BBACAC"
Then I translate each byte (ASCII char) in the file according to my assigning table.
The translated string is: "11010010", right? (1+1+0+10+0+10).
I can save this in only 1 char which its ASCII value (in decimal) equals to the binary number 11010010.
When we check we see that 11010010 in binary is 210 in decimal.
so I save to the compressed file the ASCII char number 210.

When I want to reverse it back,I just take the char 210 and convert it back into a binary number. and gets back the stream "11010010".
Everything sounds just fine by now, right?
SO HERE COMES THE PROBLEM:
11010010 can be recognized in many many ways such as:

"1"+"1"+"0"+"10"+"0"+"10" (How it really should be recognized).
"110"+"100"+"10"
"1"+"10100"+"1"+"0"
"1"+"10"+1001"+"0"

Of course there are many more different combinations, so I dont know what will make to choose the exact correct combination.

This is what I meant when I said there are many combinations in the decoding process.

Any Idea guys?
Thanks in advance.

**Illuminator** · May 28th, 2001, 06:25 PM

I dont thin you quite understand huffman coding.

0 = A and 1 = B isnt valid because you just blocked out the rest of the character set

this is how it should be
0 = A
1 = leads to one of B C or D
then
11 = C
10 = lead to one of C or D
continuing
101 = B
100 = D

Then the string 10011110 can only be interpreted one way
100 = D
11 = C
11 = C
0 = A

so 10011110 can only equal DCCA

TB · May 29th, 2001, 08:51 AM

You should not see the string as an array of chars.
If the string's content are only numbers, the you can virtually convert the string into a number, a long number.

"234" = 234 -> 11101010
"34536346" = 34536346 -> 10000011101111101110011010

Then you cut the binary number into chunks of 8bit to save them as bytes.

The problem that would appear is, that you can't convert a string with, let's say 2000 numbers - that means, that the number will have 2000 digits - because VB can't handle with such big number.

**Lior** · May 30th, 2001, 03:00 PM

Illuminator, I didnt say my idea is based on the huffman coding, and I didnt try to explain the huffman coding at all.

TB, thanks for your criative idea, I will check it out.

Thanks to you two, for trying to help.

More ideas will be well-appriciated.

Thanks.

Thread: How to compress a string of digits?

Thread Tools

Display

How to compress a string of digits?

Posting Permissions