Results 1 to 12 of 12

Thread: How to compress a string of digits?

  1. #1

    Thread Starter
    Registered User Lior's Avatar
    Join Date
    Jan 2000
    Posts
    307

    How to compress a string of digits?

    Hello,

    Anyone got an idea of how to compress a string of digits (0-9) ?

    Let's say the string is:
    "897459473597349857893475897234895798347598734985798347589734897589347985793487598734985798347589734 8
    9578347598798347534975983479857947578974984735983749857983475987349857983475987349579384759837498573
    9487598734975983475987349857934758973495793475783459873498579384759873497598734953497593475973459709
    740243522332432432438208394234020394908324"

    I know I can take every 2 digits and convert them into a char using the CHR function. In some cases I can convert even 3 digits into a char (in case the 3 digits gives me a number between 100-255).

    This technique compresses the string into a half of it. (at least).

    Does anyone have a better compression technique or algorithm for compressing such string?

    Thank you very much.

  2. #2
    Lively Member TB's Avatar
    Join Date
    Feb 2001
    Location
    Austria
    Posts
    106
    I have a solution for your problem, but it won't with long strings.
    You can see the string as a big number. So you can convert it to a binary number.
    Because VB can't handle with such big numbers, you will have to write a function that converts the string into a binary string.

    There is another solution, you can make in C++.
    I got a class, called FLINT.
    With that, you can handle with long, and I mean very long numbers.
    I can send you that class and the manual for it.
    mojo

  3. #3

    Thread Starter
    Registered User Lior's Avatar
    Join Date
    Jan 2000
    Posts
    307
    A little problem with your solution:

    Even when I will convert it into a binary number, and then save each 8 digits of the binary string as a byte, I will not be able to reverse it back.
    because if (for e.g.) I saved the byte 11001011, it can be reversed in many many different combinations:
    1100+10+11
    1100+1011
    1100+101+1
    1+100+10+1+1
    and many many many more options.

    How will I be able to reverse it to the EXACT combination ???

  4. #4
    Addicted Member Lemon Lime's Avatar
    Join Date
    Jul 2000
    Location
    Space, the final frontier, go along the yellow brick road,turn left then left again. In the service window, ask for the insane dude.
    Posts
    167
    it is possible as long as you use a header, but it might give you a larger filesize....
    I've had enough with sainity!
    What's the use of it anyway?

  5. #5
    Frenzied Member HarryW's Avatar
    Join Date
    Jan 2000
    Location
    Heiho no michi
    Posts
    1,827
    Ever heard of Huffman coding? Since you have a limited character set here, you might be able to use that to reduce the amount of info stored.
    Harry.

    "From one thing, know ten thousand things."

  6. #6
    Addicted Member Lemon Lime's Avatar
    Join Date
    Jul 2000
    Location
    Space, the final frontier, go along the yellow brick road,turn left then left again. In the service window, ask for the insane dude.
    Posts
    167
    Originally posted by HarryW
    Ever heard of Huffman coding? Since you have a limited character set here, you might be able to use that to reduce the amount of info stored.
    you're right, but Lior tries to use other things since two years ago...
    he is quite obsessed, coz he wants to kick Huffman Coding's ass...
    I've had enough with sainity!
    What's the use of it anyway?

  7. #7

    Thread Starter
    Registered User Lior's Avatar
    Join Date
    Jan 2000
    Posts
    307
    Acctually, I have learned most of the standard formats used today.
    This includes Huffman Encoding, LZW (ZIP), LZH, and the simple RLE.
    I am tring to make a new algorithm for lossless data-compressing, which might compress with better ratio.
    My key is to be able to compress ANY file.
    This means I will be able to compress even compressed files!
    You see, the compression will not be depended on some orders of chars, which will prevent it of re-compressing.
    The more you use the algorithm on the new-compressed file, the better ratio you'll get.
    Even if I compress files into 90% of their original size, it worths it! because you will be able to compress the new compressed file again, and now it became only 81% of the original size.
    Go on this way, and congratulations! You're responsible for a new format in the world! ZIP will die...
    Last edited by Lior; May 26th, 2001 at 08:41 PM.

  8. #8
    Lively Member TB's Avatar
    Join Date
    Feb 2001
    Location
    Austria
    Posts
    106

    Lightbulb

    No, there is only one combination.

    If you have the number 10111010.
    You convert it into a decimal number that way:

    You begin with the last digit:
    0, that means 0 * 2^0
    1, 1 * 2^1 +
    0, 0 * 2^2 +
    1, 1 * 2^3 +
    1, 1 * 2^4 +
    1, 1 * 2^5 +
    0, 0 * 2^6 +
    1, 1 * 2^7
    ------------
    After the addition of those numbers, you'll have the decimal number.
    You will get problems withh a long binary number, to convert it back into a decimal, because VB cannot handle such big numbers.
    You should write your program in C++.
    If you want to, I can send you the classes, so that you can handle that number.
    mojo

  9. #9

    Thread Starter
    Registered User Lior's Avatar
    Join Date
    Jan 2000
    Posts
    307
    Hi man, I think there was a little misunderstanding here and I didnt make myself clear enough.
    I didnt say that the whole binary number can be converted into many decimal numbers. What I said is that if I saparate it into binary chunks (which some of them will have less than 8 binary digits), then there are many many combinations.
    I think an example will make it clearer:

    As you know there are 256 different ASCII codes (0-255).
    And the number 11111111 in binary base is 255 in decimal base.
    This means I can assign binary numbers to the whole ASCII chars table.
    Now, Lets say that the char "A" is the most frequent char in the file, and therefore I assigned him the shortest binary number (0).
    and the char "B" is the next frequent char appears in the file, and that's why he gets the second shortest binary number (1).
    And then "C" is the next most frequent char, and is assigned with the next shortest binary number available (10).

    Now, let's say I have a file containing this: "BBACAC"
    Then I translate each byte (ASCII char) in the file according to my assigning table.
    The translated string is: "11010010", right? (1+1+0+10+0+10).
    I can save this in only 1 char which its ASCII value (in decimal) equals to the binary number 11010010.
    When we check we see that 11010010 in binary is 210 in decimal.
    so I save to the compressed file the ASCII char number 210.

    When I want to reverse it back,I just take the char 210 and convert it back into a binary number. and gets back the stream "11010010".
    Everything sounds just fine by now, right?
    SO HERE COMES THE PROBLEM:
    11010010 can be recognized in many many ways such as:

    "1"+"1"+"0"+"10"+"0"+"10" (How it really should be recognized).
    "110"+"100"+"10"
    "1"+"10100"+"1"+"0"
    "1"+"10"+1001"+"0"

    Of course there are many more different combinations, so I dont know what will make to choose the exact correct combination.

    This is what I meant when I said there are many combinations in the decoding process.

    Any Idea guys?
    Thanks in advance.
    Last edited by Lior; May 28th, 2001 at 02:17 PM.

  10. #10
    Lively Member
    Join Date
    May 2000
    Posts
    84
    I dont thin you quite understand huffman coding.

    0 = A and 1 = B isnt valid because you just blocked out the rest of the character set

    this is how it should be
    0 = A
    1 = leads to one of B C or D
    then
    11 = C
    10 = lead to one of C or D
    continuing
    101 = B
    100 = D

    Then the string 10011110 can only be interpreted one way
    100 = D
    11 = C
    11 = C
    0 = A

    so 10011110 can only equal DCCA

  11. #11
    Lively Member TB's Avatar
    Join Date
    Feb 2001
    Location
    Austria
    Posts
    106
    You should not see the string as an array of chars.
    If the string's content are only numbers, the you can virtually convert the string into a number, a long number.

    "234" = 234 -> 11101010
    "34536346" = 34536346 -> 10000011101111101110011010

    Then you cut the binary number into chunks of 8bit to save them as bytes.

    The problem that would appear is, that you can't convert a string with, let's say 2000 numbers - that means, that the number will have 2000 digits - because VB can't handle with such big number.
    mojo

  12. #12

    Thread Starter
    Registered User Lior's Avatar
    Join Date
    Jan 2000
    Posts
    307
    Illuminator, I didnt say my idea is based on the huffman coding, and I didnt try to explain the huffman coding at all.

    TB, thanks for your criative idea, I will check it out.

    Thanks to you two, for trying to help.

    More ideas will be well-appriciated.

    Thanks.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width