dcsimg
Page 2 of 2 FirstFirst 12
Results 41 to 50 of 50

Thread: [RESOLVED] Can someone fix my sorting algorithm up?

  1. #41

    Thread Starter
    Member
    Join Date
    Apr 2009
    Posts
    55

    Re: [RESOLVED] Can someone fix my sorting algorithm up?

    what a life don't worry about this too much do whatever you want thanks for trying to help me out. you should definitely get some sleep if i was moving I would stay of computers for a while just to get to know the new area.. just go on when i'm bored.. this isn't really much of a priority for this. enjoy your new job so much new things to embrace in idk how you even got time for this, haha idk what to write.. snowing that much be tough and pleasant at the same time to drive in haha.

    I myself didn't test the results from my comparisons and yours to see which is right I would dump the first 100 digits from both arrays and space them out in a notepad to see where the difference happens but I didn't do it too lazy I just assumed that if there is 400 then 401 after it its just 1 digit difference top one being a 0 and the bottom one being a 1. I don't even think identical lines are possible in rotations maybe with some luck collisions could happen but why are the collisions soo close in row index's they should be completely random numbers since the rotations only change 1 byte a time then its just the last digit where zero's slowly go down and 1 come in

  2. #42
    Sinecure devotee
    Join Date
    Aug 2013
    Location
    Southern Tier NY
    Posts
    4,907

    Re: [RESOLVED] Can someone fix my sorting algorithm up?

    It depends on the data of course, A simple case is alternating 1's and 0's . If you rotate that, every other rotation is going to match. Pretty much any symmetrical pattern will match at some point, and possibly a number of times.

    Looking at possible ways to make the search more efficient, I added a little bit of code during the filling of the rotations array, to also count runs of zeros or ones (starting with the first byte of the rotation, how many of the following sequential bytes are the same. The one with the longest run of zeros would be the first item in the sorted array, and the one with the longest run of ones would be the last. If you group these first, then you can skip all the leading zeros and ones and just start sorting after that point, on matching runs.

    It takes almost no time to count those as you go along (filling the rotations array backwards again, so that you can store the count as you go along). That showed me that in your set of 214,548 bytes, only 820 of them were 1s, and also that none of those bytes with 1 in them were contiguous, i.e. there was never more than one byte with 1 in it in a row.

    I don't know if the data you're dealing with, that would always be the case or not, but if it was, then a specialized sort could take advantage of that to greatly reduce the number of compares necessary to sort the data. One way to look at it would be that you have 820 different (or so) runs of zeros counts that you want to sort. For the first iteration you would sort all the runs that start in the first byte, by the size of the run. You would have essentially, fairly quickly divided the big list into 820+ groups to sort. Theoretically that would mean on average each of those 820 groups would have 260 items to be sorted to have the whole thing sorted.

    Of course, in reality the groups won't be the same size. You would have (in this set of data) a small number of groups with large (as large as 1700+ 0s in the run) in them to be sorted, and a large number of groups with a lot of small runs in them. Of the 820 groups, the smallest number of rows in the group that would need to be sorted would be 1, and the largest number of rows in one of those 820 groups, would be 820 itself. That is by definition since you only have 820 1s in the array. Because of the rotation, each of those 820 1s will end up in the second byte of the rotation series once, which means you have 820 rotations that have a run of one 0 in the first byte. Of course, you could also have a group with 820 rotations with two 0s at that front, the same thing for three. In fact, whatever the smallest gap between 1s in the series should determine the number of groups that would have 820 "rows" with that number of 0s, or less at the start to sort.

    By having each byte in the series having the number of zeros that follow it associated with it, you can sort based on that number, then skip over that number to the next zero following the 1 byte location to see how many zeros are in that series and sort that group based on that number. Skipping all those bytes would reduce the group being sorted and the number of comparisons needed greatly. Another statistic, I don't know that I mentioned before, is the current code ends up calling the recursive group sort sub over 8 million times with your current data set. With sparsely scattered ones in the data as you have, I assume the calls to sort groups would be reduced to only a couple of hundred thousand times.

  3. #43
    Sinecure devotee
    Join Date
    Aug 2013
    Location
    Southern Tier NY
    Posts
    4,907

    Re: [RESOLVED] Can someone fix my sorting algorithm up?

    I was already investigated the sorting problem when I wrote the above post, so I decided to continue investigating since we knew it failed very early (relatively) in the sort. Essentially using a conditional break, I let it run until it got the to sorting the group of the first 10 elements (the last of the 0 group sorts, where the recursion depth was in the low 1700s.

    Without too much elaboration, the problem was cause by the minor optimization of updating the two pointers after doing a swap. This is fine as long as there are at least two items left to be compared in the group. When there is only one item left in the group to compare, it exited the loop because there was no purpose in comparing the item with itself (the two pointers would be pointing at the remaining item). But, the last item should be compared to the item in front of it and possibly behind it, since it may need to be swapped with one of those neighbors.

    In the SortSegment Sub, after the items were swapped were these two lines:
    Code:
                upPtr += 1 'we can update the pointers since we
                dnPtr -= 1 'don't need to test the swapped rows again
    If we had three items left before the swap, then we would have only one item left after and we need to test that item with its neighbors, so we add a guard around those two lines:
    Code:
              If dnPtr - upPtr > 3 Then
                upPtr += 1 'we can update the pointers since we
                dnPtr -= 1 'don't need to test the swapped rows again
              End If
    As part of my earlier verification, I added a routine to compare two entries (I believe I mentioned that before). So I added a button to loop through the whole array and compare every entry with its next neighbor to verify they all were in order. This quick sanity check (it is very quick) not only proved the items were now in order, but also that there were no duplicates. To be a proper test, it probably should allow for duplicates, but this was just a quick sanity check, not a complete test.

    In case I didn't post the test function before, I'll post it here, and the added button click handler code that does the sanity check. Your post had some higher index values showing as well, and my index is two lower than what you are showing, but the numbers are in the same sequence. Since my code appears to be output correctly all the items, the difference may be on your end, I can't say for sure.
    Code:
      Function CompareTwoRows(row1 As Integer, row2 As Integer) As Integer
        Dim retVal As Integer
        Dim DataLen As Integer = Rotations.Length
    
        'convert RotationIndexes to PointerIndexes
        Dim ptr1 As Integer = (DataLen - row1) Mod DataLen
        Dim ptr2 As Integer = (DataLen - row2) Mod DataLen
    
        For i = 0 To DataLen - 1
          If rotatedData(ptr1) <> rotatedData(ptr2) Then
            If rotatedData(ptr1) < rotatedData(ptr2) Then
              retVal = -1
            Else
              retVal = 1
            End If
            Exit For
          End If
          ptr1 += 1 : ptr2 += 1
        Next
    
        Return retVal
      End Function
    
      Private Sub Button2_Click(sender As System.Object, e As System.EventArgs) Handles Button2.Click
        Dim cmpVal As Integer
    
        Label1.Text = "Comparing"
        Label1.Refresh()
    
        'santity check
        For i As Integer = 0 To Rotations.Length - 2
          cmpVal = CompareTwoRows(Rotations(i), Rotations(i + 1))
          If cmpVal > 0 Then Stop   'this breaks in the IDE, it probably does nothing in an executable, but I didn't test it outside the IDE
        Next
        Label1.Text = "Full array order tested, and passed. Sort appears valid"
      End Sub
    Instead of doing a Stop statement call to break into the debugger if it finds an item out of sort, it should probably be changed to a messagebox with meaningful information, if you wanted to make it a permanent part of the program. Because it is pretty quick, you could always do the sanity check at the end of the sort as part of the process, if desired.

    p.s. I also removed the Mod operators from the above code and in the other place it was used. It doesn't really make a noticeable speed difference, but it is unnecessary because I switched to using the 0 pointer in the initial array setup rather than the data.Length offset for the Rotation 0 offset.

  4. #44
    Sinecure devotee
    Join Date
    Aug 2013
    Location
    Southern Tier NY
    Posts
    4,907

    Re: [RESOLVED] Can someone fix my sorting algorithm up?

    Heading up to the Lobby to get breakfast.
    Decided to build a few variations and see what times I'm getting.
    The X86 versions are slightly faster than the X64 versions. This seems to be true for most of the code I write. I think I've only had one case where I was using a lot of data and doing a lot of math on the data where the X64 version of the code was faster than the X86 code.
    (I am running on an Windows 10 64-bit "Engineering" laptop).

    So, some of the results (the X-64 versions were only slightly slower, don't show all the variations in this list, just the one for example). The AnyCPU versions was only slightly slower, say about 7ms compared to the X-64 version
    Code:
    Proc     build    In/Out IDE Time
    -------- -------- ---------- -------------
    X-86     Debug    In         720 ms
    X-86     Release  In         463 ms
    AnyCpu   Release  Out        297 ms
    X-64     Release  Out        290 ms
    X-86     Release  Out        199 ms
    This was using your dump.bin data, read in during the Form Load so loading of the OriginalData was not part of the Sort timing. Also didn't include the time to verify the sort order (which would be much shorter than the time to actually sort the data).

  5. #45

    Thread Starter
    Member
    Join Date
    Apr 2009
    Posts
    55

    Re: [RESOLVED] Can someone fix my sorting algorithm up?

    Many thanks for your commitment and support!!!!!! finally could close the book on this question once and for all

    Hey passel I haven't been able to access a computer for 2 days but ya thanks for that If statement > 3 it solved the whole issue. Ya X86 is faster probably because it all works on 32 bit integers under the hood. AnyCpu is pretty funny lol how does it work on both CPUs probably thats not native at all but completely interpreted bytecode version for the .NET framework then again it all is so i dont know what im talking about

    The sanity check function is interesting way to figure out if its all valid but I had a hard time understanding it just checks 2 rows at a time to check if they are sorted this could still be fooled if its all sorted by segments and not as a whole haha like collisions but I tested it with a method that 100% gives the correct result. It's a burrow wheeler reverser I madeI inputted the original data before it got doubled in rotatedData and I generated a new data from the last data element from every sorted rotation (thats how BWT works) and then I took which rotation contains the original data aka Rotation 0's position I had to use my Mod Pointer to rotation index converter for this set so I could make the reverse easier.. then I dumped both datas the original data and the decoded data from the reverse and I ran it in HexCmp (a app that compares byte by byte for changes and clicked if any changes existed and it said no so yup it's perfect!!! the chance of it still being unsorted is like 1 in 820 lol

    Heres how the whole thing looks like

    Code:
        Public Function SortLexicoGraphicallyArrayPointers(ByRef data As Byte(), ByRef outdata As Byte()) As UInteger
            Dim rotatedData As Byte()
            Dim index As Long
            ReDim rotatedData(data.LongLength + data.LongLength - 1)
    
            'File.WriteAllBytes("C:\Users\Igor\Desktop\compress\BWT Gap Fixer Compression - Copy\WindowsApp\WindowsApp\bin\x64\Debug\dump.bin", data)
    
    
            Array.Copy(data, 0, rotatedData, 0, data.LongLength)
            Array.Copy(data, 0, rotatedData, data.LongLength, data.LongLength)
    
            Dim Rotations As UInteger()
            ReDim Rotations(data.LongLength - 1)
    
            Dim rotation As UInteger = 0
            'Rotations points to where each Rotation exists in the Original array
            For i = data.LongLength To 1 Step -1
                Rotations(rotation) = i
                rotation += 1
            Next
    
            Dim pos As Long = 0
    
            SortSegment(rotatedData, Rotations, 0, data.Length - 1, 0)  'sort the whole array
    
            'Dim outData() As Byte
            ReDim outData(data.Length - 1)
            For i As Integer = 0 To data.Length - 1
                's += Chr(data(i))
                outData(i) = rotatedData(Rotations(i) + data.Length - 1)
            Next
            'f_frmMain.txtDebug.Text += s + vbCrLf
    
    
            'Convert rotation pointers back to rotation values.
            For index = 0 To Rotations.LongLength - 1
                rotation = (data.LongLength - Rotations(index)) Mod data.LongLength
                Rotations(index) = rotation
            Next
    
            Dim offsetIndexOfOriginalData As Integer = 0
            For i As Integer = 0 To Rotations.Length - 1
                If Rotations(i) = 0 Then
                    offsetIndexOfOriginalData = i
                End If
            Next
    
            Return offsetIndexOfOriginalData
        End Function
    Code:
        Private Sub SortSegment(RotatedData() As Byte, Rotations() As UInteger, segBottomIn As Integer, segTopIn As Integer, Offset As Integer)
            Dim upPtr, dnPtr As UInteger
            upPtr = segBottomIn : dnPtr = segTopIn
            Dim zeroTop As Integer = segBottomIn - 1
            Dim segTop As Integer = segTopIn
            Dim segBottom As Integer = segBottomIn
    
            If Offset < Rotations.Length - 1 Then
                Do While upPtr < dnPtr
    
                    'Search from the bottom of the segment, looking for a byte with 1 in it
                    'Keep track of where the last 0 was found for our split point
                    Do While RotatedData(Rotations(upPtr) + Offset) = 0
                        zeroTop = upPtr
                        If upPtr < dnPtr Then
                            upPtr += 1
                        Else
                            Exit Do
                        End If
                    Loop
    
                    'Search from the top, looking for a byte with 0 in it
                    Do While RotatedData(Rotations(dnPtr) + Offset) = 1 AndAlso upPtr < dnPtr
                        dnPtr -= 1
                    Loop
    
                    'If two bytes were found that need to be swapped, swap the rows
                    If upPtr < dnPtr Then
                        Dim tmpFirst As UInteger = Rotations(upPtr)
                        Rotations(upPtr) = Rotations(dnPtr)
                        Rotations(dnPtr) = tmpFirst
                        zeroTop = upPtr
                        If dnPtr - upPtr > 3 Then
                            upPtr += 1 'we can update the pointers since we
                            dnPtr -= 1 'don't need to test the swapped rows again
                        End If
                    End If
                Loop
    
                'if we have multiple entrys in the 0 end of the segment, split the 0's subsegment
                'based on the next byte in the seqence
                If zeroTop > segBottom Then
                    SortSegment(RotatedData, Rotations, segBottom, zeroTop, Offset + 1)
                End If
    
                'If we hae multiple entrys in the 1 end of the segment, split the 1's subsegment
                If zeroTop + 1 < segTop Then
                    SortSegment(RotatedData, Rotations, zeroTop + 1, segTop, Offset + 1)
                End If
            End If
        End Sub
    Code:
        Public Function reverseBurrowsWheelerTransform(ByVal data As Byte(), ByVal decodeIndex As Long) As Byte()
            Dim dataSize As UInteger = data.Length - 1
    
            Dim countBytes As UInteger() = New UInteger(255) {}
            Dim byteCountsPerIndex As UInteger() = New UInteger(dataSize) {}
    
            'Counts amount of same bytes already occurred before the next substring
            For i As Integer = 0 To dataSize
                byteCountsPerIndex(i) = countBytes(data(i))
                countBytes(data(i)) += 1
            Next
    
            'Lexicographically less then decodeIndex
            Dim sum As UInteger = 0
            Dim tmpCount As UInteger = 0
    
            For i As Integer = 0 To countBytes.Length - 1
                tmpCount = countBytes(i)
                countBytes(i) = sum
                sum += tmpCount
            Next
    
            Dim decodedData As Byte() = New Byte(dataSize) {}
            tmpCount = decodeIndex
            For i As Integer = dataSize To 0 Step -1
                decodedData(i) = data(tmpCount)
                tmpCount = byteCountsPerIndex(tmpCount) + countBytes(data(tmpCount))
            Next
    
            Return decodedData
        End Function

  6. #46
    Sinecure devotee
    Join Date
    Aug 2013
    Location
    Southern Tier NY
    Posts
    4,907

    Re: [RESOLVED] Can someone fix my sorting algorithm up?

    Quote Originally Posted by pkedpker View Post
    ... AnyCpu is pretty funny lol how does it work on both CPUs probably thats not native at all but completely interpreted bytecode version for the .NET framework then again it all is so i dont know what im talking about...
    The .exe file is not all native code, it is mostly Microsoft Intermediate Language (MSIL), which when you run the exe file, a JIT (Just In Time) compiler can efficiently convert MSIL to native code. So, with "AnyCpu", you support JIT converting into either x86 or X64 native code. There's more to it than that, I'm sure, but that is the crux of it. If the controls and methods you use in your program are supported by mono, you can just run the exe you created on the windows box on a linux box without having to recompile the code for Linux, for instance. You don't run the exe directly, you run mono and pass the exe to it, so it accesses the MSIL code in the exe and compiles it (or perhaps interprets it in some cases, I don't know).

    Quote Originally Posted by pkedpker View Post
    The sanity check function is interesting way to figure out if its all valid but I had a hard time understanding it just checks 2 rows at a time to check if they are sorted this could still be fooled if its all sorted by segments and not as a whole haha like collisions but I tested it with a method that 100% gives the correct result. ...
    It doesn't matter if it is sorted by segments. You only run it after the sort is done, i.e. all the segments have been sorted. It simply goes through the whole array from front to back (i.e. in sorted order) and verifies that the lower row is not greater than the row after it. If any row was not sorted correctly, it would fail that test at some point.

    Since the recursive sub call fails in the worse case scenario for this sort (only one byte set causes a stack overflow), I went ahead and modified the code slightly to not use recursion. Where it would recall the sub to split the next segment, I just push the three paramters on a stack. The program then just loops back up to the top, and check if there is anything on the stack, and if so pops it and processes it. If there is nothing on the stack, then you're done so the loop exits.

    It essentially runs at about the same speed as the recursion version, although it might be slightly slower on average. But, at least it doesn't bomb out on the worse case test. It does take awhile to finish though for the worse case, generally in my machine from 56 to 59 seconds. It's also a worse case for the current version of my sanity check routine as it also takes about the same amount of time, perhaps a little more, to verify that the rows are sorted properly. That has given me an idea about how to improve the speed of the santity check, but I haven't tried to improve it yet.

    In post#36, when I found that the recursive sort fails on a 200,000 byte array with only 1 byte set, I mentioned a possible approach (before I implemented the non-recursive version).
    So, if you have a lot of zeros in your data (runs of thousands of zeros), a recursive implementation is not the way to go. I would have to think some more about a better approach for that case. Perhaps look for the longest runs of zeros and ones, and "presort" those into matching subsegments, then start the sorting by byte offset in each of those subsegments to do the final sorting.
    I didn't implement it first because using a stack instead of being recursive was a simple change and could be done in a few minutes.
    Doing the approach of looking at runs and presorting those into matching subsegments would involve multiple lists to collect rows that started with matching runs into groups to be sorted, which is a more complicated thing to work out.

    In post #42 I went into greater detail about how using runs to create the groups/segments to be sorted, would work. Earlier today I went ahead and created a version along those lines, using a linked list approach to collect rows into groups, then process the linked lists to fill the Rotations array in sorted group order, and then sorted each of those groups to get the final sort.

    Running that on your set of data, it was about twice as fast as the recursive method (I got around 128 ms for the x86 Release out of the IDE run.

    Interestingly (though not unexpected), the "worse case" version which the recursive method couldn't sort, and the non-recursive version took about a minute to sort, the "sort runs into sorted groups" version did it in 16 ms.

    It is a bit more complicated to understand, in its current form as it is kind of hacked together. If you want it, I should go through it and clean it up. Since the first step is to go through the array once and generate the run counts (and that is quite fast since it is just one pass through the data), based on the statistics of ratio of 0's to 1's, it might make sense to have both methods available if there is a case where recursion has a chance to be faster by a fair amount.

    I haven't tested a fairly balanced mix of 0's and 1's yet to see how this version of the sort does in that case. Since it is twice as fast on your data set, and over 300 time faster on the "worse case" it looks like a winner, but the "worse" case is as sparse as you can get, and your data set was pretty sparse as well, over 260 0's for every 1 in the file. I can assume that at some point, there will be a cross over where a lower ratio of 0's to 1's (more evenly matched) will make the "sorted runs" approach the significantly slower version.

    It's late so I'm not testing it now.
    Last edited by passel; Dec 21st, 2018 at 12:38 AM.

  7. #47

    Thread Starter
    Member
    Join Date
    Apr 2009
    Posts
    55

    Re: [RESOLVED] Can someone fix my sorting algorithm up?

    if you want to give the non-recursive version that would be cool i dont care how the code looks like haha i wouldn't look it more then one time since you wasted your time writing it the least you could do is release it since there is no point holding it. I don't think the worst case scenario is possible the distribution of 1's and 0's will always be alike give or take 10 or so less or more 1's. This whole project I am working on was just a theory to test out a assumption I had about the BWT that if you run the sorted dataset (last value of every rotation) on it self and only store the rotation index of the original data, where the output is ran on itself for another pass over and over again that sooner or later I would have a dataset with the 1's being all close together give or take a few zero's thats my initially goal but it still looks random even with the sorted data in play so I would need to look at a different way of doing this... dont even know if its possible im trying this to achieve some kind of like random number generator that would move the 1's around to bunch them up as close as possible then be able to undo it with as little overhead information as possible to achieve compression. Where the 1's bunch up I don't care either since I could just rotate the data once and get them all near the start of the datafile. Thats about all..

    My previous theory was to down-size all the gaps by half but this caused a problem where there was only one 0 or no 0's at all before hitting the next 1 which also failed idea since u can't down-size those values and it throws the whole thing off at undoing the process.

  8. #48
    Sinecure devotee
    Join Date
    Aug 2013
    Location
    Southern Tier NY
    Posts
    4,907

    Re: [RESOLVED] Can someone fix my sorting algorithm up?

    You're only rotating the bytes around. The sequence and spacing of the 1's and 0's will always be the same, just starting at a different location.
    Any column in the rotated matrix will be the same sequence, but in various rotations, in reverse order from top to bottom compared to left to right horizontally.

    Now, if you sort the rotations, then the first row of course will have the 1's packed as "tight" as they can be. The longest run of 0's will be at front.
    So, in theory, you could just eliminate the leading 0's and the first 1, and save the amount you rotated and you could restore the original data.
    Essentially this is similar to the normalization that takes place after a floating point calculation in a math coprocessor or library. The leading 0's and the first one are shifted off (removed from the number) (the result referred to as the mantissa) and the amount shifted is stored in the exponent portion of the number.

    Perhaps after sorting, then the last column (or some other column) may have the 1's grouped closer together, but I don't see how that can be of much use, because sorting reorders the data, and you can't unsort sorted data back into the original order without having a lot of information about the original order.

    I'm remembering a story from the late 1970's when using a word processor on a person computer was one of the primary selling points of why someone might want to buy a home computer. The guy presenting the software opened a document and showed how you could do a search and replace to replace all the "p"'s in the document with an "o". And then, you could do a search and replace of the "o"s back to "p", but then realized it changed all the "o"s to "p", not just the ones he had originally changed. Of course later, with more memory available and better software they added the "Undo" function, but in order for that to work, you either have to snapshot the document before the search and replace operation, or track every letter changed so you can undo the changes in reverse order.

    So, to unsort sorted data back to its original order, you would either need a copy of the original data, or track how the data moved from original order to sorted order. I think that is the fallacy of your thinking. That you can somehow unsort data back into its original order by some complimentary algorithm related to how you sorted the data. That is not possible.

    Sorting data is not the same as scrambling data with an algorithm. With a scrambling algorithm, assuming the algorithm is not one way (which technically, a sort is one way reordering), you can use a complimentary algorithm to unscramble the data.

    All compression schemes work by removing redundant information that can be encode into a shorter form, i.e. runs of data reduced to a single instance of the data and a run length, or a single copy of the data in a lookup table, and an index into the table location stored instead of the data each place it occurs.
    There are compression schemes that depend on determining fractal equations that can reproduce portions of data, but I have no chance of understanding how that is done, and I've hear of wavelet compression, but I don't understand that either.

    I could post the non-recursive version, as it was just a simple modification, so it is essentially the same program already posted, with pushing data to a stack instead of doing a recursive call. A loop had to be added of course so that stack gets drained.

    The code that I said is hacked together and not organized is the third version which uses the runs of repetitive bytes to essentially treat the data in a compressed way. That improved the time quite a bit in some case, ( I mentioned it doubled the speed on your data), but it is a partial implementation, only considering the first run of each rotation to initially divide the data into 820 groups of rotations that start runs of 0's, and one group that contains the rotations that start with a 1 (a single group containing 820 rotations). It then uses the existing sort method to sort those groups.

    That sorted your data about twice as fast, and a randomly filled array fast as well. But some further testing, setting two 0's in the data took twice as long a single zero, and setting 2, 3, 4 and 5 bytes with 1, but fairly widely space, took increasing long periods, e.g. the 4 byte version took about 15 seconds and the 5 byte set version took about 30 seconds. This has to do with the sort reverting to the existing sort after the first groups are created. I think the solution for a consistently fast sort would be to finish the pattern, by sorting the runs at every pass, not just the first. Since that code is more complicated, and involves a form of creating linked lists, which in this case I think can be simplified to a single list, or array, of integers, rather than an array of structures or classes. So, I would rather work that out, rather than post a complicated half implementation which would not be that useful, I think.

    So, since you asked, I'll post the non-recursive version, and you should see it isn't too much different from the recursive version.
    I'm posting all the code here, so be aware the are extra references to labels since I put some information to labels rather than use debug.print.
    Also, an OpenFileDialog opens at the start, so you can select the .bin file to be loaded for the initial data (since I used your data in several different version, I just select it from one location, and didn't bother write the code to figure out where that location is). If you just cancel out, then it will create a 200,000 byte array, and put some data in it. I just checked, and it isn't filling it with random 1s, it is just setting one 1 in the array. That is just what it was when I copied it.

    There is also a bunch of commented out code, as I replaced existing code with alternatives.
    Code:
    Public Class Form1
    
      Private OriginalData() As Byte
      Private SortedOrder() As Integer
      Private rotatedData As Byte()
      Private Rotations As Integer()
      Private rand As New Random
      Private maxRecursionDepth As Integer
      Private RecursionDepth As Integer
      Private InitialRunCount() As Integer
    
      Private Sub Form1_Load(sender As System.Object, e As System.EventArgs) Handles MyBase.Load
        Dim fd As OpenFileDialog = New OpenFileDialog
        Dim fileName As String
        fd.InitialDirectory = Environment.CurrentDirectory
        If fd.ShowDialog = Windows.Forms.DialogResult.OK Then
          fileName = fd.FileName
          OriginalData = IO.File.ReadAllBytes(fileName)
        Else
          ReDim OriginalData(199999)                       ' Create a 200,000 byte array
          'For i As Integer = 0 To OriginalData.Length - 1
          '  OriginalData(i) = CByte(rand.Next(2))               '   fill it with random 0's and 1's
          'Next
          OriginalData(100) = 1   'worse case scenarion, only one 1 in the array
        End If
      End Sub
    
      Private Sub Button1_Click(sender As System.Object, e As System.EventArgs) Handles Button1.Click
        Dim sw As New Stopwatch
        sw.Start()
        SortedOrder = SortUpDownArrayPointers(OriginalData)
        sw.Stop()
        ' Debug.Print("Elapsed milliseconds = {0}", sw.ElapsedMilliseconds)
        ' Debug.Print("Max Recursion Depth: {0}", maxRecursionDepth)
        Label2.Text = String.Format("Elapsed time: {0} ms {1}Max Recursion Depth: {2}",
                                    sw.ElapsedMilliseconds.ToString, vbCrLf, maxRecursionDepth.ToString)
        ' DebugPrintMatrix()
      End Sub
    
      Public Function SortUpDownArrayPointers(ByRef data As Byte()) As Integer()
        ReDim rotatedData(data.Length + data.Length - 1)
    
        Array.Copy(data, 0, rotatedData, 0, data.Length)
        Array.Copy(data, 0, rotatedData, data.Length, data.Length)
    
        ReDim Rotations(data.Length - 1)
        ReDim InitialRunCount(data.Length - 1)  'hold the count of the number of bytes in sequence, starting with the first, that match the first
    
        Dim rotation As Integer = 0
        'Rotations points to where each Rotation exists in the Original array
        'For i = data.Length To 1 Step -1
        '  Rotations(rotation) = i
        '  rotation += 1
        'Next
        Dim runCnt As Integer
        Dim runDir As Integer = 2 * rotatedData(data.Length - 1) - 1  '0 = -1, 1 = 1
        Dim newRunDir As Integer
        Dim tmpOneCnt As Integer
    
        'We initial the Rotations pointer array, but we also count strings of 0s and 1s associated with the starting byte of that rotation
        'so we can presort the array based on these runs. The sequences that start with the longest run of 0s has to be the at the front of the sort
        'and ones that start with the longest run of 1s has to be at the end of the sort.
        'To save having to use two arrays, one for 0s and one for 1s, we just count 0s in a negative direction and 1s in a positive direction
        'so the sign of the count indicates whether we're counting 0s or 1s, and the magnitude of the number indicates the count
        For i = data.Length - 1 To 0 Step -1
          Rotations(i) = i    'Set the index pointer (or offset of the start of the sequence in the array, in other words)
    
          newRunDir = 2 * rotatedData(i) - 1  'common logic for counting, so calculate the direction first to see if we're changing what we're counting
          If newRunDir = runDir Then  'if this byte is the same value as the one we're counting Then
            runCnt += runDir          '  Count it
          Else                        'else the byte is a different value from what we're currently counting
            '  Debug.Print(runCnt.ToString)
            runDir = newRunDir        '  update the run direction flag
            runCnt = runDir           '  and reinitialize the current count (to 1 or -1)
            If runCnt = 1 Then tmpOneCnt += 1
          End If
          InitialRunCount(i) = runCnt 'save the number of intial bytes in the series match the starting byte
    
        Next
        Debug.Print("Number of 1s: {0}", tmpOneCnt.ToString)
    
        'Start the sorting by kicking off the recursive function to split the array based on the 
        'value of the first byte of the sequence
        'SortSegment(Segment first byte, segment last byte, byte offset to test
        SortSegment(0, data.Length - 1, 0)  'sort the whole array
    
        'Convert rotation pointers back to rotation values.
        For index = 0 To Rotations.Length - 1
          rotation = (data.Length - Rotations(index)) ' Mod data.Length
          Rotations(index) = rotation
        Next
        Return Rotations
      End Function
    
      Private Sub SortSegment(segBottomIn As Integer, segTopIn As Integer, Offset As Integer)
        Dim upPtr, dnPtr As Integer
        Dim zeroTop, segTop, segBottom As Integer
    
        Dim LocalStack As New Stack(Of Integer)
    
        LocalStack.Push(Offset)
        LocalStack.Push(segTopIn)
        LocalStack.Push(segBottomIn)
    
        Do While LocalStack.Count > 0
          segBottomIn = LocalStack.Pop
          segTopIn = LocalStack.Pop
          Offset = LocalStack.Pop
          If LocalStack.Count > maxRecursionDepth Then maxRecursionDepth = LocalStack.Count
    
          zeroTop = segBottomIn - 1
          segTop = segTopIn
          segBottom = segBottomIn
          upPtr = segBottom
          dnPtr = segTop
    
          If Offset < Rotations.Length - 1 Then
            Do While upPtr < dnPtr
    
              'Search from the bottom of the segment, looking for a byte with 1 in it
              'Keep track of where the last 0 was found for our split point
              Do While rotatedData(Rotations(upPtr) + Offset) = 0
                zeroTop = upPtr
                If upPtr < dnPtr Then
                  upPtr += 1
                Else
                  Exit Do
                End If
              Loop
    
              'Search from the top, looking for a byte with 0 in it
              Do While rotatedData(Rotations(dnPtr) + Offset) = 1 AndAlso upPtr < dnPtr
                dnPtr -= 1
              Loop
    
              'If two bytes were found that need to be swapped, swap the rows
              If upPtr < dnPtr Then
                Dim tmpFirst As Integer = Rotations(upPtr)
                Rotations(upPtr) = Rotations(dnPtr)
                Rotations(dnPtr) = tmpFirst
                zeroTop = upPtr
                If dnPtr - upPtr > 3 Then
                  upPtr += 1 'we can update the pointers since we
                  dnPtr -= 1 'don't need to test the swapped rows again
                End If
              End If
            Loop
    
    
            'if we have multiple entrys in the 0 end of the segment, split the 0's subsegment
            'based on the next byte in the seqence
            If zeroTop > segBottom Then
              ' If zeroTop < 10 Then Stop
              '  SortSegment(segBottom, zeroTop, Offset + 1)
              LocalStack.Push(Offset + 1)
              LocalStack.Push(zeroTop)
              LocalStack.Push(segBottom)
            End If
    
            'If we hae multiple entrys in the 1 end of the segment, split the 1's subsegment
            If zeroTop + 1 < segTop Then
              '     SortSegment(zeroTop + 1, segTop, Offset + 1)
              LocalStack.Push(Offset + 1)
              LocalStack.Push(segTop)
              LocalStack.Push(zeroTop + 1)
            End If
    
          End If
        Loop
    
      End Sub
    
      Sub DebugPrintMatrix()
        Dim row As Integer = 0
        Dim idx As Integer
        '    For Each idx As Integer In Rotations
        For j As Integer = 0 To 100
          idx = Rotations(j)
          Debug.Write(String.Format("{0,6}: ", row.ToString))
          row += 1
          For i = 0 To 80 'OriginalData.Length - 1
            Debug.Write(rotatedData(idx + i).ToString)
          Next
          Debug.WriteLine("")
        Next
      End Sub
    
      Private Sub Form1_Shown(sender As Object, e As System.EventArgs) Handles Me.Shown
        NumericUpDown1.Maximum = OriginalData.Length - 1
        NumericUpDown2.Maximum = NumericUpDown1.Maximum
      End Sub
    
      Private Sub Button2_Click(sender As System.Object, e As System.EventArgs) Handles Button2.Click
        '   Dim Result As String = "Rows are equal"
        Dim cmpVal As Integer
    
        Label1.Text = "Comparing"
        Label1.Refresh()
        '   cmpVal = CompareTwoRows(CInt(NumericUpDown1.Value), CInt(NumericUpDown2.Value))
    
        'santity check
        For i As Integer = 0 To Rotations.Length - 2
          cmpVal = CompareTwoRows(Rotations(i), Rotations(i + 1))
          If cmpVal > 0 Then Stop
        Next
        'If cmpVal < 0 Then
        '  Result = "First Row is less than Second Row"
        'ElseIf cmpVal > 0 Then
        '  Result = "First Row is greater than Second Row"
        'End If
        Label1.Text = "Full array order tested, and passed. Sort appears valid"
      End Sub
    
      Function CompareTwoRows(row1 As Integer, row2 As Integer) As Integer
        Dim retVal As Integer
        Dim DataLen As Integer = Rotations.Length
        'convert RotationIndexes to PointerIndexes
        Dim ptr1 As Integer = (DataLen - row1) 'Mod DataLen
        Dim ptr2 As Integer = (DataLen - row2) 'Mod DataLen
    
        For i = 0 To DataLen - 1
          If rotatedData(ptr1) <> rotatedData(ptr2) Then
            If rotatedData(ptr1) < rotatedData(ptr2) Then
              retVal = -1
            Else
              retVal = 1
            End If
            Exit For
          End If
          ptr1 += 1 : ptr2 += 1
        Next
    
        Return retVal
      End Function
    End Class

  9. #49

    Thread Starter
    Member
    Join Date
    Apr 2009
    Posts
    55

    Re: [RESOLVED] Can someone fix my sorting algorithm up?

    thanks for the non-recursion function it'll do :P

    by the way you said the way of sorting the data and getting the last column is not possible to get back the original data.. it is.. it's called Burrow Wheeler Transform I posted my reverse code as reverseBurrowsWheelerTransform you need to pass the offsetIndexOfOriginalData which is the first rotation index (0)'s position where its sorted in and you also pass in the last column data.. but it doesn't stack the 1's as tightly as it could... there is still big gaps like 300 - 800 0's in between. It doesn't suffer from the replacing all P's with O's then you lose all the original O's when you replace them back to P's

  10. #50
    Sinecure devotee
    Join Date
    Aug 2013
    Location
    Southern Tier NY
    Posts
    4,907

    Re: [RESOLVED] Can someone fix my sorting algorithm up?

    I didn't say is not possible to get back the original data from the last column. I said it wasn't possible to unsort sorted data without a method of tracking.
    In this case, the tracking bit you have is you have the offsetIndexOfOriginalData. When you sort the data you are merely starting with the longest runs of 0's at the front, and the longest runs (in this case, you only have runs of one byte) of 1's at the end. As I surmised in my later posts, if you just count the lengths of the runs and sort based on that, you can sort the data pretty quickly.

    Before you sort, every column of data is the same data in your original row, just in reverse sequence order. So, the last column of data is the first row of data in reverse order, left to right vs top to bottom.
    After you sort, then the last column should be related to the reverse order of your original data, but sorted in reverse order.
    So, knowing where the original data ended up when you sorted, and then sorting the data in reverse order using the index to divide the resort into above and below the index you gave, you get the original data back.

    I don't know if doing multiple sorts, and keeping track of the relative original index in each sort, to try to get the ones packed tighter will be efficient.
    I mean, you would have to reverse each sort to get back to the original data.

    Your example data had 820 ones in it, and the longest run of zeros between any pair of ones was 1710.
    Since 1710 will fit in a two byte Integer, then a simple compression would just store 820 2-byte integers of the size of the runs in order. Even if you packed the 820 bytes of 1's next to each other through multiple rotations, you're talking 820 bytes, plus all the OriginalIndexes of each sort used to get the 1's to gather. Without some empirical analysis, I don't even know the repeated rotations will result in gathering 1's together.

    Using 1640 bytes in a straight forward RunLengthEncoded scheme seems like it would be fairly good compression. Your zip file compresses the data to 2.1 K so is close to the same compression, using a more generic compression scheme.

    Anyway, I think I'm about out of steam on this topic, so will probably be giving it a rest. The partial sort on runs code seems to have a bug when tested with more combinations, so I would need to trouble shoot the issue, and I think that won't happen soon at this point, with the holiday activities picking up, and work always needed some of my attention.
    Last edited by passel; Dec 23rd, 2018 at 10:07 PM.

Page 2 of 2 FirstFirst 12

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Featured


Click Here to Expand Forum to Full Width