
Dec 18th, 2018, 06:49 PM
#41
Thread Starter
Member
Re: [RESOLVED] Can someone fix my sorting algorithm up?
what a life don't worry about this too much do whatever you want thanks for trying to help me out. you should definitely get some sleep if i was moving I would stay of computers for a while just to get to know the new area.. just go on when i'm bored.. this isn't really much of a priority for this. enjoy your new job so much new things to embrace in idk how you even got time for this, haha idk what to write.. snowing that much be tough and pleasant at the same time to drive in haha.
I myself didn't test the results from my comparisons and yours to see which is right I would dump the first 100 digits from both arrays and space them out in a notepad to see where the difference happens but I didn't do it too lazy I just assumed that if there is 400 then 401 after it its just 1 digit difference top one being a 0 and the bottom one being a 1. I don't even think identical lines are possible in rotations maybe with some luck collisions could happen but why are the collisions soo close in row index's they should be completely random numbers since the rotations only change 1 byte a time then its just the last digit where zero's slowly go down and 1 come in

Dec 18th, 2018, 11:56 PM
#42
Re: [RESOLVED] Can someone fix my sorting algorithm up?
It depends on the data of course, A simple case is alternating 1's and 0's . If you rotate that, every other rotation is going to match. Pretty much any symmetrical pattern will match at some point, and possibly a number of times.
Looking at possible ways to make the search more efficient, I added a little bit of code during the filling of the rotations array, to also count runs of zeros or ones (starting with the first byte of the rotation, how many of the following sequential bytes are the same. The one with the longest run of zeros would be the first item in the sorted array, and the one with the longest run of ones would be the last. If you group these first, then you can skip all the leading zeros and ones and just start sorting after that point, on matching runs.
It takes almost no time to count those as you go along (filling the rotations array backwards again, so that you can store the count as you go along). That showed me that in your set of 214,548 bytes, only 820 of them were 1s, and also that none of those bytes with 1 in them were contiguous, i.e. there was never more than one byte with 1 in it in a row.
I don't know if the data you're dealing with, that would always be the case or not, but if it was, then a specialized sort could take advantage of that to greatly reduce the number of compares necessary to sort the data. One way to look at it would be that you have 820 different (or so) runs of zeros counts that you want to sort. For the first iteration you would sort all the runs that start in the first byte, by the size of the run. You would have essentially, fairly quickly divided the big list into 820+ groups to sort. Theoretically that would mean on average each of those 820 groups would have 260 items to be sorted to have the whole thing sorted.
Of course, in reality the groups won't be the same size. You would have (in this set of data) a small number of groups with large (as large as 1700+ 0s in the run) in them to be sorted, and a large number of groups with a lot of small runs in them. Of the 820 groups, the smallest number of rows in the group that would need to be sorted would be 1, and the largest number of rows in one of those 820 groups, would be 820 itself. That is by definition since you only have 820 1s in the array. Because of the rotation, each of those 820 1s will end up in the second byte of the rotation series once, which means you have 820 rotations that have a run of one 0 in the first byte. Of course, you could also have a group with 820 rotations with two 0s at that front, the same thing for three. In fact, whatever the smallest gap between 1s in the series should determine the number of groups that would have 820 "rows" with that number of 0s, or less at the start to sort.
By having each byte in the series having the number of zeros that follow it associated with it, you can sort based on that number, then skip over that number to the next zero following the 1 byte location to see how many zeros are in that series and sort that group based on that number. Skipping all those bytes would reduce the group being sorted and the number of comparisons needed greatly. Another statistic, I don't know that I mentioned before, is the current code ends up calling the recursive group sort sub over 8 million times with your current data set. With sparsely scattered ones in the data as you have, I assume the calls to sort groups would be reduced to only a couple of hundred thousand times.

Dec 19th, 2018, 02:05 AM
#43
Re: [RESOLVED] Can someone fix my sorting algorithm up?
I was already investigated the sorting problem when I wrote the above post, so I decided to continue investigating since we knew it failed very early (relatively) in the sort. Essentially using a conditional break, I let it run until it got the to sorting the group of the first 10 elements (the last of the 0 group sorts, where the recursion depth was in the low 1700s.
Without too much elaboration, the problem was cause by the minor optimization of updating the two pointers after doing a swap. This is fine as long as there are at least two items left to be compared in the group. When there is only one item left in the group to compare, it exited the loop because there was no purpose in comparing the item with itself (the two pointers would be pointing at the remaining item). But, the last item should be compared to the item in front of it and possibly behind it, since it may need to be swapped with one of those neighbors.
In the SortSegment Sub, after the items were swapped were these two lines:
Code:
upPtr += 1 'we can update the pointers since we
dnPtr = 1 'don't need to test the swapped rows again
If we had three items left before the swap, then we would have only one item left after and we need to test that item with its neighbors, so we add a guard around those two lines:
Code:
If dnPtr  upPtr > 3 Then
upPtr += 1 'we can update the pointers since we
dnPtr = 1 'don't need to test the swapped rows again
End If
As part of my earlier verification, I added a routine to compare two entries (I believe I mentioned that before). So I added a button to loop through the whole array and compare every entry with its next neighbor to verify they all were in order. This quick sanity check (it is very quick) not only proved the items were now in order, but also that there were no duplicates. To be a proper test, it probably should allow for duplicates, but this was just a quick sanity check, not a complete test.
In case I didn't post the test function before, I'll post it here, and the added button click handler code that does the sanity check. Your post had some higher index values showing as well, and my index is two lower than what you are showing, but the numbers are in the same sequence. Since my code appears to be output correctly all the items, the difference may be on your end, I can't say for sure.
Code:
Function CompareTwoRows(row1 As Integer, row2 As Integer) As Integer
Dim retVal As Integer
Dim DataLen As Integer = Rotations.Length
'convert RotationIndexes to PointerIndexes
Dim ptr1 As Integer = (DataLen  row1) Mod DataLen
Dim ptr2 As Integer = (DataLen  row2) Mod DataLen
For i = 0 To DataLen  1
If rotatedData(ptr1) <> rotatedData(ptr2) Then
If rotatedData(ptr1) < rotatedData(ptr2) Then
retVal = 1
Else
retVal = 1
End If
Exit For
End If
ptr1 += 1 : ptr2 += 1
Next
Return retVal
End Function
Private Sub Button2_Click(sender As System.Object, e As System.EventArgs) Handles Button2.Click
Dim cmpVal As Integer
Label1.Text = "Comparing"
Label1.Refresh()
'santity check
For i As Integer = 0 To Rotations.Length  2
cmpVal = CompareTwoRows(Rotations(i), Rotations(i + 1))
If cmpVal > 0 Then Stop 'this breaks in the IDE, it probably does nothing in an executable, but I didn't test it outside the IDE
Next
Label1.Text = "Full array order tested, and passed. Sort appears valid"
End Sub
Instead of doing a Stop statement call to break into the debugger if it finds an item out of sort, it should probably be changed to a messagebox with meaningful information, if you wanted to make it a permanent part of the program. Because it is pretty quick, you could always do the sanity check at the end of the sort as part of the process, if desired.
p.s. I also removed the Mod operators from the above code and in the other place it was used. It doesn't really make a noticeable speed difference, but it is unnecessary because I switched to using the 0 pointer in the initial array setup rather than the data.Length offset for the Rotation 0 offset.

Dec 19th, 2018, 08:00 AM
#44
Re: [RESOLVED] Can someone fix my sorting algorithm up?
Heading up to the Lobby to get breakfast.
Decided to build a few variations and see what times I'm getting.
The X86 versions are slightly faster than the X64 versions. This seems to be true for most of the code I write. I think I've only had one case where I was using a lot of data and doing a lot of math on the data where the X64 version of the code was faster than the X86 code.
(I am running on an Windows 10 64bit "Engineering" laptop).
So, some of the results (the X64 versions were only slightly slower, don't show all the variations in this list, just the one for example). The AnyCPU versions was only slightly slower, say about 7ms compared to the X64 version
Code:
Proc build In/Out IDE Time
   
X86 Debug In 720 ms
X86 Release In 463 ms
AnyCpu Release Out 297 ms
X64 Release Out 290 ms
X86 Release Out 199 ms
This was using your dump.bin data, read in during the Form Load so loading of the OriginalData was not part of the Sort timing. Also didn't include the time to verify the sort order (which would be much shorter than the time to actually sort the data).

Dec 20th, 2018, 09:29 PM
#45
Thread Starter
Member
Re: [RESOLVED] Can someone fix my sorting algorithm up?
Many thanks for your commitment and support!!!!!! finally could close the book on this question once and for all
Hey passel I haven't been able to access a computer for 2 days but ya thanks for that If statement > 3 it solved the whole issue. Ya X86 is faster probably because it all works on 32 bit integers under the hood. AnyCpu is pretty funny lol how does it work on both CPUs probably thats not native at all but completely interpreted bytecode version for the .NET framework then again it all is so i dont know what im talking about
The sanity check function is interesting way to figure out if its all valid but I had a hard time understanding it just checks 2 rows at a time to check if they are sorted this could still be fooled if its all sorted by segments and not as a whole haha like collisions but I tested it with a method that 100% gives the correct result. It's a burrow wheeler reverser I madeI inputted the original data before it got doubled in rotatedData and I generated a new data from the last data element from every sorted rotation (thats how BWT works) and then I took which rotation contains the original data aka Rotation 0's position I had to use my Mod Pointer to rotation index converter for this set so I could make the reverse easier.. then I dumped both datas the original data and the decoded data from the reverse and I ran it in HexCmp (a app that compares byte by byte for changes and clicked if any changes existed and it said no so yup it's perfect!!! the chance of it still being unsorted is like 1 in 820 lol
Heres how the whole thing looks like
Code:
Public Function SortLexicoGraphicallyArrayPointers(ByRef data As Byte(), ByRef outdata As Byte()) As UInteger
Dim rotatedData As Byte()
Dim index As Long
ReDim rotatedData(data.LongLength + data.LongLength  1)
'File.WriteAllBytes("C:\Users\Igor\Desktop\compress\BWT Gap Fixer Compression  Copy\WindowsApp\WindowsApp\bin\x64\Debug\dump.bin", data)
Array.Copy(data, 0, rotatedData, 0, data.LongLength)
Array.Copy(data, 0, rotatedData, data.LongLength, data.LongLength)
Dim Rotations As UInteger()
ReDim Rotations(data.LongLength  1)
Dim rotation As UInteger = 0
'Rotations points to where each Rotation exists in the Original array
For i = data.LongLength To 1 Step 1
Rotations(rotation) = i
rotation += 1
Next
Dim pos As Long = 0
SortSegment(rotatedData, Rotations, 0, data.Length  1, 0) 'sort the whole array
'Dim outData() As Byte
ReDim outData(data.Length  1)
For i As Integer = 0 To data.Length  1
's += Chr(data(i))
outData(i) = rotatedData(Rotations(i) + data.Length  1)
Next
'f_frmMain.txtDebug.Text += s + vbCrLf
'Convert rotation pointers back to rotation values.
For index = 0 To Rotations.LongLength  1
rotation = (data.LongLength  Rotations(index)) Mod data.LongLength
Rotations(index) = rotation
Next
Dim offsetIndexOfOriginalData As Integer = 0
For i As Integer = 0 To Rotations.Length  1
If Rotations(i) = 0 Then
offsetIndexOfOriginalData = i
End If
Next
Return offsetIndexOfOriginalData
End Function
Code:
Private Sub SortSegment(RotatedData() As Byte, Rotations() As UInteger, segBottomIn As Integer, segTopIn As Integer, Offset As Integer)
Dim upPtr, dnPtr As UInteger
upPtr = segBottomIn : dnPtr = segTopIn
Dim zeroTop As Integer = segBottomIn  1
Dim segTop As Integer = segTopIn
Dim segBottom As Integer = segBottomIn
If Offset < Rotations.Length  1 Then
Do While upPtr < dnPtr
'Search from the bottom of the segment, looking for a byte with 1 in it
'Keep track of where the last 0 was found for our split point
Do While RotatedData(Rotations(upPtr) + Offset) = 0
zeroTop = upPtr
If upPtr < dnPtr Then
upPtr += 1
Else
Exit Do
End If
Loop
'Search from the top, looking for a byte with 0 in it
Do While RotatedData(Rotations(dnPtr) + Offset) = 1 AndAlso upPtr < dnPtr
dnPtr = 1
Loop
'If two bytes were found that need to be swapped, swap the rows
If upPtr < dnPtr Then
Dim tmpFirst As UInteger = Rotations(upPtr)
Rotations(upPtr) = Rotations(dnPtr)
Rotations(dnPtr) = tmpFirst
zeroTop = upPtr
If dnPtr  upPtr > 3 Then
upPtr += 1 'we can update the pointers since we
dnPtr = 1 'don't need to test the swapped rows again
End If
End If
Loop
'if we have multiple entrys in the 0 end of the segment, split the 0's subsegment
'based on the next byte in the seqence
If zeroTop > segBottom Then
SortSegment(RotatedData, Rotations, segBottom, zeroTop, Offset + 1)
End If
'If we hae multiple entrys in the 1 end of the segment, split the 1's subsegment
If zeroTop + 1 < segTop Then
SortSegment(RotatedData, Rotations, zeroTop + 1, segTop, Offset + 1)
End If
End If
End Sub
Code:
Public Function reverseBurrowsWheelerTransform(ByVal data As Byte(), ByVal decodeIndex As Long) As Byte()
Dim dataSize As UInteger = data.Length  1
Dim countBytes As UInteger() = New UInteger(255) {}
Dim byteCountsPerIndex As UInteger() = New UInteger(dataSize) {}
'Counts amount of same bytes already occurred before the next substring
For i As Integer = 0 To dataSize
byteCountsPerIndex(i) = countBytes(data(i))
countBytes(data(i)) += 1
Next
'Lexicographically less then decodeIndex
Dim sum As UInteger = 0
Dim tmpCount As UInteger = 0
For i As Integer = 0 To countBytes.Length  1
tmpCount = countBytes(i)
countBytes(i) = sum
sum += tmpCount
Next
Dim decodedData As Byte() = New Byte(dataSize) {}
tmpCount = decodeIndex
For i As Integer = dataSize To 0 Step 1
decodedData(i) = data(tmpCount)
tmpCount = byteCountsPerIndex(tmpCount) + countBytes(data(tmpCount))
Next
Return decodedData
End Function

Dec 21st, 2018, 12:33 AM
#46
Re: [RESOLVED] Can someone fix my sorting algorithm up?
Originally Posted by pkedpker
... AnyCpu is pretty funny lol how does it work on both CPUs probably thats not native at all but completely interpreted bytecode version for the .NET framework then again it all is so i dont know what im talking about...
The .exe file is not all native code, it is mostly Microsoft Intermediate Language (MSIL), which when you run the exe file, a JIT (Just In Time) compiler can efficiently convert MSIL to native code. So, with "AnyCpu", you support JIT converting into either x86 or X64 native code. There's more to it than that, I'm sure, but that is the crux of it. If the controls and methods you use in your program are supported by mono, you can just run the exe you created on the windows box on a linux box without having to recompile the code for Linux, for instance. You don't run the exe directly, you run mono and pass the exe to it, so it accesses the MSIL code in the exe and compiles it (or perhaps interprets it in some cases, I don't know).
Originally Posted by pkedpker
The sanity check function is interesting way to figure out if its all valid but I had a hard time understanding it just checks 2 rows at a time to check if they are sorted this could still be fooled if its all sorted by segments and not as a whole haha like collisions but I tested it with a method that 100% gives the correct result. ...
It doesn't matter if it is sorted by segments. You only run it after the sort is done, i.e. all the segments have been sorted. It simply goes through the whole array from front to back (i.e. in sorted order) and verifies that the lower row is not greater than the row after it. If any row was not sorted correctly, it would fail that test at some point.
Since the recursive sub call fails in the worse case scenario for this sort (only one byte set causes a stack overflow), I went ahead and modified the code slightly to not use recursion. Where it would recall the sub to split the next segment, I just push the three paramters on a stack. The program then just loops back up to the top, and check if there is anything on the stack, and if so pops it and processes it. If there is nothing on the stack, then you're done so the loop exits.
It essentially runs at about the same speed as the recursion version, although it might be slightly slower on average. But, at least it doesn't bomb out on the worse case test. It does take awhile to finish though for the worse case, generally in my machine from 56 to 59 seconds. It's also a worse case for the current version of my sanity check routine as it also takes about the same amount of time, perhaps a little more, to verify that the rows are sorted properly. That has given me an idea about how to improve the speed of the santity check, but I haven't tried to improve it yet.
In post#36, when I found that the recursive sort fails on a 200,000 byte array with only 1 byte set, I mentioned a possible approach (before I implemented the nonrecursive version).
So, if you have a lot of zeros in your data (runs of thousands of zeros), a recursive implementation is not the way to go. I would have to think some more about a better approach for that case. Perhaps look for the longest runs of zeros and ones, and "presort" those into matching subsegments, then start the sorting by byte offset in each of those subsegments to do the final sorting.
I didn't implement it first because using a stack instead of being recursive was a simple change and could be done in a few minutes.
Doing the approach of looking at runs and presorting those into matching subsegments would involve multiple lists to collect rows that started with matching runs into groups to be sorted, which is a more complicated thing to work out.
In post #42 I went into greater detail about how using runs to create the groups/segments to be sorted, would work. Earlier today I went ahead and created a version along those lines, using a linked list approach to collect rows into groups, then process the linked lists to fill the Rotations array in sorted group order, and then sorted each of those groups to get the final sort.
Running that on your set of data, it was about twice as fast as the recursive method (I got around 128 ms for the x86 Release out of the IDE run.
Interestingly (though not unexpected), the "worse case" version which the recursive method couldn't sort, and the nonrecursive version took about a minute to sort, the "sort runs into sorted groups" version did it in 16 ms.
It is a bit more complicated to understand, in its current form as it is kind of hacked together. If you want it, I should go through it and clean it up. Since the first step is to go through the array once and generate the run counts (and that is quite fast since it is just one pass through the data), based on the statistics of ratio of 0's to 1's, it might make sense to have both methods available if there is a case where recursion has a chance to be faster by a fair amount.
I haven't tested a fairly balanced mix of 0's and 1's yet to see how this version of the sort does in that case. Since it is twice as fast on your data set, and over 300 time faster on the "worse case" it looks like a winner, but the "worse" case is as sparse as you can get, and your data set was pretty sparse as well, over 260 0's for every 1 in the file. I can assume that at some point, there will be a cross over where a lower ratio of 0's to 1's (more evenly matched) will make the "sorted runs" approach the significantly slower version.
It's late so I'm not testing it now.
Last edited by passel; Dec 21st, 2018 at 12:38 AM.

Dec 22nd, 2018, 02:40 PM
#47
Thread Starter
Member
Re: [RESOLVED] Can someone fix my sorting algorithm up?
if you want to give the nonrecursive version that would be cool i dont care how the code looks like haha i wouldn't look it more then one time since you wasted your time writing it the least you could do is release it since there is no point holding it. I don't think the worst case scenario is possible the distribution of 1's and 0's will always be alike give or take 10 or so less or more 1's. This whole project I am working on was just a theory to test out a assumption I had about the BWT that if you run the sorted dataset (last value of every rotation) on it self and only store the rotation index of the original data, where the output is ran on itself for another pass over and over again that sooner or later I would have a dataset with the 1's being all close together give or take a few zero's thats my initially goal but it still looks random even with the sorted data in play so I would need to look at a different way of doing this... dont even know if its possible im trying this to achieve some kind of like random number generator that would move the 1's around to bunch them up as close as possible then be able to undo it with as little overhead information as possible to achieve compression. Where the 1's bunch up I don't care either since I could just rotate the data once and get them all near the start of the datafile. Thats about all..
My previous theory was to downsize all the gaps by half but this caused a problem where there was only one 0 or no 0's at all before hitting the next 1 which also failed idea since u can't downsize those values and it throws the whole thing off at undoing the process.

Dec 22nd, 2018, 09:42 PM
#48
Re: [RESOLVED] Can someone fix my sorting algorithm up?
You're only rotating the bytes around. The sequence and spacing of the 1's and 0's will always be the same, just starting at a different location.
Any column in the rotated matrix will be the same sequence, but in various rotations, in reverse order from top to bottom compared to left to right horizontally.
Now, if you sort the rotations, then the first row of course will have the 1's packed as "tight" as they can be. The longest run of 0's will be at front.
So, in theory, you could just eliminate the leading 0's and the first 1, and save the amount you rotated and you could restore the original data.
Essentially this is similar to the normalization that takes place after a floating point calculation in a math coprocessor or library. The leading 0's and the first one are shifted off (removed from the number) (the result referred to as the mantissa) and the amount shifted is stored in the exponent portion of the number.
Perhaps after sorting, then the last column (or some other column) may have the 1's grouped closer together, but I don't see how that can be of much use, because sorting reorders the data, and you can't unsort sorted data back into the original order without having a lot of information about the original order.
I'm remembering a story from the late 1970's when using a word processor on a person computer was one of the primary selling points of why someone might want to buy a home computer. The guy presenting the software opened a document and showed how you could do a search and replace to replace all the "p"'s in the document with an "o". And then, you could do a search and replace of the "o"s back to "p", but then realized it changed all the "o"s to "p", not just the ones he had originally changed. Of course later, with more memory available and better software they added the "Undo" function, but in order for that to work, you either have to snapshot the document before the search and replace operation, or track every letter changed so you can undo the changes in reverse order.
So, to unsort sorted data back to its original order, you would either need a copy of the original data, or track how the data moved from original order to sorted order. I think that is the fallacy of your thinking. That you can somehow unsort data back into its original order by some complimentary algorithm related to how you sorted the data. That is not possible.
Sorting data is not the same as scrambling data with an algorithm. With a scrambling algorithm, assuming the algorithm is not one way (which technically, a sort is one way reordering), you can use a complimentary algorithm to unscramble the data.
All compression schemes work by removing redundant information that can be encode into a shorter form, i.e. runs of data reduced to a single instance of the data and a run length, or a single copy of the data in a lookup table, and an index into the table location stored instead of the data each place it occurs.
There are compression schemes that depend on determining fractal equations that can reproduce portions of data, but I have no chance of understanding how that is done, and I've hear of wavelet compression, but I don't understand that either.
I could post the nonrecursive version, as it was just a simple modification, so it is essentially the same program already posted, with pushing data to a stack instead of doing a recursive call. A loop had to be added of course so that stack gets drained.
The code that I said is hacked together and not organized is the third version which uses the runs of repetitive bytes to essentially treat the data in a compressed way. That improved the time quite a bit in some case, ( I mentioned it doubled the speed on your data), but it is a partial implementation, only considering the first run of each rotation to initially divide the data into 820 groups of rotations that start runs of 0's, and one group that contains the rotations that start with a 1 (a single group containing 820 rotations). It then uses the existing sort method to sort those groups.
That sorted your data about twice as fast, and a randomly filled array fast as well. But some further testing, setting two 0's in the data took twice as long a single zero, and setting 2, 3, 4 and 5 bytes with 1, but fairly widely space, took increasing long periods, e.g. the 4 byte version took about 15 seconds and the 5 byte set version took about 30 seconds. This has to do with the sort reverting to the existing sort after the first groups are created. I think the solution for a consistently fast sort would be to finish the pattern, by sorting the runs at every pass, not just the first. Since that code is more complicated, and involves a form of creating linked lists, which in this case I think can be simplified to a single list, or array, of integers, rather than an array of structures or classes. So, I would rather work that out, rather than post a complicated half implementation which would not be that useful, I think.
So, since you asked, I'll post the nonrecursive version, and you should see it isn't too much different from the recursive version.
I'm posting all the code here, so be aware the are extra references to labels since I put some information to labels rather than use debug.print.
Also, an OpenFileDialog opens at the start, so you can select the .bin file to be loaded for the initial data (since I used your data in several different version, I just select it from one location, and didn't bother write the code to figure out where that location is). If you just cancel out, then it will create a 200,000 byte array, and put some data in it. I just checked, and it isn't filling it with random 1s, it is just setting one 1 in the array. That is just what it was when I copied it.
There is also a bunch of commented out code, as I replaced existing code with alternatives.
Code:
Public Class Form1
Private OriginalData() As Byte
Private SortedOrder() As Integer
Private rotatedData As Byte()
Private Rotations As Integer()
Private rand As New Random
Private maxRecursionDepth As Integer
Private RecursionDepth As Integer
Private InitialRunCount() As Integer
Private Sub Form1_Load(sender As System.Object, e As System.EventArgs) Handles MyBase.Load
Dim fd As OpenFileDialog = New OpenFileDialog
Dim fileName As String
fd.InitialDirectory = Environment.CurrentDirectory
If fd.ShowDialog = Windows.Forms.DialogResult.OK Then
fileName = fd.FileName
OriginalData = IO.File.ReadAllBytes(fileName)
Else
ReDim OriginalData(199999) ' Create a 200,000 byte array
'For i As Integer = 0 To OriginalData.Length  1
' OriginalData(i) = CByte(rand.Next(2)) ' fill it with random 0's and 1's
'Next
OriginalData(100) = 1 'worse case scenarion, only one 1 in the array
End If
End Sub
Private Sub Button1_Click(sender As System.Object, e As System.EventArgs) Handles Button1.Click
Dim sw As New Stopwatch
sw.Start()
SortedOrder = SortUpDownArrayPointers(OriginalData)
sw.Stop()
' Debug.Print("Elapsed milliseconds = {0}", sw.ElapsedMilliseconds)
' Debug.Print("Max Recursion Depth: {0}", maxRecursionDepth)
Label2.Text = String.Format("Elapsed time: {0} ms {1}Max Recursion Depth: {2}",
sw.ElapsedMilliseconds.ToString, vbCrLf, maxRecursionDepth.ToString)
' DebugPrintMatrix()
End Sub
Public Function SortUpDownArrayPointers(ByRef data As Byte()) As Integer()
ReDim rotatedData(data.Length + data.Length  1)
Array.Copy(data, 0, rotatedData, 0, data.Length)
Array.Copy(data, 0, rotatedData, data.Length, data.Length)
ReDim Rotations(data.Length  1)
ReDim InitialRunCount(data.Length  1) 'hold the count of the number of bytes in sequence, starting with the first, that match the first
Dim rotation As Integer = 0
'Rotations points to where each Rotation exists in the Original array
'For i = data.Length To 1 Step 1
' Rotations(rotation) = i
' rotation += 1
'Next
Dim runCnt As Integer
Dim runDir As Integer = 2 * rotatedData(data.Length  1)  1 '0 = 1, 1 = 1
Dim newRunDir As Integer
Dim tmpOneCnt As Integer
'We initial the Rotations pointer array, but we also count strings of 0s and 1s associated with the starting byte of that rotation
'so we can presort the array based on these runs. The sequences that start with the longest run of 0s has to be the at the front of the sort
'and ones that start with the longest run of 1s has to be at the end of the sort.
'To save having to use two arrays, one for 0s and one for 1s, we just count 0s in a negative direction and 1s in a positive direction
'so the sign of the count indicates whether we're counting 0s or 1s, and the magnitude of the number indicates the count
For i = data.Length  1 To 0 Step 1
Rotations(i) = i 'Set the index pointer (or offset of the start of the sequence in the array, in other words)
newRunDir = 2 * rotatedData(i)  1 'common logic for counting, so calculate the direction first to see if we're changing what we're counting
If newRunDir = runDir Then 'if this byte is the same value as the one we're counting Then
runCnt += runDir ' Count it
Else 'else the byte is a different value from what we're currently counting
' Debug.Print(runCnt.ToString)
runDir = newRunDir ' update the run direction flag
runCnt = runDir ' and reinitialize the current count (to 1 or 1)
If runCnt = 1 Then tmpOneCnt += 1
End If
InitialRunCount(i) = runCnt 'save the number of intial bytes in the series match the starting byte
Next
Debug.Print("Number of 1s: {0}", tmpOneCnt.ToString)
'Start the sorting by kicking off the recursive function to split the array based on the
'value of the first byte of the sequence
'SortSegment(Segment first byte, segment last byte, byte offset to test
SortSegment(0, data.Length  1, 0) 'sort the whole array
'Convert rotation pointers back to rotation values.
For index = 0 To Rotations.Length  1
rotation = (data.Length  Rotations(index)) ' Mod data.Length
Rotations(index) = rotation
Next
Return Rotations
End Function
Private Sub SortSegment(segBottomIn As Integer, segTopIn As Integer, Offset As Integer)
Dim upPtr, dnPtr As Integer
Dim zeroTop, segTop, segBottom As Integer
Dim LocalStack As New Stack(Of Integer)
LocalStack.Push(Offset)
LocalStack.Push(segTopIn)
LocalStack.Push(segBottomIn)
Do While LocalStack.Count > 0
segBottomIn = LocalStack.Pop
segTopIn = LocalStack.Pop
Offset = LocalStack.Pop
If LocalStack.Count > maxRecursionDepth Then maxRecursionDepth = LocalStack.Count
zeroTop = segBottomIn  1
segTop = segTopIn
segBottom = segBottomIn
upPtr = segBottom
dnPtr = segTop
If Offset < Rotations.Length  1 Then
Do While upPtr < dnPtr
'Search from the bottom of the segment, looking for a byte with 1 in it
'Keep track of where the last 0 was found for our split point
Do While rotatedData(Rotations(upPtr) + Offset) = 0
zeroTop = upPtr
If upPtr < dnPtr Then
upPtr += 1
Else
Exit Do
End If
Loop
'Search from the top, looking for a byte with 0 in it
Do While rotatedData(Rotations(dnPtr) + Offset) = 1 AndAlso upPtr < dnPtr
dnPtr = 1
Loop
'If two bytes were found that need to be swapped, swap the rows
If upPtr < dnPtr Then
Dim tmpFirst As Integer = Rotations(upPtr)
Rotations(upPtr) = Rotations(dnPtr)
Rotations(dnPtr) = tmpFirst
zeroTop = upPtr
If dnPtr  upPtr > 3 Then
upPtr += 1 'we can update the pointers since we
dnPtr = 1 'don't need to test the swapped rows again
End If
End If
Loop
'if we have multiple entrys in the 0 end of the segment, split the 0's subsegment
'based on the next byte in the seqence
If zeroTop > segBottom Then
' If zeroTop < 10 Then Stop
' SortSegment(segBottom, zeroTop, Offset + 1)
LocalStack.Push(Offset + 1)
LocalStack.Push(zeroTop)
LocalStack.Push(segBottom)
End If
'If we hae multiple entrys in the 1 end of the segment, split the 1's subsegment
If zeroTop + 1 < segTop Then
' SortSegment(zeroTop + 1, segTop, Offset + 1)
LocalStack.Push(Offset + 1)
LocalStack.Push(segTop)
LocalStack.Push(zeroTop + 1)
End If
End If
Loop
End Sub
Sub DebugPrintMatrix()
Dim row As Integer = 0
Dim idx As Integer
' For Each idx As Integer In Rotations
For j As Integer = 0 To 100
idx = Rotations(j)
Debug.Write(String.Format("{0,6}: ", row.ToString))
row += 1
For i = 0 To 80 'OriginalData.Length  1
Debug.Write(rotatedData(idx + i).ToString)
Next
Debug.WriteLine("")
Next
End Sub
Private Sub Form1_Shown(sender As Object, e As System.EventArgs) Handles Me.Shown
NumericUpDown1.Maximum = OriginalData.Length  1
NumericUpDown2.Maximum = NumericUpDown1.Maximum
End Sub
Private Sub Button2_Click(sender As System.Object, e As System.EventArgs) Handles Button2.Click
' Dim Result As String = "Rows are equal"
Dim cmpVal As Integer
Label1.Text = "Comparing"
Label1.Refresh()
' cmpVal = CompareTwoRows(CInt(NumericUpDown1.Value), CInt(NumericUpDown2.Value))
'santity check
For i As Integer = 0 To Rotations.Length  2
cmpVal = CompareTwoRows(Rotations(i), Rotations(i + 1))
If cmpVal > 0 Then Stop
Next
'If cmpVal < 0 Then
' Result = "First Row is less than Second Row"
'ElseIf cmpVal > 0 Then
' Result = "First Row is greater than Second Row"
'End If
Label1.Text = "Full array order tested, and passed. Sort appears valid"
End Sub
Function CompareTwoRows(row1 As Integer, row2 As Integer) As Integer
Dim retVal As Integer
Dim DataLen As Integer = Rotations.Length
'convert RotationIndexes to PointerIndexes
Dim ptr1 As Integer = (DataLen  row1) 'Mod DataLen
Dim ptr2 As Integer = (DataLen  row2) 'Mod DataLen
For i = 0 To DataLen  1
If rotatedData(ptr1) <> rotatedData(ptr2) Then
If rotatedData(ptr1) < rotatedData(ptr2) Then
retVal = 1
Else
retVal = 1
End If
Exit For
End If
ptr1 += 1 : ptr2 += 1
Next
Return retVal
End Function
End Class

Dec 23rd, 2018, 01:20 AM
#49
Thread Starter
Member
Re: [RESOLVED] Can someone fix my sorting algorithm up?
thanks for the nonrecursion function it'll do :P
by the way you said the way of sorting the data and getting the last column is not possible to get back the original data.. it is.. it's called Burrow Wheeler Transform I posted my reverse code as reverseBurrowsWheelerTransform you need to pass the offsetIndexOfOriginalData which is the first rotation index (0)'s position where its sorted in and you also pass in the last column data.. but it doesn't stack the 1's as tightly as it could... there is still big gaps like 300  800 0's in between. It doesn't suffer from the replacing all P's with O's then you lose all the original O's when you replace them back to P's

Dec 23rd, 2018, 09:57 PM
#50
Re: [RESOLVED] Can someone fix my sorting algorithm up?
I didn't say is not possible to get back the original data from the last column. I said it wasn't possible to unsort sorted data without a method of tracking.
In this case, the tracking bit you have is you have the offsetIndexOfOriginalData. When you sort the data you are merely starting with the longest runs of 0's at the front, and the longest runs (in this case, you only have runs of one byte) of 1's at the end. As I surmised in my later posts, if you just count the lengths of the runs and sort based on that, you can sort the data pretty quickly.
Before you sort, every column of data is the same data in your original row, just in reverse sequence order. So, the last column of data is the first row of data in reverse order, left to right vs top to bottom.
After you sort, then the last column should be related to the reverse order of your original data, but sorted in reverse order.
So, knowing where the original data ended up when you sorted, and then sorting the data in reverse order using the index to divide the resort into above and below the index you gave, you get the original data back.
I don't know if doing multiple sorts, and keeping track of the relative original index in each sort, to try to get the ones packed tighter will be efficient.
I mean, you would have to reverse each sort to get back to the original data.
Your example data had 820 ones in it, and the longest run of zeros between any pair of ones was 1710.
Since 1710 will fit in a two byte Integer, then a simple compression would just store 820 2byte integers of the size of the runs in order. Even if you packed the 820 bytes of 1's next to each other through multiple rotations, you're talking 820 bytes, plus all the OriginalIndexes of each sort used to get the 1's to gather. Without some empirical analysis, I don't even know the repeated rotations will result in gathering 1's together.
Using 1640 bytes in a straight forward RunLengthEncoded scheme seems like it would be fairly good compression. Your zip file compresses the data to 2.1 K so is close to the same compression, using a more generic compression scheme.
Anyway, I think I'm about out of steam on this topic, so will probably be giving it a rest. The partial sort on runs code seems to have a bug when tested with more combinations, so I would need to trouble shoot the issue, and I think that won't happen soon at this point, with the holiday activities picking up, and work always needed some of my attention.
Last edited by passel; Dec 23rd, 2018 at 10:07 PM.
Posting Permissions
 You may not post new threads
 You may not post replies
 You may not post attachments
 You may not edit your posts

Forum Rules

Click Here to Expand Forum to Full Width
