|
-
Jun 27th, 2011, 04:13 PM
#1
Thread Starter
Hyperactive Member
[RESOLVED] Merge strings and remove overlap
Hi,
I have many strings in a two column text file. The first column is the start position, the second one is the string.
Ex:
Code:
41800403 ACTTACTTACCTACTTCCTTCCCCAAGCCCTTTTCCCCTGTTAAACCCCCCCTGCCACACTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATC
41800409 TTAGCTCCTTGCCCCCCCAAGGCCCTTTGCGCTGGTAACCTCTCCCTGCCACACTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATCCCCACT
41800426 CAAGGCCCTTTGCGCTGGTAAACTCTCCCTGCCACACTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTAGAGTACCTTCCA
41800448 CTCCCCCTGCCACACTCCCAACCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAAC
41800458 CACCCTCCCAACCCCCACCCGTCTTTCGGGGGAGGCTGGCTGCATCCCCCTTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTC
41800458 CNCGCTCCCAGCCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTC
41800459 CCCCGCCCACCCCCCCTCCTTCCGTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCT
41800460 CACCCCCAACCACCACCCGTCCTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTC
41800462 CTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATCCCCACTTCCCGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACACACTTAGACTCAC
41800464 CCCAACCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTT
41800464 CCCACCCCCCATCCTGCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTT
41800467 CACCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAG
41800476 CTGTCGTGCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAGGTTTTCCAA
41800478 TTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAGGTTTTCCAAAT
41800478 TTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAGGTTTTCCAAAT
The second column contains 100 chars.Obviously these strings have overlapped, if a start position is given and the string length is given.
How to merge them together?
Ex, I want start point is 38781735, end point is 38781900. The string length is 38781900-38781735+1
Thanks.
-
Jun 28th, 2011, 01:15 AM
#2
Addicted Member
Re: Merge strings and remove overlap
 Originally Posted by zhshqzyc
Hi,
I have many strings in a two column text file. The first column is the start position, the second one is the string.
Ex:
Code:
41800403 ACTTACTTACCTACTTCCTTCCCCAAGCCCTTTTCCCCTGTTAAACCCCCCCTGCCACACTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATC
41800409 TTAGCTCCTTGCCCCCCCAAGGCCCTTTGCGCTGGTAACCTCTCCCTGCCACACTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATCCCCACT
41800426 CAAGGCCCTTTGCGCTGGTAAACTCTCCCTGCCACACTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTAGAGTACCTTCCA
41800448 CTCCCCCTGCCACACTCCCAACCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAAC
41800458 CACCCTCCCAACCCCCACCCGTCTTTCGGGGGAGGCTGGCTGCATCCCCCTTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTC
41800458 CNCGCTCCCAGCCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTC
41800459 CCCCGCCCACCCCCCCTCCTTCCGTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCT
41800460 CACCCCCAACCACCACCCGTCCTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTC
41800462 CTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATCCCCACTTCCCGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACACACTTAGACTCAC
41800464 CCCAACCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTT
41800464 CCCACCCCCCATCCTGCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTT
41800467 CACCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAG
41800476 CTGTCGTGCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAGGTTTTCCAA
41800478 TTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAGGTTTTCCAAAT
41800478 TTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAGGTTTTCCAAAT
I have a question about the text you quoted there. If the first column is the starting position and the second column is 100 characters from that point, then why are some of the index positions resulting in different data in column two? Also this doesn't seem to always be the case if you notice the final two entries are identical.
-
Jun 28th, 2011, 07:30 AM
#3
Thread Starter
Hyperactive Member
Re: Merge strings and remove overlap
It is possible. If two entries are identical, we can ignore one of them. Actually the data should be pre-processed, the same entires in raw data should be removed first.
We can overwrite the data with the later one since there are data error/mismatch in the experiments.
-
Jun 28th, 2011, 08:37 AM
#4
Addicted Member
Re: Merge strings and remove overlap
I thought I would give it a whirl. If I understand you correctly, no matter what, the latest entry of a given index is to overwrite a previous one with the same index. With this code, it will get the difference between the first line and second and cut a portion of the first line the length of the difference and repeat this until the end of the file.
I also verified this manually, so if it comes out incorrect, I misunderstood what you were asking.
Let me know if something doesn't result how you want it.
Code:
Imports System.IO
Public Class Form1
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Debug.WriteLine("-")
Dim DnaIndex, DnaData As New List(Of String)
Dim resultParts As New List(Of String)
Dim myReader As StreamReader = New StreamReader("H:\dna.txt")
Using myReader
Do While myReader.Peek <> -1
Dim parts() As String = Split(myReader.ReadLine, " "c)
If DnaIndex.Contains(parts(0)) Then
DnaData.RemoveAt(DnaIndex.IndexOf(parts(0)))
DnaData.Add(parts(1))
Else
DnaIndex.Add(parts(0))
DnaData.Add(parts(1))
End If
Loop
End Using
Dim NewStr As String = Nothing
For line As Integer = 0 To DnaIndex.Count - 1
Dim difference As Integer
Dim linedata As String = DnaData(line)
If line = DnaIndex.Count - 1 Then
resultParts.Add(linedata)
Else
difference = CInt(DnaIndex(line + 1)) - CInt(DnaIndex(line))
resultParts.Add(Mid(linedata, 1, difference))
End If
Next
NewStr = String.Concat(resultParts.ToArray)
Debug.WriteLine(NewStr)
End Sub
End Class
Oh, and NewStr is the result. "ACTTACTTAGCTCCTTGCCCCCCCAAGGCCCTTTGCGCTGGTAAACTCCCCCTGCCCCACTCCCCACCCCCATCTTTCTTTCAGGGGAGGCTGGCTGCA TCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAGGTTTTCCAAAT"
Last edited by skor13; Jun 28th, 2011 at 09:02 AM.
-
Jun 28th, 2011, 08:43 AM
#5
Re: Merge strings and remove overlap
What output are you expecting from the data you showed in post #1?
-
Jun 28th, 2011, 09:29 AM
#6
Thread Starter
Hyperactive Member
Re: [RESOLVED] Merge strings and remove overlap
-
Jun 28th, 2011, 01:08 PM
#7
Re: [RESOLVED] Merge strings and remove overlap
did skor13's answer solve the problem?
- Coding Examples:
- Features:
- Online Games:
- Compiled Games:
-
Jun 28th, 2011, 04:56 PM
#8
Addicted Member
Re: [RESOLVED] Merge strings and remove overlap
Well I suppose it did as it's listed as resolved. If you planned to use this in more specific range, I hadn't put in any code to terminate at a specific point. It will end with copying the last line completely. If you wanted to make a change there, where I have "resultParts.add(linedata)" you would want to cut off characters from there as that is the final line of the string.
Edit: Updated it to have start and end positions. It will display up to the end position - 1.
Code:
Imports System.IO
Public Class Form1
Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Debug.WriteLine(MergeDna("E:\dna.txt", 0, 41800478 + 105))
End Sub
Function MergeDna(ByVal FileLocation As String, ByVal StartPosition As Integer, ByVal EndPosition As Integer) As String
Dim DnaIndex, DnaData As New List(Of String)
Dim resultParts As New List(Of String)
Dim myReader As StreamReader = New StreamReader(FileLocation)
Using myReader
Do While myReader.Peek <> -1
Dim parts() As String = Split(myReader.ReadLine, " "c)
If DnaIndex.Contains(parts(0)) Then
DnaData.RemoveAt(DnaIndex.IndexOf(parts(0)))
DnaData.Add(parts(1))
Else
DnaIndex.Add(parts(0))
DnaData.Add(parts(1))
End If
Loop
End Using
Dim firstIndex As Integer = CInt(DnaIndex(0))
Dim NewStr As String = Nothing
For line As Integer = 0 To DnaIndex.Count - 1
Dim difference As Integer
Dim linedata As String = DnaData(line)
If line = DnaIndex.Count - 1 Then
resultParts.Add(linedata)
Else
difference = CInt(DnaIndex(line + 1)) - CInt(DnaIndex(line))
resultParts.Add(Mid(linedata, 1, difference))
End If
Next
If (StartPosition - firstIndex + 1) < 0 Then
Debug.WriteLine("Start value is out of range. Setting to start from beginning.")
Dim NewStart As Integer = 0
NewStr = (Mid(String.Concat(resultParts.ToArray), NewStart + 1, EndPosition - StartPosition))
Else
NewStr = (Mid(String.Concat(resultParts.ToArray), StartPosition - firstIndex + 1, EndPosition - StartPosition))
End If
Return NewStr
End Function
End Class
Last edited by skor13; Jun 29th, 2011 at 11:06 PM.
Reason: Another update
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|