Results 1 to 8 of 8

Thread: [RESOLVED] Merge strings and remove overlap

  1. #1

    Thread Starter
    Hyperactive Member
    Join Date
    Mar 2008
    Posts
    470

    Resolved [RESOLVED] Merge strings and remove overlap

    Hi,
    I have many strings in a two column text file. The first column is the start position, the second one is the string.
    Ex:
    Code:
    41800403 ACTTACTTACCTACTTCCTTCCCCAAGCCCTTTTCCCCTGTTAAACCCCCCCTGCCACACTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATC
    41800409 TTAGCTCCTTGCCCCCCCAAGGCCCTTTGCGCTGGTAACCTCTCCCTGCCACACTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATCCCCACT
    41800426 CAAGGCCCTTTGCGCTGGTAAACTCTCCCTGCCACACTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTAGAGTACCTTCCA
    41800448 CTCCCCCTGCCACACTCCCAACCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAAC
    41800458 CACCCTCCCAACCCCCACCCGTCTTTCGGGGGAGGCTGGCTGCATCCCCCTTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTC
    41800458 CNCGCTCCCAGCCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTC
    41800459 CCCCGCCCACCCCCCCTCCTTCCGTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCT
    41800460 CACCCCCAACCACCACCCGTCCTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTC
    41800462 CTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATCCCCACTTCCCGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACACACTTAGACTCAC
    41800464 CCCAACCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTT
    41800464 CCCACCCCCCATCCTGCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTT
    41800467 CACCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAG
    41800476 CTGTCGTGCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAGGTTTTCCAA
    41800478 TTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAGGTTTTCCAAAT
    41800478 TTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAGGTTTTCCAAAT
    The second column contains 100 chars.Obviously these strings have overlapped, if a start position is given and the string length is given.
    How to merge them together?
    Ex, I want start point is 38781735, end point is 38781900. The string length is 38781900-38781735+1
    Thanks.

  2. #2
    Addicted Member
    Join Date
    Apr 2011
    Posts
    223

    Re: Merge strings and remove overlap

    Quote Originally Posted by zhshqzyc View Post
    Hi,
    I have many strings in a two column text file. The first column is the start position, the second one is the string.
    Ex:
    Code:
    41800403 ACTTACTTACCTACTTCCTTCCCCAAGCCCTTTTCCCCTGTTAAACCCCCCCTGCCACACTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATC
    41800409 TTAGCTCCTTGCCCCCCCAAGGCCCTTTGCGCTGGTAACCTCTCCCTGCCACACTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATCCCCACT
    41800426 CAAGGCCCTTTGCGCTGGTAAACTCTCCCTGCCACACTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTAGAGTACCTTCCA
    41800448 CTCCCCCTGCCACACTCCCAACCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAAC
    41800458 CACCCTCCCAACCCCCACCCGTCTTTCGGGGGAGGCTGGCTGCATCCCCCTTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTC
    41800458 CNCGCTCCCAGCCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTC
    41800459 CCCCGCCCACCCCCCCTCCTTCCGTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCT
    41800460 CACCCCCAACCACCACCCGTCCTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTC
    41800462 CTCCCAACCCCCATCCTTCCTTCAGGGGAGGCTGGCTGCATCCCCACTTCCCGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACACACTTAGACTCAC
    41800464 CCCAACCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTT
    41800464 CCCACCCCCCATCCTGCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTT
    41800467 CACCCCCATCCTTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAG
    41800476 CTGTCGTGCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAGGTTTTCCAA
    41800478 TTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAGGTTTTCCAAAT
    41800478 TTCTTTCAGGGGAGGCTGGCTGCATCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAGGTTTTCCAAAT
    I have a question about the text you quoted there. If the first column is the starting position and the second column is 100 characters from that point, then why are some of the index positions resulting in different data in column two? Also this doesn't seem to always be the case if you notice the final two entries are identical.

  3. #3

    Thread Starter
    Hyperactive Member
    Join Date
    Mar 2008
    Posts
    470

    Re: Merge strings and remove overlap

    It is possible. If two entries are identical, we can ignore one of them. Actually the data should be pre-processed, the same entires in raw data should be removed first.

    We can overwrite the data with the later one since there are data error/mismatch in the experiments.

  4. #4
    Addicted Member
    Join Date
    Apr 2011
    Posts
    223

    Re: Merge strings and remove overlap

    I thought I would give it a whirl. If I understand you correctly, no matter what, the latest entry of a given index is to overwrite a previous one with the same index. With this code, it will get the difference between the first line and second and cut a portion of the first line the length of the difference and repeat this until the end of the file.

    I also verified this manually, so if it comes out incorrect, I misunderstood what you were asking.

    Let me know if something doesn't result how you want it.

    Code:
    Imports System.IO
    Public Class Form1
        Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
            Debug.WriteLine("-")
            Dim DnaIndex, DnaData As New List(Of String)
            Dim resultParts As New List(Of String)
            Dim myReader As StreamReader = New StreamReader("H:\dna.txt")
            Using myReader
                Do While myReader.Peek <> -1
                    Dim parts() As String = Split(myReader.ReadLine, " "c)
    
                    If DnaIndex.Contains(parts(0)) Then
                        DnaData.RemoveAt(DnaIndex.IndexOf(parts(0)))
                        DnaData.Add(parts(1))
                    Else
                        DnaIndex.Add(parts(0))
                        DnaData.Add(parts(1))
                    End If
                Loop
            End Using
    
            Dim NewStr As String = Nothing
            For line As Integer = 0 To DnaIndex.Count - 1
                Dim difference As Integer
                Dim linedata As String = DnaData(line)
                If line = DnaIndex.Count - 1 Then
                    resultParts.Add(linedata)
                Else
                    difference = CInt(DnaIndex(line + 1)) - CInt(DnaIndex(line))
                    resultParts.Add(Mid(linedata, 1, difference))
                End If
            Next
    
            NewStr = String.Concat(resultParts.ToArray)
            Debug.WriteLine(NewStr)
        End Sub
    End Class
    Oh, and NewStr is the result. "ACTTACTTAGCTCCTTGCCCCCCCAAGGCCCTTTGCGCTGGTAAACTCCCCCTGCCCCACTCCCCACCCCCATCTTTCTTTCAGGGGAGGCTGGCTGCA TCCCCACTTCCTGGAGTACCTTCCCAGATCTCCTGGGACAGGTCAACAAACTTAGTCTCACTTTAGGTTTTCCAAAT"
    Last edited by skor13; Jun 28th, 2011 at 09:02 AM.

  5. #5
    VB Addict Pradeep1210's Avatar
    Join Date
    Apr 2004
    Location
    Inside the CPU...
    Posts
    6,614

    Re: Merge strings and remove overlap

    What output are you expecting from the data you showed in post #1?
    Pradeep, Microsoft MVP (Visual Basic)
    Please appreciate posts that have helped you by clicking icon on the left of the post.
    "A problem well stated is a problem half solved." — Charles F. Kettering

    Read articles on My Blog101 LINQ SamplesJSON ValidatorXML Schema Validator"How Do I" videos on MSDNVB.NET and C# ComparisonGood Coding PracticesVBForums Reputation SaverString EnumSuper Simple Tetris Game


    (2010-2013)
    NB: I do not answer coding questions via PM. If you want my help, then make a post and PM me it's link. If I can help, trust me I will...

  6. #6

    Thread Starter
    Hyperactive Member
    Join Date
    Mar 2008
    Posts
    470

    Re: [RESOLVED] Merge strings and remove overlap

    Thank you anyway.

  7. #7
    eXtreme Programmer .paul.'s Avatar
    Join Date
    May 2007
    Location
    Chelmsford UK
    Posts
    26,423

    Re: [RESOLVED] Merge strings and remove overlap

    did skor13's answer solve the problem?

  8. #8
    Addicted Member
    Join Date
    Apr 2011
    Posts
    223

    Re: [RESOLVED] Merge strings and remove overlap

    Well I suppose it did as it's listed as resolved. If you planned to use this in more specific range, I hadn't put in any code to terminate at a specific point. It will end with copying the last line completely. If you wanted to make a change there, where I have "resultParts.add(linedata)" you would want to cut off characters from there as that is the final line of the string.

    Edit: Updated it to have start and end positions. It will display up to the end position - 1.

    Code:
    Imports System.IO
    Public Class Form1
        Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    
            Debug.WriteLine(MergeDna("E:\dna.txt", 0, 41800478 + 105))
    
        End Sub
    
        Function MergeDna(ByVal FileLocation As String, ByVal StartPosition As Integer, ByVal EndPosition As Integer) As String
            Dim DnaIndex, DnaData As New List(Of String)
            Dim resultParts As New List(Of String)
            Dim myReader As StreamReader = New StreamReader(FileLocation)
            Using myReader
                Do While myReader.Peek <> -1
                    Dim parts() As String = Split(myReader.ReadLine, " "c)
    
                    If DnaIndex.Contains(parts(0)) Then
                        DnaData.RemoveAt(DnaIndex.IndexOf(parts(0)))
                        DnaData.Add(parts(1))
                    Else
                        DnaIndex.Add(parts(0))
                        DnaData.Add(parts(1))
                    End If
                Loop
            End Using
    
            Dim firstIndex As Integer = CInt(DnaIndex(0))
    
            Dim NewStr As String = Nothing
            For line As Integer = 0 To DnaIndex.Count - 1
                Dim difference As Integer
                Dim linedata As String = DnaData(line)
                If line = DnaIndex.Count - 1 Then
                    resultParts.Add(linedata)
                Else
                    difference = CInt(DnaIndex(line + 1)) - CInt(DnaIndex(line))
                    resultParts.Add(Mid(linedata, 1, difference))
                End If
            Next
            If (StartPosition - firstIndex + 1) < 0 Then
                Debug.WriteLine("Start value is out of range. Setting to start from beginning.")
                Dim NewStart As Integer = 0
                NewStr = (Mid(String.Concat(resultParts.ToArray), NewStart + 1, EndPosition - StartPosition))
            Else
                NewStr = (Mid(String.Concat(resultParts.ToArray), StartPosition - firstIndex + 1, EndPosition - StartPosition))
            End If
            Return NewStr
        End Function
    End Class
    Last edited by skor13; Jun 29th, 2011 at 11:06 PM. Reason: Another update

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width