Random DNA Base Pair Creation

**overhill** · Nov 19th, 2000, 02:31 PM

I have made a program that calculates a user-specified number of random DNA base pairs. The four possible pair combinations are:

A:T
T:A
G:C
C:G

The problem comes when calculating millions of these pairs. How can I quickly calculate 10 million base pairs? With the code that I am using it takes over 15 seconds for each million base pairs as a compiled EXE. All that really needs to be calculated is the number of each kind so that I can display percentages.

So in summary I need a way to randomly choose one of these four base pairs millions of times and display totals. Thanks for any help!

Portion of code used:

Code:

    NumberofTimes = 10000000
    Randomize
    For X = 1 To NumberofTimes
        r = Int(Rnd * 8)
        If r = 0 Or r = 1 Then
            '-Base Pair A:T
            AT = AT + 1
        ElseIf r = 2 Or r = 3 Then
            '-Base Pair T:A
            TA = TA + 1
        ElseIf r = 4 Or r = 5 Then
            '-Base Pair G:C
            GC = GC + 1
        ElseIf r = 6 Or r = 7 Or r = 8 Then
            '-Base Pair C:G
            CG = CG + 1
        End If
    Next X

[Edited by overhill on 11-19-2000 at 02:37 PM]

**kedaman** · Nov 19th, 2000, 02:41 PM

may i ask you what you need this code for, what's the purpose? I'm pretty sure the results will be about the same every time. Also looks like you wanted to have the occurance of C:G more often, but that won't do it anyway since r won't contain 8 ever.

**overhill** · Nov 19th, 2000, 08:47 PM

Thank you for pointing out that it will never equal 8. I originally thought that 7.5 and above would round up to 8, but not with Int(). Thanks.

The purpose of using smaller numbers (1-1,000,000) shows how the percentage of each base pair gets closer and closer to 25% each as the number increases. As for larger numbers, the human genome contains 3 billion base pairs. Some students wanted to have the computer calculate a sample human's DNA. When I figured out how long it was going to take I wanted to see if some other code was more efficient.

Essentially, I guess I was wondering if their is a more efficient way to code this. Here is the updated code.

Code:

    NumberofTimes = 10000000
    Randomize
    For X = 1 To NumberofTimes
        r = Int(Rnd * 4)
        If r = 0 Then
            '-Base Pair A:T
            AT = AT + 1
        ElseIf r = 1 Then
            '-Base Pair T:A
            TA = TA + 1
        ElseIf r = 2 Then
            '-Base Pair G:C
            GC = GC + 1
        ElseIf r = 3 Then
            '-Base Pair C:G
            CG = CG + 1
        End If
    Next X

**Kaverin** · Nov 20th, 2000, 12:41 AM

Just to insert a little comment here, that counting wouldn't really be that helpful. If you only have a range of 4 possible numbers, the larger your sample size gets, the closer the percentage of each number will get to 25%. In general, as sample size increases, each number of a set of N numbers will start to appear about 1 / N percent of the time. Unless you wanted to make more numbers, and stick more things into the Case statements like if you wanted to skew the occurance of say G:C. YOu could make it slightly more efficient (I think), using something like this, although making VB do something this many times isn't really efficient in the first place

.

Code:

   Dim alngCounts(1 To 4) As Long
   Dim intIndex As Integer
   Dim i As Long

   Randomize Timer
   
   For i = 1 To 1000000
      intIndex = Int(Rnd * 4) + 1
      alngCounts(intIndex) = alngCounts(intIndex) + 1
   Next i

Oh, and being a biology person, you might want to take into account Chargaff's rule when you make a program for this. Because adenine and thymine, and cytosine and guanine pair up like they do, their percentages are very close. A genome might be 30% A, 30% T, 20% G and 20% C. And if you're really a stickler for biochem, don't forget uracil and other purine/pyrimidine bases that can get put in (in RNA for example). I could ramble about bio all day

....

**kedaman** · Nov 20th, 2000, 12:03 PM

Well i'm not that into genetics and stuff like you Kaverin, although i had some of that in my biology courses and it was the part i liked most. Anyway youre absolutely right, there's no good in emulating the randomness by counting each individual combination of 10000000. A waste of cpu i think. Well you could do it the mathematical way using the binominal formula. I'm not sure how you implement it with 10000000 elements but i'm searching for a short cut right now.

Thread: Random DNA Base Pair Creation

Thread Tools

Display

Purpose

Posting Permissions