Results 1 to 16 of 16

Thread: [RESOLVED]One Billion Row Challange

  1. #1

    Thread Starter
    Powered By Medtronic dbasnett's Avatar
    Join Date
    Dec 2007
    Location
    Jefferson City, MO
    Posts
    9,776

    Resolved [RESOLVED]One Billion Row Challange

    Here was the original post. Since I don't do Java unless it is liquid I decided to give it a try in VB. I didn't expect to get near the published times that were shown, 1.5 seconds.

    What followed was humbling. After creating the text file with the billion rows I did some rough timings. Using a StreamReader:

    1. reading 512MB blocks until end of stream took 18 seconds
    2. reading one line at a time until end of stream took 36 seconds


    That was the best I could hope for. The bottom line doing all the required computation is 1.7 minutes on my PC. Where did I go wrong?????

    I'm attaching the project I created for this. It will create the billion row file if you click the create button. The creation alone takes 3 minutes on my PC. The city names were from a list of names I had in Missouri.

    The uploaded project has no bin or obj folder.

    BillRowChal.zip

    Edit: FWIW I did things for the sake of speed I would never actually do.
    Last edited by dbasnett; May 15th, 2024 at 11:29 AM.
    My First Computer -- Documentation Link (RT?M) -- Using the Debugger -- Prime Number Sieve
    Counting Bits -- Subnet Calculator -- UI Guidelines -- >> SerialPort Answer <<

    "Those who use Application.DoEvents have no idea what it does and those who know what it does never use it." John Wein

  2. #2
    King of sapila
    Join Date
    Oct 2006
    Location
    Greece
    Posts
    6,621

    Re: Billion Row Challange

    I don't wanna be the smartarse here but doing this on a standalone not used server is not a life time example.
    What need to be done is a performance test and traffic simulation with p.e. a tool like JMeter Load Testing tool , this way you will get a glimpse of what will go on when it's live
    else you are just relying on the different hardware everyone posses when running.
    Anyhow for the sake of a test example I guess all the above is not relevant but taken in consideration if for example 100 users try to do the same thing on the server (also I'm guessing you should not process row 1bil rows in a live environment, except if you have a server at CERN or something :P ) and see the results there.

    Also this should probably be better in codebank.
    ἄνδρα μοι ἔννεπε, μοῦσα, πολύτροπον, ὃς μάλα πολλὰ
    πλάγχθη, ἐπεὶ Τροίης ἱερὸν πτολίεθρον ἔπερσεν·

  3. #3

    Thread Starter
    Powered By Medtronic dbasnett's Avatar
    Join Date
    Dec 2007
    Location
    Jefferson City, MO
    Posts
    9,776

    Re: Billion Row Challange

    Quote Originally Posted by sapator View Post
    I don't wanna be the smartarse here but doing this on a standalone not used server is not a life time example.... Also this should probably be better in codebank.
    If you run the project and do the create, and the basic read some numbers will be shown that give you an idea of the relative performance between your machine and mine. The reason I didn't put this in the CodeBank was because of the performance difference between my effort(1.7 minutes) and the best Java effort(1.5 seconds). Obviously something I'm not doing correctly.
    My First Computer -- Documentation Link (RT?M) -- Using the Debugger -- Prime Number Sieve
    Counting Bits -- Subnet Calculator -- UI Guidelines -- >> SerialPort Answer <<

    "Those who use Application.DoEvents have no idea what it does and those who know what it does never use it." John Wein

  4. #4
    PowerPoster PlausiblyDamp's Avatar
    Join Date
    Dec 2016
    Location
    Pontypool, Wales
    Posts
    2,547

    Re: Billion Row Challange

    Quote Originally Posted by dbasnett View Post
    If you run the project and do the create, and the basic read some numbers will be shown that give you an idea of the relative performance between your machine and mine. The reason I didn't put this in the CodeBank was because of the performance difference between my effort(1.7 minutes) and the best Java effort(1.5 seconds). Obviously something I'm not doing correctly.
    I haven't had time to look at your code properly yet, but you might consider moving to net 8 instead of staying on framework 4.8.1 - when I ran your app here it took 206.2 seconds to run your solution. Converting to .net 8 and trying again took 150.3 seconds! So there is a fair performance win by upgrading your version of .net alone!
    Last edited by PlausiblyDamp; May 9th, 2024 at 06:12 PM.

  5. #5
    PowerPoster techgnome's Avatar
    Join Date
    May 2002
    Posts
    34,569

    Re: Billion Row Challange

    Interesting... I'll have to take a look at this later when I have more time.


    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

  6. #6

    Thread Starter
    Powered By Medtronic dbasnett's Avatar
    Join Date
    Dec 2007
    Location
    Jefferson City, MO
    Posts
    9,776

    Re: Billion Row Challange

    Quote Originally Posted by PlausiblyDamp View Post
    I haven't had time to look at your code properly yet, but you might consider moving to net 8 instead of staying on framework 4.8.1 - when I ran your app here it took 206.2 seconds to run your solution. Converting to .net 8 and trying again took 150.3 seconds! So there is a fair performance win by upgrading your version of .net alone!
    From 1.7 minutes to < 1 minute under .Net 8. WOW.

    edit: The performance seems to be IO. The simple read test was 2.5 times faster.
    Last edited by dbasnett; May 10th, 2024 at 10:06 AM.
    My First Computer -- Documentation Link (RT?M) -- Using the Debugger -- Prime Number Sieve
    Counting Bits -- Subnet Calculator -- UI Guidelines -- >> SerialPort Answer <<

    "Those who use Application.DoEvents have no idea what it does and those who know what it does never use it." John Wein

  7. #7
    Fanatic Member 2kaud's Avatar
    Join Date
    May 2014
    Location
    England
    Posts
    1,020

    Re: Billion Row Challange

    The performance seems to be IO
    You mean console output? Try re-directing stdout to nul
    All advice is offered in good faith only. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/

    C++23 Compiler: Microsoft VS2022 (17.6.5)

  8. #8

    Thread Starter
    Powered By Medtronic dbasnett's Avatar
    Join Date
    Dec 2007
    Location
    Jefferson City, MO
    Posts
    9,776

    Re: Billion Row Challange

    Quote Originally Posted by 2kaud View Post
    You mean console output? Try re-directing stdout to nul
    No. IO to the billion row text file appears a lot faster.
    Last edited by dbasnett; May 15th, 2024 at 10:36 AM.
    My First Computer -- Documentation Link (RT?M) -- Using the Debugger -- Prime Number Sieve
    Counting Bits -- Subnet Calculator -- UI Guidelines -- >> SerialPort Answer <<

    "Those who use Application.DoEvents have no idea what it does and those who know what it does never use it." John Wein

  9. #9
    eXtreme Programmer .paul.'s Avatar
    Join Date
    May 2007
    Location
    Chelmsford UK
    Posts
    25,520

    Re: Billion Row Challange

    @dbasnett… I’d be interested to see the Java code that you originally found…

    Edit… I found your link now. I’ve never noticed Java being significantly faster than .Net generally.
    Last edited by .paul.; May 12th, 2024 at 03:54 AM.

  10. #10
    PowerPoster wqweto's Avatar
    Join Date
    May 2011
    Location
    Sofia, Bulgaria
    Posts
    5,230

    Re: Billion Row Challange

    There is sample input weather_stations.csv file in https://github.com/gunnarmorling/1brc/tree/main/data

    Is this data loaded for 18 seconds?

    Btw, you can check out some C# implementations too like https://github.com/buybackoff/1brc -- this one is clocked at ~2 sec or https://github.com/noahfalk/1brc/ -- even faster one.

    Of course it's ugly, unreadable source, gory details all over it -- that's what happens when you have to smash these high-level languages down to the metal.

    Here are a lot of optimizations explained: https://hotforknowledge.com/2024/01/...ation-journey/

    cheers,
    </wqw>

  11. #11
    PowerPoster PlausiblyDamp's Avatar
    Join Date
    Dec 2016
    Location
    Pontypool, Wales
    Posts
    2,547

    Re: Billion Row Challange

    I stumbled across those myself, amazed at just how much you can push c# when you need to.

  12. #12

    Thread Starter
    Powered By Medtronic dbasnett's Avatar
    Join Date
    Dec 2007
    Location
    Jefferson City, MO
    Posts
    9,776

    Re: Billion Row Challange

    Under .Net 8 using memory mapped file I'm under 25 seconds. Here is the code and associated class. It took a little effort to get the file segments to line up.

    Code:
        Private Sub SolutionMMF()
            ' < 23 seconds
            Dim numTasks As Integer = 0 'number of tasks
            Select Case NumProcessor 'guessing
                Case Is <= 4
                    numTasks = 2
                Case Is <= 8
                    numTasks = 4
                Case Is <= 16
                    numTasks = 10
                Case Is <= 32 'my PC
                    numTasks = 26
                Case Else
                    numTasks = NumProcessor - 6
            End Select
    
            Dim lenData As Long = New IO.FileInfo(DataPath).Length
            Dim lenSeg As Long = lenData \ numTasks 'default segment length
            Dim lenSeg1 As Long = (lenData - (lenSeg * numTasks)) + lenSeg 'first segment length
            Dim DAGS As New List(Of DoAGroupMMF)
            Using MMF As IO.MemoryMappedFiles.MemoryMappedFile = IO.MemoryMappedFiles.MemoryMappedFile.CreateFromFile(DataPath,
                                                                                                                       IO.FileMode.Open,
                                                                                                                       Nothing,
                                                                                                                       lenData,
                                                                                                                       IO.MemoryMappedFiles.MemoryMappedFileAccess.Read)
                Dim offset As Long = 0L
                Dim Len As Long = lenSeg1
                Using accessor As IO.MemoryMappedFiles.MemoryMappedViewAccessor = MMF.CreateViewAccessor(0L, 0L,
                                                                                                         IO.MemoryMappedFiles.MemoryMappedFileAccess.Read)
                    For x As Integer = 1 To numTasks
                        Dim byt As Byte = Nothing
                        Do
                            'position so the segment ends on CR
                            If (offset + Len) < lenData Then
                                byt = accessor.ReadByte(offset + Len)
                                Len += 1L
                                If byt = 13 Then
                                    Exit Do
                                End If
                            Else
                                Exit Do
                            End If
                        Loop
                        Dim DAG As New DoAGroupMMF(MMF, offset, Len)
                        DAGS.Add(DAG)
                        offset += Len 'adjust offset
                        Len = lenSeg 'computed length
                    Next
                End Using
            End Using
    
            Dim taskList As New List(Of Task)
            Dim results As New List(Of Dictionary(Of String, CityData))
            For Each ag As DoAGroupMMF In DAGS
                taskList.Add(ag.this)
                results.Add(ag.myDict)
            Next
    
            Task.WaitAll(taskList.ToArray)
            'consolidate results
            DictA = results(0)
            For Tidx As Integer = 1 To numTasks - 1
                Dim d As Dictionary(Of String, CityData) = results(Tidx)
                For Each kvp As KeyValuePair(Of String, CityData) In d
                    Dim cd As CityData = kvp.Value
                    Dim cdD As CityData = DictA(kvp.Key)
                    cdD.Count += cd.Count
                    cdD.Total += cd.Total
                Next
            Next
            STPW.Stop()
    
            CommonResults()
    
        End Sub
    and the class

    Code:
       Private Class DoAGroupMMF
            Public this As Task
            Public myDict As New Dictionary(Of String, CityData)(512)
            Public accessor As IO.MemoryMappedFiles.MemoryMappedViewAccessor
            Public Offs As Long = 0L
            Public Len As Long = 0L
            Public procCT As Integer = 0
    
            Public Sub New(MMF As IO.MemoryMappedFiles.MemoryMappedFile, Offset As Long, Length As Long)
                Me.accessor = MMF.CreateViewAccessor(0L, 0L,
                                                     IO.MemoryMappedFiles.MemoryMappedFileAccess.Read)
                Me.Offs = Offset
                Me.Len = Length
                Me.this = Task.Run(Sub() Me.DoWork())
            End Sub
    
            Public Sub DoWork()
                Dim sb As New System.Text.StringBuilder(128)
                For idx As Long = 0L To Me.Len - 1L
                    Dim byt As Byte = Me.accessor.ReadByte(Me.Offs + idx) 'ReadChar didn't work
                    If byt = 0 Then
                        Continue For
                    End If
                    Dim c As Char = Convert.ToChar(byt)
                    If c = ControlChars.Cr Then
                        procCT += 1
                        Dim message As String = sb.ToString
                        sb.Length = 0
                        Dim parts() As String = message.Split(";"c) 'split the message
                        Dim val As Integer = Integer.Parse(parts(1).Replace(".", "")) 'treat val as integer
                        Dim key As String = parts(0)
                        Dim CD As CityData
                        If Me.myDict.ContainsKey(key) Then
                            CD = Me.myDict(key)
                            CD.Count += 1
                            CD.Total += val
                            If val > CD._Max Then
                                CD._Max = val
                            ElseIf val < CD._Min Then
                                CD._Min = val
                            End If
                        Else
                            CD = New CityData(key, val)
                            Me.myDict.Add(key, CD)
                        End If
                    Else
                        sb.Append(c)
                    End If
                Next
                Me.accessor.Dispose()
            End Sub
        End Class
    Some of the other 'tricks' mentioned I didn't understand.
    Last edited by dbasnett; May 13th, 2024 at 11:29 AM.
    My First Computer -- Documentation Link (RT?M) -- Using the Debugger -- Prime Number Sieve
    Counting Bits -- Subnet Calculator -- UI Guidelines -- >> SerialPort Answer <<

    "Those who use Application.DoEvents have no idea what it does and those who know what it does never use it." John Wein

  13. #13

    Thread Starter
    Powered By Medtronic dbasnett's Avatar
    Join Date
    Dec 2007
    Location
    Jefferson City, MO
    Posts
    9,776

    Re: Billion Row Challange

    By getting rid of the split and the period replacement I'm under 17 seconds.

    Code:
        Private Class DoAGroupMMF
            Public this As Task
            Public myDict As New Dictionary(Of String, CityData)(512)
            Public accessor As IO.MemoryMappedFiles.MemoryMappedViewAccessor
            Public Offs As Long = 0L
            Public Len As Long = 0L
            Public procCT As Integer = 0
    
            Public Sub New(MMF As IO.MemoryMappedFiles.MemoryMappedFile,
                             Offset As Long,
                             Length As Long)
                Me.accessor = MMF.CreateViewAccessor(0L, 0L,
                                                     IO.MemoryMappedFiles.MemoryMappedFileAccess.Read)
                Me.Offs = Offset
                Me.Len = Length
                Me.this = Task.Run(Sub() Me.DoWork())
            End Sub
    
            Public Sub DoWork()
                Dim cnm As New System.Text.StringBuilder(128) 'the key, city name
                Dim valS As New System.Text.StringBuilder(32) 'the integer value
                Dim havesemi As Boolean = False 'have semicolon
                Dim byt As Byte
                Dim c As Char
                For idx As Long = 0L To Me.Len - 1L
                    byt = Me.accessor.ReadByte(Me.Offs + idx) 'ReadChar didn't work
                    If byt = 59 Then 'semi colon
                        havesemi = True 'strip semi
                    ElseIf byt = 46 AndAlso havesemi Then
                        'strip period after semi
                    ElseIf byt <> 13 Then
                        c = Convert.ToChar(byt)
                        If Not havesemi Then
                            cnm.Append(c)
                        Else
                            valS.Append(c)
                        End If
                    Else
                        Me.procCT += 1
                        Dim val As Integer = Integer.Parse(valS.ToString) 'treat val as integer, period was stripped
                        Dim key As String = cnm.ToString 'the key
                        Dim CD As CityData
                        If Me.myDict.ContainsKey(key) Then
                            CD = Me.myDict(key)
                            CD.Count += 1
                            CD.Total += val
                            If val > CD._Max Then CD._Max = val
                            If val < CD._Min Then CD._Min = val
                        Else
                            CD = New CityData(key, val)
                            Me.myDict.Add(key, CD)
                        End If
                        cnm.Length = 0
                        valS.Length = 0
                        havesemi = False
                    End If
                Next
                Me.accessor.Dispose()
            End Sub
        End Class
    My First Computer -- Documentation Link (RT?M) -- Using the Debugger -- Prime Number Sieve
    Counting Bits -- Subnet Calculator -- UI Guidelines -- >> SerialPort Answer <<

    "Those who use Application.DoEvents have no idea what it does and those who know what it does never use it." John Wein

  14. #14

    Thread Starter
    Powered By Medtronic dbasnett's Avatar
    Join Date
    Dec 2007
    Location
    Jefferson City, MO
    Posts
    9,776

    Re: Billion Row Challange

    I'm attaching my final .Net 8 version. VS2022, 64 bit. When ran by itself in the IDE it is consistently below 16 seconds. If I ever need to read a billion line text file I now know how. Been fun.

    Thanks for the help.

    BillRowFin.zip
    My First Computer -- Documentation Link (RT?M) -- Using the Debugger -- Prime Number Sieve
    Counting Bits -- Subnet Calculator -- UI Guidelines -- >> SerialPort Answer <<

    "Those who use Application.DoEvents have no idea what it does and those who know what it does never use it." John Wein

  15. #15
    Fanatic Member
    Join Date
    Jun 2019
    Posts
    564

    Re: [RESOLVED] Billion Row Challange

    Compile and run it in Release mode and you will see better results.

  16. #16
    PowerPoster PlausiblyDamp's Avatar
    Join Date
    Dec 2016
    Location
    Pontypool, Wales
    Posts
    2,547

    Re: [RESOLVED] Billion Row Challange

    Quote Originally Posted by peterst View Post
    Compile and run it in Release mode and you will see better results.
    Some of the optimisations added to .Net 7 and .Net 8 are getting almost unbelievable. Things like Dynamic PGO and On-Stack Replacement can make such a difference when compiled in release mode. Given how much bad press LinQ gets in terms of performance it is quite refreshing to see some of the fastest entries even using LinQ as part of the solution.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width