|
-
May 8th, 2024, 12:12 PM
#1
[RESOLVED]One Billion Row Challange
Here was the original post. Since I don't do Java unless it is liquid I decided to give it a try in VB. I didn't expect to get near the published times that were shown, 1.5 seconds .
What followed was humbling. After creating the text file with the billion rows I did some rough timings. Using a StreamReader:
- reading 512MB blocks until end of stream took 18 seconds
- reading one line at a time until end of stream took 36 seconds
That was the best I could hope for. The bottom line doing all the required computation is 1.7 minutes on my PC. Where did I go wrong?????
I'm attaching the project I created for this. It will create the billion row file if you click the create button. The creation alone takes 3 minutes on my PC. The city names were from a list of names I had in Missouri.
The uploaded project has no bin or obj folder.
BillRowChal.zip
Edit: FWIW I did things for the sake of speed I would never actually do.
Last edited by dbasnett; May 15th, 2024 at 11:29 AM.
-
May 8th, 2024, 01:28 PM
#2
Re: Billion Row Challange
I don't wanna be the smartarse here but doing this on a standalone not used server is not a life time example.
What need to be done is a performance test and traffic simulation with p.e. a tool like JMeter Load Testing tool , this way you will get a glimpse of what will go on when it's live
else you are just relying on the different hardware everyone posses when running.
Anyhow for the sake of a test example I guess all the above is not relevant but taken in consideration if for example 100 users try to do the same thing on the server (also I'm guessing you should not process row 1bil rows in a live environment, except if you have a server at CERN or something :P ) and see the results there.
Also this should probably be better in codebank.
ἄνδρα μοι ἔννεπε, μοῦσα, πολύτροπον, ὃς μάλα πολλὰ
πλάγχθη, ἐπεὶ Τροίης ἱερὸν πτολίεθρον ἔπερσεν·
-
May 8th, 2024, 01:35 PM
#3
Re: Billion Row Challange
 Originally Posted by sapator
I don't wanna be the smartarse here but doing this on a standalone not used server is not a life time example.... Also this should probably be better in codebank.
If you run the project and do the create, and the basic read some numbers will be shown that give you an idea of the relative performance between your machine and mine. The reason I didn't put this in the CodeBank was because of the performance difference between my effort(1.7 minutes) and the best Java effort(1.5 seconds). Obviously something I'm not doing correctly.
-
May 9th, 2024, 06:05 PM
#4
Re: Billion Row Challange
 Originally Posted by dbasnett
If you run the project and do the create, and the basic read some numbers will be shown that give you an idea of the relative performance between your machine and mine. The reason I didn't put this in the CodeBank was because of the performance difference between my effort(1.7 minutes) and the best Java effort(1.5 seconds). Obviously something I'm not doing correctly.
I haven't had time to look at your code properly yet, but you might consider moving to net 8 instead of staying on framework 4.8.1 - when I ran your app here it took 206.2 seconds to run your solution. Converting to .net 8 and trying again took 150.3 seconds! So there is a fair performance win by upgrading your version of .net alone!
Last edited by PlausiblyDamp; May 9th, 2024 at 06:12 PM.
-
May 10th, 2024, 08:03 AM
#5
Re: Billion Row Challange
Interesting... I'll have to take a look at this later when I have more time.
-tg
-
May 10th, 2024, 09:27 AM
#6
Re: Billion Row Challange
 Originally Posted by PlausiblyDamp
I haven't had time to look at your code properly yet, but you might consider moving to net 8 instead of staying on framework 4.8.1 - when I ran your app here it took 206.2 seconds to run your solution. Converting to .net 8 and trying again took 150.3 seconds! So there is a fair performance win by upgrading your version of .net alone!
From 1.7 minutes to < 1 minute under .Net 8. WOW.
edit: The performance seems to be IO. The simple read test was 2.5 times faster.
Last edited by dbasnett; May 10th, 2024 at 10:06 AM.
-
May 10th, 2024, 11:10 AM
#7
Re: Billion Row Challange
The performance seems to be IO
You mean console output? Try re-directing stdout to nul
All advice is offered in good faith only. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/
C++23 Compiler: Microsoft VS2022 (17.6.5)
-
May 10th, 2024, 01:56 PM
#8
Re: Billion Row Challange
 Originally Posted by 2kaud
You mean console output? Try re-directing stdout to nul
No. IO to the billion row text file appears a lot faster.
Last edited by dbasnett; May 15th, 2024 at 10:36 AM.
-
May 12th, 2024, 03:44 AM
#9
Re: Billion Row Challange
@dbasnett… I’d be interested to see the Java code that you originally found…
Edit… I found your link now. I’ve never noticed Java being significantly faster than .Net generally.
Last edited by .paul.; May 12th, 2024 at 03:54 AM.
- Coding Examples:
- Features:
- Online Games:
- Compiled Games:
-
May 12th, 2024, 04:11 AM
#10
Re: Billion Row Challange
There is sample input weather_stations.csv file in https://github.com/gunnarmorling/1brc/tree/main/data
Is this data loaded for 18 seconds?
Btw, you can check out some C# implementations too like https://github.com/buybackoff/1brc -- this one is clocked at ~2 sec or https://github.com/noahfalk/1brc/ -- even faster one.
Of course it's ugly, unreadable source, gory details all over it -- that's what happens when you have to smash these high-level languages down to the metal.
Here are a lot of optimizations explained: https://hotforknowledge.com/2024/01/...ation-journey/
cheers,
</wqw>
Last edited by wqweto; May 12th, 2024 at 04:18 AM.
-
May 12th, 2024, 08:40 AM
#11
Re: Billion Row Challange
I stumbled across those myself, amazed at just how much you can push c# when you need to.
-
May 13th, 2024, 10:56 AM
#12
Re: Billion Row Challange
Under .Net 8 using memory mapped file I'm under 25 seconds. Here is the code and associated class. It took a little effort to get the file segments to line up.
Code:
Private Sub SolutionMMF()
' < 23 seconds
Dim numTasks As Integer = 0 'number of tasks
Select Case NumProcessor 'guessing
Case Is <= 4
numTasks = 2
Case Is <= 8
numTasks = 4
Case Is <= 16
numTasks = 10
Case Is <= 32 'my PC
numTasks = 26
Case Else
numTasks = NumProcessor - 6
End Select
Dim lenData As Long = New IO.FileInfo(DataPath).Length
Dim lenSeg As Long = lenData \ numTasks 'default segment length
Dim lenSeg1 As Long = (lenData - (lenSeg * numTasks)) + lenSeg 'first segment length
Dim DAGS As New List(Of DoAGroupMMF)
Using MMF As IO.MemoryMappedFiles.MemoryMappedFile = IO.MemoryMappedFiles.MemoryMappedFile.CreateFromFile(DataPath,
IO.FileMode.Open,
Nothing,
lenData,
IO.MemoryMappedFiles.MemoryMappedFileAccess.Read)
Dim offset As Long = 0L
Dim Len As Long = lenSeg1
Using accessor As IO.MemoryMappedFiles.MemoryMappedViewAccessor = MMF.CreateViewAccessor(0L, 0L,
IO.MemoryMappedFiles.MemoryMappedFileAccess.Read)
For x As Integer = 1 To numTasks
Dim byt As Byte = Nothing
Do
'position so the segment ends on CR
If (offset + Len) < lenData Then
byt = accessor.ReadByte(offset + Len)
Len += 1L
If byt = 13 Then
Exit Do
End If
Else
Exit Do
End If
Loop
Dim DAG As New DoAGroupMMF(MMF, offset, Len)
DAGS.Add(DAG)
offset += Len 'adjust offset
Len = lenSeg 'computed length
Next
End Using
End Using
Dim taskList As New List(Of Task)
Dim results As New List(Of Dictionary(Of String, CityData))
For Each ag As DoAGroupMMF In DAGS
taskList.Add(ag.this)
results.Add(ag.myDict)
Next
Task.WaitAll(taskList.ToArray)
'consolidate results
DictA = results(0)
For Tidx As Integer = 1 To numTasks - 1
Dim d As Dictionary(Of String, CityData) = results(Tidx)
For Each kvp As KeyValuePair(Of String, CityData) In d
Dim cd As CityData = kvp.Value
Dim cdD As CityData = DictA(kvp.Key)
cdD.Count += cd.Count
cdD.Total += cd.Total
Next
Next
STPW.Stop()
CommonResults()
End Sub
and the class
Code:
Private Class DoAGroupMMF
Public this As Task
Public myDict As New Dictionary(Of String, CityData)(512)
Public accessor As IO.MemoryMappedFiles.MemoryMappedViewAccessor
Public Offs As Long = 0L
Public Len As Long = 0L
Public procCT As Integer = 0
Public Sub New(MMF As IO.MemoryMappedFiles.MemoryMappedFile, Offset As Long, Length As Long)
Me.accessor = MMF.CreateViewAccessor(0L, 0L,
IO.MemoryMappedFiles.MemoryMappedFileAccess.Read)
Me.Offs = Offset
Me.Len = Length
Me.this = Task.Run(Sub() Me.DoWork())
End Sub
Public Sub DoWork()
Dim sb As New System.Text.StringBuilder(128)
For idx As Long = 0L To Me.Len - 1L
Dim byt As Byte = Me.accessor.ReadByte(Me.Offs + idx) 'ReadChar didn't work
If byt = 0 Then
Continue For
End If
Dim c As Char = Convert.ToChar(byt)
If c = ControlChars.Cr Then
procCT += 1
Dim message As String = sb.ToString
sb.Length = 0
Dim parts() As String = message.Split(";"c) 'split the message
Dim val As Integer = Integer.Parse(parts(1).Replace(".", "")) 'treat val as integer
Dim key As String = parts(0)
Dim CD As CityData
If Me.myDict.ContainsKey(key) Then
CD = Me.myDict(key)
CD.Count += 1
CD.Total += val
If val > CD._Max Then
CD._Max = val
ElseIf val < CD._Min Then
CD._Min = val
End If
Else
CD = New CityData(key, val)
Me.myDict.Add(key, CD)
End If
Else
sb.Append(c)
End If
Next
Me.accessor.Dispose()
End Sub
End Class
Some of the other 'tricks' mentioned I didn't understand.
Last edited by dbasnett; May 13th, 2024 at 11:29 AM.
-
May 14th, 2024, 09:32 AM
#13
Re: Billion Row Challange
By getting rid of the split and the period replacement I'm under 17 seconds.
Code:
Private Class DoAGroupMMF
Public this As Task
Public myDict As New Dictionary(Of String, CityData)(512)
Public accessor As IO.MemoryMappedFiles.MemoryMappedViewAccessor
Public Offs As Long = 0L
Public Len As Long = 0L
Public procCT As Integer = 0
Public Sub New(MMF As IO.MemoryMappedFiles.MemoryMappedFile,
Offset As Long,
Length As Long)
Me.accessor = MMF.CreateViewAccessor(0L, 0L,
IO.MemoryMappedFiles.MemoryMappedFileAccess.Read)
Me.Offs = Offset
Me.Len = Length
Me.this = Task.Run(Sub() Me.DoWork())
End Sub
Public Sub DoWork()
Dim cnm As New System.Text.StringBuilder(128) 'the key, city name
Dim valS As New System.Text.StringBuilder(32) 'the integer value
Dim havesemi As Boolean = False 'have semicolon
Dim byt As Byte
Dim c As Char
For idx As Long = 0L To Me.Len - 1L
byt = Me.accessor.ReadByte(Me.Offs + idx) 'ReadChar didn't work
If byt = 59 Then 'semi colon
havesemi = True 'strip semi
ElseIf byt = 46 AndAlso havesemi Then
'strip period after semi
ElseIf byt <> 13 Then
c = Convert.ToChar(byt)
If Not havesemi Then
cnm.Append(c)
Else
valS.Append(c)
End If
Else
Me.procCT += 1
Dim val As Integer = Integer.Parse(valS.ToString) 'treat val as integer, period was stripped
Dim key As String = cnm.ToString 'the key
Dim CD As CityData
If Me.myDict.ContainsKey(key) Then
CD = Me.myDict(key)
CD.Count += 1
CD.Total += val
If val > CD._Max Then CD._Max = val
If val < CD._Min Then CD._Min = val
Else
CD = New CityData(key, val)
Me.myDict.Add(key, CD)
End If
cnm.Length = 0
valS.Length = 0
havesemi = False
End If
Next
Me.accessor.Dispose()
End Sub
End Class
-
May 15th, 2024, 10:42 AM
#14
Re: Billion Row Challange
I'm attaching my final .Net 8 version. VS2022, 64 bit. When ran by itself in the IDE it is consistently below 16 seconds. If I ever need to read a billion line text file I now know how. Been fun.
Thanks for the help.
BillRowFin.zip
-
May 15th, 2024, 11:06 AM
#15
Re: [RESOLVED] Billion Row Challange
Compile and run it in Release mode and you will see better results.
-
May 15th, 2024, 02:12 PM
#16
Re: [RESOLVED] Billion Row Challange
 Originally Posted by peterst
Compile and run it in Release mode and you will see better results.
Some of the optimisations added to .Net 7 and .Net 8 are getting almost unbelievable. Things like Dynamic PGO and On-Stack Replacement can make such a difference when compiled in release mode. Given how much bad press LinQ gets in terms of performance it is quite refreshing to see some of the fastest entries even using LinQ as part of the solution.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|