This code tackles it twice, once the "long way" for ilustration:
Code:
Option Explicit
'Implementing the Equivalent of a FULL OUTER JOIN in Microsoft Jet SQL.
Private Sub Main()
'We'll do our work in App.Path:
ChDir App.Path
ChDrive App.Path
'Clean up from any prior test run:
On Error Resume Next
Kill "inner.txt"
Kill "left.txt"
Kill "right.txt"
Kill "c4steps.txt"
Kill "c.txt"
Kill "schema.ini"
On Error GoTo 0
With New ADODB.Connection
.Open "Provider=Microsoft.Jet.OLEDB.4.0;" _
& "Data Source='.';" _
& "Extended Properties='Text;Hdr=No'"
'Do it in 4 steps for illustration:
.Execute "SELECT [A].*, [B].[F2], [B].[F3], [B].[F4] " _
& "INTO [inner.txt] FROM " _
& "[a.txt] [A] INNER JOIN [b.txt] [B] ON " _
& "[A].[F1] = [B].[F1]", _
, _
adCmdText Or adExecuteNoRecords
.Execute "SELECT [A].*, 0 AS [B_F2], 0 AS [B_F3], 0 AS [B_F4] " _
& "INTO [left.txt] FROM " _
& "[a.txt] [A] LEFT JOIN [b.txt] [B] ON " _
& "[A].[F1] = [B].[F1] " _
& "WHERE [B].[F1] IS NULL", _
, _
adCmdText Or adExecuteNoRecords
.Execute "SELECT [B].[F1], 0 AS [A_F2], 0 AS [A_F3], 0 AS [A_F4], " _
& "[B].[F2], [B].[F3], [B].[F4] " _
& "INTO [right.txt] FROM " _
& "[a.txt] [A] RIGHT JOIN [b.txt] [B] ON " _
& "[A].[F1] = [B].[F1] " _
& "WHERE [A].[F1] IS NULL", _
, _
adCmdText Or adExecuteNoRecords
.Execute "SELECT * " _
& "INTO [c4steps.txt] FROM (" _
& "SELECT * FROM [inner.txt] UNION ALL " _
& "SELECT * FROM [left.txt] UNION ALL " _
& "SELECT * FROM [right.txt]) " _
& "ORDER BY [F1]", _
, _
adCmdText Or adExecuteNoRecords
'Do it all in one go:
.Execute "SELECT * " _
& "INTO [c.txt] FROM (" _
& "SELECT [A].*, [B].[F2], [B].[F3], [B].[F4] " _
& "FROM [a.txt] [A] INNER JOIN [b.txt] [B] ON " _
& "[A].[F1] = [B].[F1] UNION ALL " _
& "SELECT [A].*, 0 AS [B_F2], 0 AS [B_F3], 0 AS [B_F4] " _
& "FROM [a.txt] [A] LEFT JOIN [b.txt] [B] ON " _
& "[A].[F1] = [B].[F1] " _
& "WHERE [B].[F1] IS NULL UNION ALL " _
& "SELECT [B].[F1], 0 AS [A_F2], 0 AS [A_F3], 0 AS [A_F4], " _
& "[B].[F2], [B].[F3], [B].[F4] " _
& "FROM [a.txt] [A] RIGHT JOIN [b.txt] [B] ON " _
& "[A].[F1] = [B].[F1] " _
& "WHERE [A].[F1] IS NULL) " _
& "ORDER BY [F1]", _
, _
adCmdText Or adExecuteNoRecords
.Close
End With
MsgBox "Done"
End Sub
My question is: Do we have another way to do this? Perhaps using some other preinstalled library, manually via brute force VB6 code, or with the aid of some 3rd party library for acting on tabular text data?
I was thinking of the free Log Parser 2.2 from Microsoft but it doesn't do joins of any kind.
My first thought would be execute a left join first then execute a right join using a where clause that filters it to get only the items where there is no match in the other table/file then combine the results.
Edit: I have not tried it of course but seems like it would do the trick?
Here is a solution (at least returns same results as your project). It is based off this wikipedia post.
Using sSQL to organize my thoughts
Code:
sSQL = "Select [A].*, [B].[F2],[B].[F3],[B].[F4] FROM [a.txt] AS [A] " & _
"INNER JOIN [b.txt] As [B] ON [B].[F1]=[A].[F1] "
sSQL = sSQL & "UNION ALL Select [A1].*, [B1].[F2],[B1].[F3],[B1].[F4] " & _
"FROM [a.txt] [A1] LEFT OUTER JOIN [b.txt] AS [B1] ON [B1].[F1]=[A1].[F1] " & _
"WHERE [A1].[F1]=[B1].[F1] AND [B1].[F1] IS NULL "
sSQL = sSQL & "UNION ALL Select [A2].*, [B2].[F2],[B2].[F3],[B2].[F4] " & _
"FROM [a.txt] [A2] LEFT OUTER JOIN [b.txt] AS [B2] ON [B2].[F1]=[A2].[F1] " & _
"WHERE [A2].[F1]=[B2].[F1] AND [A2].[F1] IS NULL"
sSQL = "SELECT foj.* INTO [inner.txt] FROM (" & sSQL & ") As foj "
Now execute that SQL statement.
Edited: Ignore above. I misread your code and tried to match inner text. Oops. I think a similar solution can be made to match your c.txt. If you don't find it in a bit, I'll look again.
Last edited by LaVolpe; Sep 20th, 2017 at 04:07 PM.
Insomnia is just a byproduct of, "It can't be done"
I know you're no fan of SQLite and vbRichClient5, but I did manage to meet your data requirements using a "simulated" full outer join in SQLite as follows:
Code:
Sub TestJoin()
Dim lo_Cnn As vbRichClient5.cConnection
Dim lo_Rs As vbRichClient5.cRecordset
Dim ii As Long
' Populate sample tables and data....you could slurp this information from a file into the DB of course.
' RC5 has a CSV parsing class, StringBuilder class, and FSO class that can make this process easier.
Set lo_Cnn = New_c.Connection(":memory:", DBCreateInMemory)
lo_Cnn.Execute "CREATE TABLE table_a (year INTEGER, a INTEGER DEFAULT 0, b INTEGER DEFAULT 0, c INTEGER DEFAULT 0)"
lo_Cnn.Execute "CREATE TABLE table_b (year INTEGER, a INTEGER DEFAULT 0, b INTEGER DEFAULT 0, c INTEGER DEFAULT 0)"
lo_Cnn.Execute "INSERT INTO table_a (year, a, b, c) VALUES (1901, 1,2,3)"
lo_Cnn.Execute "INSERT INTO table_a (year, a, b, c) VALUES (1902, 4,5,6)"
lo_Cnn.Execute "INSERT INTO table_a (year, a, b, c) VALUES (1904, 7,8,9)"
lo_Cnn.Execute "INSERT INTO table_a (year, a, b, c) VALUES (1906, 10,11,12)"
lo_Cnn.Execute "INSERT INTO table_a (year, a, b, c) VALUES (1908, 13,14,15)"
lo_Cnn.Execute "INSERT INTO table_b (year, a, b, c) VALUES (1901, 101,102,103)"
lo_Cnn.Execute "INSERT INTO table_b (year, a, b, c) VALUES (1902, 104,105,106)"
lo_Cnn.Execute "INSERT INTO table_b (year, a, b, c) VALUES (1903, 107,108,109)"
lo_Cnn.Execute "INSERT INTO table_b (year, a, b, c) VALUES (1904, 110,111,112)"
lo_Cnn.Execute "INSERT INTO table_b (year, a, b, c) VALUES (1905, 113,114,115)"
lo_Cnn.Execute "INSERT INTO table_b (year, a, b, c) VALUES (1906, 116,117,118)"
lo_Cnn.Execute "INSERT INTO table_b (year, a, b, c) VALUES (1908, 119,120,121)"
' This is the SQL statement that joins the data
Set lo_Rs = lo_Cnn.OpenRecordset("SELECT table_a.year, ifnull(table_a.a, 0), ifnull(table_a.b, 0), ifnull(table_a.c, 0), ifnull(table_b.a, 0), ifnull(table_b.b, 0), ifnull(table_b.c, 0) FROM table_a " & _
"LEFT JOIN table_b USING(year) UNION ALL " & _
"SELECT table_b.year, ifnull(table_a.a, 0), ifnull(table_a.b, 0), ifnull(table_a.c, 0), ifnull(table_b.a, 0), ifnull(table_b.b, 0), ifnull(table_b.c, 0) FROM table_b " & _
"LEFT JOIN table_a USING(year) WHERE table_a.year IS NULL ORDER BY year ASC")
' Output the joined data
Do While Not lo_Rs.EOF
For ii = 0 To lo_Rs.Fields.Count - 1
Debug.Print lo_Rs.Fields(ii).Value & vbTab;
Next ii
Debug.Print vbNewLine
lo_Rs.MoveNext
Loop
End Sub
In case the Data-Fields (besides the Year-Column) are all Numbers, you could do a unified Group By Query (with a Sum-Aggregate on all Data-Columns) like this:
Code:
Select Year, Sum(F1) As F1, Sum(F2) As F2, Sum(F3) As F3, Sum(F4) As F4, Sum(F5) As F5, Sum(F6) As F6 From (
Select Year, F1, F2, F3, 0 As F4, 0 As F5, 0 As F6 From A
Union All
Select Year, 0 As F1, 0 As F2, 0 As F3, F1 As F4, F2 As F5, F3 As F6 From B
) Group By Year
If the CSV-Tables A and B are huge ones, SQLite can act as a fast Import-Target, which then allows also
faster Queries on its (temporarily created) InMemory or (if the import-data is >500MB) also FileBased-DB-Tables.
Here's a Demo (needs a Project-Reference to vbRIchClient5 and also an MSHFlexGrid1 on the Form):
Code:
Option Explicit
Implements ICSVCallback
Private Cnn As cConnection, CSV As cCSV, Cmd As cCommand
Private Sub Form_Load()
Set Cnn = New_c.Connection(, DBCreateInMemory)
Cnn.Execute "Create Table A(Year Integer Primary Key, F1 Integer, F2 Integer, F3 Integer)"
Cnn.Execute "Create Table B(Year Integer Primary Key, F1 Integer, F2 Integer, F3 Integer)"
With New_c.StringBuilder
.Clear
.AppendNL "1901,1,1,1"
.AppendNL "1902,2,2,2"
.AppendNL "1904,4,4,4"
.AppendNL "1906,6,6,6"
.AppendNL "1908,8,8,8"
ImportToTable "A", .ToUTF8
.Clear
.AppendNL "1901,11,11,11"
.AppendNL "1902,12,12,12"
.AppendNL "1903,13,13,13"
.AppendNL "1904,14,14,14"
.AppendNL "1905,15,15,15"
.AppendNL "1906,16,16,16"
ImportToTable "B", .ToUTF8
.Clear
.AppendNL "Select A.Year, A.F1,A.F2,A.F3, B.F1 F4,B.F2 F5,B.F3 F6 From A Left Outer Join B Using(Year)"
.AppendNL "Union"
.AppendNL "Select B.Year, A.F1,A.F2,A.F3, B.F1 F4,B.F2 F5,B.F3 F6 From B Left Outer Join A Using(Year)"
Set MSHFlexGrid1.DataSource = Cnn.OpenRecordset(.ToString).DataSource
End With
End Sub
Sub ImportToTable(TableName As String, CSVContent() As Byte)
Cnn.BeginTrans
Set Cmd = Cnn.CreateCommand("Insert Into " & TableName & " Values(?,?,?,?)")
Set CSV = New_c.CSV
CSV.ParseBytes CSVContent, Me
Cnn.CommitTrans
End Sub
Private Function ICSVCallback_NewValue(ByVal RowNr As Long, ByVal ColNr As Long, B() As Byte, ByVal BValStartPos As Long, ByVal BValLen As Long) As Long
Cmd.SetInt32 ColNr + 1, CSV.parseNumber(B, BValStartPos, BValLen)
If ColNr = 3 Then Cmd.Execute
End Function
Here's the above Demos Output-Result (using the example-data you have given):
HTH
Olaf
Last edited by Schmidt; Sep 20th, 2017 at 04:22 PM.
This is revised from post #2. It produces same results as your c.txt file and runs significantly faster than the multi-file solution in your zip. If useful, great. If not, no biggie.
Code:
sSQL = "Select [A].*, [B].[F2],[B].[F3],[B].[F4] " & _
"FROM [a.txt] AS [A] INNER JOIN [b.txt] As [B] ON [B].[F1]=[A].[F1] "
sSQL = sSQL & "UNION ALL Select [A1].*, 0 As [F2],0 As [F3],0 As [F4] " & _
"FROM [a.txt] [A1] LEFT OUTER JOIN [b.txt] AS [B1] ON [B1].[F1]=[A1].[F1] " & _
"WHERE [B1].[F1] IS NULL "
sSQL = sSQL & "UNION ALL Select [B2].[F1] ,0 As [F2],0 As [F3],0 As [F4], [B2].[F2],[B2].[F3],[B2].[F4] " & _
"FROM [a.txt] [A2] RIGHT OUTER JOIN [b.txt] AS [B2] ON [B2].[F1]=[A2].[F1] " & _
"WHERE [A2].[F1] IS NULL"
sSQL = "SELECT foj.* INTO [C.txt] FROM (" & sSQL & ") As foj ORDER BY F1"
Edited: The downside to this approach is that you have to pre-fill the null columns which means you need to know in advance or build a routine to discover how many there will be in both source files.
Last edited by LaVolpe; Sep 20th, 2017 at 04:31 PM.
Insomnia is just a byproduct of, "It can't be done"
The multiple query approach was there for illustration. It is followed by a single compound query.
I guess what I was hoping to find is something more specifically designed to work with tabular data text files that does this kind of "combination."
I suppose you could build something around the Collection class as well... ok, yes that seems to work:
Code:
Option Explicit
Private Sub Main()
Dim Collection As Collection
Dim F As Integer
Dim Y As Integer, N1 As Integer, N2 As Integer, N3 As Integer
Dim Item As Variant
'We'll do our work in App.Path:
ChDir App.Path
ChDrive App.Path
Set Collection = New Collection
F = FreeFile(0)
Open "a.txt" For Input As #F
Do Until EOF(F)
Input #F, Y, N1, N2, N3
InsertInOrder Collection, Array(Y, N1, N2, N3, 0, 0, 0)
Loop
Close #F
F = FreeFile(0)
Open "b.txt" For Input As #F
Do Until EOF(F)
Input #F, Y, N1, N2, N3
JoinInOrder Collection, Array(Y, 0, 0, 0, N1, N2, N3)
Loop
Close #F
F = FreeFile(0)
Open "c.txt" For Output As #F
For Each Item In Collection
Write #F, Item(0), Item(1), Item(2), Item(3), Item(4), Item(5), Item(6)
Next
Close #F
MsgBox "Done"
End Sub
Code:
Option Explicit
Private Const KEY_INDEX = 0
Public Sub InsertInOrder( _
ByVal Collection As Collection, _
ByRef NewItem As Variant)
'NewItem(0) is the key field.
Dim Min As Long, Mid As Long, Max As Long
Dim NewKey As Variant, TestKey As Variant
Dim CompareStrings As Boolean, Compare As Integer
With Collection
If .Count = 0 Then
.Add NewItem
Exit Sub
End If
Min = 0
Max = .Count - 1
NewKey = NewItem(KEY_INDEX)
CompareStrings = VarType(NewKey) = vbString
Do Until Min > Max
Mid = (Min + Max) \ 2
TestKey = .Item(Mid + 1)(KEY_INDEX)
If CompareStrings Then
Compare = StrComp(TestKey, NewKey, vbBinaryCompare)
Else
Compare = Sgn(TestKey - NewKey)
End If
Select Case Compare
Case Is < 0
Min = Mid + 1
Case Is > 0
Max = Mid - 1
Case 0
.Add NewItem, , , Mid + 1
Exit Sub
End Select
Loop
If Min = Mid Then
.Add NewItem, , Mid + 1
Else
.Add NewItem, , , Mid + 1
End If
End With
End Sub
Public Sub JoinInOrder( _
ByVal Collection As Collection, _
ByRef NewItem As Variant)
'NewItem(0) is the key field.
Dim Min As Long, Mid As Long, Max As Long
Dim NewKey As Variant, TestItem As Variant, TestKey As Variant
Dim CompareStrings As Boolean, Compare As Integer
With Collection
If .Count = 0 Then
.Add NewItem
Exit Sub
End If
Min = 0
Max = .Count - 1
NewKey = NewItem(KEY_INDEX)
CompareStrings = VarType(NewKey) = vbString
Do Until Min > Max
Mid = (Min + Max) \ 2
TestItem = .Item(Mid + 1)
TestKey = TestItem(KEY_INDEX)
If CompareStrings Then
Compare = StrComp(TestKey, NewKey, vbBinaryCompare)
Else
Compare = Sgn(TestKey - NewKey)
End If
Select Case Compare
Case Is < 0
Min = Mid + 1
Case Is > 0
Max = Mid - 1
Case 0
TestItem(4) = NewItem(4)
TestItem(5) = NewItem(5)
TestItem(6) = NewItem(6)
.Remove Mid + 1
If Mid + 1 = .Count Then
.Add TestItem
Else
.Add TestItem, , Mid + 1
End If
Exit Sub
End Select
Loop
If Min = Mid Then
.Add NewItem, , Mid + 1
Else
.Add NewItem, , , Mid + 1
End If
End With
End Sub
Found and fixed a bug. Shows that this approach is perhaps even trickier to get right than a complex SQL query.
Added slightly larger test inputs.
Last edited by dilettante; Sep 20th, 2017 at 06:42 PM.
Not sure, for what scenario the whole thing should be optimized.
Is it "more speed, when the Files get larger" or is it "sparse, easy to write code"?
FWIW, I've made a Comparison-Zip with the current approaches which came in so far (including a new, RC5-SortedDictionary one): OuterJoinComparisons.zip
There's 5 variants (each of them reading from the same a- and b.txt ... and then writing c.txt as the output-result,
c.txt is Killed before each of the test-runs - and a CheckSum of the new generated c.txt of each approach is calculated)
Here's the results (measured on a native compiled binary, but the IDE-results don't differ much):
I changed the Files to ; instead of ,
in Post#1 you said CSV
Edit revised Code
Code:
Dim Cn As ADODB.Connection
Private Sub cmdExecute_Click()
Command1.Value = True 'create Table
Command2.Value = True 'insert File a.txt
Command3.Value = True 'insert File b.txt
Command6.Value = True 'create Helper Table
Command4.Value = True 'Update Value part1
Command5.Value = True 'update Value part2
Command7.Value = True 'insert Sorted data to new Table
End Sub
Private Sub Command1_Click()
Dim sSql As String
Dim s() As String
Dim TableName As String
TableName = "tbl_Test"
On Error Resume Next
sSql = "Drop Table " & TableName
Cn.Execute sSql
Err.Clear
On Error GoTo 0
'with DEFAULT 0
ReDim s(11)
s(0) = "Create Table " & TableName
s(1) = "(m_ID Int Identity"
s(1) = "(m_ID Int Identity Primary Key" 'PrimärKey
s(2) = ",m_Year Int"
s(3) = ",m_atxt1 Int DEFAULT 0 "
s(4) = ",m_atxt2 Int DEFAULT 0"
s(5) = ",m_atxt3 Int DEFAULT 0"
s(6) = ",m_btxt1 Int DEFAULT 0 "
s(7) = ",m_btxt2 Int DEFAULT 0"
s(8) = ",m_btxt3 Int DEFAULT 0)"
'without DEFAULT
' ReDim s(11)
' s(0) = "Create Table " & TableName
' s(1) = "(m_ID Int Identity"
' ' s(1) = "(m_ID Int Identity Primary Key" 'zusätzlich PrimärKey
' s(2) = ",m_Year Int"
' s(3) = ",m_atxt1 Int"
' s(4) = ",m_atxt2 Int"
' s(5) = ",m_atxt3 Int"
' s(6) = ",m_btxt1 Int"
' s(7) = ",m_btxt2 Int"
' s(8) = ",m_btxt3 Int)"
sSql = Join(s, " ")
Cn.Execute sSql
End Sub
Private Sub Command2_Click()
Dim sSql As String
'first a.txt:
sSql = "INSERT INTO [tbl_Test] (m_Year,m_atxt1,m_atxt2,m_atxt3)"
sSql = sSql & "SELECT F1 ,F2,F3,F4 FROM [Text;DATABASE=C:\Text_Test;HDR=No].[a.txt]"
Cn.Execute sSql
End Sub
Private Sub Command3_Click()
Dim sSql As String
'now b.txt
sSql = "INSERT INTO [tbl_Test] (m_Year,m_btxt1,m_btxt2,m_btxt3)"
sSql = sSql & " SELECT F1 ,F2,F3,F4 FROM [Text;DATABASE=C:\Text_Test;HDR=No].[b.txt]"
Cn.Execute sSql
End Sub
Private Sub Command6_Click()
Dim sSql As String
Dim TableName As String
TableName = "tbl_Dup"
On Error Resume Next
sSql = "Drop Table " & TableName
Cn.Execute sSql
Err.Clear
On Error GoTo 0
sSql = "SELECT tbl_Test.m_Year, tbl_Test.m_atxt1, tbl_Test.m_atxt2, "
sSql = sSql & "tbl_Test.m_atxt3, tbl_Test.m_btxt1, tbl_Test.m_btxt2, "
sSql = sSql & "tbl_Test.m_btxt3 INTO tbl_Dup FROM tbl_Test;"
Cn.Execute sSql
End Sub
Private Sub Command4_Click()
'Update the Values
Dim sSql As String
sSql = "UPDATE tbl_Dup LEFT JOIN tbl_Test ON tbl_Dup.m_Year "
sSql = sSql & "= tbl_Test.m_Year SET tbl_Test.m_btxt1 = "
sSql = sSql & "[tbl_Dup].[m_btxt1], tbl_Test.m_btxt2 = [tbl_Dup].[m_btxt2], "
sSql = sSql & "tbl_Test.m_btxt3 = [tbl_Dup].[m_btxt3]"
sSql = sSql & " WHERE (((tbl_Test.m_btxt1) Like 0) AND "
sSql = sSql & "((tbl_Test.m_btxt2) Like 0) AND ((tbl_Test.m_btxt3) Like 0) "
sSql = sSql & " AND ((tbl_Test.m_Year) Like [tbl_Dup].[m_Year]));"
Cn.Execute sSql
End Sub
Private Sub Command5_Click()
'Update
Dim sSql As String
sSql = "UPDATE tbl_Dup RIGHT JOIN tbl_Test ON tbl_Dup.m_Year "
sSql = sSql & "= tbl_Test.m_Year SET tbl_Test.m_atxt1 = "
sSql = sSql & "[tbl_Dup].[m_atxt1], tbl_Test.m_atxt2 = [tbl_Dup].[m_atxt2],"
sSql = sSql & "tbl_Test.m_atxt3 = [tbl_Dup].[m_atxt3]"
sSql = sSql & "WHERE (((tbl_Test.m_atxt1) Like 0) AND "
sSql = sSql & "((tbl_Test.m_atxt2) Like 0) AND ((tbl_Test.m_atxt3) Like 0) "
sSql = sSql & "AND ((tbl_Test.m_Year) Like [tbl_Dup].[m_Year]));"
Cn.Execute sSql
End Sub
Private Sub Command7_Click()
'Last step sort the Data into new Table
Dim sSql As String
Dim TableName As String
TableName = "tbl_Sorted"
On Error Resume Next
sSql = "Drop Table " & TableName
Cn.Execute sSql
Err.Clear
On Error GoTo 0
sSql = "SELECT DISTINCT tbl_Test.m_Year, tbl_Test.m_atxt1, tbl_Test.m_atxt2, "
sSql = sSql & "tbl_Test.m_atxt3, tbl_Test.m_btxt1, tbl_Test.m_btxt2, tbl_Test.m_btxt3"
sSql = sSql & " INTO tbl_Sorted FROM tbl_Test;"
Cn.Execute sSql
End Sub
Public Function openTheDatabase() As Boolean
'-- DB öffnen
Dim sConnectionString As String
On Error GoTo dbError
'-- New connection --
Set Cn = New ADODB.Connection
sConnectionString = "Provider=Microsoft.Jet.OLEDB.4.0;" _
& "Data Source=" & App.Path & "\db4.mdb"
Cn.Open sConnectionString
openTheDatabase = True
Exit Function
dbError:
MsgBox (Err.Description)
openTheDatabase = False
End Function
Private Sub Form_Load()
If (Not openTheDatabase()) Then
MsgBox "Datenbank fehler..."
Else
MsgBox "OK"
Exit Sub
End If
End Sub
regards
Chris
Last edited by ChrisE; Sep 21st, 2017 at 08:06 AM.
to hunt a species to extinction is not logical !
since 2010 the number of Tigers are rising again in 2016 - 3900 were counted. with Baby Callas it's 3901, my wife and I had 2-3 months the privilege of raising a Baby Tiger.
I'll mark this resolved. It wasn't for any one particular case:
The files might be large but more often fairly small like 1000 lines of data or less. There might be just the two files A and B or Left and Right, or there might be 3 files (adding a Middle). Or there might be 4 or even 5 files.
The data might always be comma delimited, either following the locale or always using the invariant locale like most modern text file formats do. Or it might sometimes be Tab delimited.
The files might even have different numbers and types of columns outside of the matching key column.
It sounds like we have a few ways to go about this.
If the inputs all have the same format a Full Outer Join works - as long as we have one. When we don't we can simulate one by compounding other Joins.
Or we can just code something up to do the same thing.
To handle more exotic cases where inputs have varying formats we'd have to code something up. If there might be several inputs instead of just two files this may be a better option than using a SQL database engine too.
I was hoping somebody knew of something like Log Parser that can do this. Log Parser (2.2 or so is the latest I think) is a Microsoft tool for handling text "log" formats. It has both a command line utility program and an ActiveX DLL.
Log Parser
Log parser is a powerful, versatile tool that provides universal query access to text-based data such as log files, XML files and CSV files, as well as key data sources on the Windows® operating system such as the Event Log, the Registry, the file system, and Active Directory®.
You tell Log Parser what information you need and how you want it processed. The results of your query can be custom-formatted in text based output, or they can be persisted to more specialty targets like SQL, SYSLOG, or a chart.
Here's the results (measured on a native compiled binary, but the IDE-results don't differ much)...
That seems very odd.
When I run the Collection-based code I posted above it takes very little time. Certainly nothing as long as the other implementations do. I wonder where the bottleneck might be on your machine? Some sort of slowdown of VB6 native text I/O when the language setting uses non-standard delimiters?
More likely it has to do with whether it is compiled or run in the IDE, how much data there is, etc.
No big deal either way. An approach based on SQL Joins is probably easier to get working correctly for all inputs. Correct is always more important than fast.
Last edited by dilettante; Sep 21st, 2017 at 09:57 AM.
Re: [RESOLVED] CSV Full Outer Join - alternatives?
dil
As Chris noted, you did say CSV files.
So I tried this (no SQL stuff) ..
Code:
' join
Dim dilC() As String
ReDim dilC(10)
dilC(0) = "0000" ' populate element 0
fpath = "D:\VBForums\dilA.txt"
Open fpath For Input As #1
'
nn = 1
For aa = 1 To 10
If EOF(1) Then
Exit For
End If
Line Input #1, xtra
yra = CInt(Left(xtra, 4))
fpath = "D:\VBForums\dilB.txt"
Open fpath For Input As #2
For bb = 1 To 10
Line Input #2, xtrb
yrb = CInt(Left(xtrb, 4))
yrc = CInt(Left(dilC(nn - 1), 4))
' 1. did it
If yrb <= yrc Then
b = nada
' 2. insert new yrB before curr yrA
ElseIf yrb < yra Then
dilC(nn) = Left(xtrb, 4) & ",0,0,0" & Mid(xtrb, 5)
nn = nn + 1
' do curr yrA
Line Input #2, xtrb
yrb = CInt(Left(xtrb, 4))
If yrb = yra Then
dilC(nn) = xtra & Mid(xtrb, 5)
nn = nn + 1
Else
b = b
End If
Exit For
' 3. same yr
ElseIf yrb = yra Then
dilC(nn) = xtra & Mid(xtrb, 5)
nn = nn + 1
Exit For
' 4. new yr after
ElseIf yrb > yra Then
b = b
End If
' 5. no yrB
If EOF(2) Then
dilC(nn) = xtra & ",0,0,0"
nn = nn + 1
Exit For
End If
Next bb
Close #2
Next aa
Close #1
' create dilC.txt
fpath = "D:\VBForums\dilC.txt"
Open fpath For Output As #1
For ii = 1 To nn - 1
Write #1, dilC(ii)
Next ii
Close #1
.. and got this ..
Here is the file
There are a few loose ends
assumes Year in both files are already "sorted" .. ie, chronological
there are one or 2 branches that need attention
I don't know why the created dilC.txt has leading and trailing " (quotation marks)
But, it's pretty quick.
PS. You marked this as RESOLVED but I did begin work on it before so .. reason for post #10 ..
When I run the Collection-based code I posted above it takes very little time. Certainly nothing as long as the other implementations do.
You've included very small a.txt and b.txt files (only 70 lines and 45 lines) in your Collection-example-Zip...
To exclude non-related overheads, the timings I've posted were done with 5000 lines in each file (leaving a few gaps in each - per Mod 10 and Mod 8 respectively).
Already with that relative small LineCount the Collection-approach shows tendencies to become exponentially worse
(above 10000 lines in each file it becomes unusable).
The other approaches do not show such an expontial worsening (they scale linearily).
64000 lines:
640000 lines:
Originally Posted by dilettante
No big deal either way. An approach based on SQL Joins is probably easier to get working correctly for all inputs. Correct is always more important than fast.
In case the input-files get larger, one might consider using SQLite and the RC5-FileParsing-Class cCSV (instead of an unknown MS-Tool),
which is consistently about factor 3 as fast as the JET-ISAM-driver...
That even whilst currently producing the output-text-file "by hand" (after the Outer-Join-Select, in an Rs.ValueMatrix(i,j) -Loop).
If one decides to leave the imported CSV-data within the SQLite-DB (and use that File further instead of c.txt), then only
about half the time again would be needed for the plain a,b imports...
Re: [RESOLVED] CSV Full Outer Join - alternatives?
It is all good feedback, and we have some good alternatives for anyone finding this thread by searching.
I think one big weakness of my Collection-based approach is relying on binary searching. The cost is minimal for small sets of data but grows with larger sets of data. Databases normally do far better, and you can add an index to the "inputs" that will outstrip my "search and insert" approach.
This originally came up in the context of fairly small files. If the files get large or you need to do a lot of this then performance does become an issue.
I had been hoping for some utility or library that might do this sort of thing generically. No need to write careful code or craft any SQL: just supply the list of files and which field is the key for matching.
The more I think about this though, hardly anyone ever needs this. If they did we'd have such a tool, or at least more databases would support Full Outer Joins. SQL Server has them, MySQL may I don't know. But embedded/desktop DMBSs don't seem to. Probably because they aren't needed very much. As seen above we can produce the same result with a little effort.
Re: [RESOLVED] CSV Full Outer Join - alternatives?
Hi,
I know this is marked as resolved, but I had another go at this with a shema.ini
this output isn't quit what you want, I think this method could be improved on.
here the sample..
Code:
'create a schema.ini and place it with the
'textfiles in Folder C:\Text_Test
'[a.txt]
'ColNameHeader = True
'Format=Delimited(;)
'
'[b.txt]
'ColNameHeader = True
'Format=Delimited(;)
'textfile a.txt looks like this
'nYear;a1;a2;a3
'1901;1;1;1
'1902;2;2;2
'1904;4;4;4
'1906;6;6;6
'1908;8;8;8
'textfile b.txt looks like this
'nYear;a4;a5;a6
'1901;11;11;11
'1902;12;12;12
'1903;13;13;13
'1904;14;14;14
'1905;15;15;15
'1906;16;16;16
Const myFolder = "C:\Text_Test\"
Const a_File = "a.txt"
Const b_File = "b.txt"
Const adOpenStatic = 3
Const adLockOptimistic = 3
Const adCmdText = &H1
Private Sub Command1_Click()
Dim rst
Dim conn
Dim fso
Dim outPath
Dim outStream
Dim colCount, i
Dim hdr()
Set fso = CreateObject("Scripting.FileSystemObject")
'Outpufile
outPath = myFolder & "out.txt"
If fso.FileExists(outPath) Then fso.DeleteFile outPath, True
'Connection
Set conn = CreateObject("ADODB.Connection")
conn.provider = "Microsoft.Jet.OLEDB.4.0"
conn.ConnectionString = "Data Source='" & myFolder & "';Extended Properties='text'" 'HDR=Yes;FMT=Delimited('';'')';"
conn.Open
Set rst = CreateObject("ADODB.recordset")
rst.Open "SELECT t1.*, t2.* FROM " & a_File & " AS t1 INNER JOIN " & b_File & " AS t2 ON t1.nYear = t2.nYear", _
conn, adOpenStatic, adLockOptimistic, adCmdText
'Header
colCount = rst.fields.Count
ReDim hdr(colCount - 1)
For i = 0 To colCount - 1
hdr(i) = rst.fields(i).Name
Next
'Out-Stream
Set outStream = fso.OpenTextFile(outPath, 2, True) ' 2 = ForWriting
'write Header
outStream.writeLine Join(hdr, ";")
'write new file
outStream.writeLine rst.GetString(2, , ";", vbCrLf)
outStream.Close
End Sub
schema.ini..
output..
regards
Chris
Last edited by ChrisE; Sep 22nd, 2017 at 04:23 AM.
to hunt a species to extinction is not logical !
since 2010 the number of Tigers are rising again in 2016 - 3900 were counted. with Baby Callas it's 3901, my wife and I had 2-3 months the privilege of raising a Baby Tiger.
Re: [RESOLVED] CSV Full Outer Join - alternatives?
Originally Posted by dilettante
If they are already in sequence, great. I wasn't really assuming that though. Still, in some cases the assumption may be valid.
You used Write # where you probably wanted Print #.
First off, Print did the trick ..
Secondly, I've now dealt with "NOT in sequence" by doing a bubble sort.
I started the timer before the sorting
Code:
QueryPerformanceFrequency msFreq 'get frequency
QueryPerformanceCounter msTime1 'get first time
Dim aaA(), aaB()
ReDim aaA(1, nna)
ReDim aaB(1, nnb)
' 1. unsorted a, populate array aaA
fpath = "D:\VBForums\dilau.txt"
Open fpath For Input As #1
nn = 0
Do While Not EOF(1)
Line Input #1, xtr
aaA(0, nn + 1) = xtr
aaA(1, nn + 1) = xtr
nn = nn + 1
Loop
Close #1
' 2. bubble sort a
ns = 0
For ii = nn To 1 Step -1
didswap = 0
For jj = 1 To ii - 1
yr0 = CInt(Left(aaA(1, jj), 4))
yr1 = CInt(Left(aaA(1, jj + 1), 4))
If yr0 > yr1 Then
swap = aaA(1, jj)
aaA(1, jj) = aaA(1, jj + 1)
aaA(1, jj + 1) = swap
didswap = 1
ns = ns + 1
End If
Next jj
If didswap = 0 Then
Exit For
End If
Next ii
' 3. display a sorted, write to file
fpath = "D:\VBForums\dilAs.txt"
Open fpath For Output As #1
For ii = 1 To nna
Label2(2).Caption = Label2(2).Caption & aaA(1, ii) & vbCrLf
Print #1, aaA(1, ii)
Next ii
Close #1
' 4. unsorted b, populate array aaB
fpath = "D:\VBForums\dilbu.txt"
Open fpath For Input As #1
nn = 0
Do While Not EOF(1)
Line Input #1, xtr
aaB(0, nn + 1) = xtr
aaB(1, nn + 1) = xtr
nn = nn + 1
Loop
Close #1
' 5. bubble sort b
ns = 0
For ii = nn To 1 Step -1
didswap = 0
For jj = 1 To ii - 1
yr0 = CInt(Left(aaB(1, jj), 4))
yr1 = CInt(Left(aaB(1, jj + 1), 4))
If yr0 > yr1 Then
swap = aaB(1, jj)
aaB(1, jj) = aaB(1, jj + 1)
aaB(1, jj + 1) = swap
didswap = 1
ns = ns + 1
End If
Next jj
If didswap = 0 Then
Exit For
End If
Next ii
' 6. display b sorted, write to file
fpath = "D:\VBForums\dilBs.txt"
Open fpath For Output As #1
For ii = 1 To nnb
Label2(3).Caption = Label2(3).Caption & aaB(1, ii) & vbCrLf
Print #1, aaB(1, ii)
Next ii
Close #1
' 7. join
ReDim dilC(10)
dilC(0) = "0000" ' populate element 0
fpath = "D:\VBForums\dilAs.txt"
Open fpath For Input As #1
'
Dim yrAdid, yrBdid
nn = 1
For aa = 1 To 10
If EOF(1) Then
Exit For
End If
Line Input #1, xtra
yra = CInt(Left(xtra, 4))
fpath = "D:\VBForums\dilBs.txt"
Open fpath For Input As #2
For bb = 1 To 10
Line Input #2, xtrb
yrb = CInt(Left(xtrb, 4))
yrc = CInt(Left(dilC(nn - 1), 4))
' 1. did it
If yrb <= yrc Then
b = nada
' 2. insert new yrB before curr yrA
ElseIf yrb < yra Then
dilC(nn) = Left(xtrb, 4) & ",0,0,0" & Mid(xtrb, 5)
nn = nn + 1
yrBdid = yrb
' do curr yrA
Line Input #2, xtrb
yrb = CInt(Left(xtrb, 4))
If yrb = yra Then
dilC(nn) = xtra & Mid(xtrb, 5)
nn = nn + 1
yrAdid = yra
Else
b = b
End If
Exit For
' 3. same yr
ElseIf yrb = yra Then
dilC(nn) = xtra & Mid(xtrb, 5)
nn = nn + 1
yrAdid = yra
yrBdid = yrn
Exit For
' 4. new yr after
ElseIf yrb > yra Then
' do curr yrA
If yra > yrAdid Then
dilC(nn) = xtra & ",0,0,0"
nn = nn + 1
yrAdid = yra
End If
' do new yrB
dilC(nn) = Left(xtrb, 4) & ",0,0,0" & Mid(xtrb, 5)
nn = nn + 1
yrBdid = yrb
End If
' 5. no yrB
If EOF(2) Then
If yra > yrAdid Then
dilC(nn) = xtra & ",0,0,0"
nn = nn + 1
End If
Exit For
End If
Next bb
Close #2
Next aa
Close #1
' create dilC.txt
fpath = "D:\VBForums\dilC.txt"
Open fpath For Output As #1
For ii = 1 To nn - 1
Print #1, dilC(ii)
Next ii
Close #1
'
QueryPerformanceCounter msTime2 'get second time
zz = (msTime2 - msTime1) / msFreq
Here is the image
Here is the file
Comments
6 steps were added before "join" to do the sorting
arrays are introduced for sorting
I still used text files as well .. if only arrays were used, could be a bit quicker
I added 2 more dates from your zip file, such that b.txt now goes beyond a.txt
only one "loose end" .. may not even be an issue
the 2nd former loose end (' 4. new yr after) is now resolved.
It took a little longer, but still is pretty quick
Re: [RESOLVED] CSV Full Outer Join - alternatives?
Originally Posted by Spooman
First off, Print did the trick ..
Secondly, I've now dealt with "NOT in sequence" by doing a bubble sort.
I started the timer before the sorting
Code:
QueryPerformanceFrequency msFreq 'get frequency
QueryPerformanceCounter msTime1 'get first time
Dim aaA(), aaB()
ReDim aaA(1, nna)
ReDim aaB(1, nnb)
' 1. unsorted a, populate array aaA
fpath = "D:\VBForums\dilau.txt"
Open fpath For Input As #1
nn = 0
Do While Not EOF(1)
Line Input #1, xtr
aaA(0, nn + 1) = xtr
aaA(1, nn + 1) = xtr
nn = nn + 1
Loop
Close #1
' 2. bubble sort a
ns = 0
For ii = nn To 1 Step -1
didswap = 0
For jj = 1 To ii - 1
yr0 = CInt(Left(aaA(1, jj), 4))
yr1 = CInt(Left(aaA(1, jj + 1), 4))
If yr0 > yr1 Then
swap = aaA(1, jj)
aaA(1, jj) = aaA(1, jj + 1)
aaA(1, jj + 1) = swap
didswap = 1
ns = ns + 1
End If
Next jj
If didswap = 0 Then
Exit For
End If
Next ii
' 3. display a sorted, write to file
fpath = "D:\VBForums\dilAs.txt"
Open fpath For Output As #1
For ii = 1 To nna
Label2(2).Caption = Label2(2).Caption & aaA(1, ii) & vbCrLf
Print #1, aaA(1, ii)
Next ii
Close #1
' 4. unsorted b, populate array aaB
fpath = "D:\VBForums\dilbu.txt"
Open fpath For Input As #1
nn = 0
Do While Not EOF(1)
Line Input #1, xtr
aaB(0, nn + 1) = xtr
aaB(1, nn + 1) = xtr
nn = nn + 1
Loop
Close #1
' 5. bubble sort b
ns = 0
For ii = nn To 1 Step -1
didswap = 0
For jj = 1 To ii - 1
yr0 = CInt(Left(aaB(1, jj), 4))
yr1 = CInt(Left(aaB(1, jj + 1), 4))
If yr0 > yr1 Then
swap = aaB(1, jj)
aaB(1, jj) = aaB(1, jj + 1)
aaB(1, jj + 1) = swap
didswap = 1
ns = ns + 1
End If
Next jj
If didswap = 0 Then
Exit For
End If
Next ii
' 6. display b sorted, write to file
fpath = "D:\VBForums\dilBs.txt"
Open fpath For Output As #1
For ii = 1 To nnb
Label2(3).Caption = Label2(3).Caption & aaB(1, ii) & vbCrLf
Print #1, aaB(1, ii)
Next ii
Close #1
' 7. join
ReDim dilC(10)
dilC(0) = "0000" ' populate element 0
fpath = "D:\VBForums\dilAs.txt"
Open fpath For Input As #1
'
Dim yrAdid, yrBdid
nn = 1
For aa = 1 To 10
If EOF(1) Then
Exit For
End If
Line Input #1, xtra
yra = CInt(Left(xtra, 4))
fpath = "D:\VBForums\dilBs.txt"
Open fpath For Input As #2
For bb = 1 To 10
Line Input #2, xtrb
yrb = CInt(Left(xtrb, 4))
yrc = CInt(Left(dilC(nn - 1), 4))
' 1. did it
If yrb <= yrc Then
b = nada
' 2. insert new yrB before curr yrA
ElseIf yrb < yra Then
dilC(nn) = Left(xtrb, 4) & ",0,0,0" & Mid(xtrb, 5)
nn = nn + 1
yrBdid = yrb
' do curr yrA
Line Input #2, xtrb
yrb = CInt(Left(xtrb, 4))
If yrb = yra Then
dilC(nn) = xtra & Mid(xtrb, 5)
nn = nn + 1
yrAdid = yra
Else
b = b
End If
Exit For
' 3. same yr
ElseIf yrb = yra Then
dilC(nn) = xtra & Mid(xtrb, 5)
nn = nn + 1
yrAdid = yra
yrBdid = yrn
Exit For
' 4. new yr after
ElseIf yrb > yra Then
' do curr yrA
If yra > yrAdid Then
dilC(nn) = xtra & ",0,0,0"
nn = nn + 1
yrAdid = yra
End If
' do new yrB
dilC(nn) = Left(xtrb, 4) & ",0,0,0" & Mid(xtrb, 5)
nn = nn + 1
yrBdid = yrb
End If
' 5. no yrB
If EOF(2) Then
If yra > yrAdid Then
dilC(nn) = xtra & ",0,0,0"
nn = nn + 1
End If
Exit For
End If
Next bb
Close #2
Next aa
Close #1
' create dilC.txt
fpath = "D:\VBForums\dilC.txt"
Open fpath For Output As #1
For ii = 1 To nn - 1
Print #1, dilC(ii)
Next ii
Close #1
'
QueryPerformanceCounter msTime2 'get second time
zz = (msTime2 - msTime1) / msFreq
Here is the image
Here is the file
Comments
6 steps were added before "join" to do the sorting
arrays are introduced for sorting
I still used text files as well .. if only arrays were used, could be a bit quicker
I added 2 more dates from your zip file, such that b.txt now goes beyond a.txt
only one "loose end" .. may not even be an issue
the 2nd former loose end (' 4. new yr after) is now resolved.
It took a little longer, but still is pretty quick
Any more challenges? ..
Spoo
nice one Spoo !!
regards
Chris
to hunt a species to extinction is not logical !
since 2010 the number of Tigers are rising again in 2016 - 3900 were counted. with Baby Callas it's 3901, my wife and I had 2-3 months the privilege of raising a Baby Tiger.