Import data from multiple html files into one excel worksheet
I got the below macro from a old thread. I have the lines on the html page files saved on c drive, which are supposed to be in a table format but they are not and there is no separator between each column data except some undefined spaces as shown in below image, so how can i import it as a proper table .i.e. into separate columns in the excel file, also i have multiple such html files so can i import multiple files into one worksheet at one go.
Code:
Sub HTML_Table_To_Excel()
Dim htm As Object
Dim Tr As Object
Dim Td As Object
Dim Tab1 As Object
'Replace the URL of the webpage that you want to download
Web_URL = VBA.Trim(Sheets(1).Cells(1, 1))
'Create HTMLFile Object
Set HTML_Content = CreateObject("htmlfile")
'Get the WebPage Content to HTMLFile Object
With CreateObject("msxml2.xmlhttp")
.Open "GET", Web_URL, False
.send
HTML_Content.body.innerHTML = .responseText
End With
Column_Num_To_Start = 1
iRow = 2
iCol = Column_Num_To_Start
iTable = 10
'Loop Through Each Table and Download it to Excel in Proper Format
For Each Tab1 In HTML_Content.getElementsByTagName("table")
With HTML_Content.getElementsByTagName("table")(iTable)
For Each Tr In .Rows
For Each Td In Tr.Cells
Sheets(1).Cells(iRow, iCol).Select
Sheets(1).Cells(iRow, iCol) = Td.innerText
iCol = iCol + 1
Next Td
iCol = Column_Num_To_Start
iRow = iRow + 1
Next Tr
End With
iTable = iTable + 1
iCol = Column_Num_To_Start
iRow = iRow + 1
Next Tab1
End Sub
Re: Import data from multiple html files into one excel worksheet
Is that really an HTML file with HTML tags, or is it actually just a text file with a fixed width fond and spaces to get the data into the correct columns? If it is a text file, you would need something like this:
Re: Import data from multiple html files into one excel worksheet
Originally Posted by jdc2000
Is that really an HTML file with HTML tags, or is it actually just a text file with a fixed width fond and spaces to get the data into the correct columns? If it is a text file, you would need something like this:
Its an html file only, i had opened it in the notepad++ and taken a snapshot of couple of lines to shown the example and the spaces between the table columns are same most of the time but not always, can we change the above piece of code work with it to identify the lines by html tag names
Last edited by abhay_547; Jan 31st, 2018 at 11:33 PM.
Re: Import data from multiple html files into one excel worksheet
from what I can see of that screen capture, there do not appear to be any HTML tags present. Does it look the same in Notepad (NOT Notepad++)? What happens if you start Excel and try to open the file? (Make sure that you have File Types set to ALL.) Can you open the file?
Re: Import data from multiple html files into one excel worksheet
Originally Posted by jdc2000
from what I can see of that screen capture, there do not appear to be any HTML tags present. Does it look the same in Notepad (NOT Notepad++)? What happens if you start Excel and try to open the file? (Make sure that you have File Types set to ALL.) Can you open the file?
Yes, there are close to 500 .htm files on my c drive folder. when i open it in excel the data which should come in multiple columns comes in one cell.
Re: Import data from multiple html files into one excel worksheet
Did you open one of the files using Notepad (NOT Notepad++)? Post a SMALL screen capture of what you get so we can see the file data. Are there any HTML tags visible?
Re: Import data from multiple html files into one excel worksheet
Originally Posted by jdc2000
Did you open one of the files using Notepad (NOT Notepad++)? Post a SMALL screen capture of what you get so we can see the file data. Are there any HTML tags visible?
Yes, I did open it in the Notepad (Not NOTEPAD++) and i could see the HTML tags in it.
Re: Import data from multiple html files into one excel worksheet
post a sample of part of one of the html files
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
Originally Posted by westconn1
post a sample of part of one of the html files
Below is the html page script and attached is the image how it looks in chrome.
Code:
<!DOCTYPE HTML PUBLIC - //W3C// DTD HTML 4.0 Strict//EN>
<HTML><HEAD></HEAD><BODY>
<style>
table {font-family:monospace;font-size:13px}
td {white-space:pre}
</style>
<TABLE MARGIN=0 BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH=100%>
<TR><TD BGCOLOR="white"><SPAN STYLE="font-weight:500">XXX SALES CORPORATION SALES DATE: 01/17/2018 SCSLE-ID: ZRR REPORT-ID: 00007040000402242018</SPAN></TD></TR>
<TR><TD BGCOLOR="white"><SPAN STYLE="font-weight:500">XXX CORPORATION DAILY SALES ACTIVITY REPORT SYSTEM DATE: 01/17/2018 SYSTEM TIME: 17:40:20 PAGE 85</SPAN></TD></TR>
<TR><TD BGCOLOR="white"> </TD></TR>
<TR><TD BGCOLOR="lightgrey"><SPAN STYLE="font-weight:500">XXX SALES GROUP XXDD CLASSIFICATION INTERNAL/OTHER</SPAN></TD></TR>
<TR><TD BGCOLOR="lightgrey"><SPAN STYLE="font-weight:500">ACCOUNT TYPE: D </SPAN></TD></TR>
<TR><TD BGCOLOR="white"> </TD></TR>
<TR><TD BGCOLOR="white"><SPAN STYLE="font-weight:500"> B S TRXN SALE SALE TRXN INFO OPP KLC ACCOUNT</SPAN></TD></TR>
<TR><TD BGCOLOR="white"><SPAN STYLE="font-weight:500"> DESCRIPTION DATE D C RESI SINGL PRICE PRICE AMOUNT R EFF M/K SEF ORGZ TYP NUMBER</SPAN></TD></TR>
<TR><TD BGCOLOR="white"><SPAN STYLE="font-weight:500"></SPAN><SPAN STYLE="font-weight:500;text-decoration:underline">
<TR><TD BGCOLOR="lightgrey"> </TD></TR>
<TR><TD BGCOLOR="lightgrey"> </TD></TR>
<TR><TD BGCOLOR="white"><SPAN STYLE="font-weight:500"> (continued)</SPAN></TD></TR>
<TR><TD BGCOLOR="white"><SPAN STYLE="font-weight:500"> CREDIT 01/17 B C 2 14.64000 16.11500 445.00 00889 047ACDDR5</SPAN></TD></TR>
<TR><TD BGCOLOR="white"><SPAN STYLE="font-weight:500"> CREDIT 01/17 B C 1 14.64000 16.11500 678.00 00889 072R14433</SPAN></TD></TR>
<TR><TD BGCOLOR="lightgrey"><SPAN STYLE="font-weight:500"> CREDIT 01/17 B C 2 14.64000 16.11500 445.00 00889 047ACDDR5</SPAN></TD></TR>
<TR><TD BGCOLOR="lightgrey"><SPAN STYLE="font-weight:500"> CREDIT 01/17 B C 2 14.22000 16.11500 324.00 00889 047ACDDK5</SPAN></TD></TR>
<TR><TD BGCOLOR="lightgrey"><SPAN STYLE="font-weight:500"> CREDIT 01/17 B C 2 14.74000 16.17500 755.00 00889 047ACDDS5</SPAN></TD></TR>
<TR><TD BGCOLOR="white"><SPAN STYLE="font-weight:500"> CREDIT 01/17 B C 2 15.14000 17.21500 445.00 00889 047ACDDR5</SPAN></TD></TR>
<TR><TD BGCOLOR="white"><SPAN STYLE="font-weight:500"> CREDIT 01/17 B C 1 15.14000 17.21500 218.00 00819 072R166S4</SPAN></TD></TR>
<TR><TD BGCOLOR="white"><SPAN STYLE="font-weight:500"> CREDIT 01/17 B C 1 15.14000 17.21500 218.00 00889 072R166S4</SPAN></TD></TR>
<TR><TD BGCOLOR="lightgrey"><SPAN STYLE="font-weight:500"> CREDIT 01/17 B C 2 12.32000 16.12400 885.00 00889 047ACDDR5</SPAN></TD></TR>
<TR><TD BGCOLOR="lightgrey"><SPAN STYLE="font-weight:500"> CREDIT 01/17 B C 2 12.12000 16.12400 764.00 00819 047AK67C8</SPAN></TD></TR>
<TR><TD BGCOLOR="lightgrey"><SPAN STYLE="font-weight:500"> CREDIT 01/17 B C 2 12.77000 16.17500 755.00 00889 047AKRED6</SPAN></TD></TR>
<TR><TD BGCOLOR="white"><SPAN STYLE="font-weight:500"> CREDIT 01/17 B C 2 16.23000 14.14500 442.00 00886 047AERTH3</SPAN></TD></TR>
<TR><TD BGCOLOR="white"><SPAN STYLE="font-weight:500"> CREDIT 01/17 B C 1 16.23000 14.21400 461.00 00819 072SIOOP2</SPAN></TD></TR>
<TR><TD BGCOLOR="white"><SPAN STYLE="font-weight:500"> CREDIT 01/17 B C 1 16.23000 14.21400 461.00 00889 072UYOOL6</SPAN></TD></TR>
Below is the image how it looks in chrome, now I have 500+ *.htm files saved on my local drive similar to this and I am looking to extract data from all files and import into a worksheet in excel, i want the header rows which looks like is in a tabular format.
Last edited by abhay_547; Feb 9th, 2018 at 01:20 AM.
Re: Import data from multiple html files into one excel worksheet
as above, this reads the data correctly but has some errors with the headers, where there is some empty cells and some additional spacings, in files with large data, it would probably be faster to fix those errors than loop all the rows
Code:
Dim doc As MSHTML.HTMLDocument
Open "c:\temp\547.htm" For Input As 1
htm = Input(LOF(1), #1)
Close 1
Set doc = CreateObject("htmlfile")
doc.body.innerHTML = htm
Set t = doc.getElementsByTagName("table")(0)
Set myclip = CreateObject("clipbrd.clipboard")
myclip.Clear
myclip.settext t.innerText
Range("A6").PasteSpecial
it is pretty simple to build a loop through all files in a folder and add below the above data
do you actually want the headers for each file? easy to copy all then remove the headers if not required
Last edited by westconn1; Feb 10th, 2018 at 12:02 AM.
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
I am getting the error on the setmyclip line as Activex Component can't create object. I know that I have to register regsvr32 c:\users\abhy\desktop\clipboard.dll, but I can't do this at office since we wouldn't have admin rights in office to register any dll file also I can't ask IT person to register any dll file on office machine like this since its restricted. do we have any reference which we can select in the references which will make this work. also below is what i have got for looping through files so far. And ok if it imports the header for only first file and not for rest since we are going to append the data one below other from each .htm file into one worksheet.
Code:
Sub GetHtmlFiles()
'reference set to Microsoft HTML Object Library
Dim f As String, Path As String, NextFile As String, ThisBook As String
Path = "C:\users\abhy\desktop\htmldatafiles\"
ChDir Path
f = Dir(Path & "*.htm")
While f <> ""
Dim doc As MSHTML.HTMLDocument
Open f For Input As 1
htm = Input(LOF(1), #1)
Close 1
Set doc = CreateObject("htmlfile")
doc.body.innerHTML = htm
Set t = doc.getElementsByTagName("table")(0)
Set myclip = CreateObject("clipbrd.clipboard")
myclip.Clear
myclip.settext t.innerText
Range("A6").PasteSpecial
MsgBox f
Workbooks(f).Close SaveChanges:=False
f = Dir()
Wend
End Sub
Last edited by abhay_547; Feb 10th, 2018 at 07:46 AM.
Re: Import data from multiple html files into one excel worksheet
do we have any reference which we can select in the references which will make this work.
it would still have to be registered, i will figure some alternative, using a dataobject, i just find it a simple way to use the clipboard object in vba
do you want the continued line at the top of the data? is that common to be there?
are the header rows all the same?
Code:
Dim doc As MSHTML.HTMLDocument, d As DataObject, htm As String
Set d = New DataObject
Set doc = CreateObject("htmlfile")
fpath = "C:\users\abhy\desktop\htmldatafiles\"
fpath = "c:\temp\"
f = Dir(fpath & "*.htm")
Do While f <> ""
Open fpath & f For Input As 1
htm = Input(LOF(1), #1)
Close 1
doc.body.innerHTML = htm
Set t = doc.getElementsByTagName("table")(0)
d.Clear
d.settext t.innerText
d.PutInClipboard
rw = Cells(Rows.Count, 1).End(xlUp).Row + 1
If rw = 2 Then rw = 1
Cells(rw, 1).PasteSpecial
If rw = 1 Then 'fix headers
Else ' delete headers
hrows = 12 ' may need adjustment
Cells(rw, 1).Resize(hrows).EntireRow.Delete
End If
f = Dir
Loop
will require reference to msforms2, i moved the object creation out of the loop so they can just be reused
as the html file does not contain any columns (just formatted rows), it is probably easier to just put the headers for the top of worksheet from a template file or when a new worksheet is created, then just always delete the headers, they can not easily be fixed from the html file
the delete headers may require some adjustment, but works with the sample data file
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
I just need the header from the first file of html folder, it can be ignored from rest all files since we will append data from each file one below another. Below is the snapshot highlighting the header row.
Also I added the reference to Microsoft Forms 2.0 Object Library, it wasn't appearing in the references in excel 2010 but i used the browse button and selected it from C:\WINDOWS\system32\FM20.DLL and then it started appearing in the references but when I run the code nothing happens. It doesn't show any error as well.
Code:
Sub GetHTMLFiles ()
Dim doc As MSHTML.HTMLDocument, d As DataObject, htm As String
Set d = New DataObject
Set doc = CreateObject("htmlfile")
fpath = "C:\users\abhy\desktop\htmldatafiles\"
fpath = "C:\users\abhy\Appdata\Local\Temp"
f = Dir(fpath & "*.htm")
Do While f <> ""
Open fpath & f For Input As 1
htm = Input(LOF(1), #1)
Close 1
doc.body.innerHTML = htm
Set t = doc.getElementsByTagName("table")(0)
d.Clear
d.settext t.innerText
d.PutInClipboard
rw = Cells(Rows.Count, 1).End(xlUp).Row + 1
If rw = 2 Then rw = 1
Cells(rw, 1).PasteSpecial
If rw = 1 Then 'fix headers
Else ' delete headers
hrows = 12 ' may need adjustment
Cells(rw, 1).Resize(hrows).EntireRow.Delete
End If
f = Dir
Loop
End Sub
Last edited by abhay_547; Feb 11th, 2018 at 05:00 AM.
Re: Import data from multiple html files into one excel worksheet
but when I run the code nothing happens. It doesn't show any error as well.
it was definitely working when i tested, imported your sample and someother htm files, without problem
does f get a value?
you have 2 file paths, is the 2nd one correct?
do the other variables get a correct value as the code runs through them?
step through the code or msgbox the variables to check the values are correct
the results should be pasted to the active worksheet, if there is any existing data the results should be below that
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
Ok. I have commented out the below temp folder fpath in the code and it worked but it imports everything into one column. Attached is the htm sample file and the excel file with macro and imported text for your reference. can you please have a look.
Code:
Sub GetHTMLFiles ()
Dim doc As MSHTML.HTMLDocument, d As DataObject, htm As String
Set d = New DataObject
Set doc = CreateObject("htmlfile")
fpath = "C:\users\abhy\desktop\htmldatafiles\"
'fpath = "C:\users\abhy\Appdata\Local\Temp"
f = Dir(fpath & "*.htm")
Do While f <> ""
Open fpath & f For Input As 1
htm = Input(LOF(1), #1)
Close 1
doc.body.innerHTML = htm
Set t = doc.getElementsByTagName("table")(0)
d.Clear
d.settext t.innerText
d.PutInClipboard
rw = Cells(Rows.Count, 1).End(xlUp).Row + 1
If rw = 2 Then rw = 1
Cells(rw, 1).PasteSpecial
If rw = 1 Then 'fix headers
Else ' delete headers
hrows = 12 ' may need adjustment
Cells(rw, 1).Resize(hrows).EntireRow.Delete
End If
f = Dir
Loop
End Sub
Re: Import data from multiple html files into one excel worksheet
but it imports everything into one column.
on testing, seems like this happens in later versions, should be easy enough to fix, will post later
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
Ok. just to confirm the version again, I am using excel 2010 in office and my personal pc I got excel 2013. I tested in both and the output is coming in one column
Re: Import data from multiple html files into one excel worksheet
this is tested in excel 2013
Code:
Sub GetHTMLFiles()
Dim doc As MSHTML.HTMLDocument, d As DataObject, htm As String
Set d = New DataObject
Set doc = CreateObject("htmlfile")
fpath = "C:\users\abhy\desktop\htmldatafiles\"
fpath = "C:\Temp\"
f = Dir(fpath & "html extract.htm")
Do While f <> ""
Open fpath & f For Input As 1
htm = Input(LOF(1), #1)
Close 1
doc.body.innerHTML = htm
Set t = doc.getElementsByTagName("table")(0)
d.Clear
d.settext t.innerText
d.PutInClipboard
rw = Cells(Rows.Count, 1).End(xlUp).Row + 1
If rw = 2 Then rw = 1
Cells(rw, 1).PasteSpecial
Dim r As Range
If rw = 1 Then 'fix headers
Cells(1, 1).Resize(10).EntireRow.Delete
Range("a1:n1") = Array("", "", "Bs", "", "", "TRXN", "SALE", "SALE", "TRXN", "INFO", "", "OPP", "KLC", "ACCOUNT")
Range("a2:n2") = Array("DESCRIPTION", "DATE", "Dc", "RESI", "SINGL", "PRICE", "PRICE", "AMOUNT", "REFF", "M/K", "SEF", "ORGZ", "TYP", "NUMBER")
Cells(3, 1).Resize(Cells(Rows.Count, 1).End(xlUp).Row - 2).TextToColumns , xlDelimited, xlTextQualifierNone, , , , , True
Else ' delete headers
hrows = 12 ' may need adjustment
Cells(rw, 1).Resize(hrows).EntireRow.Delete
Set r = Cells(rw, 1).Resize(Cells(Rows.Count, 1).End(xlUp).Row - rw)
r.TextToColumns r, xlDelimited, xlTextQualifierNone, , , , , True
End If
f = Dir(fpath & "html extract.htm")
Loop
End Sub
unlike last time I tested on this machine the paste did go across multiple columns, the same as previous tests on the older version of excel, I have no idea what has changed to make the difference in the results, the call to texttocolumns will not matter if the data is already in multiple columns
you may want to remove the last 2 lines of data from each import, if it is the same as the sample file
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
Ok. initially when i tried to run the above code, it didn't work but when I commented out the below line it worked, but in my folder i had only one sample htm file and it kept looping and importing the data from the same file. Ideally if it finds one file then it should import that and then stop but it keeps looping and importing the same file. can you advise what's going wrong, both the source path and temp folder path are declared as fpath probably that's what is creating the problem and when i comment out it works but then keeps importing the data from the same file again and again.
Re: Import data from multiple html files into one excel worksheet
i had only one sample htm file and it kept looping and importing the data from the same file
i did that deliberately for test as i also only had one sample file, just forgot to remove when posting, remembered after shutting down the computer
change f = Dir(fpath & "html extract.htm") at the bottom of the loop to
Code:
f = Dir
and the initial call to dir, to your file pattern as you had before, i changed that also as there were other html files in the folder, that were no use for testing
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
Ok. Great. Thanks a lot for your help . It's getting imported but for the first column if there is space in the text then it will import the text into 2 or 3 columns for e.g. if the text in the first column is just CREDIT then it will import it into the Column A but if the text in the column A is CREDIT INTERNAL (INT) then it will import the CREDIT into Column A and INTERNAL in Column B and (INT) in column D this is the case only with the first column since it consists of the text with space in some pages for e.g. in 500+ htm files around 120 htm pages (randomly) would have the text "CREDIT INTERNAL (INT)" or "DEBIT INTERNAL (INT)" in the first column. apart from this for few page line items the data is not available in the date column so instead of keeping that cell blank it imports the next column data in it because of which the column order changes for around 12000 rows of data. Would it be possible to take care of such thing in the code itself or will have to do a post import formatting. Also i have added the code to delete blank rows in between and also the lines which says contiuned... but how do i add 2 conditions over there...e.g. "*continued*" and "*ENP*" how can i add this...
Code:
Sub GetHTMLFiles()
Dim doc As MSHTML.HTMLDocument, d As DataObject, htm As String
Set d = New DataObject
Set doc = CreateObject("htmlfile")
fpath = "C:\users\abhy\desktop\htmldatafiles\"
f = Dir(fpath)
Do While f <> ""
Open fpath & f For Input As 1
htm = Input(LOF(1), #1)
Close 1
doc.body.innerHTML = htm
Set t = doc.getElementsByTagName("table")(0)
d.Clear
d.settext t.innerText
d.PutInClipboard
rw = Cells(Rows.Count, 1).End(xlUp).Row + 1
If rw = 2 Then rw = 1
Cells(rw, 1).PasteSpecial
Dim r As Range
If rw = 1 Then 'fix headers
Cells(1, 1).Resize(10).EntireRow.Delete
Range("a1:n1") = Array("", "", "Bs", "", "", "TRXN", "SALE", "SALE", "TRXN", "INFO", "", "OPP", "KLC", "ACCOUNT")
Range("a2:n2") = Array("DESCRIPTION", "DATE", "Dc", "RESI", "SINGL", "PRICE", "PRICE", "AMOUNT", "REFF", "M/K", "SEF", "ORGZ", "TYP", "NUMBER")
Cells(3, 1).Resize(Cells(Rows.Count, 1).End(xlUp).Row - 2).TextToColumns , xlDelimited, xlTextQualifierNone, , , , , True
Else ' delete headers
hrows = 12 ' may need adjustment
Cells(rw, 1).Resize(hrows).EntireRow.Delete
Set r = Cells(rw, 1).Resize(Cells(Rows.Count, 1).End(xlUp).Row - rw)
r.TextToColumns r, xlDelimited, xlTextQualifierNone, , , , , True
End If
f = Dir
Loop
On Error Resume Next
Columns("A").SpecialCells(xlCellTypeBlanks).EntireRow.Delete
With ActiveSheet
.AutoFilterMode = False
With Range("A1", Range("A" & Rows.Count).End(xlUp))
.AutoFilter 1, "*continued*"
On Error Resume Next
.Offset(1).SpecialCells(12).EntireRow.Delete
End With
.AutoFilterMode = False
End With
End Sub
Last edited by abhay_547; Feb 14th, 2018 at 03:34 AM.
Re: Import data from multiple html files into one excel worksheet
Would it be possible to take care of such thing in the code itself
probably loop down the second column (or use find) for internal, then easy to fix, add the word internal to column a and delete the cell in column B
i thought that the continued line was already take care of in the code i posted, but that assumed all the headers were the same, and if some file did not have a continued line then a row of data may have been deleted, hence the comment "may need adjustment"
i only hade one sample file to work with that did not contain all possibilities of the data
as far as the last line and blank line above, it would be easy to check for a blank line above the last line, then paste the new data over it, no need to delete
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
i had another thought, if internal is the only word that causes the 1st column to break, then just replace that word in the original string
add the 2nd line
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
Originally Posted by westconn1
i had another thought, if internal is the only word that causes the 1st column to break, then just replace that word in the original string
add the 2nd line
Re: Import data from multiple html files into one excel worksheet
no, i suggested finding " internal" and replacing the preceding space with an underscore so that it would be continuous with the word before it, being debit, credit or any other
as mentioned, this would need testing
your code, would just add more spaces, making more additional columns, you would end up with 4 columns rather than the current 2 columns
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
Ok. Got it. I tested it and it seems to work. can you advise on the below 3 things as well.
1) Right now it doesn't loop the files and import the data in the file order/sequence for e.g. I have files named as FILE1.HTM, FILE2.HTM...and so on which goes till FILE500.htm or even FILE1000.htm but when the import happens it doesn't import the data of FILE1.HTM first and then FILE2.HTM and so on, it doesn't follow the sequence. can we modify the code to import the data by looping the files in order/sequence.
2) for the account column whereever the account number starts with Zero e.g. 05347875, it will truncate the zero and import the number only starting 5 where else in some cases even if the file number starts with zero but ends with some alphabets then it wouldn't truncate the starting zero for e.g. 0573232A3 will get imported as it is, I believe such account numbers are treated as text since they have a alphabet in it and hence they get imported fully as text where else if the account number doesn't have any alphabet in it then its treated as a number hence the first/zero is getting truncated. Can we set the account column as text so that it gets imported correctly all time.
3) As mentioned in my earlier comment, for few page line items the data is not available in the date column so instead of keeping that cell blank it imports the next column data in the date column because of which the column order changes for around 12000 rows of data. Would it be possible to take care of such thing in the code itself.
Re: Import data from multiple html files into one excel worksheet
instead of using dir to get the filename you could try like
Code:
x = 1
fname = mypath & "file" & x & ".htm"
do until fname = ""
''''all your code for each file
x= x + 1
fname = mypath & "file" & x & ".htm"
loop
if any files are missing from the number sequence, it will stop at that point
Can we set the account column
set column to text, has to be done before the data is put in the cell
Code:
range("d1").entirecolumn.numberformat = "@"
change the D to whatever column you want
As mentioned in my earlier comment, for few page line items the data is not available in the date column so instead of keeping that cell blank it imports the next column data in the date column
as there are no columns at all in the html code, it is hard to change the way excel converts text to columns, possibly the easiest way would be, after import, to check where the last column is empty then insert the additional cell for those rows, but i am not sure how accurate that would be, and it would probably take quite some time to run, post some sample data of rows that do not have date and i will see if it is possible to fix the raw data, before importing, similar to the fix for internal
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
Ok. I have made the changes as you suggested but now it keeps looping on the last file again and again and keeps importing its data, not sure what is going wrong since it doesn't throw any error. can you check and advise.
Code:
Sub GetHTMLFiles()
Dim doc As MSHTML.HTMLDocument, d As DataObject, htm As String
Set d = New DataObject
Set doc = CreateObject("htmlfile")
fpath = "C:\users\abhy\desktop\htmldatafiles\"
x = 0
fname = fpath & "FILE" & x & ".htm"
Do until fname = ""
Open fname For Input As 1
htm = Input(LOF(1), #1)
Close 1
htm = Replace(htm, " CREDIT", "_CREDIT", , , vbTextCompare)
doc.body.innerHTML = htm
Set t = doc.getElementsByTagName("table")(0)
d.Clear
Range("J1").entirecolumn.numberformat = "@"
d.settext t.innerText
d.PutInClipboard
rw = Cells(Rows.Count, 1).End(xlUp).Row + 1
If rw = 2 Then rw = 1
Cells(rw, 1).PasteSpecial
Dim r As Range
If rw = 1 Then 'fix headers
Cells(1, 1).Resize(10).EntireRow.Delete
Range("a1:n1") = Array("", "", "Bs", "", "", "TRXN", "SALE", "SALE", "TRXN", "INFO", "", "OPP", "KLC", "ACCOUNT")
Range("a2:n2") = Array("DESCRIPTION", "DATE", "Dc", "RESI", "SINGL", "PRICE", "PRICE", "AMOUNT", "REFF", "M/K", "SEF", "ORGZ", "TYP", "NUMBER")
Cells(3, 1).Resize(Cells(Rows.Count, 1).End(xlUp).Row - 2).TextToColumns , xlDelimited, xlTextQualifierNone, , , , , True
Else ' delete headers
hrows = 12 ' may need adjustment
Cells(rw, 1).Resize(hrows).EntireRow.Delete
Set r = Cells(rw, 1).Resize(Cells(Rows.Count, 1).End(xlUp).Row - rw)
r.TextToColumns r, xlDelimited, xlTextQualifierNone, , , , , True
End If
x= x + 1
fname = fpath & "FILE" & x & ".htm"
loop
On Error Resume Next
Columns("A").SpecialCells(xlCellTypeBlanks).EntireRow.Delete
With ActiveSheet
.AutoFilterMode = False
With Range("A1", Range("A" & Rows.Count).End(xlUp))
.AutoFilter 1, "*continued*"
On Error Resume Next
.Offset(1).SpecialCells(12).EntireRow.Delete
End With
.AutoFilterMode = False
End With
End Sub
Re: Import data from multiple html files into one excel worksheet
I'm not following the suggestion of:
Code:
Do until fname = ""
since, as coded, fname will never be empty. It may contain a path to a file that doesn't exist, but that's not what is being checked for.
So, since fname is never empty, the Do-Loop will never terminate. And I'm not sure why no errors are thrown when it attempts to open and read from files that don't exist, but since you report that it keeps processing the data from the last file that exists, it looks like it is just plowing past those lines of code and re-processing the value stored in "htm", which would be the contents of the "last" file.
The following lines of code might work better for you:
Change the first few lines of code from:
Code:
Do until fname = ""
Open fname For Input As 1
htm = Input(LOF(1), #1)
Close 1
to:
Code:
Do
htm=""
Open fname For Input As 1
htm = Input(LOF(1), #1)
Close 1
If htm = "" Then Exit Do
That will clear out the htm variable before each attempt to populate it from a file. If it encounters a filename that doesn't exist (which would mean you've reached the end of your series of files), then the htm variable will stay empty, and then the If statement will terminate the Do-Loop, which I think is what you are looking for.
Re: Import data from multiple html files into one excel worksheet
my error
should have been
Code:
x = 1
fname = dir(mypath & "file" & x & ".htm")
do until fname = ""
Open mypath & fname For Input As 1
''''all your code for each file
x= x + 1
fname = dir(mypath & "file" & x & ".htm")
loop
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
Ok. got it. My folder consists of the htm files 0 to 653 .i.e. FILE0.htm, FILE1.htm and so on, if i don't want to import all the files which are there in the folder then I can just enter the start number of the file and end number of the file so that it imports the data from the start number file to end number file. for e.g. even though i have 653 + 0 = 657 files in the folder but i want to import only from FILE242.htm to FILE411.htm, to make this work, i have created a string for both startfilenum and endfilenum but it's not working, it again keeping looping the last file, what is going wrong.
Code:
Dim StartFileNum As String
Dim EndFileNum As String
StartFileNum = ThisWorkbook.Sheets("Sheet1").Range("D3").Value
EndFileNum = ThisWorkbook.Sheets("Sheet1").Range("D4").Value
x = StartFileNum
fname = dir(mypath & "file" & x & ".htm")
do until fname = "mypath & "file" & x & ".htm""
Open mypath & fname For Input As 1
''''all your code for each file
x= x + 1
fname = dir(mypath & "file" & x & ".htm")
loop
Re: Import data from multiple html files into one excel worksheet
Code:
x= x + 1
if x > endfilenum then exit do
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
Ok. below is the change I made, so if I enter zero and Endfilenum as 653 then it doesn't import entire thing...may be something is going wrong, it works when I enter the page 242 to 411.
Code:
Dim StartFileNum As String
Dim EndFileNum As String
StartFileNum = ThisWorkbook.Sheets("Sheet1").Range("D3").Value
EndFileNum = ThisWorkbook.Sheets("Sheet1").Range("D4").Value
x = StartFileNum
fname = dir(mypath & "file" & x & ".htm")
do until fname = "mypath & "file" EndFilenum x & ".htm""
Open mypath & fname For Input As 1
''''all your code for each file
x= x + 1
if x > endfilenum then exit do
fname = dir(mypath & "file" & x & ".htm")
loop
Re: Import data from multiple html files into one excel worksheet
absolutely no idea
what have you tried to diagnose the problem?
i would try
Code:
Dim StartFileNum As String
Dim EndFileNum As String
StartFileNum = ThisWorkbook.Sheets("Sheet1").Range("D3").Value
EndFileNum = ThisWorkbook.Sheets("Sheet1").Range("D4").Value
x = StartFileNum
fname = dir(mypath & "file" & x & ".htm")
do until fname = "mypath & "file" EndFilenum x & ".htm""
debug.print fname
x= x + 1
if x > endfilenum then exit do
fname = dir(mypath & "file" & x & ".htm")
loop
this should just loop through and print the flenames, if any file in the numeric order is missing then the code will not process further files, as designed, as you now have starting and finishing points you could also try
Code:
for x = startfilenumber to endfilenumber
fname = dir(mypath & "file" & x & ".htm")
if len(fname) > 0 then
debug.print fname
'''' open file and do stuff
else msgbox "file " & x & " not found
end if
next
apart from the msgbox that could be removed, missing files will just be ignored
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
ok. It works now as expected. Thanks a lot for all your guidance and help Really appreciate it.
Only one thing which is still an issue is the date column coming as blank which imports the next data column inside date column due to which it changes the column order for certain number of rows. As suggested in your earlier comment, i am attaching herewith the sample file where the date column doesn't consist of data for certain rows (in few pages the date column is completely blank also). this seems to be the issue with few other columns as well in some pages e.g. RESI which comes blank for certain rows in certain number of pages (or completely blank in few pages, discovered by me while checking the data post import). I think once a fix is figured out for date column same logic can be applied for all may be if the space is treated as a character before import then it may import the character in the column.
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
Great. Its working as expected
The only issue is that when post running the macro when I try to replace the "~" with Null it doesn't work. It says Microsoft Excel cannot find any data to replace.
Another thing, while data is getting imported after the last column, can we input the file number for e.g. the macro has imported File256.htm then in column P can we enter number 256 or even the file256.htm text against each row which is imported from that file, just to identify the rows that which is imported from which file.
Last edited by abhay_547; Feb 18th, 2018 at 01:02 PM.
you could always insert some additional columns, if needed, to the entire worksheet after all the files are imported
i have always been trying to avoid looping through all the data line by line
i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case. Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next
dim all variables as required as often i have done so elsewhere in my code but only posted the relevant part
come back and mark your original post as resolved if your problem is fixed
pete
Re: Import data from multiple html files into one excel worksheet
thanks. but what i mean is for e.g. if the date column has missing values for most of the rows, then instead of showing the character "~" I want to show "Null" text for those row items. and I am trying to do this post the macro run .i.e. a simple find and replace with "~" and "Null" but excel is not able to find the "~" to replace it with "Null" not sure why.