|
-
Jun 20th, 2006, 08:01 AM
#1
Thread Starter
Addicted Member
[RESOLVED] How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
Well, almost solved, though I cant get rid of the darn JavaScript. Here's what I got so far:
Add references to "Microsoft VBScript Regular Expressions 5.5".
On Form:
place 2 text boxes, 1 Inet control, and 2 buttons.
VB Code:
Private Sub Command1_Click()
Text1 = Inet1.OpenURL("http://www.yujunet.com/")
End Sub
Private Sub Command2_Click()
Dim temp1 As String
Dim temp2 As String
Dim temp3 As String
Dim newstring
temp1 = RemoveLines(Text1)
temp2 = RegExFind(temp1, "<script[^>]*>(.*)</script>")
temp3 = RegExReplace(temp1, temp2, "")
temp3 = RemoveHTML(temp3)
Text2 = temp3
End Sub
In Module:
VB Code:
Function RemoveLines(myString As String)
'convert multiline to single line string:
myString = Replace(myString, vbTab, " ") 'removes Tabs
myString = Replace(myString, Chr(13), " ")
myString = Replace(myString, Chr(10), " ")
myString = Replace(myString, vbCrLf, " ")
myString = Replace(myString, vbNewLine, " ")
RemoveLines = myString
End Function
Function RegExFind(myString As String, FindWhat As String)
On Error Resume Next
'Create objects.
Dim objRegExp As RegExp
Dim objMatch As Match
Dim colMatches As MatchCollection
Dim RetStr As String
Set objRegExp = New RegExp
objRegExp.Pattern = FindWhat
objRegExp.IgnoreCase = True
objRegExp.Global = True
objRegExp.MultiLine = True
If (objRegExp.Test(myString) = True) Then
Set colMatches = objRegExp.Execute(myString)
For Each objMatch In colMatches
RetStr = objMatch.Value
Next
Else
RetStr = "" 'No matches
End If
RegExFind = RetStr
End Function
Function RegExReplace(myString As String, FindThis As String, ReplaceWithThis As String)
On Error Resume Next
'search string for item and then replace with new item:
Dim sourse1 As String, resourse As Object
sourse1 = myString
Set resourse = New RegExp
resourse.Pattern = FindThis
resourse.Global = True
resourse.IgnoreCase = True
If resourse.Test(sourse1) = True Then
myString = resourse.Replace(sourse1, ReplaceWithThis)
End If
RegExReplace = myString
End Function
Function RemoveHTML(strText As String)
Dim RegEx
Set RegEx = New RegExp
RegEx.Pattern = "<[^>]*>"
RegEx.Global = True
RegEx.IgnoreCase = True
strText = Replace(strText, " ", "")
RemoveHTML = RegEx.Replace(strText, "")
End Function
Any suggestions would really help
-
Jun 20th, 2006, 08:38 AM
#2
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
what do want as a result.. all the html? or just the body?
JPnyc rocks!! (Just ask him!)
If u have your answer please go to the thread tools and click "Mark Thread Resolved"
-
Jun 20th, 2006, 08:47 AM
#3
Thread Starter
Addicted Member
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
I need to get rid of HTML and JavaScript.
-
Jun 20th, 2006, 08:52 AM
#4
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
so what do u want? just the text of the site?
JPnyc rocks!! (Just ask him!)
If u have your answer please go to the thread tools and click "Mark Thread Resolved"
-
Jun 20th, 2006, 09:03 AM
#5
Thread Starter
Addicted Member
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
-
Jun 20th, 2006, 09:06 AM
#6
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
add the webbrowser control to your project:
VB Code:
Private Sub Form_Load()
WebBrowser1.Navigate "http://www.yujunet.com/"
End Sub
Private Sub WebBrowser1_DocumentComplete(ByVal pDisp As Object, URL As Variant)
If (pDisp Is WebBrowser1.Application) Then
Debug.Print WebBrowser1.Document.documentElement.innerText
End If
End Sub
thats it
JPnyc rocks!! (Just ask him!)
If u have your answer please go to the thread tools and click "Mark Thread Resolved"
-
Jun 20th, 2006, 09:14 AM
#7
Thread Starter
Addicted Member
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
Thanks Static, but I dont want to use WebBrowser as it loads too many useless items to me. Thats why I wanted to use Inet.
-
Jun 20th, 2006, 09:16 AM
#8
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
Method 2:
Add a reference to the MS HTML Object Library
Remove the webbrowser control
VB Code:
Dim HTML As New HTMLDocument
Dim DOC As HTMLDocument
Set DOC = HTML.createDocumentFromUrl("http://www.yujunet.com/", vbNullString)
Do While DOC.ReadyState <> "complete"
DoEvents
Loop
Debug.Print DOC.documentElement.innerText
JPnyc rocks!! (Just ask him!)
If u have your answer please go to the thread tools and click "Mark Thread Resolved"
-
Jun 20th, 2006, 09:16 AM
#9
Thread Starter
Addicted Member
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
Here's my alternative that I'm working on, though its buggy too:
http://www.vbforums.com/showthread.php?t=412337
-
Jun 20th, 2006, 11:39 AM
#10
Thread Starter
Addicted Member
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
Trying something else ... though still unsucessful:
VB Code:
'Use the same module functions as in the first post.
Private Sub Command1_Click()
Dim i As String
i = Inet1.OpenURL("http://www.yujunet.com/")
Text1 = RemoveLines(i)
End Sub
Private Sub Command2_Click()
Dim i As String
i = RemoveSpaces(Text1)
Text2 = Trim$(i)
End Sub
Private Sub Command3_Click()
'this one finds the tags, but it finds the first <script[^>]*> and the last </script>, while I need to find EVERY match.
Text3 = RegExFind(Text2, "<script[^>]*>(.*)</script>")
End Sub
Private Sub Command4_Click()
sArray = Split(sText, "<script[^>]*>")
For i = 0 To Len(Text2)
iPoe = InStr(sArray(i), "</script>")
If iPoe Then
iPor = "<script[^>]*>" & Mid$(sArray(i), 1, (iPoe - 1)) & "</script>"
Text4 = Trim$(Replace(sText, iPor, " "))
End If
Next i
End Sub
Any help?
-
Jun 20th, 2006, 02:35 PM
#11
Thread Starter
Addicted Member
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
Well, looks like I've solved the pozzle. It stips HTML, JavaScript, CSS, and comment tags from HTML file, and leaves just the text. Something similat to WebBrowser1.Document.documentElement.innerText but with a use of RegEx:
form:
VB Code:
Private Sub Command1_Click()
Dim i As String
i = Inet1.OpenURL("http://www.yujunet.com/")
i = RemoveLines(i)
i = RegExReplace(i, "<style[^>]*>[\s\S]*?</style>", " ")
i = RegExReplace(i, "<script[^>]*>[\s\S]*?</script>", " ")
i = RegExReplace(i, "<!--[\s\S]*?-->", " ")
i = RegExReplace(i, "<[^>]*>", " ")
i = RegExReplace(i, " ", " ")
i = RegExReplace(i, "&", " ")
i = RemoveSpaces(i)
Text1 = Trim$(i)
End Sub
module:
VB Code:
Function RegExReplace(myString As String, FindThis As String, ReplaceWithThis As String)
On Error Resume Next
'search string for item and then replace with new item:
Dim sourse1 As String, resourse As Object
sourse1 = myString
Set resourse = New RegExp
resourse.Pattern = FindThis
resourse.Global = True
resourse.IgnoreCase = True
If resourse.Test(sourse1) = True Then
myString = resourse.Replace(sourse1, ReplaceWithThis)
End If
RegExReplace = myString
End Function
Function RemoveSpaces(myString As String)
Do Until InStr(1, myString, " ") = 0
myString = Replace(Replace(myString, " ", " "), " ", " ")
Loop
RemoveSpaces = myString
End Function
Function RemoveLines(myString As String)
'convert multiline to single line string:
myString = Replace(myString, vbTab, " ") 'removes Tabs
myString = Replace(myString, Chr(13), " ") ' vbNullString
myString = Replace(myString, Chr(10), " ")
myString = Replace(myString, vbCrLf, " ")
myString = Replace(myString, vbNewLine, " ")
RemoveLines = myString
End Function
It works fine, though any improvement suggestions are really appreciated
-
Jun 20th, 2006, 03:03 PM
#12
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
Looks good to me nice work
JPnyc rocks!! (Just ask him!)
If u have your answer please go to the thread tools and click "Mark Thread Resolved"
-
Jun 20th, 2006, 03:36 PM
#13
PowerPoster
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost sol
And without RegExpressions ..
Added removal of Extra Chars, Special Symbols, Single Letters, Digits, Common Words. Upper Case first letter of each word.
VB Code:
Option Explicit
Private Sub Command1_Click()
Dim i As String
i = Inet1.OpenURL("http://www.yujunet.com/")
If Len(i) Then
i = RemoveLines(i)
i = RemoveTags(i, "<style", "</style>")
i = RemoveTags(i, "<script", "</script>")
i = RemoveTags(i, "<!--", "-->")
i = RemoveTags(i, "<", ">")
i = RemoveTags(i, "&#", ";") ' SPECIAL SYMBOLS
i = RemoveChars(i, " #&#"#>#<#[#]#""#;#:#.#,#'#/#$#%#?#!#|#(#)#=#-#+#&#*#©#®")
i = RemoveDigits(i, "0 1 2 3 4 5 6 7 8 9")
i = RemoveCommon(i, "a b c d e f g h i j k l m n o p q r s t u v w x y z")
i = RemoveCommon(i, "at and com is or of to that this then the was what with where who when")
i = RemoveMultiple(i, " ") ' GET RID OF MULTIPLE SPACES
i = StrConv(i, vbProperCase) ' UPPER CASE FIRST LETTER
Text1 = Trim$(i)
End If
End Sub
Private Function RemoveTags(ByVal myString As String, _
start As String, finish As String) As String
Dim sArray() As String, i As Integer
Dim iPor As String, iPoe As Integer
sArray = Split(myString, start, , 3) ' SPLIT BY TAG START
For i = 0 To UBound(sArray) ' LOOP THROUGH
iPoe = InStr(1, sArray(i), finish, 3) ' GET REPLACE LENGTH
If iPoe Then ' IF EXISTS IN TEXT
iPor = start & Mid$(sArray(i), 1, (iPoe - 1)) & finish ' OUR REPLACE STRING
myString = Trim$(Replace(myString, iPor, " ", , , 3)) ' REPLACE IN TEXT
End If
Next i ' NEXT TAG START
RemoveTags = myString
End Function
Private Function RemoveCommon(ByVal myString As String, _
myVal As String) As String
Dim sArray() As String, i As Integer
sArray = Split(myVal)
For i = 0 To UBound(sArray)
Do While (InStr(1, " " & myString & " ", " " & sArray(i) & " ", 3))
myString = Replace(" " & myString & " ", " " & sArray(i) & " ", " ", , , 3)
Loop
Next
RemoveCommon = myString
End Function
Private Function RemoveDigits(ByVal myString As String, _
myVal As String) As String
Dim sArray() As String, i As Integer
sArray = Split(myVal)
For i = 0 To UBound(sArray)
Do While (InStr(myString, sArray(i)))
myString = Replace(myString, sArray(i), " ")
Loop
Next
RemoveDigits = myString
End Function
Private Function RemoveChars(ByVal myString As String, _
myVal As String) As String
Dim sArray() As String, i As Integer
sArray = Split(myVal, "#")
For i = 0 To UBound(sArray)
myString = Replace(myString, sArray(i), " ", , , 3)
Next i
myString = Replace(myString, "#", " ")
RemoveChars = myString
End Function
Private Function RemoveMultiple(ByVal myString As String, _
myVal As String) As String
Do While (InStr(myString, myVal))
myString = Replace(myString, myVal, " ", , , 3)
Loop
RemoveMultiple = myString
End Function
Private Function RemoveLines(ByVal myString As String) As String
myString = Replace(myString, vbTab, " ")
myString = Replace(myString, Chr(13), " ")
myString = Replace(myString, Chr(10), " ")
RemoveLines = myString
End Function
Last edited by rory; Jun 20th, 2006 at 06:17 PM.
-
Jun 20th, 2006, 04:31 PM
#14
Thread Starter
Addicted Member
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
Wow! Thanks guys. Now we getting somewhere
-
Jun 20th, 2006, 04:37 PM
#15
PowerPoster
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost sol
Updated .. try it now .. like on Yahoo or something with a ton of text ..
basically i made the RemoveSpaces a Multiple Function .. so you can remove ... as well as extra spaces .. or anything else that might have multiple chars ..
In the case of the . you want to keep it if it is something like "$200.00" ...
but not "End of Sentence."
You also want to replace the commas (?) but not "$200,000.00"
hope it helps ..
If you want to get rid of numbers, etc then you'll need to add a function for that or let us know ..
Last edited by rory; Jun 20th, 2006 at 04:57 PM.
-
Jun 20th, 2006, 05:09 PM
#16
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost sol
 Originally Posted by foxter
module:
VB Code:
Function RemoveLines(myString As String)
'convert multiline to single line string:
myString = Replace(myString, vbTab, " ") 'removes Tabs
myString = Replace(myString, Chr(13), " ") ' vbNullString
myString = Replace(myString, Chr(10), " ")
myString = Replace(myString, vbCrLf, " ")
myString = Replace(myString, vbNewLine, " ")
RemoveLines = myString
End Function
Once you've removed all occurrences of Chr(13) and Chr(10), there are no occurrences of vbCrLf or vbNewLine - you've removed them. vbCr is Chr(13), vbLf is Chr(10) and vbNewLine is vbCr & vbLf.
The most difficult part of developing a program is understanding the problem.
The second most difficult part is deciding how you're going to solve the problem.
Actually writing the program (translating your solution into some computer language) is the easiest part.
Please indent your code and use [HIGHLIGHT="VB"] [/HIGHLIGHT] tags around it to make it easier to read.
Please Help Us To Save Ana
-
Jun 20th, 2006, 07:58 PM
#17
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
I think it would be quicker to use the HTML library and the innerText property like Static originally suggested. Reference against the Microsoft HTML Object Library (or whatever it's called) rather than the Webbrowser control. I agree that using a control for this is not really appropriate but it doesn't mean you should shut yourself out from taking advantage of an already present routine which is likely to be more efficient and powerful.
-
Jun 20th, 2006, 11:16 PM
#18
PowerPoster
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost sol
Other Methods to get the Text .. API, MSXML, HTML Object (already suggested), Winsock (wont show that here)
API (no controls needed):
sText = SendAPIRequest("http://www.mywebsitelink.com")
VB Code:
Option Explicit
Private Const STRING_SIZE = 128
Private Const INTERNET_OPEN_TYPE_DIRECT = 1
Private Const INTERNET_FLAG_NO_CACHE_WRITE = &H4000000
Private Declare Function InternetOpen Lib "wininet" Alias "InternetOpenA" _
(ByVal sAgent As String, ByVal lAccessType As Long, ByVal sProxyName As String, _
ByVal sProxyBypass As String, ByVal lFlags As Long) As Long
Private Declare Function InternetCloseHandle Lib "wininet" (ByRef hInet As Long) As Long
Private Declare Function InternetReadFile Lib "wininet" _
(ByVal hFile As Long, ByVal sBuffer As String, ByVal lNumBytesToRead As Long, lNumberOfBytesRead As Long) As Integer
Private Declare Function InternetOpenUrl Lib "wininet" Alias "InternetOpenUrlA" _
(ByVal hInternetSession As Long, ByVal lpszUrl As String, ByVal lpszHeaders As String, _
ByVal dwHeadersLength As Long, ByVal dwFlags As Long, ByVal dwContext As Long) As Long
'// GET TEXT FROM WEB PAGE **** USING API
Private Function SendAPIRequest(ByVal strUrl As String) As String
Dim hOpen As Long, hFile As Long
Dim Ret As Long, sBuffer As String * 128
Dim iResult As Integer, sData As String
hOpen = InternetOpen("VB Program", 1, vbNullString, vbNullString, 0)
If hOpen = 0 Then
MsgBox "Error opening Internet connection"
Exit Function
End If
hFile = InternetOpenUrl(hOpen, strUrl, vbNullString, 0, INTERNET_FLAG_NO_CACHE_WRITE, 0)
If hFile = 0 Then
MsgBox "Error opening Web page"
Else
InternetReadFile hFile, sBuffer, STRING_SIZE, Ret
sData = sBuffer
Do While Ret <> 0
InternetReadFile hFile, sBuffer, STRING_SIZE, Ret
sData = sData + Mid(sBuffer, 1, Ret)
Loop
End If
InternetCloseHandle hFile
InternetCloseHandle hOpen
SendAPIRequest = sData
sData = ""
End Function
MSXML: Reference Microsoft XML, version 2.0 (or above if your server supports it - 4.0 suggested)
sText = SendRequest("http://www.mywebsitelink.com")
VB Code:
Option Explicit
Private Function SendRequest(ByVal strUrl As String) _
As String
On Error Resume Next
Dim objHTTP As New MSXML.XMLHTTPRequest ' CREATE OBJECT
objHTTP.Open "GET", strUrl, False ' START REQUEST
objHTTP.setRequestHeader "Content-Type", "text/html"
If Err = 0 Then ' NO ERRORS
objHTTP.send ' SEND REQUEST
SendRequest = objHTTP.responseText ' GET TEXT
Else
MsgBox "Error " & Err.Number & _
vbNewLine & Err.Description
End If
End Function
And the One that was posted above .. HTML Object Library ..
Reference Microsoft HTML Object Library.
In this case as shown by static, it strips all the tags already ..
though you would still need to clean up the text.
sText = getHTMLDocument("http://www.mywebsitelink.com")
VB Code:
Option Explicit
Private Function getHTMLDocument(ByVal strUrl As String) As String
Dim HTML As New HTMLDocument
Dim DOC As HTMLDocument
Set DOC = HTML.createDocumentFromUrl(strUrl, vbNullString)
Do While DOC.ReadyState <> "complete"
DoEvents
Loop
getHTMLDocument = DOC.documentElement.innerText
End Function
Last edited by rory; Jun 21st, 2006 at 01:24 AM.
-
Jun 20th, 2006, 11:19 PM
#19
PowerPoster
Re: How do I remove JavaScript from HTML source using Regular Expressions. Almost sol
 Originally Posted by penagate
I think it would be quicker to use the HTML library and the innerText property like Static originally suggested. Reference against the Microsoft HTML Object Library (or whatever it's called) rather than the Webbrowser control. I agree that using a control for this is not really appropriate but it doesn't mean you should shut yourself out from taking advantage of an already present routine which is likely to be more efficient and powerful.
agreed, didnt even know that control existed .. :-)
-
Jun 28th, 2006, 04:10 PM
#20
Member
Re: [RESOLVED] How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
I have a similar task to read text from a FAQ webpage and store the Q/A pairs somehow. They will then all be written to an aiml output file in the format below
<aiml>
<category>
<pattern>WHAT ARE YOU</pattern>
<template>
I am the latest result in artificial intelligence,
which can reproduce the capabilities of the human brain
with greater speed and accuracy.
</template>
</category>
.
.
.
<aiml>
see -- http://www.alicebot.org/aiml.html
The use of the webrowsercontrol is pretty cool. Any comments ?
-
Jun 29th, 2006, 12:02 AM
#21
Thread Starter
Addicted Member
Re: [RESOLVED] How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
In the code from my first post:
VB Code:
temp2 = RegExFind(temp1, "<script[^>]*>(.*)</script>")
change to:
VB Code:
temp2 = RegExFind(temp1, "<aiml>(.*)</aiml>")
and then search for text within <TEMPLATE> tags within that string:
VB Code:
myNewString = RegExFind(temp2, "<template>(.*)</template>")
-
Jun 29th, 2006, 12:26 AM
#22
Member
Re: [RESOLVED] How do I remove JavaScript from HTML source using Regular Expressions. Almost solved.
Actually after reading the text from a faq webpage, the program will then output an aiml file complete with the <aiml>...</aiml> tags, I'm not trying to strip text from the aiml pages at all. Is this your understanding foxter? It seems you're recommending how to strip text form an aiml file.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|