-
Aug 30th, 2020, 03:10 PM
#1
Thread Starter
New Member
RegEx of txt that has been converted from PDF
I am in a situation where I have to convert a PDF to a format that can be set to a DataGridView.
The only Resolution I can come up with is using Itextsharp and converting the PDF to a textfile for the most part the format is kept.
here is the Code to parse the pdf to text.
Code:
Dim mPDF As String = "C:\Users\Innovators World Wid\Documents\test.pdf"
Dim mTXT As String = "C:\Users\Innovators World Wid\Documents\test.txt"
Dim mPDFreader As New iTextSharp.text.pdf.PdfReader(mPDF)
Dim mPageCount As Integer = mPDFreader.NumberOfPages()
Dim parser As PdfReaderContentParser = New PdfReaderContentParser(mPDFreader)
'Create the text file.
Dim fs As FileStream = File.Create(mTXT)
Dim strategy As iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
For i As Integer = 1 To mPageCount
strategy = parser.ProcessContent(i, New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy())
Dim info As Byte() = New UTF8Encoding(True).GetBytes(strategy.GetResultantText())
fs.Write(info, 0, info.Length)
Next
fs.Close()
The text output ends up looking like this. (also see attached copy of file.txt)
63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS
64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS
Which is "Pretty close"
The issue is the lines where express then has another number next to it (look at line 65 where 66 starts on the line. It should look like this throughout (to make adding it to a DataGridView easier.
63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS
64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS
66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS
The attempt was to use RegEx to remove everything but this "Format"
"FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS"
Or in some cases it may end a bit differently (like)
FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS , Replacement Order
The RegEx is
Code:
(\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*)";
Question Does anyone have a better solution. Or a cleaner solution. What I need is
PDF Somehow Converted to a format that can can be inputted in to a Datgrid in the appropriate rows and columns
Any method to do what I like is appreciated
Edit:
I am using RegEx at the moment. This is the sub
Code:
Private Sub Fixtext()
Dim regex As Regex = New Regex("\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*")
Using reader As StreamReader = New StreamReader("C:\Users\Innovators World Wid\Documents\test.txt")
While (True)
Dim line As String = reader.ReadLine()
If line = Nothing Then
Return
End If
Dim match As Match = regex.Match(line)
If match.Success Then
Dim value As String = match.Groups(1).Value
Console.WriteLine(line)
End If
End While
End Using
End Sub
The issue is the output still contains a few issues.
490 FMPC0847531898 OD119522758218348000 Aug 28, 2020 03:20 PM EXPRESS 491 FMPP0532220915 OD119522825195489000 Aug 28, 2020 03:21 PM EXPRESS Tracking Id Forms Required Order Id RTS done on Notes492 FMPP0532194482 OD119522868525176000 Aug 28, 2020 03:21 PM EXPRESS 493 FMPP0532195684 OD119522871090000000 Aug 28, 2020 03:21 PM EXPRESS 494 FMPP0532224318 OD119522895172342000 Aug 28, 2020 03:21 PM EXPRESS 495 FMPC0847571813 OD119522919323643000 Aug 28, 2020 03:21 PM EXPRESS
That is one issue. It isn't removing the "Tracking ID Forms Required order ID RTS Done On Notes" Which should be removed
And a few lines are still crammed together.
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
The result should be
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS
66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
Any Help would be great! Thank you!
-
Sep 1st, 2020, 06:01 AM
#2
Re: RegEx of txt that has been converted from PDF
innww,
A quick Google search told me there are several ways to go about converting PDF files to .txt files. https://www.google.com/search?rlz=1C...4dUDCA0&uact=5.
The code you posted looks fairly decent and without examining it in depth I wouldn't know what exactly could be going wrong.
yours,
Peter Swinkels
-
Sep 1st, 2020, 07:44 AM
#3
Thread Starter
New Member
Re: RegEx of txt that has been converted from PDF
Originally Posted by Peter Swinkels
innww,
A quick Google search told me there are several ways to go about converting PDF files to .txt files. https://www.google.com/search?rlz=1C...4dUDCA0&uact=5.
The code you posted looks fairly decent and without examining it in depth I wouldn't know what exactly could be going wrong.
yours,
Peter Swinkels
Thank You for the attempt! The issue was with the Regex. I have resolved it! Thank you
-
Sep 1st, 2020, 11:51 AM
#4
Re: RegEx of txt that has been converted from PDF
Good to hear! Did you mark the thread as resolved? And you’re welcome btw?
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|