Results 1 to 4 of 4

Thread: RegEx of txt that has been converted from PDF

  1. #1

    Thread Starter
    New Member
    Join Date
    Aug 2020
    Posts
    6

    Post RegEx of txt that has been converted from PDF

    I am in a situation where I have to convert a PDF to a format that can be set to a DataGridView.

    The only Resolution I can come up with is using Itextsharp and converting the PDF to a textfile for the most part the format is kept.

    here is the Code to parse the pdf to text.

    Code:
    Dim mPDF As String = "C:\Users\Innovators World Wid\Documents\test.pdf"
        Dim mTXT As String = "C:\Users\Innovators World Wid\Documents\test.txt"
       Dim mPDFreader As New iTextSharp.text.pdf.PdfReader(mPDF)
        Dim mPageCount As Integer = mPDFreader.NumberOfPages()
        Dim parser As PdfReaderContentParser = New PdfReaderContentParser(mPDFreader)
        'Create the text file.
        Dim fs As FileStream = File.Create(mTXT)
    Dim strategy As iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
    For i As Integer = 1 To mPageCount
    strategy = parser.ProcessContent(i, New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy())
    Dim info As Byte() = New UTF8Encoding(True).GetBytes(strategy.GetResultantText())
    fs.Write(info, 0, info.Length)
    Next
    fs.Close()

    The text output ends up looking like this. (also see attached copy of file.txt)


    63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS

    64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS

    65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS

    67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS


    Which is "Pretty close"


    The issue is the lines where express then has another number next to it (look at line 65 where 66 starts on the line. It should look like this throughout (to make adding it to a DataGridView easier.


    63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS

    64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS

    65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS

    66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS

    67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS


    The attempt was to use RegEx to remove everything but this "Format"


    "FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS"

    Or in some cases it may end a bit differently (like)

    FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS , Replacement Order
    The RegEx is

    Code:
    (\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*)";
    Question Does anyone have a better solution. Or a cleaner solution. What I need is

    PDF Somehow Converted to a format that can can be inputted in to a Datgrid in the appropriate rows and columns

    Any method to do what I like is appreciated

    Edit:

    I am using RegEx at the moment. This is the sub

    Code:
    Private Sub Fixtext()
            Dim regex As Regex = New Regex("\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*")
            Using reader As StreamReader = New StreamReader("C:\Users\Innovators World Wid\Documents\test.txt")
               While (True)
                   Dim line As String = reader.ReadLine()
                    If line = Nothing Then
                        Return
                    End If
                    Dim match As Match = regex.Match(line)
                                 If match.Success Then
                        Dim value As String = match.Groups(1).Value
                       Console.WriteLine(line)
                    End If
                End While
            End Using
        End Sub
    The issue is the output still contains a few issues.

    490 FMPC0847531898 OD119522758218348000 Aug 28, 2020 03:20 PM EXPRESS 491 FMPP0532220915 OD119522825195489000 Aug 28, 2020 03:21 PM EXPRESS Tracking Id Forms Required Order Id RTS done on Notes492 FMPP0532194482 OD119522868525176000 Aug 28, 2020 03:21 PM EXPRESS 493 FMPP0532195684 OD119522871090000000 Aug 28, 2020 03:21 PM EXPRESS 494 FMPP0532224318 OD119522895172342000 Aug 28, 2020 03:21 PM EXPRESS 495 FMPC0847571813 OD119522919323643000 Aug 28, 2020 03:21 PM EXPRESS
    That is one issue. It isn't removing the "Tracking ID Forms Required order ID RTS Done On Notes" Which should be removed

    And a few lines are still crammed together.

    65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
    The result should be

    65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS

    66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
    Any Help would be great! Thank you!

  2. #2
    Frenzied Member
    Join Date
    Feb 2003
    Posts
    1,807

    Re: RegEx of txt that has been converted from PDF

    innww,

    A quick Google search told me there are several ways to go about converting PDF files to .txt files. https://www.google.com/search?rlz=1C...4dUDCA0&uact=5.

    The code you posted looks fairly decent and without examining it in depth I wouldn't know what exactly could be going wrong.

    yours,
    Peter Swinkels

  3. #3

    Thread Starter
    New Member
    Join Date
    Aug 2020
    Posts
    6

    Re: RegEx of txt that has been converted from PDF

    Quote Originally Posted by Peter Swinkels View Post
    innww,

    A quick Google search told me there are several ways to go about converting PDF files to .txt files. https://www.google.com/search?rlz=1C...4dUDCA0&uact=5.

    The code you posted looks fairly decent and without examining it in depth I wouldn't know what exactly could be going wrong.

    yours,
    Peter Swinkels
    Thank You for the attempt! The issue was with the Regex. I have resolved it! Thank you

  4. #4
    Frenzied Member
    Join Date
    Feb 2003
    Posts
    1,807

    Re: RegEx of txt that has been converted from PDF

    Good to hear! Did you mark the thread as resolved? And you’re welcome btw?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width