Results 1 to 12 of 12

Thread: [RESOLVED] I've got new problem can someone please shed some light on this for me

  1. #1

    Thread Starter
    Hyperactive Member
    Join Date
    Mar 2012
    Posts
    281

    Resolved [RESOLVED] I've got new problem can someone please shed some light on this for me

    I'm downloading files from a web site.there pdf files..for some reason the web site is appending the html page source code to the end of pdf file.

    So the pdf files will not open..i can open the pdf files in ultraedit and search the doc for "EOF.....<!DOCTYPE html PUBLIC" and cut everything after the EOF and add 0A to the end and the pdf file is fixed..
    My problem is all of the pdf files at this web site are doing this..

    how can i code something that will start at the end of the pdf file and scan backwards..find the offending code's address and save from the beginning of the file to that address location and add the 0A to the end of it.

    or which ever way would be better beginning to end or end to beginning...

  2. #2
    Addicted Member
    Join Date
    Jan 2013
    Location
    Overland Park Kansas
    Posts
    183

    Re: I've got new problem can someone please shed some light on this for me

    you could have the source code be a string and then split the string as needed or use left() or mid() functions to shave characters off of the string

  3. #3
    PowerPoster techgnome's Avatar
    Join Date
    May 2002
    Posts
    34,687

    Re: I've got new problem can someone please shed some light on this for me

    Ultimately it depends on how you're downloading it too.... that would be the first thing I'd look at... make sure that when you're downloading it, you're getting the PDF and just the PDF... seems a bit odd that you're getting extra stuff...

    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

  4. #4

    Thread Starter
    Hyperactive Member
    Join Date
    Mar 2012
    Posts
    281

    Re: I've got new problem can someone please shed some light on this for me

    hi no the way it's downloading is appearently a bug with the web site..the file extensions are correct,,the xml files are using the same type link and those files are fine..as well as zip files..but the pdf's for some reason are getting the html code attached to the end of the pdf..i know it's odd..first time i've ever run across this type of issue .

    as for making the html a string i don't think that will work..this would mean that i would have to read the whole pdf as a string..and pdf's as well as any file that isn't plain text have non-printable text..so reading it as a string won't work..

  5. #5
    PowerPoster stanav's Avatar
    Join Date
    Jul 2006
    Location
    Providence, RI - USA
    Posts
    9,290

    Re: I've got new problem can someone please shed some light on this for me

    This may work: Open the pdf as text file and chop off what you don't need. Convert the remaining string to bytes and then write it back to a file with .pdf extension using binarywriter. If you provide a sample pdf file, I'll see what I can do...
    Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it.
    - Abraham Lincoln -

  6. #6

    Thread Starter
    Hyperactive Member
    Join Date
    Mar 2012
    Posts
    281

    Re: I've got new problem can someone please shed some light on this for me

    took awhile to find something small enough to attach..the file attached give a good idea of whats going on.
    Attached Files Attached Files

  7. #7
    PowerPoster dunfiddlin's Avatar
    Join Date
    Jun 2012
    Posts
    8,245

    Re: I've got new problem can someone please shed some light on this for me

    the file attached give a good idea of whats going on.
    Well, not really, as my PDF reader simply reads this as a PDF file without any apparent difficulty or any extraneous matter displayed. As the guys said, without some information on the method of download (weirdly, we like code better than vague descriptions!) and, if possible, the site in question there's really not a whole lot we can do!
    As the 6-dimensional mathematics professor said to the brain surgeon, "It ain't Rocket Science!"

    Reviews: "dunfiddlin likes his DataTables" - jmcilhinney

    Please be aware that whilst I will read private messages (one day!) I am unlikely to reply to anything that does not contain offers of cash, fame or marriage!

  8. #8

    Thread Starter
    Hyperactive Member
    Join Date
    Mar 2012
    Posts
    281

    Re: I've got new problem can someone please shed some light on this for me

    if you open that file in a hexeditor.you will see quit clearly that from the end of the file upwards there is nothing but html code.
    Not the whole file mind ya but at some point you find the beginning of the html file and just before that you will find the actual end of the pdf EOF or 454F46 removing everything after that EOF and making the last byte 0A fixes the file ..i'm not sure how you got it to open in acrobat..as i've tried reader and full version of 7 maybe the newer acrobat like 11 will open it,,but i'm not installing something i don't need just to open a file that when fixed to the correct length will open just fine in the version i have installed..

  9. #9

    Thread Starter
    Hyperactive Member
    Join Date
    Mar 2012
    Posts
    281

    Re: I've got new problem can someone please shed some light on this for me

    after taking another look at a good pdf file i realized that the 0A i mentioned that needed to be put on the end is not required

  10. #10
    PowerPoster dunfiddlin's Avatar
    Join Date
    Jun 2012
    Posts
    8,245

    Re: I've got new problem can someone please shed some light on this for me

    I use Foxit Reader (free and free from Adobe bloat) but it also shows perfectly adequately in Universal Viewer and Internet Explorer.
    As the 6-dimensional mathematics professor said to the brain surgeon, "It ain't Rocket Science!"

    Reviews: "dunfiddlin likes his DataTables" - jmcilhinney

    Please be aware that whilst I will read private messages (one day!) I am unlikely to reply to anything that does not contain offers of cash, fame or marriage!

  11. #11
    PowerPoster stanav's Avatar
    Join Date
    Jul 2006
    Location
    Providence, RI - USA
    Posts
    9,290

    Re: I've got new problem can someone please shed some light on this for me

    If you use notepad to open the sample pdf file you uploaded, you'll see that it uses external references. The xref points to the 2nd half of the file, which is the html source code of a web page. To get rid of the html, you will need to open the file in a pdf reader and then save it. The act of opening and saving seems to consolidate those external references and gets rid of the html. From these findings, I've come up with a solution for you. You can use iTextSharp to open the original file and then use pdfCopy to save a copy of the file, which will be in proper pdf format. After that, you can delete the original and rename the newly created file to the old file (optional).
    Here is the code for making a copy of the pdf file using itextsharp
    Code:
    Public Shared Sub FixPdf(ByVal sourcePdf As String)
            Dim reader As iTextSharp.text.pdf.PdfReader = Nothing
            Dim doc As iTextSharp.text.Document = Nothing
            Dim pdfCpy As iTextSharp.text.pdf.PdfCopy = Nothing
            Dim page As iTextSharp.text.pdf.PdfImportedPage = Nothing
            Dim pageCount As Integer = 0
            Dim ext As String = IO.Path.GetExtension(sourcePdf)
            Dim fileName As String = IO.Path.GetFileNameWithoutExtension(sourcePdf)
            Dim outfile As String = IO.Path.Combine(IO.Path.GetDirectoryName(sourcePdf), String.Format("{0}_fixed{1}", fileName, ext))
            Try
                reader = New iTextSharp.text.pdf.PdfReader(sourcePdf)
                pageCount = reader.NumberOfPages
                doc = New iTextSharp.text.Document(reader.GetPageSizeWithRotation(1))
                pdfCpy = New iTextSharp.text.pdf.PdfCopy(doc, New IO.FileStream(outfile, IO.FileMode.Create))
                doc.Open()
                For i As Integer = 1 To pageCount
                    page = pdfCpy.GetImportedPage(reader, i)
                    pdfCpy.AddPage(page)
                Next
                doc.Close()
                reader.Close()
                'Delete the original and rename the new pdf. This is optional, of course...
                'IO.File.Delete(sourcePdf)
                'IO.File.Move(outfile, sourcePdf)
            Catch ex As Exception
                Throw ex
            End Try
        End Sub
    Let us have faith that right makes might, and in that faith, let us, to the end, dare to do our duty as we understand it.
    - Abraham Lincoln -

  12. #12

    Thread Starter
    Hyperactive Member
    Join Date
    Mar 2012
    Posts
    281

    Re: I've got new problem can someone please shed some light on this for me

    that'll work..thanks bud

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width