[RESOLVED] I've got new problem can someone please shed some light on this for me

**M@dH@tter** · Jan 30th, 2013, 10:27 PM

I'm downloading files from a web site.there pdf files..for some reason the web site is appending the html page source code to the end of pdf file.

So the pdf files will not open..i can open the pdf files in ultraedit and search the doc for "EOF.....<!DOCTYPE html PUBLIC" and cut everything after the EOF and add 0A to the end and the pdf file is fixed..
My problem is all of the pdf files at this web site are doing this..

how can i code something that will start at the end of the pdf file and scan backwards..find the offending code's address and save from the beginning of the file to that address location and add the 0A to the end of it.

or which ever way would be better beginning to end or end to beginning...

**iamcpc** · Jan 30th, 2013, 10:47 PM

you could have the source code be a string and then split the string as needed or use left() or mid() functions to shave characters off of the string

**techgnome** · Jan 30th, 2013, 10:53 PM

Ultimately it depends on how you're downloading it too.... that would be the first thing I'd look at... make sure that when you're downloading it, you're getting the PDF and just the PDF... seems a bit odd that you're getting extra stuff...

-tg

**M@dH@tter** · Jan 30th, 2013, 11:20 PM

hi no the way it's downloading is appearently a bug with the web site..the file extensions are correct,,the xml files are using the same type link and those files are fine..as well as zip files..but the pdf's for some reason are getting the html code attached to the end of the pdf..i know it's odd..first time i've ever run across this type of issue .

as for making the html a string i don't think that will work..this would mean that i would have to read the whole pdf as a string..and pdf's as well as any file that isn't plain text have non-printable text..so reading it as a string won't work..

**stanav** · Jan 31st, 2013, 09:27 AM

This may work: Open the pdf as text file and chop off what you don't need. Convert the remaining string to bytes and then write it back to a file with .pdf extension using binarywriter. If you provide a sample pdf file, I'll see what I can do...

**M@dH@tter** · Jan 31st, 2013, 01:31 PM

took awhile to find something small enough to attach..the file attached give a good idea of whats going on.

**dunfiddlin** · Jan 31st, 2013, 01:38 PM

the file attached give a good idea of whats going on.

Well, not really, as my PDF reader simply reads this as a PDF file without any apparent difficulty or any extraneous matter displayed. As the guys said, without some information on the method of download (weirdly, we like code better than vague descriptions!) and, if possible, the site in question there's really not a whole lot we can do!

**M@dH@tter** · Jan 31st, 2013, 02:23 PM

if you open that file in a hexeditor.you will see quit clearly that from the end of the file upwards there is nothing but html code.
Not the whole file mind ya but at some point you find the beginning of the html file and just before that you will find the actual end of the pdf EOF or 454F46 removing everything after that EOF and making the last byte 0A fixes the file ..i'm not sure how you got it to open in acrobat..as i've tried reader and full version of 7 maybe the newer acrobat like 11 will open it,,but i'm not installing something i don't need just to open a file that when fixed to the correct length will open just fine in the version i have installed..

**M@dH@tter** · Jan 31st, 2013, 03:34 PM

after taking another look at a good pdf file i realized that the 0A i mentioned that needed to be put on the end is not required

**dunfiddlin** · Jan 31st, 2013, 03:49 PM

I use Foxit Reader (free and free from Adobe bloat) but it also shows perfectly adequately in Universal Viewer and Internet Explorer.

**stanav** · Jan 31st, 2013, 04:38 PM

If you use notepad to open the sample pdf file you uploaded, you'll see that it uses external references. The xref points to the 2nd half of the file, which is the html source code of a web page. To get rid of the html, you will need to open the file in a pdf reader and then save it. The act of opening and saving seems to consolidate those external references and gets rid of the html. From these findings, I've come up with a solution for you. You can use iTextSharp to open the original file and then use pdfCopy to save a copy of the file, which will be in proper pdf format. After that, you can delete the original and rename the newly created file to the old file (optional).
Here is the code for making a copy of the pdf file using itextsharp

Code:

Public Shared Sub FixPdf(ByVal sourcePdf As String)
        Dim reader As iTextSharp.text.pdf.PdfReader = Nothing
        Dim doc As iTextSharp.text.Document = Nothing
        Dim pdfCpy As iTextSharp.text.pdf.PdfCopy = Nothing
        Dim page As iTextSharp.text.pdf.PdfImportedPage = Nothing
        Dim pageCount As Integer = 0
        Dim ext As String = IO.Path.GetExtension(sourcePdf)
        Dim fileName As String = IO.Path.GetFileNameWithoutExtension(sourcePdf)
        Dim outfile As String = IO.Path.Combine(IO.Path.GetDirectoryName(sourcePdf), String.Format("{0}_fixed{1}", fileName, ext))
        Try
            reader = New iTextSharp.text.pdf.PdfReader(sourcePdf)
            pageCount = reader.NumberOfPages
            doc = New iTextSharp.text.Document(reader.GetPageSizeWithRotation(1))
            pdfCpy = New iTextSharp.text.pdf.PdfCopy(doc, New IO.FileStream(outfile, IO.FileMode.Create))
            doc.Open()
            For i As Integer = 1 To pageCount
                page = pdfCpy.GetImportedPage(reader, i)
                pdfCpy.AddPage(page)
            Next
            doc.Close()
            reader.Close()
            'Delete the original and rename the new pdf. This is optional, of course...
            'IO.File.Delete(sourcePdf)
            'IO.File.Move(outfile, sourcePdf)
        Catch ex As Exception
            Throw ex
        End Try
    End Sub

**M@dH@tter** · Jan 31st, 2013, 08:32 PM

that'll work..thanks bud

Thread: [RESOLVED] I've got new problem can someone please shed some light on this for me

Thread Tools

Display

[RESOLVED] I've got new problem can someone please shed some light on this for me

Re: I've got new problem can someone please shed some light on this for me

Re: I've got new problem can someone please shed some light on this for me

Re: I've got new problem can someone please shed some light on this for me

Re: I've got new problem can someone please shed some light on this for me

Re: I've got new problem can someone please shed some light on this for me

Re: I've got new problem can someone please shed some light on this for me

Re: I've got new problem can someone please shed some light on this for me

Re: I've got new problem can someone please shed some light on this for me

Re: I've got new problem can someone please shed some light on this for me

Re: I've got new problem can someone please shed some light on this for me

Re: I've got new problem can someone please shed some light on this for me

Posting Permissions