dcsimg
Results 1 to 8 of 8

Thread: replace and edit huge ansi file

  1. #1

    Thread Starter
    New Member
    Join Date
    Sep 2017
    Posts
    12

    Question replace and edit huge ansi file

    hi
    i have a large and huge file in ansi format , some times more than 8Gb or ...
    so i start to reading line by line and writing to another file for replaceing.
    but after editing , some lines got mess and file size is changed ,

    for example if i want replace "user" with "user" file size must not be changed , but its change
    sorry for my enlish ,


    Code:
    
      Sub Main()
    
            Dim args() As String = System.Environment.GetCommandLineArgs()
    
             
            If args.Length > 4 Then
    
                If args(1) = "replace" Then
                     
                    Dim MyTempFile As String = args(3)
                    Dim MyFile As String = args(2)
    
                    If IO.File.Exists(MyFile) Then
                        Dim sr As New IO.StreamReader(MyFile, Encoding.Default)
                        Dim sw As New IO.StreamWriter(MyTempFile, False, Encoding.Default)
    
                        While Not (sr.Peek = -1)
                            Dim Line As String = sr.ReadLine
                            Line = Line.Replace(args(4), args(5))
                            sw.WriteLine(Line)
                        End While
                        sr.Close()
                        sw.Close()
                       
                        GC.Collect()
                        IO.File.Copy(MyTempFile, MyFile, True)
                    End If
    
                    Console.WriteLine("done")
    
                End If
    
            End If

  2. #2
    .NUT jmcilhinney's Avatar
    Join Date
    May 2005
    Location
    Sydney, Australia
    Posts
    100,585

    Re: replace and edit huge ansi file

    Have you confirmed that Line has the same value before and after the Replace call? Are you sure that the file was written using the default encoding on the current machine in the first place?
    Why is my data not saved to my database? | MSDN Data Walkthroughs
    VBForums Database Development FAQ
    My CodeBank Submissions: VB | C#
    My Blog: Data Among Multiple Forms (3 parts)
    Beginner Tutorials: VB | C# | SQL

  3. #3

    Thread Starter
    New Member
    Join Date
    Sep 2017
    Posts
    12

    Re: replace and edit huge ansi file

    i change this line to comment ' Line = Line.Replace(args(4), args(5) for finding the problem ,
    the problem is in reading and writing to file .
    for example this file in atachment ,
    befor reaading file size is 6,688 kb but after reaading and writing to another file 6,697 kb

    temp.zip

  4. #4
    PowerPoster techgnome's Avatar
    Join Date
    May 2002
    Posts
    31,943

    Re: replace and edit huge ansi file

    My guess is that it's the encoding. You're using the Encoding.Default, which could be anything. If I remember right, on a reader it will default to the encoding of the file... I don't remember what the result is on a writer... but my guess is that the source file is using one encoding (like ASCII for instance) and the other is using something different (like UTF-16).

    On a side note... if you need to keep the files the same size, you need to do a little more than this:
    Line = Line.Replace(args(4), args(5))

    What if you're replacing "Users" with "User" ... it would shorten the file by a character for each replacement.... or the other way around: replacing "User" with "Users" ... now you're making the file one character longer for each replacement... so if the file length should never change, you need to add some logic to make sure that the lengths of the two strings are equal. If the new string is shorter, it should be padded with spaces (or something)... if it's longer, it needs to be truncated. -- That could also account for the size difference... your example replaced like for like but I don't know if that was just a for example, or a live case. But if you replace something with something longer, that would account for a larger file.

    -tg
    * I don't respond to private (PM) requests for help. It's not conducive to the general learning of others.*
    * I also don't respond to friend requests. Save a few bits and don't bother. I'll just end up rejecting anyways.*
    * How to get EFFECTIVE help: The Hitchhiker's Guide to Getting Help at VBF - Removing eels from your hovercraft *
    * How to Use Parameters * Create Disconnected ADO Recordset Clones * Set your VB6 ActiveX Compatibility * Get rid of those pesky VB Line Numbers * I swear I saved my data, where'd it run off to??? *

  5. #5

    Thread Starter
    New Member
    Join Date
    Sep 2017
    Posts
    12

    Re: replace and edit huge ansi file

    forget about replace for now

    i try Encoding.Default UTF-16 , iso-8859-15 , ascii but in all case file size is changed after reading and writing to new file .

    also i try System.Text.Encoding.Unicode and again file size changed 2kb and some characters in china lang.
    Last edited by astarali; Dec 8th, 2017 at 09:01 AM.

  6. #6
    Super Moderator si_the_geek's Avatar
    Join Date
    Jul 2002
    Location
    Bristol, UK
    Posts
    40,341

    Re: replace and edit huge ansi file

    When you don't already know the encoding of a file, open it in a Hex editor to see what the first few bytes are - because that is where the encoding method is specified (if there is encoding).

    In this case looking at the smaller file shows that over 1000 bytes at the start are 0, which shows there is no encoding at all. If there is an option for no encoding, that is what you should use in this case.

  7. #7

    Thread Starter
    New Member
    Join Date
    Sep 2017
    Posts
    12

    Re: replace and edit huge ansi file

    yes ,you are right .so i am trying to read file in byte arays , and editing in hex

  8. #8
    You don't want to know.
    Join Date
    Aug 2010
    Posts
    4,580

    Re: replace and edit huge ansi file

    Argh. Pet peeve. "Editing in hex" is not a real phrase. I see this get people in very interesting situations that are hard to dig them out of.

    A "hex editor" is really just "a binary editor". It's for editing the bytes of a file without much interpretation by the editor. 2 hexadecimal digits represent 8 bits, which makes 2-digit columns of hexadecimal digits a very convenient way for binary editors to represent the bytes in a file. But you aren't "editing hex" and there's nothing magic about hexadecimal. Most editors are happy to switch to octal or binary modes, too. The reason I nitpick is I often see people very hung up on strange ideas about editing bytes. The byte arrays are the same thing as "editing hex".

    You are in the worst possible case for text encoding: ANSI with an unknown code page. You might not be able to solve this problem. ANSI follows ASCII for the first 127 byte values, and everything above only has meaning relative to a particular code page. There are dozens if not hundreds of valid code pages, sometimes multiples per language. Some of those code pages even specify that some characters take more than one byte to represent. Using the wrong code page often means the bulk of your text file is misinterpreted or unrecognizable.

    Sadly, when ANSI was designed, no one thought to make a convention or standard for storing, "Which code page do I use?" as part of the text file. It was assumed if the files were on a networked environment, some other protocol or convention would handle that.

    So if you have "a random ANSI file" you can only make guesses as to the code page. Human interpretation is about the only way to know for sure it worked. For an 8GB file, that's impractical.

    That's why, more than 20 years ago, the Unicode standard was created, and a host of implementations exist. UTF-8 is the most popular because it "looks like ASCII" and a ton of fundamental protocols work only with ASCII (no ANSI.) UTF-16 is the next-most popular because it more uniformly represents common human languages. Encoding.Default is probably one of those two, I've never liked the name. (Windows defaults to UTF-16, .NET stores Strings as UTF-16, but many things in .NET use UTF-8 if you don't specify an encoding, so the concept of "default" is very muddy.)

    It is not required, but it is very common, to include some bytes at the start of a Unicode file to indicate both the encoding and the byte order. Most people think it's an encoding indicator, but since every Unicode encoding can use multiple bytes per character, the byte order is the most relevant part. In UTF-16, the character "A" can be represented both as "00 41" or "41 00". the BOM is how you tell both: that it's UTF-16 and the order of the bytes.

    When you have a text file with no BOM, you have to guess. UTF-16 is easy to detect if the file uses English text: you'll see a lot of ASCII-range characters bordered with null like "00 41" or "41 00". UTF-8 is trickier to detect because it shares the ASCII range: for many English-only files there's not a distinction. It's when the file uses non-English characters it gets interesting. If you're lucky and it's UTF-8 you'll find enough recognizable multi-byte patterns to declare it as such. If you're not, you have to try interpreting it as many different code pages until you figure out which one's the right one.

    So I do not think you have enough information to solve the problem.

    You aren't getting what you expect because you know it's an ANSI-encoded file, but you're using "Encoding.Default" which is either UTF-8 or UTF-16. None of those are compatible with ANSI. If you want to work with ANSI-encoded files, you have to create an Encoding and specify the code page. If you don't know the code page, you have to guess.

    So if you have no clue what code page to use, you can't solve the problem. If you can narrow it down to even 10 or 15, you can try them all until you find the "right" one.

    A really good binary editor can help with this: the best ones I've used let you highlight arbitrary ranges of bytes and interpret them in many ways. So you could highlight some bytes in a great binary editor and try many different encodings until you find one that "fits".
    This answer is wrong. You should be using TableAdapter and Dictionaries instead.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Featured


Click Here to Expand Forum to Full Width