[RESOLVED] Converting String to ByteArray

**Doogle** · Nov 5th, 2012, 02:09 AM

I've written an Asynchronous Socket client, which is working ok. I'm still on the learning curve from VB6 to .NET and there's one or two things I'm not clear about.

Part of the code requires sending the user's input to the Socket, using the BeginSend method. This requires the data to be sent as a Byte Array. Currently I'm doing this:

Code:

        If txtToSend.Text <> vbNullString Then
            MessageToSend = txtToSend.Text & vbNewLine
            Dim bytSend(MessageToSend.Length - 1) As Byte
            For i = 0 To MessageToSend.Length - 1
                bytSend(i) = Asc(MessageToSend.Substring(i, 1))
            Next
            Client.Client.BeginSend(bytSend, 0, MessageToSend.Length, 0, AddressOf SendCallback, Client)
        End If

which seems to be a bit 'clunky' and 'vb6ish'. I feel as though I ought to be able to use 'MessageToSend.ToCharArray' and somehow coerce the result into a byte array.

Or is there something quite fundamental I'm missing?

**Niya** · Nov 5th, 2012, 02:35 AM

If you're gonna move to VB.Net, I implore you to stop thinking of text as directly interchangable with bytes. Think of a string as a sequence of unicode characters where conversion is necessary to switch between representations. In a unicode world, thinking of strings as a 1 byte per character byte array could be detrimental. The exact same string could be represented by totally different byte sequences according to which unicode format its in. The closest thing to ASCII, which you would be used to from VB6, would be UTF8:-

vbnet Code:

'
        Dim message As String = "Hello world"
 
        'Gets a byte array that represents the string in UTF8 format
        Dim byMessage As Byte() = System.Text.Encoding.UTF8.GetBytes(message)

The great thing about this is that it would work with text in any language since UTF8 can represent any character, even non-latin characters. UTF8 in particular is backward compatible with ASCII so you can save the bytes directly into a text file and even an old DOS text editor would be able to read it as long as you use normal latin characters from the ASCII codepage.

Your code should look like this:-

vbnet Code:

'
        If txtToSend.Text <> vbNullString Then
            MessageToSend = txtToSend.Text & vbNewLine
            Dim bytSend As Byte() = System.Text.Encoding.UTF8.GetBytes(MessageToSend)
 
            Client.Client.BeginSend(bytSend, 0, MessageToSend.Length, 0, AddressOf SendCallback, Client)
        End If

**Doogle** · Nov 5th, 2012, 02:41 AM

Thanks for that.

I seem to be on quite a steep learning curve and it's difficult getting my head round ASCII vs UTF8 etc after 40 years of programming. The little grey cells are not quite as active as they were. However, I suppose now is as good a time as any to start.

**Niya** · Nov 5th, 2012, 03:02 AM

Don't worry you'd get it. It was confusing for me too in the beginning. When I realized that I should stop interfering with the bytes in Strings directly and treat them as black boxes where the only thing I know is the format and the character sequence, everything became clearer.

**Half** · Nov 5th, 2012, 08:23 PM

UTF-8 is not Unicode.
.NET is mostly using Unicode if the encoding is not explicitly specified but the forms themselves are not saved in Unicode.
Socket communication and pretty much everything else Internet is still using extended ASCII (the 0-255 range)
:-D

When I am sending 대한민국, I am not sending 대한민국 but the bytes EB,8C,80,ED,95,9C,EB,AF,BC,EA,B5,AD. Just like you are doing it in your original code. I think it is great to use what's easiest but only if you know why and how it is easier. Otherwise when your users say they are getting squares and question marks you won't know if the GetBytes' encoding was wrong, they don't have an adequate font, the send went wrong etc. etc.

I now find the basics to be really simple, the only thing I had to read back then:
http://www.joelonsoftware.com/articles/Unicode.html

**Evil_Giraffe** · Nov 6th, 2012, 04:32 AM

Originally Posted by Half

UTF-8 is not Unicode.

Um, yes it is.

Originally Posted by The very article you linked to

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

Originally Posted by Half

Socket communication and pretty much everything else Internet is still using extended ASCII (the 0-255 range)

No, "Socket communication and pretty much everything else Internet" is using bytes. How the receiving systems interpret those bytes is up to them. The sending and receiving system simply have to agree. This can either be by decree (as in this example: it is stated that text will be sent encoded as UTF-8), or by some form of negotiation/command (as in web pages, the doctype that comes first in the html document should specify what encoding the document is in (yes, this is a bit screwy, fortunately the doctype is carefully constructed to only use characters that are the same in basically every encoding ever invented)

**Half** · Nov 6th, 2012, 12:28 PM

Originally Posted by Evil_Giraffe

Originally Posted by Half

UTF-8 is not Unicode.

Um, yes it is.

Originally Posted by The very article I linked to

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

Note the 'another system' piece of text. You can use a system of dead raccoons to store Unicode data and call it URCns-D but it won't make it Unicode. To illustrate the humongous difference between actual Unicode and a distorted thing like UTF-8:
input string: "대한민국"
UTF8.GetBytes("대한민국") => Byte Seq: 128 237 149 156 235 175 188 234 181 173
Unicode.GetBytes("대한민국") => Byte Seq: 0 179 92 213 252 187 109 173

-----------------------------------

Originally Posted by Evil_Giraffe

No, "Socket communication and pretty much everything else Internet" is using bytes. How the receiving systems interpret those bytes is up to them. The sending and receiving system simply have to agree. This can either be by decree (as in this example: it is stated that text will be sent encoded as UTF-8), or by some form of negotiation/command (as in web pages, the doctype that comes first in the html document should specify what encoding the document is in (yes, this is a bit screwy, fortunately the doctype is carefully constructed to only use characters that are the same in basically every encoding ever invented)

Bytes and ASCII chars are interchangeable terms in most dev communities. You may not know it but e.g. HEX editors , in the text portion, almost always show the bytes in ASCII and not in other encodings.

Of course I can see how beginners can become a bit frustrated when a byte is treated like a char and a char is treated like a byte but after battling with APIs for a while or working with webservers it all becomes a bit clearer. In any case it boils down to: either use simple extended ASCII when generating byte sequences for transfer and tell the client the encoding to use when treating those bytes (my preference)

or

Use some other encoding in generating the byte sequence and when Windows 9 decides it's time for middle-endiannes, start digging for the source code to make sense of the output.

**dbasnett** · Nov 6th, 2012, 02:15 PM

What is Unicode? In today's computing environment it would be the exception that a character set isn't unicode. The only trick is that the encoder and decoder agree on the encoding. There is even one for extended ASCII, Windows-28591, which supplies a one-to-one code for each of the 256 characters.

One other thing. If you are going to quote a link it is always best to read it thoroughly. From The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

"The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII."

**Evil_Giraffe** · Nov 8th, 2012, 05:46 AM

Originally Posted by Half

Note the 'another system' piece of text. You can use a system of dead raccoons to store Unicode data and call it URCns-D but it won't make it Unicode.

I think you're confusing "Unicode" with "some specific encoding of Unicode". Unicode text is simply a string of code points. Representing these code points as bytes (or, as in this [hopefully] hypothetical example, dead raccoons) is the job of the encoding. So the 'another system' refers to a different encoding system of Unicode code points, not a non-Unicode system.

Originally Posted by Half

To illustrate the humongous difference between actual Unicode and a distorted thing like UTF-8:
input string: "대한민국"
UTF8.GetBytes("대한민국") => Byte Seq: 128 237 149 156 235 175 188 234 181 173
Unicode.GetBytes("대한민국") => Byte Seq: 0 179 92 213 252 187 109 173

Yes, two different encodings end up with different bytes. Hence why the sending and receiving party need to agree on the encoding used. Maybe you're confused because the encoding class is called "Unicode". I agree it's oddly named, but look up the documentation and you'll find that it's simply little-endian UTF-16.
The "actual Unicode" isn't the result of Unicode.GetBytes. It's this: U+B300 U+D55C U+BBFC U+AD6D

Originally Posted by Half

Use some other encoding in generating the byte sequence and when Windows 9 decides it's time for middle-endiannes, start digging for the source code to make sense of the output.

No, as the quote that dbasnett has pulled out of the article states, you simply have to know what encoding was used to generate that set of bytes. Windows 9 doesn't decide what the encoding is, the application chooses the encoding scheme it uses to decode the bytes into a string.

I really think you need to go back and read that article again.

**Half** · Nov 8th, 2012, 09:14 PM

Mhhh this whole thing started by me trying to say that it is ok to simply loop through a string and get the bytes of each char. It seemed 'clunky' and 'vb6ish' to the OP but it is neither. If anything, it is C-ish.

I did not link to the now infamous webpage in order to somehow try to prove that encodings are irrelevant or useless, but to make it easier for those who would like to know why and how UTF8.GetBytes, Unicode.GetBytes, Default.GetBytes etc etc differ from one another.

I was going to ignore dbasnett's remark since it serves no other purpose other than a shot at being sarcastic but meh: the quoted text is indeed scary but it is just for drama. What happens if we do have a string without knowing its encoding? The Earth becomes a unipolar magnet, a hole in space-time opens, Stephen Hawking turns into a black hole? If it were such a big deal we would use it as an encryption system.

Originally Posted by Evil_Giraffe

Originally Posted by Half

To illustrate the humongous difference between actual Unicode and a distorted thing like UTF-8:
input string: "대한민국"
UTF8.GetBytes("대한민국") => Byte Seq: 128 237 149 156 235 175 188 234 181 173
Unicode.GetBytes("대한민국") => Byte Seq: 0 179 92 213 252 187 109 173

Yes, two different encodings end up with different bytes. Hence why the sending and receiving party need to agree on the encoding used. Maybe you're confused because the encoding class is called "Unicode". I agree it's oddly named, but look up the documentation and you'll find that it's simply little-endian UTF-16.
The "actual Unicode" isn't the result of Unicode.GetBytes. It's this: U+B300 U+D55C U+BBFC U+AD6D

but but but but... The bytes 0 179 92 213 252 187 109 173 in hex are 00 B3 | 5C D5 | FC BB | 6D AD
& we all know about endianness

**dbasnett** · Nov 9th, 2012, 09:12 AM

I was NOT being sarcastic. Several statements were made that were just wrong, and I felt they needed to be corrected. I wouldn't call the quoted text dramatic, unless that is what stating the obvious is.

Code:

82BB82EA82CD89E4815882AA96E291E882C582A082E982B182C682F0926D82C182C482A282E982B782D782C482C582CD82C882A2814182BB82EA82CD89E4815882AA82A082E982B182C682F08D7382A482C68E7682ED82EA82E982B782D782C482C582B78142

**dbasnett** · Nov 9th, 2012, 09:14 AM

I was NOT being sarcastic. Several statements were made that were just wrong, and I felt they needed to be corrected. I wouldn't call the quoted text dramatic, unless that is what stating the obvious is.

Code:

82BB82EA82CD89E4815882AA96E291E882C582A082E982B182C682F0926D82C182C482A282E982B782D782C482C582CD82C
882A2814182BB82EA82CD89E4815882AA82A082E982B182C682F08D7382A482C68E7682ED82EA82E982B782D782C482C582B78142

**Doogle** · Nov 10th, 2012, 02:08 AM

Although I marked this as resolved (as my original question was answered) the continuing discussion has me interested.

Let's see if I've got a grip on it yet.....

If I were developing a multi-lingual international Chat program where clients may be using, for instance, a Chineese character set and others may be using a Latin based character set, I would need to support multi-byte character transfer. In order to 'unscramble' the data I would need to know the character set that the client is using.

e.g.
Client 1 (using Chineese characters) send a message, the Server picks it up and forwards it to Client 2 who's using English. In order for the Chineese characters to be displayed at Client 2 I would need to also send the Code Page ID(?) or at least something to tell Client 2's program how to interpret the multi-byte format of the data (since multi-byte data can be from 1 to 6 bytes(?)) so that the original Chineese characters are displayed at Client 2.

**Niya** · Nov 10th, 2012, 02:14 AM

Actually, you don't have to do a thing more than encode your strings in a format that can express the unicode code-points used to represent chinese characters. UTF8 should be sufficient for this. As long as both clients agree that strings passed between them are encoded in UTF8.

**Doogle** · Nov 10th, 2012, 02:31 AM

Ah ha, got it. I could think of UTF8 as a 'universal panacea' in terms of unicode code-points. Once everyone's agreed that is the way data is going to be 'encoded' there shouldn't be any problems. (I'm always a bit wary about 'universal panaceas'.)

**Niya** · Nov 10th, 2012, 03:40 AM

Exactly

Thread: [RESOLVED] Converting String to ByteArray

Thread Tools

Display

[RESOLVED] Converting String to ByteArray

Re: Converting String to ByteArray

Re: Converting String to ByteArray

Re: [RESOLVED] Converting String to ByteArray

Re: [RESOLVED] Converting String to ByteArray

Re: [RESOLVED] Converting String to ByteArray

Re: [RESOLVED] Converting String to ByteArray

Re: [RESOLVED] Converting String to ByteArray

Re: [RESOLVED] Converting String to ByteArray

Re: [RESOLVED] Converting String to ByteArray

Re: [RESOLVED] Converting String to ByteArray

Re: [RESOLVED] Converting String to ByteArray

Re: [RESOLVED] Converting String to ByteArray

Re: [RESOLVED] Converting String to ByteArray

Re: [RESOLVED] Converting String to ByteArray

Re: [RESOLVED] Converting String to ByteArray

Posting Permissions