L a char* -- convert to wchar_t*

Printable View

Jun 8th, 2005, 10:35 PM
Halsafar

L a char* -- convert to wchar_t*

Okay, I have no idea what it all means but something to do with how wide something vs ansi or I dunno...

But a function I need to use once, thats it, requires a filename to specified like:
L"C:\\Afile.ext"

Well I need to be able to pass in any filename...so I pass a char*
Is this where I use wchar? tchar?

O_o
:ehh:
Jun 9th, 2005, 12:33 AM
sunburnt

Re: L a char* -- convert to wchar_t*

putting an L in front of a string literal specifies that the string literal is a wide character string. (see the thread about _T for reference).

a TCHAR is a define based on the UNICODE (or maybe _UNICODE) macro.

If UNICODE is true, TCHAR resolves to a wchar_t.
If UNICODE is false, TCHAR resolves to a char.
Jun 9th, 2005, 11:07 AM
CornedBee

Re: L a char* -- convert to wchar_t*

TCHAR depends on UNICODE and is defined in windows.h. _TCHAR depends on _UNICODE and is defined in tchar.h.

TEXT() is a macro that emits a narrow or wide character (string) depending on the UNICODE setting, from windows.h. _T() and _TEXT() are equivalents from tchar.h, and depend on _UNICODE.

windows.h defines the following synonyms (the L variants are a relic from Win16):
CHAR = char
PSTR = char *
LPSTR = char *
PCSTR = const char *
LPCSTR = const char *
WCHAR = wchar_t
PWSTR = wchar_t *
LPWSTR = wchar_t *
PCWSTR = const wchar_t *
LPCWSTR = const wchar_t *
PTSTR = TCHAR *
LPTSTR = TCHAR *
PCTSTR = const TCHAR *
LPCTSTR = const TCHAR *

The C standard requires that there exist a narrow- and a wide-character version of every string-related CRT function. The names containing str from the narrow versions are replaced by versions containing wcs. The functions are declared both in whatever header their narrow counterpart is in as well as wchar.h.
E.g.:
strpos -> wcspos
strstr -> wcswcs
printf -> wprintf
sprintf -> wsprintf

The mbtowcs and wctombs functions convert between strings of the two types, the mbtowc and wctomb convert between single characters (as far as that is possible - conversions often don't work that way.)

tchar.h is a Microsoft extension that, in addition to the typedefs and macros I've already described, contains macros that evaluate to one of the two function variants, dependent on _UNICODE. These functions use tcs and are prefixed with an underscore:
strpos/wcspos -> _tcspos
strstr/wcswcs -> _tcstcs
printf/wprintf -> _tprintf

In addition to that, the tchar.h header defines the _tmain and _tWinMain macros that you can use for your program's application entry point, because a Windows CRT program starts at wmain if it's compiled as Unicode. (This is, I believe, a violation of the standard.)

The windows.h header does the same thing for the WinAPI, with UNICODE. Every function that works with strings really exists in two variants, an A and a W version. For example, the MessageBox call is really two functions:
int MessageBoxA(HWND, PCSTR, PCSTR, int);
int MessageBoxW(HWND, PCWSTR, PCWSTR, int);
The headers also contain a macro that might evaluate to either:
#ifdef UNICODE
#define MessageBox MessageBoxW
#else
#define MessageBox MessageBoxA
#endif
The same thing is done for all structures that contain character members, except that a typedef is used instead of a macro.

Windows has the MultiByteToWideChar and WideCharToMultiByte API calls that convert between the string representations.

C++ brings a new layer into this. The entire iostream, locale and string libraries are built from the grounds up to support characters of any type, by making them templated on the character type, but also the character traits.
Therefore, std::string and std::wstring are just specializations of the template std::basic_string<CHAR_TYPE, TRAITS_TYPE, ALLOCATOR_TYPE>. std::cin and std::wcin are instances of std::istream and std::wistream respectively, which in turn are specializations of std::basic_istream<CHAR_TYPE, TRAITS_TYPE>.
Conversion in C++ is done through the charconv facet of the locale and is actually a rather advanced technique.

Nothing, not even Microsoft, provide tchar-like features for C++, however. That is, while it's very easy to just instantiate a std::basic_string<TCHAR>, there's no tstring anywhere.

Such a header is very easy to write, though. I call mine tcpp.

Also, there's no quick-and-easy way to convert between std::string and std::wstring. In fact, doing so is extremely tiresome, because a std::string doesn't provide a target for the conversion. So I wrote these:

Code:

template <typename TO, typename FROM> inline std::basic_string<TO, std::char_traits<TO>, std::allocator<TO> > chartype_cast( const std::basic_string<FROM, std::char_traits<FROM>, std::allocator<FROM> > &from, const std::locale & = std::locale()); template <> inline std::string chartype_cast(const std::string &from, const std::locale &) { return from; } template <> inline std::wstring chartype_cast(const std::wstring &from, const std::locale &) { return from; } // These can be better. template<> inline std::string chartype_cast(const std::wstring &from, const std::locale &loc) { typedef std::codecvt<wchar_t, char, std::mbstate_t> cvt_t; const cvt_t &cvt = std::use_facet<cvt_t>(loc); std::mbstate_t state; std::vector<char> outbuf((from.size() + 1) * cvt.max_length(), '\0'); const wchar_t *inseqptr = from.c_str(); char *outseqptr = &outbuf[0]; cvt_t::result r = cvt.out(state, from.c_str(), from.c_str() + from.size(), inseqptr, &outbuf[0], &outbuf[0] + outbuf.size(), outseqptr); return std::string(outbuf.begin(), outbuf.end()); } template<> inline std::wstring chartype_cast(const std::string &from, const std::locale &loc) { typedef std::codecvt<wchar_t, char, std::mbstate_t> cvt_t; const cvt_t &cvt = std::use_facet<cvt_t>(loc); std::mbstate_t state; std::vector<wchar_t> outbuf(from.size() + 1, L'\0'); const char *inseqptr = from.c_str(); wchar_t *outseqptr = &outbuf[0]; cvt_t::result r = cvt.in(state, from.c_str(), from.c_str() + from.size(), inseqptr, &outbuf[0], &outbuf[0] + outbuf.size(), outseqptr); return std::wstring(outbuf.begin(), outbuf.end()); }

As I remark, the code is far from ideal. It is easy to use, though:
std::string str = chartype_cast<char>(otherstring);
where otherstring can be either a std::string or a std::wstring, the compiler chooses.
Jun 9th, 2005, 12:10 PM
Halsafar

Re: L a char* -- convert to wchar_t*

Nice little article there CB, I enjoyed it. You answered more questions than I even asked, for example; I always wondered why MessageBoxA and MessageBoxW existed and I only called MessageBox. I see that most EVERY Api has a W/A counterpart, now it all makes sense.

So what is the purpose behind the 2 different char* types?
To me it seemingly complicated things for no reason....

As for the code you posted, I do not have time study that now, I read your article and I have to go but I do plan on implimenting it since I HAVE to.

See, I need to pass a wide-char to a function which loads a video from my games directory so it has to look something like:

->RenderFile((cGrameWindowFrame::AppPath() + "//test.avi"));
AppPath() returns a string of the application path obviously.
Looks like I'm in for a bit of work just to allow me to open any video I want from the current working directory....

Well I'm sure the code you posted will help, I'll reply more if I encounter more questions.
Jun 9th, 2005, 01:50 PM
sunburnt

Re: L a char* -- convert to wchar_t*

Nice answer, CB!

Quote:

So what is the purpose behind the 2 different char* types?
To me it seemingly complicated things for no reason....

Representing a character in english as only 8 (or using ASCII, 7) is no big deal -- there are only 26 lowercase letters, 26 uppercase letters, and some punctuation and symbols.

Now consider a langage like spanish or german. In addition to all the english characters, they have versions with different types of accents. These accents can go on many different letters, and these have to be represented somehow too, -- instead of just e and E, you have e, E, É, é, è, etc. This is a bit more data to store in one byte, but it still fits.

Now consider a language like korean, chinese, or japanese. These languages have many different characters that need to be represented -- more than can probably be stored using just 8 bits. By using 16 bits for each character, these languages can easily be represented.

That's just one of the benefits of using unicode or some other wide character encoding.
Jun 9th, 2005, 01:53 PM
CornedBee

Re: L a char* -- convert to wchar_t*

As far as paths go, I tend to use Boost.Filesystem.

Quote:

So what is the purpose behind the 2 different char* types?

Which two? LPSTR and PSTR? Or PCSTR and PSTR? Or PSTR and PWSTR?

The last is simple. wchar_t uses more space than char (2 bytes on Windows and Linux, I think, but 4 bytes on some other systems and compilers), but has the advantage of being capable of storing a wider range of characters without jumping through the hoops known as codepages, shift character sets and multi-byte character sets.
Char is preserved for historic reasons and because a lot of text (pure English, in particular) rarely makes use of the additional space.

The middle one is quite simple, too. const char * prevents modifications to the string, which is important in the type system.

The first one is more complicated. Nowadays, there's no difference. But in the 16-bit days, there was a problem: a computer that addresses its memory with 16-bit pointers is logically limited to 64 kB of memory. That's awfully little. That's why the 8086 and the 80286 had memory segments. "Near" pointers addressed memory within the active segment, while "long" or "far" pointers addressed memory anywhere, but they were larger and slowed in dereferencing. PSTR used to be a near pointer, while LPSTR used to be a far pointer.
The actual way far pointers work was too complicated and absurd for me to remember.
Jun 9th, 2005, 02:34 PM
Halsafar

Re: L a char* -- convert to wchar_t*

Hmmm, I aplaud the programmers of the days past but I definetly do not envy them... Sounds like a load of complications for nothing.

Japan has 3 alphabets, Hirogana (sp?) - 47 chars, Kitakana - 47 chars, Kanji - unknown.
I studied Japanese back in high school.
I never really though about it but I've been playing FFXI a lot and the community is 1/2 Japanese at least. A lot of people I've played with in Japanese are able to type characters from there 2 main alphabets and english... I can see how that would require a different system than we attain. Kanji is that part of the japanese alphabet which use's symbols purely and as my sensai told me it has nearly infinite characters....

Anyway, I still haven't gotten around to testing to see if I can figure this out.

I do not plan on anyone from other countries playing my games for now...
I guess as a game developer this doesn't matter anyway since anytime a game comes out in 2 different countries the games have been majorly re-vamped for the new country... Like there are several versions of Final Fantasy 7 -- US, Jap, International.
Jun 9th, 2005, 05:18 PM
CornedBee

Re: L a char* -- convert to wchar_t*

The Chinese character mapping efforts are aware of around 30000 glyphs in their script. A very educated Chinese knows about 5000 of them.

This stuff matters very much to you, whether you're a game developer or a different kind of developer. Make no mistake - the games might seem majorly re-vamped, but they're not. They just appear that way because they were made with the necessary flexibility. A flexibility that in modern games includes, among other things, built-in internationalization (i18n) and localization (l10n) support. Which in turn usually means some kind of Unicode.

But Unicode isn't just about wchar_t. It's actually far more complicated than that. You see, Unicode is really just a numbering of all characters the committee can find. What most Windows programmers know as Unicode, on the other hand, is just a mapping of these characters, called code points, to an actual encoding scheme. This particular encoding scheme is known as UTF-16, and a very similar variant of it is UCS-2. (If I'm not mistaken, Win2k supports UCS-2, and XP extended that a little bit to provide true UTF-16 support.) Just about every character needs 2 bytes in this scheme, but some use 4 bytes. However, these characters are special in that they could be represented by two separate glyphs as well.
UCS-4 is the only encoding that can actually represent every single code point defined by Unicode without variable byte lengths.
The most popular encoding on storage is UTF-8, a multibyte encoding where a single character might use anything from 1 (the US-ASCII subset) to 5 bytes. A very smart encoding scheme even lets you know when you're in the middle of a code sequence. UTF-8 is represented by the char type, so that type is far from useless.

These things are worth learning, in any case. The world gets smaller every day, and i18n gets the more important the smaller the world becomes.
Jun 10th, 2005, 01:27 PM
Halsafar

Re: L a char* -- convert to wchar_t*

Hmmm, I got accepted to university I start Arts and Sciences then I am goona major in computer science, going for bachelor, 4-5years...

I will learn about all then and how to program flexible for it???
I do not think I wish to learn something more for this game engine I've been working on for several months.

Edit: Btw what does template <> signify in your functions you posted?

I copy and pasted the last function thinking I could just use it but I had to include one of the top 2 as well. I'm not very familar with the std yet tho but that is one complex snippet.

Further editing:
It compiles, but when I try and use the function then compile I get lots of errors, one is cannot resolve template for TO
Jun 10th, 2005, 01:59 PM
Halsafar

Re: L a char* -- convert to wchar_t*

Its the second parameters

I got:

std::wstring wstrFileName;
std::locale temp;

wstrFileName = chartype_cast(strVideoFileName, temp);

Error:
error C2783: 'std::basic_string<TO,std::char_traits<TO>,std::allocator<TO>> chartype_cast(const std::basic_string<FROM,std::char_traits<FROM>,std::allocator<FROM>> &,const std::locale &)' : could not deduce template argument for 'TO'
c:\Steve\C++\Hals Dx9 Framework\include\app_D3DHelpers.h(69) : see declaration of 'chartype_cast'
Jun 10th, 2005, 02:32 PM
CornedBee

Re: L a char* -- convert to wchar_t*

You need to use it like a cast operator.

Code:

std::wstring wstrFileName = chartype_cast<wchar_t>(strVideoFileName);

You don't need to specify the locale, it has a default value.

The <> means fully specialized.