Home Php C# Sql C C++ Javascript Python Java Go Android Git Linux Asp.net Django .net Node.js Ios Xcode Cocoa Iphone Mysql Tomcat Mongodb Bash Objective-c Scala Visual-studio Apache Elasticsearch Jar Eclipse Jquery Ruby-on-rails Ruby Rubygems Android-studio Spring Lua Sqlite Emacs Ubuntu Perl Docker Swift Amazon-web-services Svn Html Ajax Xml Java-ee Maven Intellij-idea Rvm Macos Unix Css Ipad Postgresql Css3 Json Windows-server Vue.js Typescript Oracle Hibernate Internet-explorer Github Tensorflow Laravel Symfony Redis Html5 Google-app-engine Nginx Firefox Sqlalchemy Lucene Erlang Flask Vim Solr Webview Facebook Zend-framework Virtualenv Nosql Ide Twitter Safari Flutter Bundle Phonegap Centos Sphinx Actionscript Tornado Register | Login | Edit Tags | New Questions | 繁体 | 简体


10 questions online user: 50

8
votes
answers
69 views
+10

In Sphinx Search, how do I add “hashtag” to the charset_table?

I would like people to be able to search #photography as well as photography. Those should be treated as two different words in Sphinx. By default, #photography maps to photography, and I can't search for hashtags.

I read on this page that you can add the hash tag to the charset_table to accomplish this. I am completely clueless on how to do that. I don't know unicode, and I don't know what my charset_table should be.

Can someone tell me what my charset_table should be? Thanks.

# charset_table     = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F

Note: I plan on using real-time index. (not sure if this makes a difference)

沙发
+80
+50

這是U+0023根據Unicode表。所以最終的配置應該是這樣的

charset_table     = 0..9, A..Z->a..z, _, a..z, U+23, U+410..U+42F->U+430..U+44F, U+430..U+44F

不要忘記charset_type變量。AFAIK,這個例子charset_table是為了utf-8除此之外,您應該U+23blend_chars變量中刪除以允許Sphinx將其作為合法字符編入索引。

Thanks Paul. What would the entire string look like? Do I just add that to the end, with a comma before it? Not sure what the final result will be... – TIMEX Apr 30 '12 at 18:49

Updated and provided some more info. – Pavel Selitskas May 2 '12 at 16:26

In addition to the current requirement, is there any way to make it so that when users search "photography", it also returns results from "#photography"? But not the other way around... – TIMEX May 4 '12 at 23:45

expand_keywords should resolve this issue, though infix search ought to be used instead of prefix search. I don't know if it works with special characters, such as hash sign. – Pavel Selitskas May 7 '12 at 11:20

0

我希望人們能夠搜索#photography以及攝影。這些應該被視為Sphinx中的兩個不同的詞。默認情況下,#photography會映射到攝影,我無法搜索主題標籤。

美好的一天。

我認為這對你來說有一些解決方法,但是:

直接從用戶查詢調用搜索功能是不好的方法。

在sphinx引擎中調用搜索功能之前,需要對用戶字符串進行某種處理。例如,您可以檢查用戶字符串中的某些特殊字符,並從查詢中刪除特殊字符。你可以用繼續查詢來調用搜索功能。

祝好運。

0
votes
answers
27 views
+10

MySQL的腳本不能正常執行,通過蟒蛇

0

原始數據:MySQL的腳本不能正常執行,通過蟒蛇

帳戶 1234出租

通過使用一個Python shell手動執行腳本,然後還要執行腳本在MySQL運行此:

SELECT 
SUBSTRING_INDEX(SUBSTRING_INDEX(Account,':',-1),' ? ',-1) as Description, 
SUBSTRING_INDEX(SUBSTRING_INDEX(Account,':',-1),' ? ',1) as Acct_Number 
FROM table1 

MySQL的輸出(正確的)

Acct_Number Description 
1234    Rent 

Python輸出(錯誤的)

Acct_Number Description 
1234   1234 ? Rent 

有沒有辦法讓python讀這個奇怪的字符?已經成功地使用Python來運行包括類似的帳戶數據(也使用子指數)腳本 - 而不是這種性格,它工作完全正常。

如果這個人物是不是在這個職位顯示,這裏是一個鏈接到我指的是:https://apps.timwhitlock.info/unicode/inspect?s=%EF%BF%BD

+0

這將是人們更容易地幫助你,如果你發佈你嘗試了一些代碼。你在python中使用了哪些代碼,哪個變量向你顯示了特殊字符? – Siddardha

沙发
0
0

爲「黑鑽石」的原因參閱Trouble with UTF-8 characters; what I see is not what I stored

另見http://mysql.rjweb.org/doc.php/charcoll#python有關使用UTF-8與Python和MySQL提示。以確定問題出在哪裏是在INSERT與展示 -

發現竟是存儲六角這可能是非常重要的。

正如你提到的,六角EFBFBD代表黑色鑽石 - 這意味着這個問題是對事物的「存儲」側而產生的。

0
votes
answers
31 views
+10

PHP無法發送UTF-8的消息

1

我有以下的PHP代碼:PHP無法發送UTF-8的消息

<?php 

$to  = '[email protected]'; 
$subject = date("d/m/Y"); 
$message = 'Hey 123 [email protected]# αβγ'; 

$headers = "From: testsite <[email protected]>
"; 
$headers .= "Cc: testsite <[email protected]>
"; 
$headers .= "X-Sender: testsite <[email protected]>
"; 
$headers .= 'X-Mailer: PHP/' . phpversion(); 
$headers .= "X-Priority: 1
"; // Urgent message! 
$headers .= "Return-Path: [email protected]
"; // Return path for errors 
$headers .= "MIME-Version: 1.0
"; 
$headers .= "Content-Type: text/html; charset=iso-8859-1
"; 

mail($to,$subject,$message,$headers); 

echo "<script type='text/javascript'>alert('Message sent! We will get back to you soon!');</script>"; 
echo "<script>window.location.href = 'http://example.com';</script>"; 
?> 

郵件被髮送罰款。問題是αβγ(Unicode字符)沒有正確收到郵件收件人的結尾。

這是收件人看到的是:Hey 123 [email protected]# αβγ 這是他應該明白:Hey 123 [email protected]# αβγ

我到處找用盡一切方法,改變標題,將我的字符串爲Unicode等等等等,但什麼事都沒有工作。也許我做錯了什麼?

+0

可能的問題與實際閱讀器? – GrumpyCrouton

+0

你正在使用錯誤的字符集。 –

+0

@FunkFortyNiner我該如何解決這個問題? –

沙发
0
4

使用UTF-8字符集標題,例如:

$headers .= "Content-Type: text/html; charset=iso-8859-1
"; 

業大:

$headers .= "Content-Type: text/html; charset=UTF-8
"; 
0
votes
answers
42 views
+10

PVS-Studio是否瞭解Unicode字符?

1

該代碼在生產線中的警告瓦特/ returnPVS-Studio是否瞭解Unicode字符?

// Checks if the symbol defines two-symbols Unicode sequence 
bool doubleSymbol(const char c) { 
    static const char TWO_SYMBOLS_MASK = 0b110; 
    return (c >> 5) == TWO_SYMBOLS_MASK; 
} 

// Checks if the symbol defines three-symbols Unicode sequence 
bool tripleSymbol(const char c) { 
    static const char THREE_SYMBOLS_MASK = 0b1110; 
    return (c >> 4) == THREE_SYMBOLS_MASK; 
} 

// Checks if the symbol defines four-symbols Unicode sequence 
bool quadrupleSymbol(const char c) { 
    static const char FOUR_SYMBOLS_MASK = 0b11110; 
    return (c >> 3) == FOUR_SYMBOLS_MASK; 
} 

PVS說,表達式總是假的(V547),但它們實際上是不:char可以是Unicode符號的一部分被讀取到std::string! 下面是符號的Unicode表示:
1 byte - 0xxx'xxxx - 7 bits
2 bytes - 110x'xxxx 10xx'xxxx - 11 bits
3 bytes - 1110'xxxx 10xx'xxxx 10xx'xxxx - 16 bits
4 bytes - 1111'0xxx 10xx'xxxx 10xx'xxxx 10xx'xxxx - 21 bits

下面的代碼計數在Unicode文本符號數:

size_t symbolCount = 0; 

std::string s; 
while (getline(std::cin, s)) { 
    for (size_t i = 0; i < s.size(); ++i) { 
     const char c = s[i]; 
     ++symbolCount; 
     if (doubleSymbol(c)) { 
      i += 1; 
     } else if (tripleSymbol(c)) { 
      i += 2; 
     } else if (quadrupleSymbol(c)) { 
      i += 3; 
     } 
    } 
} 

std::cout << symbolCount << "
"; 

對於Hello!輸入輸出是6Привет, мир!12 —這是對的!

我錯了還是不知道PVS知道些什麼? ;)

+2

這可能是一個''signed'簽名char'轉換的問題。 – user0042

+0

@ user0042所以我不明白。如果有問題,它爲什麼會起作用? – SerVB

+0

這是一個潛在的問題。你的'char'有符號還是無符號,因爲移位運算符會給出不同的結果。並且PVS是否知道'char'是否被簽名? –

沙发
0
2

PVS-Studio分析器知道有符號和無符號字符類型。無論是否使用簽名/無符號都取決於編譯密鑰,並且PVS-Studio分析器考慮了這些密鑰。

我認爲這個代碼是編譯的,當char是signed char類型的時候。讓我們看看它帶來的後果。

讓我們來看看只在第一種情況:

bool doubleSymbol(const char c) { 
    static const char TWO_SYMBOLS_MASK = 0b110; 
    return (c >> 5) == TWO_SYMBOLS_MASK; 
} 

如果變量的值「C」小於或等於01111111,條件永遠是假的,因爲在換擋期間的最大值即可得到是011.

這意味着我們只關心變量'c'中最高位等於1的情況。由於這個變量是帶符號字符類型,所以最高位表示變量存儲一個負值。在轉換之前,有符號的char變成了一個有符號的int,並且該值繼續爲負。

現在讓我們來看看有什麼標準說,關於負數右移:

E1 >> E2的值E1右移E2位的位置。如果E1具有無符號類型或者E1具有帶符號類型和非負值,則結果的值是E1/2^E2的商的整數部分。如果E1有簽名類型和負值,則結果值是實現定義的。

因此,向左移動一個負數是實現定義的。這意味着最高位填充了零位或零位。兩者都是正確的。

PVS-Studio認爲最高位填充了1。它有充分的權利去思考,因爲有必要選擇任何實現。因此,如果變量'c'中的最高位最初等於1,則表達式((c)>> 5)將具有負值。負數不能等於TWO_SYMBOLS_MASK。

事實證明,從PVS-Studio的角度來看,條件總是錯誤的,並且它正確地發出警告V547。

實際上,編譯器的行爲可能會有所不同:最高位將填充0,然後所有內容都將正常工作。

在任何情況下,都需要修復代碼,因爲它涉及編譯器的實現定義的行爲。

代碼可能是固定的,如下所示:

bool doubleSymbol(const unsigned char c) { 
    static const char TWO_SYMBOLS_MASK = 0b110; 
    return (c >> 5) == TWO_SYMBOLS_MASK; 
} 
0
votes
answers
26 views
+10

Java比較要正確排序包含符號的字符串

0

我有一個java程序,它構建一個最大堆,調用Heapify並對任何列表進行排序。目前它將排序字母沒有問題,甚至像apple, addle, azzle這樣的字符串列表沒有問題。下面是輸入的截圖程序,這需要項目的數量在第一線進行梳理,並在它下面的列表:Java比較要正確排序包含符號的字符串

enter image description here

綠色是輸入,我知道已經正確排序。如果您檢查unicode table,則可以看到綠色列表已正確排序。但是我的程序輸出不正確(白色)。

下面是我的Heapify()的代碼片段:

//takes the maxheap(array) and begins sorting starting with the root node 
public void Heapify(String[] A, int i) 
{ 
    if(i > (max_size - 2)) 
    { 
     System.out.println("
Heapify exceeded, here are the values:"); 
     System.out.println("max_size = " + max_size); 
     System.out.println("i = " + i); 
     return; 
    } 

    //if the l-child or r-child is going to exceed array, stop 
    if((2 * i) > max_size || ((2 * i) + 1) > max_size) 
     return; 

    String leftChild = getChild("l", i); //get left child value 
    String rightChild = getChild("r", i); //get right child value 

    if ( (A[i].compareTo(leftChild) > 0) && (A[i].compareTo(rightChild) > 0) ) 
     return; //i node is greater than its left and right child node, Heapify is done 

    //if left is greater than right, switch the current and left node 
    if(leftChild.compareTo(rightChild) > 0) 
    { 
     //Swap i and left child 
     Swap(i, (2 * i)); 
     Heapify(this.h, (2 * i)); 
    } else { 
     //Swap i and right child 
     Swap(i, ((2 * i) + 1)); 
     Heapify(this.h, ((2 * i) + 1)); 
    } 

} 

忽略的方法開始的情況下,你可以看到我的字符串的比較簡單的發生與標準String.compareTo()在Java 。爲什麼不能正確地對包含符號的字符串進行排序?請注意,我不需要自定義比較器,我只需要包含在字符串中的符號(鍵盤上的任何符號)就可以用它們的unicode表示進行評估。用於compareTo的javadoc的內容如下:

按字母順序比較兩個字符串。該比較基於字符串中每個字符的Unicode值。由該String對象表示的字符序列按字典順序與參數字符串表示的字符序列進行比較。如果此String對象按照字典順序排列在參數字符串之前,那麼結果爲負整數。如果此String對象按照字典順序跟隨參數字符串,則結果爲正整數。如果字符串相等,結果爲零;當equals(Object)方法返回true時,compareTo返回0。

說明它使用unicode,對我的問題有什麼建議嗎?

測試文件(已排序):test.txt 代碼文件:Main.javaMaxHeap.java

+0

請爲你的「-1」提交一個評論,說明你爲什麼低估這個問題,而不是做一個「驅動器downvote」 – Chisx

沙发
0
2

您沒有使用compareTo(),您使用的是compareToIgnoreCase(),這說明每個字符都轉換爲大寫字母,然後該字符轉換爲小寫字母。

您的字符串在其第6個字母不同,它們是Y,n]。按照記錄進行轉換後,字符爲y,n]。所以字符串按字典順序排列爲],n,Y

+0

我實際上改爲使用'compareTo',而我仍然得到完全相同的結果?所以沒有轉換,我有'Y',''''''n',這在詞典上應該保持相同的順序.. – Chisx

+0

@Chisx好的,發佈[mcve],我們可以進一步檢查。 – erickson

+0

我已經發布了包含完整代碼文件的hastebin鏈接,以及如果使用Unix命令行將test.txt鏈接到程序「<」,可以明顯地將兩個粘貼到命令行應用程序中。 – Chisx

板凳
0
2

您使用compareToIgnoreCase,其中javadoc狀態:

這個方法返回一個整數,其跡象是,通過調用每個字符上的Character.toLowerCase(Character.toUpperCase(character))來調用compareTo與標準化字符串版本的區別。

因此,在您的示例中,']'和'n'確實在'y'之前。

+0

我實際上改爲使用'compareTo',而我仍然獲得完全相同的結果?所以沒有轉換,我有'Y',''''''n',這在詞典上應該保持相同的順序.. – Chisx

1075
votes
answers
31 views
+10

std::wstring VS std::string

I am not able to understand the differences between std::string and std::wstring. I know wstring supports wide characters such as Unicode characters. I have got the following questions:

  1. When should I use std::wstring over std::string?
  2. Can std::string hold the entire ASCII character set, including the special characters?
  3. Is std::wstring supported by all popular C++ compilers?
  4. What is exactly a "wide character"?
沙发
+9440

string? wstring?

std::string is a basic_string templated on a char, and std::wstring on a wchar_t.

char vs. wchar_t

char is supposed to hold a character, usually an 8-bit character.
wchar_t is supposed to hold a wide character, and then, things get tricky:
On Linux, a wchar_t is 4 bytes, while on Windows, it's 2 bytes.

What about Unicode, then?

The problem is that neither char nor wchar_t is directly tied to unicode.

On Linux?

Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:

#include <cstring>
#include <iostream>

int main(int argc, char* argv[])
{
   const char text[] = "olé" ;


   std::cout << "sizeof(char)    : " << sizeof(char) << std::endl ;
   std::cout << "text            : " << text << std::endl ;
   std::cout << "sizeof(text)    : " << sizeof(text) << std::endl ;
   std::cout << "strlen(text)    : " << strlen(text) << std::endl ;

   std::cout << "text(ordinals)  :" ;

   for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
   {
      std::cout << " " << static_cast<unsigned int>(
                              static_cast<unsigned char>(text[i])
                          );
   }

   std::cout << std::endl << std::endl ;

   // - - - 

   const wchar_t wtext[] = L"olé" ;

   std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl ;
   //std::cout << "wtext           : " << wtext << std::endl ; <- error
   std::cout << "wtext           : UNABLE TO CONVERT NATIVELY." << std::endl ;
   std::wcout << L"wtext           : " << wtext << std::endl;

   std::cout << "sizeof(wtext)   : " << sizeof(wtext) << std::endl ;
   std::cout << "wcslen(wtext)   : " << wcslen(wtext) << std::endl ;

   std::cout << "wtext(ordinals) :" ;

   for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
   {
      std::cout << " " << static_cast<unsigned int>(
                              static_cast<unsigned short>(wtext[i])
                              );
   }

   std::cout << std::endl << std::endl ;

   return 0;
}

outputs the following text:

sizeof(char)    : 1
text            : olé
sizeof(text)    : 5
strlen(text)    : 4
text(ordinals)  : 111 108 195 169

sizeof(wchar_t) : 4
wtext           : UNABLE TO CONVERT NATIVELY.
wtext           : ol?
sizeof(wtext)   : 16
wcslen(wtext)   : 3
wtext(ordinals) : 111 108 233

You'll see the "olé" text in char is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study the wchar_t code as an exercise)

So, when working with a char on Linux, you should usually end up using Unicode without even knowing it. And as std::string works with char, so std::string is already unicode-ready.

Note that std::string, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.

On Windows?

On Windows, this is a bit different. Win32 had to support a lot of application working with char and on different charsets/codepages produced in all the world, before the advent of Unicode.

So their solution was an interesting one: If an application works with char, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine. For example, "olé" would be "olé" in a French-localized Windows, but would be something different on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.

For Unicode based applications, Windows uses wchar_t, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, the mostly compatible UCS-2, which is almost the same thing IIRC).

Applications using char are said "multibyte" (because each glyph is composed of one or more chars), while applications using wchar_t are said "widechar" (because each glyph is composed of one or two wchar_t. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.

Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework hiding that, like GTK+ or QT...). The fact is that behind the scenes, Windows works with wchar_t strings, so even historical applications will have their char strings converted in wchar_t when using API like SetWindowText() (low level API function to set the label on a Win32 GUI).

Memory issues?

UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).

If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.

Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or slightly larger for UTF-8 than for UTF-16.

All in all, UTF-16 will mostly use 2 and occassionally 4 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.

See http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.

Conclusion

  1. When I should use std::wstring over std::string?

    On Linux? Almost never (§).
    On Windows? Almost always (§).
    On cross-platform code? Depends on your toolkit...

    (§) : unless you use a toolkit/framework saying otherwise

  2. Can std::string hold all the ASCII character set including special characters?

    Notice: A std::string is suitable for holding a 'binary' buffer, where a std::wstring is not!

    On Linux? Yes.
    On Windows? Only special characters available for the current locale of the Windows user.

    Edit (After a comment from Johann Gerell):
    a std::string will be enough to handle all char-based strings (each char being a number from 0 to 255). But:

    1. ASCII is supposed to go from 0 to 127. Higher chars are NOT ASCII.
    2. a char from 0 to 127 will be held correctly
    3. a char from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.
  3. Is std::wstring supported by almost all popular C++ compilers?

    Mostly, with the exception of GCC based compilers that are ported to Windows.
    It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.

  4. What is exactly a wide character?

    On C/C++, it's a character type written wchar_t which is larger than the simple char character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...).

板凳
+580

I recommend avoiding std::wstring on Windows or elsewhere, except when required by the interface, or anywhere near Windows API calls and respective encoding conversions as a syntactic sugar.

My view is summarized in http://utf8everywhere.org of which I am a co-author.

Unless your application is API-call-centric, e.g. mainly UI application, the suggestion is to store Unicode strings in std::string and encoded in UTF-8, performing conversion near API calls. The benefits outlined in the article outweigh the apparent annoyance of conversion, especially in complex applications. This is doubly so for multi-platform and library development.

And now, answering your questions:

  1. A few weak reasons. It exists for historical reasons, where widechars were believed to be the proper way of supporting Unicode. It is now used to interface APIs that prefer UTF-16 strings. I use them only in the direct vicinity of such API calls.
  2. This has nothing to do with std::string. It can hold whatever encoding you put in it. The only question is how You treat its content. My recommendation is UTF-8, so it will be able to hold all Unicode characters correctly. It's a common practice on Linux, but I think Windows programs should do it also.
  3. No.
  4. Wide character is a confusing name. In the early days of Unicode, there was a belief that a character can be encoded in two bytes, hence the name. Today, it stands for "any part of the character that is two bytes long". UTF-16 is seen as a sequence of such byte pairs (aka Wide characters). A character in UTF-16 takes either one or two pairs.
地板
+370

So, every reader here now should have a clear understanding about the facts, the situation. If not, then you must read paercebal's outstandingly comprehensive answer [btw: thanks!].

My pragmatical conclusion is shockingly simple: all that C++ (and STL) "character encoding" stuff is substantially broken and useless. Blame it on Microsoft or not, that will not help anyway.

My solution, after in-depth investigation, much frustration and the consequential experiences is the following:

  1. accept, that you have to be responsible on your own for the encoding and conversion stuff (and you will see that much of it is rather trivial)

  2. use std::string for any UTF-8 encoded strings (just a typedef std::string UTF8String)

  3. accept that such an UTF8String object is just a dumb, but cheap container. Do never ever access and/or manipulate characters in it directly (no search, replace, and so on). You could, but you really just really, really do not want to waste your time writing text manipulation algorithms for multi-byte strings! Even if other people already did such stupid things, don't do that! Let it be! (Well, there are scenarios where it makes sense... just use the ICU library for those).

  4. use std::wstring for UCS-2 encoded strings (typedef std::wstring UCS2String) - this is a compromise, and a concession to the mess that the WIN32 API introduced). UCS-2 is sufficient for most of us (more on that later...).

  5. use UCS2String instances whenever a character-by-character access is required (read, manipulate, and so on). Any character-based processing should be done in a NON-multibyte-representation. It is simple, fast, easy.

  6. add two utility functions to convert back & forth between UTF-8 and UCS-2:

    UCS2String ConvertToUCS2( const UTF8String &str );
    UTF8String ConvertToUTF8( const UCS2String &str );
    

The conversions are straightforward, google should help here ...

That's it. Use UTF8String wherever memory is precious and for all UTF-8 I/O. Use UCS2String wherever the string must be parsed and/or manipulated. You can convert between those two representations any time.

Alternatives & Improvements

  • conversions from & to single-byte character encodings (e.g. ISO-8859-1) can be realized with help of plain translation tables, e.g. const wchar_t tt_iso88951[256] = {0,1,2,...}; and appropriate code for conversion to & from UCS2.

  • if UCS-2 is not sufficient, than switch to UCS-4 (typedef std::basic_string<uint32_t> UCS2String)

ICU or other unicode libraries?

For advanced stuff.

4楼
+250
  1. When you want to have wide characters stored in your string. wide depends on the implementation. Visual C++ defaults to 16 bit if i remember correctly, while GCC defaults depending on the target. It's 32 bits long here. Please note wchar_t (wide character type) has nothing to do with unicode. It's merely guaranteed that it can store all the members of the largest character set that the implementation supports by its locales, and at least as long as char. You can store unicode strings fine into std::string using the utf-8 encoding too. But it won't understand the meaning of unicode code points. So str.size() won't give you the amount of logical characters in your string, but merely the amount of char or wchar_t elements stored in that string/wstring. For that reason, the gtk/glib C++ wrapper folks have developed a Glib::ustring class that can handle utf-8.

    If your wchar_t is 32 bits long, then you can use utf-32 as an unicode encoding, and you can store and handle unicode strings using a fixed (utf-32 is fixed length) encoding. This means your wstring's s.size() function will then return the right amount of wchar_t elements and logical characters.

  2. Yes, char is always at least 8 bit long, which means it can store all ASCII values.
  3. Yes, all major compilers support it.
5楼
+50

I frequently use std::string to hold utf-8 characters without any problems at all. I heartily recommend doing this when interfacing with API's which use utf-8 as the native string type as well.

For example, I use utf-8 when interfacing my code with the Tcl interpreter.

The major caveat is the length of the std::string, is no longer the number of characters in the string.

6楼
+30
  1. When you want to store 'wide' (Unicode) characters.
  2. Yes: 255 of them (excluding 0).
  3. Yes.
  4. Here's an introductory article: http://www.joelonsoftware.com/articles/Unicode.html
7楼
+20

Applications that are not satisfied with only 256 different characters have the options of either using wide characters (more than 8 bits) or a variable-length encoding (a multibyte encoding in C++ terminology) such as UTF-8. Wide characters generally require more space than a variable-length encoding, but are faster to process. Multi-language applications that process large amounts of text usually use wide characters when processing the text, but convert it to UTF-8 when storing it to disk.

The only difference between a string and a wstring is the data type of the characters they store. A string stores chars whose size is guaranteed to be at least 8 bits, so you can use strings for processing e.g. ASCII, ISO-8859-15, or UTF-8 text. The standard says nothing about the character set or encoding.

Practically every compiler uses a character set whose first 128 characters correspond with ASCII. This is also the case with compilers that use UTF-8 encoding. The important thing to be aware of when using strings in UTF-8 or some other variable-length encoding, is that the indices and lengths are measured in bytes, not characters.

The data type of a wstring is wchar_t, whose size is not defined in the standard, except that it has to be at least as large as a char, usually 16 bits or 32 bits. wstring can be used for processing text in the implementation defined wide-character encoding. Because the encoding is not defined in the standard, it is not straightforward to convert between strings and wstrings. One cannot assume wstrings to have a fixed-length encoding either.

If you don't need multi-language support, you might be fine with using only regular strings. On the other hand, if you're writing a graphical application, it is often the case that the API supports only wide characters. Then you probably want to use the same wide characters when processing the text. Keep in mind that UTF-16 is a variable-length encoding, meaning that you cannot assume length() to return the number of characters. If the API uses a fixed-length encoding, such as UCS-2, processing becomes easy. Converting between wide characters and UTF-8 is difficult to do in a portable way, but then again, your user interface API probably supports the conversion.

8楼
+10
  1. when you want to use Unicode strings and not just ascii, helpful for internationalisation
  2. yes, but it doesn't play well with 0
  3. not aware of any that don't
  4. wide character is the compiler specific way of handling the fixed length representation of a unicode character, for MSVC it is a 2 byte character, for gcc I understand it is 4 bytes. and a +1 for http://www.joelonsoftware.com/articles/Unicode.html

2. std :: string可以很好地保存NULL字符。它也可以容納utf-8和寬字符。 - 胡安2008年12月31日4:29

@Juan:這讓我再次感到困惑。如果std :: string可以保留unicode字符,std :: wstring有什麼特別之處? - Appu 12月31日'08 4:33

@Appu:std :: string可以保存UTF-8 unicode字符。有許多針對不同字符寬度的unicode標準。UTf8是8位寬。還有分別為16和32位寬的UTF-16和UTF-32-Greg D 08年12月31日4:40

使用std :: wstring。使用固定長度編碼時,每個unicode字符可以是一個wchar_t。例如,如果您選擇在Greg鏈接到的時候使用joel on software方法。那麼wstring的長度就是字符串中unicode字符的數量。但它需要更多的空間 - 胡安2008年12月31日4:43

我沒有說它不能保持0'',我的意思是不能很好地發揮一些方法可能無法給你一個包含wstring所有數據的預期結果。對於下來的選票如此苛刻。 - Greg Domjan 2008年12月31日4:53

9楼
0

1) As mentioned by Greg, wstring is helpful for internationalization, that's when you will be releasing your product in languages other than english

4) Check this out for wide character http://en.wikipedia.org/wiki/Wide_character

10楼
0

There are some very good answers here, but I think there are a couple of things I can add regarding Windows/Visual Studio. Tis is based on my experience with VS2015. On Linux, basically the answer is to use UTF-8 encoded std::string everywhere. On Windows/VS it gets more complex. Here is why. Windows expects strings stored using chars to be encoded using the locale codepage. This is almost always the ASCII character set followed by 128 other special characters depending on your location. Let me just state that this in not just when using the Windows API, there are three other major places where these strings interact with standard C++. These are string literals, output to std::cout using << and passing a filename to std::fstream.

I will be up front here that I am a programmer, not a language specialist. I appreciate that USC2 and UTF-16 are not the same, but for my purposes they are close enough to be interchangeable and I use them as such here. I'm not actually sure which Windows uses, but I generally don't need to know either. I've stated UCS2 in this answer, so sorry in advance if I upset anyone with my ignorance of this matter and I'm happy to change it if I have things wrong.

String literals

If you enter string literals that contain only characters that can be represented by your codepage then VS stores them in your file with 1 byte per character encoding based on your codepage. Note that if you change your codepage or give your source to another developer using a different code page then I think (but haven't tested) that the character will end up different. If you run your code on a computer using a different code page then I'm not sure if the character will change too.

If you enter any string literals that cannot be represented by your codepage then VS will ask you to save the file as Unicode. The file will then be encoded as UTF-8. This means that all Non ASCII characters (including those which are on your codepage) will be represented by 2 or more bytes. This means if you give your source to someone else the source will look the same. However, before passing the source to the compiler, VS converts the UTF-8 encoded text to code page encoded text and any characters missing from the code page are replaced with ?.

The only way to guarantee correctly representing a Unicode string literal in VS is to precede the string literal with an L making it a wide string literal. In this case VS will convert the UTF-8 encoded text from the file into UCS2. You then need to pass this string literal into a std::wstring constructor or you need to convert it to utf-8 and put it in a std::string. Or if you want you can use the Windows API functions to encode it using your code page to put it in a std::string, but then you may as well have not used a wide string literal.

std::cout

When outputting to the console using << you can only use std::string, not std::wstring and the text must be encoded using your locale codepage. If you have a std::wstring then you must convert it using one of the Windows API functions and any characters not on your codepage get replaced by ? (maybe you can change the character, I can't remember).

std::fstream filenames

Windows OS uses UCS2/UTF-16 for its filenames so whatever your codepage, you can have files with any Unicode character. But this means that to access or create files with characters not on your codepage you must use std::wstring. There is no other way. This is a Microsoft specific extension to std::fstream so probably won't compile on other systems. If you use std::string then you can only utilise filenames that only include characters on your codepage.

Your options

If you are just working on Linux then you probably didn't get this far. Just use UTF-8 std::string everywhere.

If you are just working on Windows just use UCS2 std::wstring everywhere. Some purists may say use UTF8 then convert when needed, but why bother with the hassle.

If you are cross platform then it's a mess to be frank. If you try to use UTF-8 everywhere on Windows then you need to be really careful with your string literals and output to the console. You can easily corrupt your strings there. If you use std::wstring everywhere on Linux then you may not have access to the wide version of std::fstream, so you have to do the conversion, but there is no risk of corruption. So personally I think this is a better option. Many would disagree, but I'm not alone - it's the path taken by wxWidgets for example.

Another option could be to typedef unicodestring as std::string on Linux and std::wstring on Windows, and have a macro called UNI() which prefixes L on Windows and nothing on Linux, then the code

#include <fstream>
#include <string>
#include <iostream>
#include <Windows.h>

#ifdef _WIN32
typedef std::wstring unicodestring;
#define UNI(text) L ## text
std::string formatForConsole(const unicodestring &str)
{
    std::string result;
    //Call WideCharToMultiByte to do the conversion
    return result;
}
#else
typedef std::string unicodestring;
#define UNI(text) text
std::string formatForConsole(const unicodestring &str)
{
    return str;
}
#endif

int main()
{

    unicodestring fileName(UNI("fileName"));
    std::ofstream fout;
    fout.open(fileName);
    std::cout << formatForConsole(fileName) << std::endl;
    return 0;
}

would be fine on either platform I think.

Answers

So To answer your questions

1) If you are programming for Windows, then all the time, if cross platform then maybe all the time, unless you want to deal with possible corruption issues on Windows or write some code with platform specific #ifdefs to work around the differences, if just using Linux then never.

2)Yes. In addition on Linux you can use it for all Unicode too. On Windows you can only use it for all unicode if you choose to manually encode using UTF-8. But the Windows API and standard C++ classes will expect the std::string to be encoded using the locale codepage. This includes all ASCII plus another 128 characters which change depending on the codepage your computer is setup to use.

3)I believe so, but if not then it is just a simple typedef of a 'std::basic_string' using wchar_t instead of char

4)A wide character is a character type which is bigger than the 1 byte standard char type. On Windows it is 2 bytes, on Linux it is 4 bytes.

關於“但是,在將源傳遞給編譯器之前,VS將UTF-8編碼的文本轉換為代碼頁編碼文本,代碼頁中缺少的任何字符都替換為?”。 - >當編譯器使用UTF-8編碼(使用/ utf-8)時,我不認為這是真的。 - Roi Danton 1月14日9:42

我不知道這是一個選擇。從這個鏈接docs.microsoft.com/en-us/cpp/build/reference/...似乎沒有要在項目屬性中選擇的複選框,您必須將其添加為附加命令行選項。好點! - 菲爾羅森伯格1月15日10:57

11楼
0

A good question! I think DATA ENCODING (sometimes a CHARSET also involved) is a MEMORY EXPRESSION MECHANISM in order to save data to a file or transfer data via a network, so I answer this question as:

1. When should I use std::wstring over std::string?

If the programming platform or API function is a single-byte one, and we want to process or parse some Unicode data, e.g read from Windows'.REG file or network 2-byte stream, we should declare std::wstring variable to easily process them. e.g.: wstring ws=L"中国a"(6 octets memory: 0x4E2D 0x56FD 0x0061), we can use ws[0] to get character '中' and ws[1] to get character '国' and ws[2] to get character 'a', etc.

2. Can std::string hold the entire ASCII character set, including the special characters?

Yes. But notice: American ASCII, means each 0x00~0xFF octet stands for one character, including printable text such as "123abc&*_&" and you said special one, mostly print it as a '.' avoid confusing editors or terminals. And some other countries extend their own "ASCII" charset, e.g. Chinese, use 2 octets to stand for one character.

3.Is std::wstring supported by all popular C++ compilers?

Maybe, or mostly. I have used: VC++6 and GCC 3.3, YES

4. What is exactly a "wide character"?

a wide character mostly indicates using 2 octets or 4 octets to hold all countries' characters. 2 octet UCS2 is a representative sample, and further e.g. English 'a', its memory is 2 octet of 0x0061(vs in ASCII 'a's memory is 1 octet 0x61)

12楼
-50

When should you NOT use wide-characters?

When you're writing code before the year 1990.

Obviously, I'm being flip, but really, it's the 21st century now. 127 characters have long since ceased to be sufficient. Yes, you can use UTF8, but why bother with the headaches?

@dave:我不知道UTF-8創造的頭痛大於Widechars(UTF-16)。在UTF-16中,您還有多字符字符。 - Pavel Radzivilovsky 09年12月29日16:08

問題是如果你在英語國家的任何地方,你都可以使用wchar_t。更不用說一些字母表的字符數比你可以容納的字符數多。在DOS上我們在那裡。Codepage精神分裂症,不,謝謝,沒有更多.. - 斯威夫特 - 星期五餡餅2016年11月26日23:02

@Swift wchar_t的問題在於它的大小和含義是特定於操作系統的。它只是將舊問題與新問題交換。而char是一個char而不管OS(至少在類似的平台上)。因此,我們不妨只使用UTF-8,將所有內容打包到字符序列中,並哀嘆C ++如何完全依靠我們自己,而沒有任何標準方法在這些序列中進行測量,索引,查找等。 - underscore_d 2017年5月21日14:16

@Swift你似乎完全倒退了。wchar_t是固定寬度的數據類型,因此10 wchar_t的數組將始終佔用sizeof(wchar_t)* 10個平台字節。UTF-16是一種可變寬度編碼,其中字符可以由1或2個16位代碼點組成(對於UTF-8,則為s / 16/8 / g)。 - underscore_d 2017年5月21日14:42

@SteveHollasch wchar_t在Windows上表示字符串會將大於FFFF的字符編碼為特殊代理項對,其他只需要一個wchar_t元素。因此,該表示將與gnu編譯器創建的表示不兼容(其中所有小於FFFF的字符將在它們前面具有零字)。存儲在wchar_t中的是由程序員和編譯器決定的,而不是通過某些協議 - Swift - Friday Pie於17年5月5日在0:33

0
votes
answers
32 views
+10

在實現RenderTree函數時,python 3.5 conda中的Unicode錯誤

0

我使用anytree包來創建樹。在實現RenderTree函數時,python 3.5 conda中的Unicode錯誤

udo = Node("Udo") 
marc = Node("Marc", parent=udo) 
print(RenderTree(udo)) 

在使用RenderTree功能,我得到的unicode如下錯誤:

Traceback (most recent call last): File "C:UsersNeelakshiworkspaceLogisticRegressionTypeHierarchyTest.py", line 16, in print(RenderTree(udo)) File "C:Miniconda3envsPython35libencodingscp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 14-16: character maps to

我也有類似的線程這個問題,但無法找到它的解決方案。我從eclipse運行這個示例程序,而不是從命令行運行。以下是封裝詳細信息:

python: 3.5.1 
conda: 4.3.24 
沙发
0
0

我通過將Eclipse編碼設置更改爲Utf-8來解決了此問題。之前是cp1252。

0
votes
answers
30 views
+10

How do I see what character set a MySQL database / table / column is?

What is the (default) charset for:

  • MySQL database

  • MySQL table

  • MySQL column

188
votes
answers
29 views
+10

How do I remove the BOM character from my xml file [duplicate]

This question already has an answer here:

I am using xsl to control the output of my xml file, but the BOM character is being added.

沙发
+1660
# vim file.xml
:set nobomb
:wq
板凳
+180

The File BOM Detector (freeware for Windows) makes it easy to remove the byte order mark.

+1我有一堆帶有BOM的文件,這個工具幫我輕鬆修復它們。只有這樣才能批量處理我到目前為止沒有編寫腳本。謝謝! - Walter Stabosz於2012年3月21日20:24

+1它是一個很小的獨立.exe,它完全按照你認為應該/希望它會在你的一些xml文件的BOM之後做。 - pettys 2013年2月18日0:51

地板
+20

just need to add this in your xslt file:

<xsl:output method="text"
        encoding="ASCII"/>
4楼
+10

Just strip first two bytes using any hex editor.

或者3,取決於UTF風味 - 評論時間是08年11月17日12:48

或者4,對於UTF-32。但它最有可能是3,UTF-8是最常見的XML編碼。 - 艾倫·摩爾08年11月24日5:59

5楼
+10

Remove the BOM symbol from string with XSLT is pretty simple:

<xsl:value-of select="translate(StringWithBOM,'','')"/>

6楼
0

I was under the impression that XML is encouraged to be written in Unicode, in some Unicode encoding, and that certain Unicode encodings are specified to contain an initial byte-order mark. Without that byte-order mark, your file is no longer correctly encoded in a Unicode encoding and therefore no longer correct XML. XML processors are encouraged to be unforgiving, to fail immediately on the slightest error (such as an incorrect Unicode encoding). What kinds of XML processors are you looking to break?

Obviously, stripping a byte-order mark from a UTF-8 encoded document makes that document appear to be ASCII encoded (not Unicode), and some text processors are capable only of using ASCII encoded documents. Is this what you're working with?

對於未指定編碼且沒有BOM的XML文件,UTF-8是默認編碼。 - mjn 09年12月15日14:07

7楼
0

What output encoding is your XSL set to use? What encoding is the input document? Where's the input coming from, and where was it saved/uploaded/dowloaded in the meantime?

XML and XSL should default to using UTF-8 if nothing else is specified. But clearly, something's going wrong here.

One thing which might happen is, the XML is being served up by a web server which is set by default to serve in ISO-8859-1, a pretty good default ... pre-Unicode.

Slightly off-topic, but Joel's very instructive article about text encodings was an eye-opener to me. There are a lot of people out there who are otherwise very smart about programming, but who persist in thinking there's such a thing as "plain text" or calling their text "ASCII" or "ANSI". It's an issue you really need to get to grips with if you haven't yet.

0
votes
answers
19 views
+10

爲什麼我們從MultiByte轉換到WideChar?

0

我用來對付ASCII字符串但現在UNICODE我太多困惑的一些術語:爲什麼我們從MultiByte轉換到WideChar?

什麼是多字節字符,什麼是widechar有什麼區別? 多字節是指在內存中包含多個字節的字符,而widechar只是一種表示它的數據類型?

  • 我們爲什麼要從MultiByteToWideCharWideCharToMultiByte轉換?

如果我聲明是這樣的:

wchar_t* wcMsg = L"?????"; 
MessageBoxW(0, wcMsg, 0, 0); 

它正確地打印消息,如果我定義UNICODE但爲什麼我沒有從這裏WideCharToMultiByte轉換?

  • 什麼角色之間的差異設定在我的項目:_MBCSUNICODE

  • MSDN將「Windows API」混淆的最後一件事是UTF-16。

任何人都可以解釋一些例子。一個很好的說明真的很感激。

+2

操作系統使用utf16編碼的字符串作爲其本機字符串類型。不使用wchar_t或std :: wstring足夠用於「ASCII字符串」是程序員最終使用MultiByteToWideChar很多。這並不是特別錯誤,只是效率低下,但是用ASCII編碼的字符串編寫阿拉伯語當然沒有什麼希望。 C++和低效不應該是一個句子或程序中使用的兩個單詞。考慮使用像Qt這樣的UI框架來減輕痛苦。 –

+0

@HansPassant:謝謝。如果你添加一個答案呢?這將是非常有利的。 – WonFeiHong

+0

「我習慣於處理ASCII字符串」:這非常可疑。如果您一直在調用MessageBoxA等Win32 API函數,讀取或寫入文本文件,讀取或寫入控制檯或使用C +庫,您一直在使用用戶的「locale」,它指定了像CP437這樣的字符編碼,Windows-1252或類似的,幾乎肯定不是ASCII。 –

沙发
0
2

ASCII字符串的char寬度爲一個字節(通常爲8位,很少爲7,9或其他位寬度)。這是一個遺留下來的時間,當內存大小非常小且昂貴時,處理器通常每個指令只能處理一個字節。

由於很容易想象,一個字節遠遠不足以存儲世界上所有可用的字形。僅中國人就有87.000個字形。字符通常只能處理256個字形(8位字節)。 ASCII只定義了96個字形(加上低32個字符,它們被定義爲不可打印的控制字符)。這對於英文上下文字符,數字和一些交織符號以及其他字形來說已經足夠了。

要處理比一個字節可容納的更多字形,一種方法是將基本字形存儲在一個字節中,其他常見字形存儲在兩個字節中,並且很少使用3字節或甚至更多字節中的字形。這種方法被稱爲Multi byte char set or Variable-width encoding。一個非常常見的例子是UTF 8,其中一個字符使用1到4個字節。它將ASCII字符集存儲在一個字節中(因此它也向後兼容ASCII)。最高位被定義爲一個開關:如果它被設置,其他字節將會跟隨。這同樣適用於以下字節,以便形成最多4個字節的「鏈」。 可變寬度的字符集的親的是:

  • 與8位ASCII字符集向後兼容性
  • 存儲器友好 - 使用作爲較少的內存儘可能

的缺點是:

  • 處理更困難,處理器更昂貴。您不能簡單地迭代一個字符串,並假定每個myString[n]都提供一個字形;相反,如果有更多的字節在後面,你必須評估每個字節。

另一種方法是將每個字符存儲在由n個字節組成的固定長度字中,該字長度足以容納所有可能的字形。這被稱爲固定寬度字符集;所有的字符都有相同的寬度。一個衆所周知的例子是UTF32。它是32位寬,可以將所有可能的字符存儲在一個字中。固定寬度字符集的pro和con顯然是可變寬度字符集的反面:內存較重但更容易迭代。

但是,即使在UTF32可用之前,Microsoft也選擇了它們的本地字符集:它們使用UTF16作爲Windows的字符集,它使用的字長至少爲2個字節(16位)。這足以存儲比單字節字符集更多的字形,但不是全部。考慮到這一點,微軟在「多字節」和「Unicode」之間的區別在今天有點誤導,因爲它們的unicode實現也是一個多字節字符集 - 只有一個字形的最小尺寸更大。有人說這是一個很好的妥協,有人說這是兩個世界中最糟糕的 - 無論如何,就是這樣。當時(Windows NT)它是唯一可用的Unicode字符集,從這個角度來看,它們在多字符和Unicode之間的區別在當時是正確的(參見Raymond Chen的評論)

當然,如果你想將一個字符串以一種編碼(比如UTF8)轉換爲另一種(比如說UTF16),您必須轉換它們。那就是MultiByteToWideChar爲你做的,WideCharToMultiByte反過來。還有一些其他的轉換函數和庫。

這種轉換的成本相當多的時間,所以得出的結論是:如果你大量使用字符串和系統調用,爲性能着想,你應該使用操作系統的本地字符集,這將是UTF-16在你的案件。

因此,對於你的字符串處理,你應該選擇wchar_t,在Windows的情況下意味着UTF16。不幸的是,wchar_t的寬度可能因編譯器而異;在Unix下它通常是UTF32,在Windows下是UTF16。

_MBCS是一個自動的預處理器定義它告訴你,你已經定義你的角色設置爲多字節,UNICODE告訴你,你已經將它設置爲UTF-16。

你甚至可以在一個程序中,這已經不是UNICODE限定設置寫

wchar_t* wcMsg = L"?????"; 
MessageBoxW(0, wcMsg, 0, 0); 

。前綴L"定義了您的字符串是一個UNICODE(寬字符)字符串,您可以使用它調用系統函數。

不幸的是,你可以不寫

char* msg = u8"?????"; 
MessageBoxA(0, msg, 0, 0); 

的字符集的支持已經改進了C++ 11,所以你也可以通過前綴u8定義字符串UTF8。但「A」後綴的窗口函數不理解UTF8。 (另請參閱https://stackoverflow.com/a/504789/2328447) 這也建議在Windows/Visual Studio下使用UTF16又名UNICODE。

設置你的項目爲「使用多字節字符集」或「使用Unicode字符集」也改變了很多其他角色依賴定義:最常見的是宏TCHAR_T()和所有的字符串相關的Windows功能,而不後綴,例如MessageBox()(不WA後綴) 如果設置您的項目以「使用多字節字符集」,TCHAR將擴大到char_T()將擴大到什麼,和Windows功能將得到A後綴連接。 如果您將項目設置爲「使用Unicode字符集」,TCHAR將擴展爲wchar_t,_T()將擴展爲L前綴,並且Windows函數將獲得附加的W後綴。

這意味着,書寫

TCHAR* msg = _T("Hello"); 
MessageBox(0, msg, 0, 0); 

將與多字節字符集或Unicode集編譯兩者。你可以在MSDN找到關於這些主題的綜合指南。

不幸的是

TCHAR* msg = _T("?????"); 
MessageBox(0, msg, 0, 0); 

仍然無法工作時,選擇「使用多字節字符集」 - 在Windows功能仍然不支持UTF8,你甚至會得到一些編譯器警告,因爲你定義了unicode字符,這些字符包含在未標記爲Unicode的字符串中(_T()未擴展爲u8

+4

年代學不太正確。 Unicode聯盟最初使用16位字符集,推薦的編碼是使用16位單元。這種編碼被稱爲「Unicode」。 Windows遵循了建議,並將其編碼稱爲「Unicode」。在Windows NT發佈後很長時間,Unicode聯盟直到1996年才轉換爲32位字符集。因此,在設計Windows時,調用編碼爲「Unicode」的16位字符是正確的。 (爲了讓事情變得更加混亂,Unicode聯盟後來縮小爲26位字符集。) –

+1

噢好吧謝謝,然後我誤會了。我會盡力解決這個問題。 – user2328447

+1

(更正:Unicode目前是一個21位字符集,而不是26.有17個平面,每個平面都有65536個字符。) –