信息

  • Comparison of Unicode encodings - Wikipedia

    UTF-16 and UTF-32 do not have endianness defined, so a byte order must be selected when receiving them over a byte-oriented network or reading them from a byte-oriented file. This may be achieved by using a byte-order mark at the start of the text or assuming big-endian (RFC 2781). UTF-8, UTF-16BE, UTF-32BE, UTF-16LE and UTF-32LE are standardised on a single byte order and does not have this problem.

  • JavaScript’s internal character encoding: UCS-2 or UTF-16?

    UTF-16 (16-bit Unicode Transformation Format) is an extension of UCS-2 that allows representing code points outside the BMP. It produces a variable-length result of either one or two 16-bit code units per code point. This way, it can encode code points in the range from 0 to 0x10FFFF.

    JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16

    这篇文章真是一篇超级好的文章!

总结

  • UTF-16 是变长的 Unicode 转换形式,具有 endianness 问题,Unicode 前面的 code point 在 UTF-16 中使用 16 位的编码,超出 16 位能表示的那些 code point,在 UTF-16 中使用 2 个 16 位的编码,也就是 32 位的编码来表示。

  • UCS-2 是 UTF-16 的前任,采用固定长度的 2 个字节,也就是 16 位来表示 Unicode 中的 code point,所以容量会比较小,不能表示 Unicode 中的全部 code point。

参考文献

  • UTF-16 - Wikipedia

    “UCS-2 should now be considered obsolete. It no longer refers to an encoding form in either 10646 or the Unicode Standard.”

    UTF-16 developed from an earlier fixed-width 16-bit encoding known as UCS-2 (for 2-byte Universal Character Set) once it became clear that 16 bits were not sufficient for Unicode’s user community.