In this article, we will explore in detail UTF-1, a fascinating topic that has captured the attention of millions of people around the world. From its impact on society to its implications on everyday life, UTF-1 has generated intense debate and aroused great interest in various communities. Throughout these pages, we will delve into different aspects of UTF-1, from its origin to its evolution over time, providing an exhaustive and updated analysis of this very relevant topic. By combining data, expert opinions, and testimonials from people who have been impacted by UTF-1, we aim to offer a broad and balanced view that enriches our readers' understanding of this fascinating topic.
This article may be too technical for most readers to understand. (September 2024) |
This article includes a list of references, related reading, or external links, but its sources remain unclear because it lacks inline citations. (September 2024) |
| MIME / IANA | ISO-10646-UTF-1 |
|---|---|
| Language | International |
| Current status | Obscure, of mainly historical interest. |
| Classification | Unicode Transformation Format, extended ASCII, variable-width encoding |
| Extends | US-ASCII |
| Transforms / Encodes | ISO/IEC 10646 (Unicode) |
| Succeeded by | UTF-8 |
UTF-1 is an obsolete method of transforming ISO/IEC 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8.
Similar to UTF-8, UTF-1 is a variable-width encoding that is backwards-compatible with ASCII. Every Unicode code point is represented by either a single byte, or a sequence of two, three, or five bytes. All ASCII code points are a single byte (the code points U+0080 through U+009F are also single bytes).
UTF-1 does not use the C0 and C1 control codes or the space character in multi-byte encodings: a byte in the range 0–0x20 or 0x7F–0x9F always stands for the corresponding code point. This design with 66 protected characters tried to be ISO/IEC 2022 compatible.
UTF-1 uses "modulo 190" arithmetic (256 − 66 = 190). For comparison, UTF-8 protects all 128 ASCII characters and needs one bit for this, and a second bit to make it self-synchronizing, resulting in "modulo 64" arithmetic (8 − 2 = 6; 26 = 64). BOCU-1 protects only the minimal set required for MIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 − 13 = 243).
| First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 |
|---|---|---|---|---|---|---|
| U+0000 | U+009F | 00–9F | ||||
| U+00A0 | U+00FF | A0 | A0–FF | |||
| U+0100 | U+4015 | A1–F5 | 21–7E, A0–FF | |||
| U+4016 | U+38E2D | F6–FB | 21–7E, A0–FF | 21–7E, A0–FF | ||
| U+38E2E | U+7FFFFFFF | FC–FF | 21–7E, A0–FF | 21–7E, A0–FF | 21–7E, A0–FF | 21–7E, A0–FF |
| code point | UTF-8 | UTF-1 |
|---|---|---|
| U+007F | 7F | 7F |
| U+0080 | C2 80 | 80 |
| U+009F | C2 9F | 9F |
| U+00A0 | C2 A0 | A0 A0 |
| U+00BF | C2 BF | A0 BF |
| U+00C0 | C3 80 | A0 C0 |
| U+00FF | C3 BF | A0 FF |
| U+0100 | C4 80 | A1 21 |
| U+015D | C5 9D | A1 7E |
| U+015E | C5 9E | A1 A0 |
| U+01BD | C6 BD | A1 FF |
| U+01BE | C6 BE | A2 21 |
| U+07FF | DF BF | AA 72 |
| U+0800 | E0 A0 80 | AA 73 |
| U+0FFF | E0 BF BF | B5 48 |
| U+1000 | E1 80 80 | B5 49 |
| U+4015 | E4 80 95 | F5 FF |
| U+4016 | E4 80 96 | F6 21 21 |
| U+D7FF | ED 9F BF | F7 2F C3 |
| U+E000 | EE 80 80 | F7 3A 79 |
| U+F8FF | EF A3 BF | F7 5C 3C |
| U+FDD0 | EF B7 90 | F7 62 BA |
| U+FDEF | EF B7 AF | F7 62 D9 |
| U+FEFF | EF BB BF | F7 64 4C |
| U+FFFD | EF BF BD | F7 65 AD |
| U+FFFE | EF BF BE | F7 65 AE |
| U+FFFF | EF BF BF | F7 65 AF |
| U+10000 | F0 90 80 80 | F7 65 B0 |
| U+38E2D | F0 B8 B8 AD | FB FF FF |
| U+38E2E | F0 B8 B8 AE | FC 21 21 21 21 |
| U+FFFFF | F3 BF BF BF | FC 21 37 B2 7A |
| U+100000 | F4 80 80 80 | FC 21 37 B2 7B |
| U+10FFFF | F4 8F BF BF | FC 21 39 6E 6C |
| U+7FFFFFFF | FD BF BF BF BF BF | FD BD 2B B9 40 |
Although modern Unicode ends at U+10FFFF, both UTF-1 and UTF-8 were designed to encode the complete 31 bits of the original Universal Character Set (UCS-4), and the last entry in this table shows this original final code point.
{{cite web}}: CS1 maint: numeric names: authors list (link)