The Slice Type section of the “the book” states:
Note: String slice range indices must occur at valid UTF-8 character boundaries. If you attempt to create a string slice in the middle of a multibyte character, your program will exit with an error. For the purposes of introducing string slices, we are assuming ASCII only in this section; a more thorough discussion of UTF-8 handling is in the “Storing UTF-8 Encoded Text with Strings” section of Chapter 8.
This note is best illustrated with the following Vietnamese poem verse:
Content of src\example_01.rs:
fn main() {
let vstr = String::from("Đầu bút nghiễn hề sự cung đao");
let slice = &vstr[0..3];
println!("string = [{}]", vstr);
println!("slice = [{}]", slice);
}
Compile and run with the following commands:
F:\rust\strings>rustc src\example_01.rs F:\rust\strings>example_01.exe
And as per documentation, the executable exits with an error:
thread 'main' panicked at 'byte index 3 is not a char boundary; it is inside 'ầ' (bytes 2..5) of `Đầu bút nghiễn hề sự cung đao`', src\example_01.rs:3:18
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Đ
is 2 (two) bytes
, and ầ
is 3 (three) bytes
. This explains what the above error
is about.
— For me, the obvious question is, for Unicode strings, how, then, do we know the correct byte index to use? We could certain iterate over the characters and calculate the byte index, but that seems overly complicated for such a simple task?
We can use the String
‘s
chars()
iterator to iterate over each character, and print out each character
code and length in bytes as follows:
Content of src\example_02.rs:
fn main() {
let vstr = String::from("Đầu bút nghiễn hề sự cung đao");
for char in vstr.chars() {
println!("char: {}, code: {}, byte size: {}", char, char as u32, char.len_utf8());
}
}
There’re 29 (twenty nine) characters, the above executable will print out 29 (twenty nine) lines, one for each character.
We can use the String
‘s
as_bytes()
method to iterate over the bytes:
Content of src\example_03.rs:
fn main() {
let vstr = String::from("Đầu bút nghiễn hề sự cung đao");
let bytes = vstr.as_bytes();
for (i, &item) in bytes.iter().enumerate() {
println!("i: {}, item: {}", i, item);
}
}
There’re 40 (forty) bytes in total, the first 5 (five) bytes
are: 196
, 144
, 225
,
186
and 167
which correspond to the
first two (2) characters Đầ
.
Following
from_utf8(…), if we feed
the above 5 (five) bytes to this method, we’d get Đầ
:
Content of src\example_04.rs:
fn main() {
let first_two_char_bytes = vec![196, 144, 225, 186, 167];
let first_two_char = String::from_utf8(first_two_char_bytes).unwrap();
println!("{}", first_two_char);
}
Letters f
, j
, w
and z
are not official Vietnamese alphabets, (most Vietnamese are
aware of them, even if they don’t know English), in addition to the
remaining 22 (twenty two) letters, there’re another 67 (sixty seven)
additional letters, as in English, there’re both upper case and
lower case. Some of these letters are not uniquely Vietnamese, they
are found in other languages, following are a few of them: à
,
â
, đ
, è
, é
,
ê
, ì
, ò
, ô
and
ù
.
We could have global constants for the 67 (sixty seven) letters as follows:
static VIETNAMESE_UPPERCASE: &str = "ÁÀẢÃẠĂẮẰẲẴẶÂẤẦẨẪẬĐÉÈẺẼẸÊẾỀỂỄỆÍÌỈĨỊÓÒỎÕỌÔỐỒỔỖỘƠỚỜỞỠỢÚÙỦŨỤƯỨỪỬỮỰÝỲỶỸỴ";
static VIETNAMESE_LOWERCASE: &str = "áàảãạăắằẳẵặâấầẩẫậđéèẻẽẹêếềểễệíìỉĩịóòỏõọôốồổỗộơớờởỡợúùủũụưứừửữựýỳỷỹỵ";
They’re listed based on two orders: Latin alphabets, then Vietnamese diacritic
tonal marks, that is, the acute (e.g. Á
, Ố
etc.)
tonal mark comes before any other marks.
For the English alphabets, the ASCII codes are in sequence. That is,
the ASCII code for capital A
is 65
,
B
is 66
and so on. This is not the case for
the “Vietnamese” letters as seen in the following table:
Upper Case | Lower Case | ||||
Char | Code | Byte #. | Char | Code | Byte #. |
Á | 193 | 2 | á | 225 | 2 |
À | 192 | 2 | à | 224 | 2 |
Ả | 7842 | 3 | ả | 7843 | 3 |
à | 195 | 2 | ã | 227 | 2 |
Ạ | 7840 | 3 | ạ | 7841 | 3 |
Ă | 258 | 2 | ă | 259 | 2 |
Ắ | 7854 | 3 | ắ | 7855 | 3 |
Ằ | 7856 | 3 | ằ | 7857 | 3 |
Ẳ | 7858 | 3 | ẳ | 7859 | 3 |
Ẵ | 7860 | 3 | ẵ | 7861 | 3 |
Ặ | 7862 | 3 | ặ | 7863 | 3 |
 | 194 | 2 | â | 226 | 2 |
Ấ | 7844 | 3 | ấ | 7845 | 3 |
Ầ | 7846 | 3 | ầ | 7847 | 3 |
Ẩ | 7848 | 3 | ẩ | 7849 | 3 |
Ẫ | 7850 | 3 | ẫ | 7851 | 3 |
Ậ | 7852 | 3 | ậ | 7853 | 3 |
Đ | 272 | 2 | đ | 273 | 2 |
É | 201 | 2 | é | 233 | 2 |
È | 200 | 2 | è | 232 | 2 |
Ẻ | 7866 | 3 | ẻ | 7867 | 3 |
Ẽ | 7868 | 3 | ẽ | 7869 | 3 |
Ẹ | 7864 | 3 | ẹ | 7865 | 3 |
Ê | 202 | 2 | ê | 234 | 2 |
Ế | 7870 | 3 | ế | 7871 | 3 |
Ề | 7872 | 3 | ề | 7873 | 3 |
Ể | 7874 | 3 | ể | 7875 | 3 |
Ễ | 7876 | 3 | ễ | 7877 | 3 |
Ệ | 7878 | 3 | ệ | 7879 | 3 |
Í | 205 | 2 | í | 237 | 2 |
Ì | 204 | 2 | ì | 236 | 2 |
Ỉ | 7880 | 3 | ỉ | 7881 | 3 |
Ĩ | 296 | 2 | ĩ | 297 | 2 |
Ị | 7882 | 3 | ị | 7883 | 3 |
Ó | 211 | 2 | ó | 243 | 2 |
Ò | 210 | 2 | ò | 242 | 2 |
Ỏ | 7886 | 3 | ỏ | 7887 | 3 |
Õ | 213 | 2 | õ | 245 | 2 |
Ọ | 7884 | 3 | ọ | 7885 | 3 |
Ô | 212 | 2 | ô | 244 | 2 |
Ố | 7888 | 3 | ố | 7889 | 3 |
Ồ | 7890 | 3 | ồ | 7891 | 3 |
Ổ | 7892 | 3 | ổ | 7893 | 3 |
Ỗ | 7894 | 3 | ỗ | 7895 | 3 |
Ộ | 7896 | 3 | ộ | 7897 | 3 |
Ơ | 416 | 2 | ơ | 417 | 2 |
Ớ | 7898 | 3 | ớ | 7899 | 3 |
Ờ | 7900 | 3 | ờ | 7901 | 3 |
Ở | 7902 | 3 | ở | 7903 | 3 |
Ỡ | 7904 | 3 | ỡ | 7905 | 3 |
Ợ | 7906 | 3 | ợ | 7907 | 3 |
Ú | 218 | 2 | ú | 250 | 2 |
Ù | 217 | 2 | ù | 249 | 2 |
Ủ | 7910 | 3 | ủ | 7911 | 3 |
Ũ | 360 | 2 | ũ | 361 | 2 |
Ụ | 7908 | 3 | ụ | 7909 | 3 |
Ư | 431 | 2 | ư | 432 | 2 |
Ứ | 7912 | 3 | ứ | 7913 | 3 |
Ừ | 7914 | 3 | ừ | 7915 | 3 |
Ử | 7916 | 3 | ử | 7917 | 3 |
Ữ | 7918 | 3 | ữ | 7919 | 3 |
Ự | 7920 | 3 | ự | 7921 | 3 |
Ý | 221 | 2 | ý | 253 | 2 |
Ỳ | 7922 | 3 | ỳ | 7923 | 3 |
Ỷ | 7926 | 3 | ỷ | 7927 | 3 |
Ỹ | 7928 | 3 | ỹ | 7929 | 3 |
Ỵ | 7924 | 3 | ỵ | 7925 | 3 |
I think because the extended ASCII table already includes a few, and so the Unicode Consortium just allocates new codes for the missing ones.
(Back in the 1990s, to display Vietnamese, some of the not often used
displayable extended ASCII characters were redrawn to look like Vietnamese
letters, and the keyboard was programmed to match. For example, the earliest
convention is VNI
, originated from the United States, whereby
u
followed by ?
produces ủ
which
replaces u
. This convention is still in used today, but with
Unicode.)
Rust provides several methods for case conversions: to_lowercase(…), to_uppercase(…), to_ascii_lowercase(…), to_ascii_uppercase(…), make_ascii_lowercase(…) and make_ascii_uppercase(…).
The first two methods work with Unicode strings:
Content of src\example_05.rs:
static VIETNAMESE_UPPERCASE: &str = "ÁÀẢÃẠĂẮẰẲẴẶÂẤẦẨẪẬĐÉÈẺẼẸÊẾỀỂỄỆÍÌỈĨỊÓÒỎÕỌÔỐỒỔỖỘƠỚỜỞỠỢÚÙỦŨỤƯỨỪỬỮỰÝỲỶỸỴ";
static VIETNAMESE_LOWERCASE: &str = "áàảãạăắằẳẵặâấầẩẫậđéèẻẽẹêếềểễệíìỉĩịóòỏõọôốồổỗộơớờởỡợúùủũụưứừửữựýỳỷỹỵ";
fn main() {
let s = String::from(VIETNAMESE_UPPERCASE.to_lowercase());
assert_eq!(s, VIETNAMESE_LOWERCASE);
// Does not work.
let s = String::from(VIETNAMESE_UPPERCASE.to_ascii_lowercase());
assert_eq!(s, VIETNAMESE_UPPERCASE);
// Does not work.
let mut s = String::from(VIETNAMESE_LOWERCASE);
s.make_ascii_uppercase();
assert_eq!(s, VIETNAMESE_LOWERCASE);
}
I’m guessing that when we’re certain we only work with ASCII strings, it’s better to call ASCII-based methods?
I’m not sure how relevant this post is to other people, I tried to understand this character versus byte issue in Rust, and these’re the example codes which I’ve written to understand the issue, I document it so that I have a reference to go back to should the need arise.
I hope you find this post relevant somehow… Thank you for reading and stay safe as always.
✿✿✿
Feature image source: