Rust: baby step — Unicode with Vietnamese text.

The Slice Type section of the “the book” states:

Note: String slice range indices must occur at valid UTF-8 character boundaries. If you attempt to create a string slice in the middle of a multibyte character, your program will exit with an error. For the purposes of introducing string slices, we are assuming ASCII only in this section; a more thorough discussion of UTF-8 handling is in the “Storing UTF-8 Encoded Text with Strings” section of Chapter 8.

This note is best illustrated with the following Vietnamese poem verse:

Content of src\example_01.rs:

fn main() {
    let vstr = String::from("Đầu bút nghiễn hề sự cung đao");
    let slice = &vstr[0..3];

    println!("string = [{}]", vstr);
    println!("slice = [{}]", slice);
}

(Đầu bút nghiễn hề sự cung đao means The young husband puts aside his pen and ink, and picks up his sword and long bow to defense his country, from the 18th century poem Chinh Phụ Ngâm — The Ballad Of A Soldier’s Wife.)

Compile and run with the following commands:

F:\rust\strings>rustc src\example_01.rs
F:\rust\strings>example_01.exe

And as per documentation, the executable exits with an error:

thread 'main' panicked at 'byte index 3 is not a char boundary; it is inside 'ầ' (bytes 2..5) of `Đầu bút nghiễn hề sự cung đao`', src\example_01.rs:3:18
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Đ is 2 (two) bytes, and ầ is 3 (three) bytes. This explains what the above error is about.

— For me, the obvious question is, for Unicode strings, how, then, do we know the correct byte index to use? We could certain iterate over the characters and calculate the byte index, but that seems overly complicated for such a simple task?

We can use the String‘s chars() iterator to iterate over each character, and print out each character code and length in bytes as follows:

Content of src\example_02.rs:

fn main() {
    let vstr = String::from("Đầu bút nghiễn hề sự cung đao");

    for char in vstr.chars() {
        println!("char: {}, code: {}, byte size: {}", char, char as u32, char.len_utf8());
    }
}

There’re 29 (twenty nine) characters, the above executable will print out 29 (twenty nine) lines, one for each character.

We can use the String‘s as_bytes() method to iterate over the bytes:

Content of src\example_03.rs:

fn main() {
    let vstr = String::from("Đầu bút nghiễn hề sự cung đao");

    let bytes = vstr.as_bytes();

    for (i, &item) in bytes.iter().enumerate() {
        println!("i: {}, item: {}", i, item);
    }
}

There’re 40 (forty) bytes in total, the first 5 (five) bytes are: 196, 144, 225, 186 and 167 which correspond to the first two (2) characters Đầ.

Following from_utf8(…), if we feed the above 5 (five) bytes to this method, we’d get Đầ:

Content of src\example_04.rs:

fn main() {
    let first_two_char_bytes = vec![196, 144, 225, 186, 167];

    let first_two_char = String::from_utf8(first_two_char_bytes).unwrap();

    println!("{}", first_two_char);
}

Letters f, j, w and z are not official Vietnamese alphabets, (most Vietnamese are aware of them, even if they don’t know English), in addition to the remaining 22 (twenty two) letters, there’re another 67 (sixty seven) additional letters, as in English, there’re both upper case and lower case. Some of these letters are not uniquely Vietnamese, they are found in other languages, following are a few of them: à, â, đ, è, é, ê, ì, ò, ô and ù.

We could have global constants for the 67 (sixty seven) letters as follows:

static VIETNAMESE_UPPERCASE: &str = "ÁÀẢÃẠĂẮẰẲẴẶÂẤẦẨẪẬĐÉÈẺẼẸÊẾỀỂỄỆÍÌỈĨỊÓÒỎÕỌÔỐỒỔỖỘƠỚỜỞỠỢÚÙỦŨỤƯỨỪỬỮỰÝỲỶỸỴ";
static VIETNAMESE_LOWERCASE: &str = "áàảãạăắằẳẵặâấầẩẫậđéèẻẽẹêếềểễệíìỉĩịóòỏõọôốồổỗộơớờởỡợúùủũụưứừửữựýỳỷỹỵ";

They’re listed based on two orders: Latin alphabets, then Vietnamese diacritic tonal marks, that is, the acute (e.g. Á, Ố etc.) tonal mark comes before any other marks.

For the English alphabets, the ASCII codes are in sequence. That is, the ASCII code for capital A is 65, B is 66 and so on. This is not the case for the “Vietnamese” letters as seen in the following table:

Upper Case			Lower Case
Char	Code	Byte #.	Char	Code	Byte #.
Á	193	2	á	225	2
À	192	2	à	224	2
Ả	7842	3	ả	7843	3
Ã	195	2	ã	227	2
Ạ	7840	3	ạ	7841	3
Ă	258	2	ă	259	2
Ắ	7854	3	ắ	7855	3
Ằ	7856	3	ằ	7857	3
Ẳ	7858	3	ẳ	7859	3
Ẵ	7860	3	ẵ	7861	3
Ặ	7862	3	ặ	7863	3
Â	194	2	â	226	2
Ấ	7844	3	ấ	7845	3
Ầ	7846	3	ầ	7847	3
Ẩ	7848	3	ẩ	7849	3
Ẫ	7850	3	ẫ	7851	3
Ậ	7852	3	ậ	7853	3
Đ	272	2	đ	273	2
É	201	2	é	233	2
È	200	2	è	232	2
Ẻ	7866	3	ẻ	7867	3
Ẽ	7868	3	ẽ	7869	3
Ẹ	7864	3	ẹ	7865	3
Ê	202	2	ê	234	2
Ế	7870	3	ế	7871	3
Ề	7872	3	ề	7873	3
Ể	7874	3	ể	7875	3
Ễ	7876	3	ễ	7877	3
Ệ	7878	3	ệ	7879	3
Í	205	2	í	237	2
Ì	204	2	ì	236	2
Ỉ	7880	3	ỉ	7881	3
Ĩ	296	2	ĩ	297	2
Ị	7882	3	ị	7883	3
Ó	211	2	ó	243	2
Ò	210	2	ò	242	2
Ỏ	7886	3	ỏ	7887	3
Õ	213	2	õ	245	2
Ọ	7884	3	ọ	7885	3
Ô	212	2	ô	244	2
Ố	7888	3	ố	7889	3
Ồ	7890	3	ồ	7891	3
Ổ	7892	3	ổ	7893	3
Ỗ	7894	3	ỗ	7895	3
Ộ	7896	3	ộ	7897	3
Ơ	416	2	ơ	417	2
Ớ	7898	3	ớ	7899	3
Ờ	7900	3	ờ	7901	3
Ở	7902	3	ở	7903	3
Ỡ	7904	3	ỡ	7905	3
Ợ	7906	3	ợ	7907	3
Ú	218	2	ú	250	2
Ù	217	2	ù	249	2
Ủ	7910	3	ủ	7911	3
Ũ	360	2	ũ	361	2
Ụ	7908	3	ụ	7909	3
Ư	431	2	ư	432	2
Ứ	7912	3	ứ	7913	3
Ừ	7914	3	ừ	7915	3
Ử	7916	3	ử	7917	3
Ữ	7918	3	ữ	7919	3
Ự	7920	3	ự	7921	3
Ý	221	2	ý	253	2
Ỳ	7922	3	ỳ	7923	3
Ỷ	7926	3	ỷ	7927	3
Ỹ	7928	3	ỹ	7929	3
Ỵ	7924	3	ỵ	7925	3

Vietnamese specific alphabets: character codes and byte sizes.

I think because the extended ASCII table already includes a few, and so the Unicode Consortium just allocates new codes for the missing ones.

(Back in the 1990s, to display Vietnamese, some of the not often used displayable extended ASCII characters were redrawn to look like Vietnamese letters, and the keyboard was programmed to match. For example, the earliest convention is VNI, originated from the United States, whereby u followed by ? produces ủ which replaces u. This convention is still in used today, but with Unicode.)

Rust provides several methods for case conversions: to_lowercase(…), to_uppercase(…), to_ascii_lowercase(…), to_ascii_uppercase(…), make_ascii_lowercase(…) and make_ascii_uppercase(…).

The first two methods work with Unicode strings:

Content of src\example_05.rs:

static VIETNAMESE_UPPERCASE: &str = "ÁÀẢÃẠĂẮẰẲẴẶÂẤẦẨẪẬĐÉÈẺẼẸÊẾỀỂỄỆÍÌỈĨỊÓÒỎÕỌÔỐỒỔỖỘƠỚỜỞỠỢÚÙỦŨỤƯỨỪỬỮỰÝỲỶỸỴ";
static VIETNAMESE_LOWERCASE: &str = "áàảãạăắằẳẵặâấầẩẫậđéèẻẽẹêếềểễệíìỉĩịóòỏõọôốồổỗộơớờởỡợúùủũụưứừửữựýỳỷỹỵ";

fn main() {
    let s = String::from(VIETNAMESE_UPPERCASE.to_lowercase());
    assert_eq!(s, VIETNAMESE_LOWERCASE);

    // Does not work.
    let s = String::from(VIETNAMESE_UPPERCASE.to_ascii_lowercase());
    assert_eq!(s, VIETNAMESE_UPPERCASE);

    // Does not work.
    let mut s = String::from(VIETNAMESE_LOWERCASE);
    s.make_ascii_uppercase();
    assert_eq!(s, VIETNAMESE_LOWERCASE);
}

I’m guessing that when we’re certain we only work with ASCII strings, it’s better to call ASCII-based methods?

I’m not sure how relevant this post is to other people, I tried to understand this character versus byte issue in Rust, and these’re the example codes which I’ve written to understand the issue, I document it so that I have a reference to go back to should the need arise.

I hope you find this post relevant somehow… Thank you for reading and stay safe as always.

✿✿✿

Feature image source:

Share this:

Related

Leave a comment Cancel reply