Rust: baby step — Unicode with Vietnamese text.

We’re looking at some Rust Unicode functionalities using Vietnamese text as examples.

The Slice Type section of the “the book” states:

Note: String slice range indices must occur at valid UTF-8 character boundaries. If you attempt to create a string slice in the middle of a multibyte character, your program will exit with an error. For the purposes of introducing string slices, we are assuming ASCII only in this section; a more thorough discussion of UTF-8 handling is in the “Storing UTF-8 Encoded Text with Strings” section of Chapter 8.

This note is best illustrated with the following Vietnamese poem verse:

Content of src\example_01.rs:
fn main() {
    let vstr = String::from("Đầu bút nghiễn hề sự cung đao");
    let slice = &vstr[0..3];

    println!("string = [{}]", vstr);
    println!("slice = [{}]", slice);
}
(Đầu bút nghiễn hề sự cung đao means The young husband puts aside his pen and ink, and picks up his sword and long bow to defense his country, from the 18th century poem Chinh Phụ NgâmThe Ballad Of A Soldier’s Wife.)

Compile and run with the following commands:

F:\rust\strings>rustc src\example_01.rs
F:\rust\strings>example_01.exe

And as per documentation, the executable exits with an error:

thread 'main' panicked at 'byte index 3 is not a char boundary; it is inside 'ầ' (bytes 2..5) of `Đầu bút nghiễn hề sự cung đao`', src\example_01.rs:3:18
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Đ is 2 (two) bytes, and is 3 (three) bytes. This explains what the above error is about.

— For me, the obvious question is, for Unicode strings, how, then, do we know the correct byte index to use? We could certain iterate over the characters and calculate the byte index, but that seems overly complicated for such a simple task?

We can use the String‘s chars() iterator to iterate over each character, and print out each character code and length in bytes as follows:

Content of src\example_02.rs:
fn main() {
    let vstr = String::from("Đầu bút nghiễn hề sự cung đao");

    for char in vstr.chars() {
        println!("char: {}, code: {}, byte size: {}", char, char as u32, char.len_utf8());
    }
}

There’re 29 (twenty nine) characters, the above executable will print out 29 (twenty nine) lines, one for each character.

We can use the String‘s as_bytes() method to iterate over the bytes:

Content of src\example_03.rs:
fn main() {
    let vstr = String::from("Đầu bút nghiễn hề sự cung đao");

    let bytes = vstr.as_bytes();

    for (i, &item) in bytes.iter().enumerate() {
        println!("i: {}, item: {}", i, item);
    }
}

There’re 40 (forty) bytes in total, the first 5 (five) bytes are: 196, 144, 225, 186 and 167 which correspond to the first two (2) characters Đầ.

Following from_utf8(…), if we feed the above 5 (five) bytes to this method, we’d get Đầ:

Content of src\example_04.rs:
fn main() {
    let first_two_char_bytes = vec![196, 144, 225, 186, 167];

    let first_two_char = String::from_utf8(first_two_char_bytes).unwrap();

    println!("{}", first_two_char);
}

Letters f, j, w and z are not official Vietnamese alphabets, (most Vietnamese are aware of them, even if they don’t know English), in addition to the remaining 22 (twenty two) letters, there’re another 67 (sixty seven) additional letters, as in English, there’re both upper case and lower case. Some of these letters are not uniquely Vietnamese, they are found in other languages, following are a few of them: à, â, đ, è, é, ê, ì, ò, ô and ù.

We could have global constants for the 67 (sixty seven) letters as follows:

static VIETNAMESE_UPPERCASE: &str = "ÁÀẢÃẠĂẮẰẲẴẶÂẤẦẨẪẬĐÉÈẺẼẸÊẾỀỂỄỆÍÌỈĨỊÓÒỎÕỌÔỐỒỔỖỘƠỚỜỞỠỢÚÙỦŨỤƯỨỪỬỮỰÝỲỶỸỴ";
static VIETNAMESE_LOWERCASE: &str = "áàảãạăắằẳẵặâấầẩẫậđéèẻẽẹêếềểễệíìỉĩịóòỏõọôốồổỗộơớờởỡợúùủũụưứừửữựýỳỷỹỵ";

They’re listed based on two orders: Latin alphabets, then Vietnamese diacritic tonal marks, that is, the acute (e.g. Á, etc.) tonal mark comes before any other marks.

For the English alphabets, the ASCII codes are in sequence. That is, the ASCII code for capital A is 65, B is 66 and so on. This is not the case for the “Vietnamese” letters as seen in the following table:

Upper CaseLower Case
CharCodeByte #.CharCodeByte #.
Á1932á2252
À1922à2242
7842378433
Ã1952ã2272
7840378413
Ă2582ă2592
7854378553
7856378573
7858378593
7860378613
7862378633
Â1942â2262
7844378453
7846378473
7848378493
7850378513
7852378533
Đ2722đ2732
É2012é2332
È2002è2322
7866378673
7868378693
7864378653
Ê2022ê2342
78703ế78713
7872378733
7874378753
7876378773
7878378793
Í2052í2372
Ì2042ì2362
7880378813
Ĩ2962ĩ2972
7882378833
Ó2112ó2432
Ò2102ò2422
7886378873
Õ2132õ2452
7884378853
Ô2122ô2442
7888378893
7890378913
7892378933
7894378953
7896378973
Ơ4162ơ4172
7898378993
7900379013
7902379033
7904379053
7906379073
Ú2182ú2502
Ù2172ù2492
7910379113
Ũ3602ũ3612
7908379093
Ư4312ư4322
7912379133
7914379153
7916379173
7918379193
7920379213
Ý2212ý2532
7922379233
7926379273
7928379293
7924379253
Vietnamese specific alphabets: character codes and byte sizes.

I think because the extended ASCII table already includes a few, and so the Unicode Consortium just allocates new codes for the missing ones.

(Back in the 1990s, to display Vietnamese, some of the not often used displayable extended ASCII characters were redrawn to look like Vietnamese letters, and the keyboard was programmed to match. For example, the earliest convention is VNI, originated from the United States, whereby u followed by ? produces which replaces u. This convention is still in used today, but with Unicode.)

Rust provides several methods for case conversions: to_lowercase(…), to_uppercase(…), to_ascii_lowercase(…), to_ascii_uppercase(…), make_ascii_lowercase(…) and make_ascii_uppercase(…).

The first two methods work with Unicode strings:

Content of src\example_05.rs:
static VIETNAMESE_UPPERCASE: &str = "ÁÀẢÃẠĂẮẰẲẴẶÂẤẦẨẪẬĐÉÈẺẼẸÊẾỀỂỄỆÍÌỈĨỊÓÒỎÕỌÔỐỒỔỖỘƠỚỜỞỠỢÚÙỦŨỤƯỨỪỬỮỰÝỲỶỸỴ";
static VIETNAMESE_LOWERCASE: &str = "áàảãạăắằẳẵặâấầẩẫậđéèẻẽẹêếềểễệíìỉĩịóòỏõọôốồổỗộơớờởỡợúùủũụưứừửữựýỳỷỹỵ";

fn main() {
    let s = String::from(VIETNAMESE_UPPERCASE.to_lowercase());
    assert_eq!(s, VIETNAMESE_LOWERCASE);

    // Does not work.
    let s = String::from(VIETNAMESE_UPPERCASE.to_ascii_lowercase());
    assert_eq!(s, VIETNAMESE_UPPERCASE);

    // Does not work.
    let mut s = String::from(VIETNAMESE_LOWERCASE);
    s.make_ascii_uppercase();
    assert_eq!(s, VIETNAMESE_LOWERCASE);
}

I’m guessing that when we’re certain we only work with ASCII strings, it’s better to call ASCII-based methods?

I’m not sure how relevant this post is to other people, I tried to understand this character versus byte issue in Rust, and these’re the example codes which I’ve written to understand the issue, I document it so that I have a reference to go back to should the need arise.

I hope you find this post relevant somehow… Thank you for reading and stay safe as always.

✿✿✿

Feature image source:

Leave a comment

Design a site like this with WordPress.com
Get started