[][src]Crate utf8_ranges

Crate utf8-ranges converts ranges of Unicode scalar values to equivalent ranges of UTF-8 bytes. This is useful for constructing byte based automatons that need to embed UTF-8 decoding.

See the documentation on the Utf8Sequences iterator for more details and an example.

Wait, what is this?

This is simplest to explain with an example. Let's say you wanted to test whether a particular byte sequence was a Cyrillic character. One possible scalar value range is [0400-04FF]. The set of allowed bytes for this range can be expressed as a sequence of byte ranges:

This example is not tested
[D0-D3][80-BF]

This is simple enough: simply encode the boundaries, 0400 encodes to D0 80 and 04FF encodes to D3 BF, and create ranges from each corresponding pair of bytes: D0 to D3 and 80 to BF.

However, what if you wanted to add the Cyrillic Supplementary characters to your range? Your range might then become [0400-052F]. The same procedure as above doesn't quite work because 052F encodes to D4 AF. The byte ranges you'd get from the previous transformation would be [D0-D4][80-AF]. However, this isn't quite correct because this range doesn't capture many characters, for example, 04FF (because its last byte, BF isn't in the range 80-AF).

Instead, you need multiple sequences of byte ranges:

This example is not tested
[D0-D3][80-BF]  # matches codepoints 0400-04FF
[D4][80-AF]     # matches codepoints 0500-052F

This gets even more complicated if you want bigger ranges, particularly if they naively contain surrogate codepoints. For example, the sequence of byte ranges for the basic multilingual plane ([0000-FFFF]) look like this:

This example is not tested
[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]

Note that the byte ranges above will not match any erroneous encoding of UTF-8, including encodings of surrogate codepoints.

And, of course, for all of Unicode ([000000-10FFFF]):

This example is not tested
[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]
[F0][90-BF][80-BF][80-BF]
[F1-F3][80-BF][80-BF][80-BF]
[F4][80-8F][80-BF][80-BF]

This crate automates the process of creating these byte ranges from ranges of Unicode scalar values.

Why would I ever use this?

You probably won't ever need this. In 99% of cases, you just decode the byte sequence into a Unicode scalar value and compare scalar values directly. However, this explicit decoding step isn't always possible. For example, the construction of some finite state machines may benefit from converting ranges of scalar values into UTF-8 decoder automata (e.g., for character classes in regular expressions).

Lineage

I got the idea and general implementation strategy from Russ Cox in his article on regexps and RE2. Russ Cox got it from Ken Thompson's grep (no source, folk lore?). I also got the idea from Lucene, which uses it for executing automata on their term index.

Structs

Utf8Range

A single inclusive range of UTF-8 bytes.

Utf8Sequences

An iterator over ranges of matching UTF-8 byte sequences.

Enums

Utf8Sequence

Utf8Sequence represents a sequence of byte ranges.