Merge pull request #26 from rivo/width
Monospace string width calculation (à la wcwidth).
rivo authored 1 year, 8 months ago
GitHub committed 1 year, 8 months ago
2 | 2 | [![Go Reference](https://pkg.go.dev/badge/github.com/rivo/uniseg.svg)](https://pkg.go.dev/github.com/rivo/uniseg) |
3 | 3 | [![Go Report](https://img.shields.io/badge/go%20report-A%2B-brightgreen.svg)](https://goreportcard.com/report/github.com/rivo/uniseg) |
4 | 4 | |
5 | This Go package implements Unicode Text Segmentation according to [Unicode Standard Annex #29](https://unicode.org/reports/tr29/) and Unicode Line Breaking according to [Unicode Standard Annex #14](https://unicode.org/reports/tr14/) (Unicode version 14.0.0). | |
5 | This Go package implements Unicode Text Segmentation according to [Unicode Standard Annex #29](https://unicode.org/reports/tr29/), Unicode Line Breaking according to [Unicode Standard Annex #14](https://unicode.org/reports/tr14/) (Unicode version 14.0.0), and monospace font string width calculation similar to [wcwidth](https://man7.org/linux/man-pages/man3/wcwidth.3.html). | |
6 | 6 | |
7 | 7 | ## Background |
8 | 8 | |
30 | 30 | |
31 | 31 | Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area. This package provides tools to determine where a string may or may not be broken and where it must be broken (for example after newline characters). |
32 | 32 | |
33 | ### Monospace Width | |
34 | ||
35 | Most terminals or text displays / text editors using a monospace font (for example source code editors) use a fixed width for each character. Some characters such as emojis or characters found in Asian and other languages may take up more than one character cell. This package provides tools to determine the number of cells a string will take up when displayed in a monospace font. See [here](https://pkg.go.dev/github.com/rivo/uniseg#hdr-Monospace_Width) for more information. | |
36 | ||
33 | 37 | ## Installation |
34 | 38 | |
35 | 39 | ```bash |
44 | 48 | n := uniseg.GraphemeClusterCount("🇩🇪🏳️🌈") |
45 | 49 | fmt.Println(n) |
46 | 50 | // 2 |
51 | ``` | |
52 | ||
53 | ### Calculating the Monospace String Width | |
54 | ||
55 | ```go | |
56 | width := uniseg.StringWidth("🇩🇪🏳️🌈!") | |
57 | fmt.Println(width) | |
58 | // 5 | |
47 | 59 | ``` |
48 | 60 | |
49 | 61 | ### Using the [`Graphemes`](https://pkg.go.dev/github.com/rivo/uniseg#Graphemes) Class |
0 | 0 | /* |
1 | Package uniseg implements Unicode Text Segmentation and Unicode Line Breaking. | |
2 | Unicode Text Segmentation conforms to Unicode Standard Annex #29 | |
3 | (https://unicode.org/reports/tr29/) and Unicode Line Breaking conforms to | |
4 | Unicode Standard Annex #14 (https://unicode.org/reports/tr14/). | |
1 | Package uniseg implements Unicode Text Segmentation, Unicode Line Breaking, and | |
2 | string width calculation for monospace fonts. Unicode Text Segmentation conforms | |
3 | to Unicode Standard Annex #29 (https://unicode.org/reports/tr29/) and Unicode | |
4 | Line Breaking conforms to Unicode Standard Annex #14 | |
5 | (https://unicode.org/reports/tr14/). | |
5 | 6 | |
6 | 7 | In short, using this package, you can split a string into grapheme clusters |
7 | 8 | (what people would usually refer to as a "character"), into words, and into |
11 | 12 | other languages. Additionally, you can use it to implement line breaking (or |
12 | 13 | "word wrapping"), that is, to determine where text can be broken over to the |
13 | 14 | next line when the width of the line is not big enough to fit the entire text. |
15 | Finally, you can use it to calculate the display width of a string for monospace | |
16 | fonts. | |
14 | 17 | |
15 | Grapheme Clusters | |
18 | # Getting Started | |
19 | ||
20 | If you just want to count the number of characters in a string, you can use | |
21 | [GraphemeClusterCount]. If you want to determine the display width of a string, | |
22 | you can use [StringWidth]. If you want to iterate over a string, you can use | |
23 | [Step], [StepString], or the [Graphemes] class (more convenient but less | |
24 | performant). This will provide you with all information: grapheme clusters, | |
25 | word boundaries, sentence boundaries, line breaks, and monospace character | |
26 | widths. The specialized functions [FirstGraphemeCluster], | |
27 | [FirstGraphemeClusterInString], [FirstWord], [FirstWordInString], | |
28 | [FirstSentence], and [FirstSentenceInString] can be used if only one type of | |
29 | information is needed. | |
30 | ||
31 | # Grapheme Clusters | |
16 | 32 | |
17 | 33 | Consider the rainbow flag emoji: 🏳️🌈. On most modern systems, it appears as one |
18 | 34 | character. But its string representation actually has 14 bytes, so counting |
20 | 36 | either: The flag has 4 Unicode code points, thus 4 runes. The stdlib function |
21 | 37 | utf8.RuneCountInString("🏳️🌈") and len([]rune("🏳️🌈")) will both return 4. |
22 | 38 | |
23 | The uniseg.GraphemeClusterCount(str) function will return 1 for the rainbow flag | |
24 | emoji. The Graphemes class and a variety of functions in this package will allow | |
25 | you to split strings into its grapheme clusters. | |
39 | The [GraphemeClusterCount] function will return 1 for the rainbow flag emoji. | |
40 | The Graphemes class and a variety of functions in this package will allow you to | |
41 | split strings into its grapheme clusters. | |
26 | 42 | |
27 | Word Boundaries | |
43 | # Word Boundaries | |
28 | 44 | |
29 | 45 | Word boundaries are used in a number of different contexts. The most familiar |
30 | 46 | ones are selection (double-click mouse selection), cursor movement ("move to |
32 | 48 | search and replace. This package provides methods for determining word |
33 | 49 | boundaries. |
34 | 50 | |
35 | Sentence Boundaries | |
51 | # Sentence Boundaries | |
36 | 52 | |
37 | 53 | Sentence boundaries are often used for triple-click or some other method of |
38 | 54 | selecting or iterating through blocks of text that are larger than single words. |
40 | 56 | database queries. This package provides methods for determining sentence |
41 | 57 | boundaries. |
42 | 58 | |
43 | Line Breaking | |
59 | # Line Breaking | |
44 | 60 | |
45 | 61 | Line breaking, also known as word wrapping, is the process of breaking a section |
46 | 62 | of text into lines such that it will fit in the available width of a page, |
48 | 64 | positions in a string where a line must be broken, may be broken, or must not be |
49 | 65 | broken. |
50 | 66 | |
67 | # Monospace Width | |
68 | ||
69 | Monospace width, as referred to in this package, is the width of a string in a | |
70 | monospace font. This is commonly used in terminal user interfaces or text | |
71 | displays or editors that don't support proportional fonts. A width of 1 | |
72 | corresponds to a single character cell. The C function [wcwidth()] and its | |
73 | implementation in other programming languages is in widespread use for the same | |
74 | purpose. However, there is no standard for the calculation of such widths, and | |
75 | this package differs from wcwidth() in a number of ways, presumably to generate | |
76 | more visually pleasing results. | |
77 | ||
78 | To start, we assume that every code point has a width of 1, with the following | |
79 | exceptions: | |
80 | ||
81 | - Code points with grapheme cluster break properties Control, CR, LF, Extend, | |
82 | and ZWJ have a width of 0. | |
83 | - U+2E3A, Two-Em Dash, has a width of 3. | |
84 | - U+2E3B, Three-Em Dash, has a width of 4. | |
85 | - Characters with the East-Asian Width properties "Fullwidth" (F) and "Wide" | |
86 | (W) have a width of 2. (Properties "Ambiguous" (A) and "Neutral" (N) both | |
87 | have a width of 1.) | |
88 | - Code points with grapheme cluster break property Regional Indicator have a | |
89 | width of 2. | |
90 | - Code points with grapheme cluster break property Extended Pictographic have | |
91 | a width of 2, unless their Emoji Presentation flag is "No", in which case | |
92 | the width is 1. | |
93 | ||
94 | For Hangul grapheme clusters composed of conjoining Jamo and for Regional | |
95 | Indicators (flags), all code points except the first one have a width of 0. For | |
96 | grapheme clusters starting with an Extended Pictographic, any additional code | |
97 | point will force a total width of 2, except if the Variation Selector-15 | |
98 | (U+FE0E) is included, in which case the total width is always 1. | |
99 | ||
100 | Note that whether these widths appear correct depends on your application's | |
101 | render engine, to which extent it conforms to the Unicode Standard, and its | |
102 | choice of font. | |
103 | ||
104 | [wcwidth()]: https://man7.org/linux/man-pages/man3/wcwidth.3.html | |
51 | 105 | */ |
52 | 106 | package uniseg |
3 | 3 | |
4 | 4 | // eastAsianWidth are taken from |
5 | 5 | // https://www.unicode.org/Public/14.0.0/ucd/EastAsianWidth.txt |
6 | // on July 25, 2022. See https://www.unicode.org/license.html for the Unicode | |
6 | // and | |
7 | // https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt | |
8 | // ("Extended_Pictographic" only) | |
9 | // on September 10, 2022. See https://www.unicode.org/license.html for the Unicode | |
7 | 10 | // license agreement. |
8 | 11 | var eastAsianWidth = [][3]int{ |
9 | 12 | {0x0000, 0x001F, prN}, // Cc [32] <control-0000>..<control-001F> |
0 | package uniseg | |
1 | ||
2 | // Code generated via go generate from gen_properties.go. DO NOT EDIT. | |
3 | ||
4 | // emojiPresentation are taken from | |
5 | // | |
6 | // and | |
7 | // https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt | |
8 | // ("Extended_Pictographic" only) | |
9 | // on September 10, 2022. See https://www.unicode.org/license.html for the Unicode | |
10 | // license agreement. | |
11 | var emojiPresentation = [][3]int{ | |
12 | {0x231A, 0x231B, prEmojiPresentation}, // E0.6 [2] (⌚..⌛) watch..hourglass done | |
13 | {0x23E9, 0x23EC, prEmojiPresentation}, // E0.6 [4] (⏩..⏬) fast-forward button..fast down button | |
14 | {0x23F0, 0x23F0, prEmojiPresentation}, // E0.6 [1] (⏰) alarm clock | |
15 | {0x23F3, 0x23F3, prEmojiPresentation}, // E0.6 [1] (⏳) hourglass not done | |
16 | {0x25FD, 0x25FE, prEmojiPresentation}, // E0.6 [2] (◽..◾) white medium-small square..black medium-small square | |
17 | {0x2614, 0x2615, prEmojiPresentation}, // E0.6 [2] (☔..☕) umbrella with rain drops..hot beverage | |
18 | {0x2648, 0x2653, prEmojiPresentation}, // E0.6 [12] (♈..♓) Aries..Pisces | |
19 | {0x267F, 0x267F, prEmojiPresentation}, // E0.6 [1] (♿) wheelchair symbol | |
20 | {0x2693, 0x2693, prEmojiPresentation}, // E0.6 [1] (⚓) anchor | |
21 | {0x26A1, 0x26A1, prEmojiPresentation}, // E0.6 [1] (⚡) high voltage | |
22 | {0x26AA, 0x26AB, prEmojiPresentation}, // E0.6 [2] (⚪..⚫) white circle..black circle | |
23 | {0x26BD, 0x26BE, prEmojiPresentation}, // E0.6 [2] (⚽..⚾) soccer ball..baseball | |
24 | {0x26C4, 0x26C5, prEmojiPresentation}, // E0.6 [2] (⛄..⛅) snowman without snow..sun behind cloud | |
25 | {0x26CE, 0x26CE, prEmojiPresentation}, // E0.6 [1] (⛎) Ophiuchus | |
26 | {0x26D4, 0x26D4, prEmojiPresentation}, // E0.6 [1] (⛔) no entry | |
27 | {0x26EA, 0x26EA, prEmojiPresentation}, // E0.6 [1] (⛪) church | |
28 | {0x26F2, 0x26F3, prEmojiPresentation}, // E0.6 [2] (⛲..⛳) fountain..flag in hole | |
29 | {0x26F5, 0x26F5, prEmojiPresentation}, // E0.6 [1] (⛵) sailboat | |
30 | {0x26FA, 0x26FA, prEmojiPresentation}, // E0.6 [1] (⛺) tent | |
31 | {0x26FD, 0x26FD, prEmojiPresentation}, // E0.6 [1] (⛽) fuel pump | |
32 | {0x2705, 0x2705, prEmojiPresentation}, // E0.6 [1] (✅) check mark button | |
33 | {0x270A, 0x270B, prEmojiPresentation}, // E0.6 [2] (✊..✋) raised fist..raised hand | |
34 | {0x2728, 0x2728, prEmojiPresentation}, // E0.6 [1] (✨) sparkles | |
35 | {0x274C, 0x274C, prEmojiPresentation}, // E0.6 [1] (❌) cross mark | |
36 | {0x274E, 0x274E, prEmojiPresentation}, // E0.6 [1] (❎) cross mark button | |
37 | {0x2753, 0x2755, prEmojiPresentation}, // E0.6 [3] (❓..❕) red question mark..white exclamation mark | |
38 | {0x2757, 0x2757, prEmojiPresentation}, // E0.6 [1] (❗) red exclamation mark | |
39 | {0x2795, 0x2797, prEmojiPresentation}, // E0.6 [3] (➕..➗) plus..divide | |
40 | {0x27B0, 0x27B0, prEmojiPresentation}, // E0.6 [1] (➰) curly loop | |
41 | {0x27BF, 0x27BF, prEmojiPresentation}, // E1.0 [1] (➿) double curly loop | |
42 | {0x2B1B, 0x2B1C, prEmojiPresentation}, // E0.6 [2] (⬛..⬜) black large square..white large square | |
43 | {0x2B50, 0x2B50, prEmojiPresentation}, // E0.6 [1] (⭐) star | |
44 | {0x2B55, 0x2B55, prEmojiPresentation}, // E0.6 [1] (⭕) hollow red circle | |
45 | {0x1F004, 0x1F004, prEmojiPresentation}, // E0.6 [1] (🀄) mahjong red dragon | |
46 | {0x1F0CF, 0x1F0CF, prEmojiPresentation}, // E0.6 [1] (🃏) joker | |
47 | {0x1F18E, 0x1F18E, prEmojiPresentation}, // E0.6 [1] (🆎) AB button (blood type) | |
48 | {0x1F191, 0x1F19A, prEmojiPresentation}, // E0.6 [10] (🆑..🆚) CL button..VS button | |
49 | {0x1F1E6, 0x1F1FF, prEmojiPresentation}, // E0.0 [26] (🇦..🇿) regional indicator symbol letter a..regional indicator symbol letter z | |
50 | {0x1F201, 0x1F201, prEmojiPresentation}, // E0.6 [1] (🈁) Japanese “here” button | |
51 | {0x1F21A, 0x1F21A, prEmojiPresentation}, // E0.6 [1] (🈚) Japanese “free of charge” button | |
52 | {0x1F22F, 0x1F22F, prEmojiPresentation}, // E0.6 [1] (🈯) Japanese “reserved” button | |
53 | {0x1F232, 0x1F236, prEmojiPresentation}, // E0.6 [5] (🈲..🈶) Japanese “prohibited” button..Japanese “not free of charge” button | |
54 | {0x1F238, 0x1F23A, prEmojiPresentation}, // E0.6 [3] (🈸..🈺) Japanese “application” button..Japanese “open for business” button | |
55 | {0x1F250, 0x1F251, prEmojiPresentation}, // E0.6 [2] (🉐..🉑) Japanese “bargain” button..Japanese “acceptable” button | |
56 | {0x1F300, 0x1F30C, prEmojiPresentation}, // E0.6 [13] (🌀..🌌) cyclone..milky way | |
57 | {0x1F30D, 0x1F30E, prEmojiPresentation}, // E0.7 [2] (🌍..🌎) globe showing Europe-Africa..globe showing Americas | |
58 | {0x1F30F, 0x1F30F, prEmojiPresentation}, // E0.6 [1] (🌏) globe showing Asia-Australia | |
59 | {0x1F310, 0x1F310, prEmojiPresentation}, // E1.0 [1] (🌐) globe with meridians | |
60 | {0x1F311, 0x1F311, prEmojiPresentation}, // E0.6 [1] (🌑) new moon | |
61 | {0x1F312, 0x1F312, prEmojiPresentation}, // E1.0 [1] (🌒) waxing crescent moon | |
62 | {0x1F313, 0x1F315, prEmojiPresentation}, // E0.6 [3] (🌓..🌕) first quarter moon..full moon | |
63 | {0x1F316, 0x1F318, prEmojiPresentation}, // E1.0 [3] (🌖..🌘) waning gibbous moon..waning crescent moon | |
64 | {0x1F319, 0x1F319, prEmojiPresentation}, // E0.6 [1] (🌙) crescent moon | |
65 | {0x1F31A, 0x1F31A, prEmojiPresentation}, // E1.0 [1] (🌚) new moon face | |
66 | {0x1F31B, 0x1F31B, prEmojiPresentation}, // E0.6 [1] (🌛) first quarter moon face | |
67 | {0x1F31C, 0x1F31C, prEmojiPresentation}, // E0.7 [1] (🌜) last quarter moon face | |
68 | {0x1F31D, 0x1F31E, prEmojiPresentation}, // E1.0 [2] (🌝..🌞) full moon face..sun with face | |
69 | {0x1F31F, 0x1F320, prEmojiPresentation}, // E0.6 [2] (🌟..🌠) glowing star..shooting star | |
70 | {0x1F32D, 0x1F32F, prEmojiPresentation}, // E1.0 [3] (🌭..🌯) hot dog..burrito | |
71 | {0x1F330, 0x1F331, prEmojiPresentation}, // E0.6 [2] (🌰..🌱) chestnut..seedling | |
72 | {0x1F332, 0x1F333, prEmojiPresentation}, // E1.0 [2] (🌲..🌳) evergreen tree..deciduous tree | |
73 | {0x1F334, 0x1F335, prEmojiPresentation}, // E0.6 [2] (🌴..🌵) palm tree..cactus | |
74 | {0x1F337, 0x1F34A, prEmojiPresentation}, // E0.6 [20] (🌷..🍊) tulip..tangerine | |
75 | {0x1F34B, 0x1F34B, prEmojiPresentation}, // E1.0 [1] (🍋) lemon | |
76 | {0x1F34C, 0x1F34F, prEmojiPresentation}, // E0.6 [4] (🍌..🍏) banana..green apple | |
77 | {0x1F350, 0x1F350, prEmojiPresentation}, // E1.0 [1] (🍐) pear | |
78 | {0x1F351, 0x1F37B, prEmojiPresentation}, // E0.6 [43] (🍑..🍻) peach..clinking beer mugs | |
79 | {0x1F37C, 0x1F37C, prEmojiPresentation}, // E1.0 [1] (🍼) baby bottle | |
80 | {0x1F37E, 0x1F37F, prEmojiPresentation}, // E1.0 [2] (🍾..🍿) bottle with popping cork..popcorn | |
81 | {0x1F380, 0x1F393, prEmojiPresentation}, // E0.6 [20] (🎀..🎓) ribbon..graduation cap | |
82 | {0x1F3A0, 0x1F3C4, prEmojiPresentation}, // E0.6 [37] (🎠..🏄) carousel horse..person surfing | |
83 | {0x1F3C5, 0x1F3C5, prEmojiPresentation}, // E1.0 [1] (🏅) sports medal | |
84 | {0x1F3C6, 0x1F3C6, prEmojiPresentation}, // E0.6 [1] (🏆) trophy | |
85 | {0x1F3C7, 0x1F3C7, prEmojiPresentation}, // E1.0 [1] (🏇) horse racing | |
86 | {0x1F3C8, 0x1F3C8, prEmojiPresentation}, // E0.6 [1] (🏈) american football | |
87 | {0x1F3C9, 0x1F3C9, prEmojiPresentation}, // E1.0 [1] (🏉) rugby football | |
88 | {0x1F3CA, 0x1F3CA, prEmojiPresentation}, // E0.6 [1] (🏊) person swimming | |
89 | {0x1F3CF, 0x1F3D3, prEmojiPresentation}, // E1.0 [5] (🏏..🏓) cricket game..ping pong | |
90 | {0x1F3E0, 0x1F3E3, prEmojiPresentation}, // E0.6 [4] (🏠..🏣) house..Japanese post office | |
91 | {0x1F3E4, 0x1F3E4, prEmojiPresentation}, // E1.0 [1] (🏤) post office | |
92 | {0x1F3E5, 0x1F3F0, prEmojiPresentation}, // E0.6 [12] (🏥..🏰) hospital..castle | |
93 | {0x1F3F4, 0x1F3F4, prEmojiPresentation}, // E1.0 [1] (🏴) black flag | |
94 | {0x1F3F8, 0x1F407, prEmojiPresentation}, // E1.0 [16] (🏸..🐇) badminton..rabbit | |
95 | {0x1F408, 0x1F408, prEmojiPresentation}, // E0.7 [1] (🐈) cat | |
96 | {0x1F409, 0x1F40B, prEmojiPresentation}, // E1.0 [3] (🐉..🐋) dragon..whale | |
97 | {0x1F40C, 0x1F40E, prEmojiPresentation}, // E0.6 [3] (🐌..🐎) snail..horse | |
98 | {0x1F40F, 0x1F410, prEmojiPresentation}, // E1.0 [2] (🐏..🐐) ram..goat | |
99 | {0x1F411, 0x1F412, prEmojiPresentation}, // E0.6 [2] (🐑..🐒) ewe..monkey | |
100 | {0x1F413, 0x1F413, prEmojiPresentation}, // E1.0 [1] (🐓) rooster | |
101 | {0x1F414, 0x1F414, prEmojiPresentation}, // E0.6 [1] (🐔) chicken | |
102 | {0x1F415, 0x1F415, prEmojiPresentation}, // E0.7 [1] (🐕) dog | |
103 | {0x1F416, 0x1F416, prEmojiPresentation}, // E1.0 [1] (🐖) pig | |
104 | {0x1F417, 0x1F429, prEmojiPresentation}, // E0.6 [19] (🐗..🐩) boar..poodle | |
105 | {0x1F42A, 0x1F42A, prEmojiPresentation}, // E1.0 [1] (🐪) camel | |
106 | {0x1F42B, 0x1F43E, prEmojiPresentation}, // E0.6 [20] (🐫..🐾) two-hump camel..paw prints | |
107 | {0x1F440, 0x1F440, prEmojiPresentation}, // E0.6 [1] (👀) eyes | |
108 | {0x1F442, 0x1F464, prEmojiPresentation}, // E0.6 [35] (👂..👤) ear..bust in silhouette | |
109 | {0x1F465, 0x1F465, prEmojiPresentation}, // E1.0 [1] (👥) busts in silhouette | |
110 | {0x1F466, 0x1F46B, prEmojiPresentation}, // E0.6 [6] (👦..👫) boy..woman and man holding hands | |
111 | {0x1F46C, 0x1F46D, prEmojiPresentation}, // E1.0 [2] (👬..👭) men holding hands..women holding hands | |
112 | {0x1F46E, 0x1F4AC, prEmojiPresentation}, // E0.6 [63] (👮..💬) police officer..speech balloon | |
113 | {0x1F4AD, 0x1F4AD, prEmojiPresentation}, // E1.0 [1] (💭) thought balloon | |
114 | {0x1F4AE, 0x1F4B5, prEmojiPresentation}, // E0.6 [8] (💮..💵) white flower..dollar banknote | |
115 | {0x1F4B6, 0x1F4B7, prEmojiPresentation}, // E1.0 [2] (💶..💷) euro banknote..pound banknote | |
116 | {0x1F4B8, 0x1F4EB, prEmojiPresentation}, // E0.6 [52] (💸..📫) money with wings..closed mailbox with raised flag | |
117 | {0x1F4EC, 0x1F4ED, prEmojiPresentation}, // E0.7 [2] (📬..📭) open mailbox with raised flag..open mailbox with lowered flag | |
118 | {0x1F4EE, 0x1F4EE, prEmojiPresentation}, // E0.6 [1] (📮) postbox | |
119 | {0x1F4EF, 0x1F4EF, prEmojiPresentation}, // E1.0 [1] (📯) postal horn | |
120 | {0x1F4F0, 0x1F4F4, prEmojiPresentation}, // E0.6 [5] (📰..📴) newspaper..mobile phone off | |
121 | {0x1F4F5, 0x1F4F5, prEmojiPresentation}, // E1.0 [1] (📵) no mobile phones | |
122 | {0x1F4F6, 0x1F4F7, prEmojiPresentation}, // E0.6 [2] (📶..📷) antenna bars..camera | |
123 | {0x1F4F8, 0x1F4F8, prEmojiPresentation}, // E1.0 [1] (📸) camera with flash | |
124 | {0x1F4F9, 0x1F4FC, prEmojiPresentation}, // E0.6 [4] (📹..📼) video camera..videocassette | |
125 | {0x1F4FF, 0x1F502, prEmojiPresentation}, // E1.0 [4] (📿..🔂) prayer beads..repeat single button | |
126 | {0x1F503, 0x1F503, prEmojiPresentation}, // E0.6 [1] (🔃) clockwise vertical arrows | |
127 | {0x1F504, 0x1F507, prEmojiPresentation}, // E1.0 [4] (🔄..🔇) counterclockwise arrows button..muted speaker | |
128 | {0x1F508, 0x1F508, prEmojiPresentation}, // E0.7 [1] (🔈) speaker low volume | |
129 | {0x1F509, 0x1F509, prEmojiPresentation}, // E1.0 [1] (🔉) speaker medium volume | |
130 | {0x1F50A, 0x1F514, prEmojiPresentation}, // E0.6 [11] (🔊..🔔) speaker high volume..bell | |
131 | {0x1F515, 0x1F515, prEmojiPresentation}, // E1.0 [1] (🔕) bell with slash | |
132 | {0x1F516, 0x1F52B, prEmojiPresentation}, // E0.6 [22] (🔖..🔫) bookmark..water pistol | |
133 | {0x1F52C, 0x1F52D, prEmojiPresentation}, // E1.0 [2] (🔬..🔭) microscope..telescope | |
134 | {0x1F52E, 0x1F53D, prEmojiPresentation}, // E0.6 [16] (🔮..🔽) crystal ball..downwards button | |
135 | {0x1F54B, 0x1F54E, prEmojiPresentation}, // E1.0 [4] (🕋..🕎) kaaba..menorah | |
136 | {0x1F550, 0x1F55B, prEmojiPresentation}, // E0.6 [12] (🕐..🕛) one o’clock..twelve o’clock | |
137 | {0x1F55C, 0x1F567, prEmojiPresentation}, // E0.7 [12] (🕜..🕧) one-thirty..twelve-thirty | |
138 | {0x1F57A, 0x1F57A, prEmojiPresentation}, // E3.0 [1] (🕺) man dancing | |
139 | {0x1F595, 0x1F596, prEmojiPresentation}, // E1.0 [2] (🖕..🖖) middle finger..vulcan salute | |
140 | {0x1F5A4, 0x1F5A4, prEmojiPresentation}, // E3.0 [1] (🖤) black heart | |
141 | {0x1F5FB, 0x1F5FF, prEmojiPresentation}, // E0.6 [5] (🗻..🗿) mount fuji..moai | |
142 | {0x1F600, 0x1F600, prEmojiPresentation}, // E1.0 [1] (😀) grinning face | |
143 | {0x1F601, 0x1F606, prEmojiPresentation}, // E0.6 [6] (😁..😆) beaming face with smiling eyes..grinning squinting face | |
144 | {0x1F607, 0x1F608, prEmojiPresentation}, // E1.0 [2] (😇..😈) smiling face with halo..smiling face with horns | |
145 | {0x1F609, 0x1F60D, prEmojiPresentation}, // E0.6 [5] (😉..😍) winking face..smiling face with heart-eyes | |
146 | {0x1F60E, 0x1F60E, prEmojiPresentation}, // E1.0 [1] (😎) smiling face with sunglasses | |
147 | {0x1F60F, 0x1F60F, prEmojiPresentation}, // E0.6 [1] (😏) smirking face | |
148 | {0x1F610, 0x1F610, prEmojiPresentation}, // E0.7 [1] (😐) neutral face | |
149 | {0x1F611, 0x1F611, prEmojiPresentation}, // E1.0 [1] (😑) expressionless face | |
150 | {0x1F612, 0x1F614, prEmojiPresentation}, // E0.6 [3] (😒..😔) unamused face..pensive face | |
151 | {0x1F615, 0x1F615, prEmojiPresentation}, // E1.0 [1] (😕) confused face | |
152 | {0x1F616, 0x1F616, prEmojiPresentation}, // E0.6 [1] (😖) confounded face | |
153 | {0x1F617, 0x1F617, prEmojiPresentation}, // E1.0 [1] (😗) kissing face | |
154 | {0x1F618, 0x1F618, prEmojiPresentation}, // E0.6 [1] (😘) face blowing a kiss | |
155 | {0x1F619, 0x1F619, prEmojiPresentation}, // E1.0 [1] (😙) kissing face with smiling eyes | |
156 | {0x1F61A, 0x1F61A, prEmojiPresentation}, // E0.6 [1] (😚) kissing face with closed eyes | |
157 | {0x1F61B, 0x1F61B, prEmojiPresentation}, // E1.0 [1] (😛) face with tongue | |
158 | {0x1F61C, 0x1F61E, prEmojiPresentation}, // E0.6 [3] (😜..😞) winking face with tongue..disappointed face | |
159 | {0x1F61F, 0x1F61F, prEmojiPresentation}, // E1.0 [1] (😟) worried face | |
160 | {0x1F620, 0x1F625, prEmojiPresentation}, // E0.6 [6] (😠..😥) angry face..sad but relieved face | |
161 | {0x1F626, 0x1F627, prEmojiPresentation}, // E1.0 [2] (😦..😧) frowning face with open mouth..anguished face | |
162 | {0x1F628, 0x1F62B, prEmojiPresentation}, // E0.6 [4] (😨..😫) fearful face..tired face | |
163 | {0x1F62C, 0x1F62C, prEmojiPresentation}, // E1.0 [1] (😬) grimacing face | |
164 | {0x1F62D, 0x1F62D, prEmojiPresentation}, // E0.6 [1] (😭) loudly crying face | |
165 | {0x1F62E, 0x1F62F, prEmojiPresentation}, // E1.0 [2] (😮..😯) face with open mouth..hushed face | |
166 | {0x1F630, 0x1F633, prEmojiPresentation}, // E0.6 [4] (😰..😳) anxious face with sweat..flushed face | |
167 | {0x1F634, 0x1F634, prEmojiPresentation}, // E1.0 [1] (😴) sleeping face | |
168 | {0x1F635, 0x1F635, prEmojiPresentation}, // E0.6 [1] (😵) face with crossed-out eyes | |
169 | {0x1F636, 0x1F636, prEmojiPresentation}, // E1.0 [1] (😶) face without mouth | |
170 | {0x1F637, 0x1F640, prEmojiPresentation}, // E0.6 [10] (😷..🙀) face with medical mask..weary cat | |
171 | {0x1F641, 0x1F644, prEmojiPresentation}, // E1.0 [4] (🙁..🙄) slightly frowning face..face with rolling eyes | |
172 | {0x1F645, 0x1F64F, prEmojiPresentation}, // E0.6 [11] (🙅..🙏) person gesturing NO..folded hands | |
173 | {0x1F680, 0x1F680, prEmojiPresentation}, // E0.6 [1] (🚀) rocket | |
174 | {0x1F681, 0x1F682, prEmojiPresentation}, // E1.0 [2] (🚁..🚂) helicopter..locomotive | |
175 | {0x1F683, 0x1F685, prEmojiPresentation}, // E0.6 [3] (🚃..🚅) railway car..bullet train | |
176 | {0x1F686, 0x1F686, prEmojiPresentation}, // E1.0 [1] (🚆) train | |
177 | {0x1F687, 0x1F687, prEmojiPresentation}, // E0.6 [1] (🚇) metro | |
178 | {0x1F688, 0x1F688, prEmojiPresentation}, // E1.0 [1] (🚈) light rail | |
179 | {0x1F689, 0x1F689, prEmojiPresentation}, // E0.6 [1] (🚉) station | |
180 | {0x1F68A, 0x1F68B, prEmojiPresentation}, // E1.0 [2] (🚊..🚋) tram..tram car | |
181 | {0x1F68C, 0x1F68C, prEmojiPresentation}, // E0.6 [1] (🚌) bus | |
182 | {0x1F68D, 0x1F68D, prEmojiPresentation}, // E0.7 [1] (🚍) oncoming bus | |
183 | {0x1F68E, 0x1F68E, prEmojiPresentation}, // E1.0 [1] (🚎) trolleybus | |
184 | {0x1F68F, 0x1F68F, prEmojiPresentation}, // E0.6 [1] (🚏) bus stop | |
185 | {0x1F690, 0x1F690, prEmojiPresentation}, // E1.0 [1] (🚐) minibus | |
186 | {0x1F691, 0x1F693, prEmojiPresentation}, // E0.6 [3] (🚑..🚓) ambulance..police car | |
187 | {0x1F694, 0x1F694, prEmojiPresentation}, // E0.7 [1] (🚔) oncoming police car | |
188 | {0x1F695, 0x1F695, prEmojiPresentation}, // E0.6 [1] (🚕) taxi | |
189 | {0x1F696, 0x1F696, prEmojiPresentation}, // E1.0 [1] (🚖) oncoming taxi | |
190 | {0x1F697, 0x1F697, prEmojiPresentation}, // E0.6 [1] (🚗) automobile | |
191 | {0x1F698, 0x1F698, prEmojiPresentation}, // E0.7 [1] (🚘) oncoming automobile | |
192 | {0x1F699, 0x1F69A, prEmojiPresentation}, // E0.6 [2] (🚙..🚚) sport utility vehicle..delivery truck | |
193 | {0x1F69B, 0x1F6A1, prEmojiPresentation}, // E1.0 [7] (🚛..🚡) articulated lorry..aerial tramway | |
194 | {0x1F6A2, 0x1F6A2, prEmojiPresentation}, // E0.6 [1] (🚢) ship | |
195 | {0x1F6A3, 0x1F6A3, prEmojiPresentation}, // E1.0 [1] (🚣) person rowing boat | |
196 | {0x1F6A4, 0x1F6A5, prEmojiPresentation}, // E0.6 [2] (🚤..🚥) speedboat..horizontal traffic light | |
197 | {0x1F6A6, 0x1F6A6, prEmojiPresentation}, // E1.0 [1] (🚦) vertical traffic light | |
198 | {0x1F6A7, 0x1F6AD, prEmojiPresentation}, // E0.6 [7] (🚧..🚭) construction..no smoking | |
199 | {0x1F6AE, 0x1F6B1, prEmojiPresentation}, // E1.0 [4] (🚮..🚱) litter in bin sign..non-potable water | |
200 | {0x1F6B2, 0x1F6B2, prEmojiPresentation}, // E0.6 [1] (🚲) bicycle | |
201 | {0x1F6B3, 0x1F6B5, prEmojiPresentation}, // E1.0 [3] (🚳..🚵) no bicycles..person mountain biking | |
202 | {0x1F6B6, 0x1F6B6, prEmojiPresentation}, // E0.6 [1] (🚶) person walking | |
203 | {0x1F6B7, 0x1F6B8, prEmojiPresentation}, // E1.0 [2] (🚷..🚸) no pedestrians..children crossing | |
204 | {0x1F6B9, 0x1F6BE, prEmojiPresentation}, // E0.6 [6] (🚹..🚾) men’s room..water closet | |
205 | {0x1F6BF, 0x1F6BF, prEmojiPresentation}, // E1.0 [1] (🚿) shower | |
206 | {0x1F6C0, 0x1F6C0, prEmojiPresentation}, // E0.6 [1] (🛀) person taking bath | |
207 | {0x1F6C1, 0x1F6C5, prEmojiPresentation}, // E1.0 [5] (🛁..🛅) bathtub..left luggage | |
208 | {0x1F6CC, 0x1F6CC, prEmojiPresentation}, // E1.0 [1] (🛌) person in bed | |
209 | {0x1F6D0, 0x1F6D0, prEmojiPresentation}, // E1.0 [1] (🛐) place of worship | |
210 | {0x1F6D1, 0x1F6D2, prEmojiPresentation}, // E3.0 [2] (🛑..🛒) stop sign..shopping cart | |
211 | {0x1F6D5, 0x1F6D5, prEmojiPresentation}, // E12.0 [1] (🛕) hindu temple | |
212 | {0x1F6D6, 0x1F6D7, prEmojiPresentation}, // E13.0 [2] (🛖..🛗) hut..elevator | |
213 | {0x1F6DD, 0x1F6DF, prEmojiPresentation}, // E14.0 [3] (🛝..🛟) playground slide..ring buoy | |
214 | {0x1F6EB, 0x1F6EC, prEmojiPresentation}, // E1.0 [2] (🛫..🛬) airplane departure..airplane arrival | |
215 | {0x1F6F4, 0x1F6F6, prEmojiPresentation}, // E3.0 [3] (🛴..🛶) kick scooter..canoe | |
216 | {0x1F6F7, 0x1F6F8, prEmojiPresentation}, // E5.0 [2] (🛷..🛸) sled..flying saucer | |
217 | {0x1F6F9, 0x1F6F9, prEmojiPresentation}, // E11.0 [1] (🛹) skateboard | |
218 | {0x1F6FA, 0x1F6FA, prEmojiPresentation}, // E12.0 [1] (🛺) auto rickshaw | |
219 | {0x1F6FB, 0x1F6FC, prEmojiPresentation}, // E13.0 [2] (🛻..🛼) pickup truck..roller skate | |
220 | {0x1F7E0, 0x1F7EB, prEmojiPresentation}, // E12.0 [12] (🟠..🟫) orange circle..brown square | |
221 | {0x1F7F0, 0x1F7F0, prEmojiPresentation}, // E14.0 [1] (🟰) heavy equals sign | |
222 | {0x1F90C, 0x1F90C, prEmojiPresentation}, // E13.0 [1] (🤌) pinched fingers | |
223 | {0x1F90D, 0x1F90F, prEmojiPresentation}, // E12.0 [3] (🤍..🤏) white heart..pinching hand | |
224 | {0x1F910, 0x1F918, prEmojiPresentation}, // E1.0 [9] (🤐..🤘) zipper-mouth face..sign of the horns | |
225 | {0x1F919, 0x1F91E, prEmojiPresentation}, // E3.0 [6] (🤙..🤞) call me hand..crossed fingers | |
226 | {0x1F91F, 0x1F91F, prEmojiPresentation}, // E5.0 [1] (🤟) love-you gesture | |
227 | {0x1F920, 0x1F927, prEmojiPresentation}, // E3.0 [8] (🤠..🤧) cowboy hat face..sneezing face | |
228 | {0x1F928, 0x1F92F, prEmojiPresentation}, // E5.0 [8] (🤨..🤯) face with raised eyebrow..exploding head | |
229 | {0x1F930, 0x1F930, prEmojiPresentation}, // E3.0 [1] (🤰) pregnant woman | |
230 | {0x1F931, 0x1F932, prEmojiPresentation}, // E5.0 [2] (🤱..🤲) breast-feeding..palms up together | |
231 | {0x1F933, 0x1F93A, prEmojiPresentation}, // E3.0 [8] (🤳..🤺) selfie..person fencing | |
232 | {0x1F93C, 0x1F93E, prEmojiPresentation}, // E3.0 [3] (🤼..🤾) people wrestling..person playing handball | |
233 | {0x1F93F, 0x1F93F, prEmojiPresentation}, // E12.0 [1] (🤿) diving mask | |
234 | {0x1F940, 0x1F945, prEmojiPresentation}, // E3.0 [6] (🥀..🥅) wilted flower..goal net | |
235 | {0x1F947, 0x1F94B, prEmojiPresentation}, // E3.0 [5] (🥇..🥋) 1st place medal..martial arts uniform | |
236 | {0x1F94C, 0x1F94C, prEmojiPresentation}, // E5.0 [1] (🥌) curling stone | |
237 | {0x1F94D, 0x1F94F, prEmojiPresentation}, // E11.0 [3] (🥍..🥏) lacrosse..flying disc | |
238 | {0x1F950, 0x1F95E, prEmojiPresentation}, // E3.0 [15] (🥐..🥞) croissant..pancakes | |
239 | {0x1F95F, 0x1F96B, prEmojiPresentation}, // E5.0 [13] (🥟..🥫) dumpling..canned food | |
240 | {0x1F96C, 0x1F970, prEmojiPresentation}, // E11.0 [5] (🥬..🥰) leafy green..smiling face with hearts | |
241 | {0x1F971, 0x1F971, prEmojiPresentation}, // E12.0 [1] (🥱) yawning face | |
242 | {0x1F972, 0x1F972, prEmojiPresentation}, // E13.0 [1] (🥲) smiling face with tear | |
243 | {0x1F973, 0x1F976, prEmojiPresentation}, // E11.0 [4] (🥳..🥶) partying face..cold face | |
244 | {0x1F977, 0x1F978, prEmojiPresentation}, // E13.0 [2] (🥷..🥸) ninja..disguised face | |
245 | {0x1F979, 0x1F979, prEmojiPresentation}, // E14.0 [1] (🥹) face holding back tears | |
246 | {0x1F97A, 0x1F97A, prEmojiPresentation}, // E11.0 [1] (🥺) pleading face | |
247 | {0x1F97B, 0x1F97B, prEmojiPresentation}, // E12.0 [1] (🥻) sari | |
248 | {0x1F97C, 0x1F97F, prEmojiPresentation}, // E11.0 [4] (🥼..🥿) lab coat..flat shoe | |
249 | {0x1F980, 0x1F984, prEmojiPresentation}, // E1.0 [5] (🦀..🦄) crab..unicorn | |
250 | {0x1F985, 0x1F991, prEmojiPresentation}, // E3.0 [13] (🦅..🦑) eagle..squid | |
251 | {0x1F992, 0x1F997, prEmojiPresentation}, // E5.0 [6] (🦒..🦗) giraffe..cricket | |
252 | {0x1F998, 0x1F9A2, prEmojiPresentation}, // E11.0 [11] (🦘..🦢) kangaroo..swan | |
253 | {0x1F9A3, 0x1F9A4, prEmojiPresentation}, // E13.0 [2] (🦣..🦤) mammoth..dodo | |
254 | {0x1F9A5, 0x1F9AA, prEmojiPresentation}, // E12.0 [6] (🦥..🦪) sloth..oyster | |
255 | {0x1F9AB, 0x1F9AD, prEmojiPresentation}, // E13.0 [3] (🦫..🦭) beaver..seal | |
256 | {0x1F9AE, 0x1F9AF, prEmojiPresentation}, // E12.0 [2] (🦮..🦯) guide dog..white cane | |
257 | {0x1F9B0, 0x1F9B9, prEmojiPresentation}, // E11.0 [10] (🦰..🦹) red hair..supervillain | |
258 | {0x1F9BA, 0x1F9BF, prEmojiPresentation}, // E12.0 [6] (🦺..🦿) safety vest..mechanical leg | |
259 | {0x1F9C0, 0x1F9C0, prEmojiPresentation}, // E1.0 [1] (🧀) cheese wedge | |
260 | {0x1F9C1, 0x1F9C2, prEmojiPresentation}, // E11.0 [2] (🧁..🧂) cupcake..salt | |
261 | {0x1F9C3, 0x1F9CA, prEmojiPresentation}, // E12.0 [8] (🧃..🧊) beverage box..ice | |
262 | {0x1F9CB, 0x1F9CB, prEmojiPresentation}, // E13.0 [1] (🧋) bubble tea | |
263 | {0x1F9CC, 0x1F9CC, prEmojiPresentation}, // E14.0 [1] (🧌) troll | |
264 | {0x1F9CD, 0x1F9CF, prEmojiPresentation}, // E12.0 [3] (🧍..🧏) person standing..deaf person | |
265 | {0x1F9D0, 0x1F9E6, prEmojiPresentation}, // E5.0 [23] (🧐..🧦) face with monocle..socks | |
266 | {0x1F9E7, 0x1F9FF, prEmojiPresentation}, // E11.0 [25] (🧧..🧿) red envelope..nazar amulet | |
267 | {0x1FA70, 0x1FA73, prEmojiPresentation}, // E12.0 [4] (🩰..🩳) ballet shoes..shorts | |
268 | {0x1FA74, 0x1FA74, prEmojiPresentation}, // E13.0 [1] (🩴) thong sandal | |
269 | {0x1FA78, 0x1FA7A, prEmojiPresentation}, // E12.0 [3] (🩸..🩺) drop of blood..stethoscope | |
270 | {0x1FA7B, 0x1FA7C, prEmojiPresentation}, // E14.0 [2] (🩻..🩼) x-ray..crutch | |
271 | {0x1FA80, 0x1FA82, prEmojiPresentation}, // E12.0 [3] (🪀..🪂) yo-yo..parachute | |
272 | {0x1FA83, 0x1FA86, prEmojiPresentation}, // E13.0 [4] (🪃..🪆) boomerang..nesting dolls | |
273 | {0x1FA90, 0x1FA95, prEmojiPresentation}, // E12.0 [6] (🪐..🪕) ringed planet..banjo | |
274 | {0x1FA96, 0x1FAA8, prEmojiPresentation}, // E13.0 [19] (🪖..🪨) military helmet..rock | |
275 | {0x1FAA9, 0x1FAAC, prEmojiPresentation}, // E14.0 [4] (🪩..🪬) mirror ball..hamsa | |
276 | {0x1FAB0, 0x1FAB6, prEmojiPresentation}, // E13.0 [7] (🪰..🪶) fly..feather | |
277 | {0x1FAB7, 0x1FABA, prEmojiPresentation}, // E14.0 [4] (🪷..🪺) lotus..nest with eggs | |
278 | {0x1FAC0, 0x1FAC2, prEmojiPresentation}, // E13.0 [3] (🫀..🫂) anatomical heart..people hugging | |
279 | {0x1FAC3, 0x1FAC5, prEmojiPresentation}, // E14.0 [3] (🫃..🫅) pregnant man..person with crown | |
280 | {0x1FAD0, 0x1FAD6, prEmojiPresentation}, // E13.0 [7] (🫐..🫖) blueberries..teapot | |
281 | {0x1FAD7, 0x1FAD9, prEmojiPresentation}, // E14.0 [3] (🫗..🫙) pouring liquid..jar | |
282 | {0x1FAE0, 0x1FAE7, prEmojiPresentation}, // E14.0 [8] (🫠..🫧) melting face..bubbles | |
283 | {0x1FAF0, 0x1FAF6, prEmojiPresentation}, // E14.0 [7] (🫰..🫶) hand with index finger and thumb crossed..heart hands | |
284 | } |
306 | 306 | // Output: First |line. |
307 | 307 | //‖Second |line.‖ |
308 | 308 | } |
309 | ||
310 | func ExampleStringWidth() { | |
311 | fmt.Println(uniseg.StringWidth("Hello, 世界")) | |
312 | // Output: 11 | |
313 | } |
2 | 2 | // This program generates a property file in Go file from Unicode Character |
3 | 3 | // Database auxiliary data files. The command line arguments are as follows: |
4 | 4 | // |
5 | // 1. The name of the Unicode data file (just the filename, without extension). | |
6 | // 2. The name of the locally generated Go file. | |
7 | // 3. The name of the slice mapping code points to properties. | |
8 | // 4. The name of the generator, for logging purposes. | |
9 | // 5. (Optional) Flags, comma-separated. The following flags are available: | |
10 | // - "emojis": include emoji properties (Extended Pictographic only). | |
11 | // - "gencat": include general category properties. | |
5 | // 1. The name of the Unicode data file (just the filename, without extension). | |
6 | // Can be "-" (to skip) if the emoji flag is included. | |
7 | // 2. The name of the locally generated Go file. | |
8 | // 3. The name of the slice mapping code points to properties. | |
9 | // 4. The name of the generator, for logging purposes. | |
10 | // 5. (Optional) Flags, comma-separated. The following flags are available: | |
11 | // - "emojis=<property>": include the specified emoji properties (e.g. | |
12 | // "Extended_Pictographic"). | |
13 | // - "gencat": include general category properties. | |
12 | 14 | // |
13 | //go:generate go run gen_properties.go auxiliary/GraphemeBreakProperty graphemeproperties.go graphemeCodePoints graphemes emojis | |
14 | //go:generate go run gen_properties.go auxiliary/WordBreakProperty wordproperties.go workBreakCodePoints words emojis | |
15 | //go:generate go run gen_properties.go auxiliary/GraphemeBreakProperty graphemeproperties.go graphemeCodePoints graphemes emojis=Extended_Pictographic | |
16 | //go:generate go run gen_properties.go auxiliary/WordBreakProperty wordproperties.go workBreakCodePoints words emojis=Extended_Pictographic | |
15 | 17 | //go:generate go run gen_properties.go auxiliary/SentenceBreakProperty sentenceproperties.go sentenceBreakCodePoints sentences |
16 | 18 | //go:generate go run gen_properties.go LineBreak lineproperties.go lineBreakCodePoints lines gencat |
17 | 19 | //go:generate go run gen_properties.go EastAsianWidth eastasianwidth.go eastAsianWidth eastasianwidth |
20 | //go:generate go run gen_properties.go - emojipresentation.go emojiPresentation emojipresentation emojis=Emoji_Presentation | |
18 | 21 | package main |
19 | 22 | |
20 | 23 | import ( |
37 | 40 | // We want to test against a specific version rather than the latest. When the |
38 | 41 | // package is upgraded to a new version, change these to generate new tests. |
39 | 42 | const ( |
40 | gbpURL = `https://www.unicode.org/Public/14.0.0/ucd/%s.txt` | |
41 | emojiURL = `https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt` | |
43 | propertyURL = `https://www.unicode.org/Public/14.0.0/ucd/%s.txt` | |
44 | emojiURL = `https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt` | |
42 | 45 | ) |
43 | 46 | |
44 | 47 | // The regular expression for a line containing a code point range property. |
54 | 57 | log.SetFlags(0) |
55 | 58 | |
56 | 59 | // Parse flags. |
57 | flags := make(map[string]struct{}) | |
60 | flags := make(map[string]string) | |
58 | 61 | if len(os.Args) >= 6 { |
59 | 62 | for _, flag := range strings.Split(os.Args[5], ",") { |
60 | flags[flag] = struct{}{} | |
63 | flagFields := strings.Split(flag, "=") | |
64 | if len(flagFields) == 1 { | |
65 | flags[flagFields[0]] = "yes" | |
66 | } else { | |
67 | flags[flagFields[0]] = flagFields[1] | |
68 | } | |
61 | 69 | } |
62 | 70 | } |
63 | 71 | |
64 | 72 | // Parse the text file and generate Go source code from it. |
65 | var emojis string | |
66 | if _, ok := flags["emojis"]; ok { | |
67 | emojis = emojiURL | |
68 | } | |
69 | 73 | _, includeGeneralCategory := flags["gencat"] |
70 | src, err := parse(fmt.Sprintf(gbpURL, os.Args[1]), emojis, includeGeneralCategory) | |
74 | var mainURL string | |
75 | if os.Args[1] != "-" { | |
76 | mainURL = fmt.Sprintf(propertyURL, os.Args[1]) | |
77 | } | |
78 | src, err := parse(mainURL, flags["emojis"], includeGeneralCategory) | |
71 | 79 | if err != nil { |
72 | 80 | log.Fatal(err) |
73 | 81 | } |
87 | 95 | |
88 | 96 | // parse parses the Unicode Properties text files located at the given URLs and |
89 | 97 | // returns their equivalent Go source code to be used in the uniseg package. If |
90 | // "emojiURL" is an empty string, no emoji code points will be included. If | |
98 | // "emojiProperty" is not an empty string, emoji code points for that emoji | |
99 | // property (e.g. "Extended_Pictographic") will be included. In those cases, you | |
100 | // may pass an empty "propertyURL" to skip parsing the main properties file. If | |
91 | 101 | // "includeGeneralCategory" is true, the Unicode General Category property will |
92 | 102 | // be extracted from the comments and included in the output. |
93 | func parse(gbpURL, emojiURL string, includeGeneralCategory bool) (string, error) { | |
103 | func parse(propertyURL, emojiProperty string, includeGeneralCategory bool) (string, error) { | |
104 | if propertyURL == "" && emojiProperty == "" { | |
105 | return "", errors.New("no properties to parse") | |
106 | } | |
107 | ||
94 | 108 | // Temporary buffer to hold properties. |
95 | 109 | var properties [][4]string |
96 | 110 | |
97 | 111 | // Open the first URL. |
98 | log.Printf("Parsing %s", gbpURL) | |
99 | res, err := http.Get(gbpURL) | |
100 | if err != nil { | |
101 | return "", err | |
102 | } | |
103 | in1 := res.Body | |
104 | defer in1.Close() | |
105 | ||
106 | // Parse it. | |
107 | scanner := bufio.NewScanner(in1) | |
108 | num := 0 | |
109 | for scanner.Scan() { | |
110 | num++ | |
111 | line := strings.TrimSpace(scanner.Text()) | |
112 | ||
113 | // Skip comments and empty lines. | |
114 | if strings.HasPrefix(line, "#") || line == "" { | |
115 | continue | |
116 | } | |
117 | ||
118 | // Everything else must be a code point range, a property and a comment. | |
119 | from, to, property, comment, err := parseProperty(line) | |
112 | if propertyURL != "" { | |
113 | log.Printf("Parsing %s", propertyURL) | |
114 | res, err := http.Get(propertyURL) | |
120 | 115 | if err != nil { |
121 | return "", fmt.Errorf("%s line %d: %v", os.Args[4], num, err) | |
122 | } | |
123 | properties = append(properties, [4]string{from, to, property, comment}) | |
124 | } | |
125 | if err := scanner.Err(); err != nil { | |
126 | return "", err | |
116 | return "", err | |
117 | } | |
118 | in1 := res.Body | |
119 | defer in1.Close() | |
120 | ||
121 | // Parse it. | |
122 | scanner := bufio.NewScanner(in1) | |
123 | num := 0 | |
124 | for scanner.Scan() { | |
125 | num++ | |
126 | line := strings.TrimSpace(scanner.Text()) | |
127 | ||
128 | // Skip comments and empty lines. | |
129 | if strings.HasPrefix(line, "#") || line == "" { | |
130 | continue | |
131 | } | |
132 | ||
133 | // Everything else must be a code point range, a property and a comment. | |
134 | from, to, property, comment, err := parseProperty(line) | |
135 | if err != nil { | |
136 | return "", fmt.Errorf("%s line %d: %v", os.Args[4], num, err) | |
137 | } | |
138 | properties = append(properties, [4]string{from, to, property, comment}) | |
139 | } | |
140 | if err := scanner.Err(); err != nil { | |
141 | return "", err | |
142 | } | |
127 | 143 | } |
128 | 144 | |
129 | 145 | // Open the second URL. |
130 | if emojiURL != "" { | |
146 | if emojiProperty != "" { | |
131 | 147 | log.Printf("Parsing %s", emojiURL) |
132 | res, err = http.Get(emojiURL) | |
148 | res, err := http.Get(emojiURL) | |
133 | 149 | if err != nil { |
134 | 150 | return "", err |
135 | 151 | } |
137 | 153 | defer in2.Close() |
138 | 154 | |
139 | 155 | // Parse it. |
140 | scanner = bufio.NewScanner(in2) | |
141 | num = 0 | |
156 | scanner := bufio.NewScanner(in2) | |
157 | num := 0 | |
142 | 158 | for scanner.Scan() { |
143 | 159 | num++ |
144 | 160 | line := scanner.Text() |
145 | 161 | |
146 | 162 | // Skip comments, empty lines, and everything not containing |
147 | 163 | // "Extended_Pictographic". |
148 | if strings.HasPrefix(line, "#") || line == "" || !strings.Contains(line, "Extended_Pictographic") { | |
164 | if strings.HasPrefix(line, "#") || line == "" || !strings.Contains(line, emojiProperty) { | |
149 | 165 | continue |
150 | 166 | } |
151 | 167 | |
188 | 204 | // Code generated via go generate from gen_properties.go. DO NOT EDIT. |
189 | 205 | |
190 | 206 | // ` + os.Args[3] + ` are taken from |
191 | // ` + gbpURL + emojiComment + ` | |
207 | // ` + propertyURL + emojiComment + ` | |
192 | 208 | // on ` + time.Now().Format("January 2, 2006") + `. See https://www.unicode.org/license.html for the Unicode |
193 | 209 | // license agreement. |
194 | 210 | var ` + os.Args[3] + ` = [][` + strconv.Itoa(columns) + `]int{ |
3 | 3 | |
4 | 4 | // Graphemes implements an iterator over Unicode grapheme clusters, or |
5 | 5 | // user-perceived characters. While iterating, it also provides information |
6 | // about word boundaries, sentence boundaries, and line breaks. | |
6 | // about word boundaries, sentence boundaries, line breaks, and monospace | |
7 | // character widths. | |
7 | 8 | // |
8 | 9 | // After constructing the class via [NewGraphemes] for a given string "str", |
9 | 10 | // [Graphemes.Next] is called for every grapheme cluster in a loop until it |
10 | 11 | // returns false. Inside the loop, information about the grapheme cluster as |
11 | // well as boundary information is available via the various methods (see | |
12 | // examples below). | |
12 | // well as boundary information and character width is available via the various | |
13 | // methods (see examples below). | |
13 | 14 | // |
14 | 15 | // Using this class to iterate over a string is convenient but it is much slower |
15 | 16 | // than using this package's [Step] or [StepString] functions or any of the |
133 | 134 | return g.boundaries & MaskLine |
134 | 135 | } |
135 | 136 | |
137 | // Width returns the monospace width of the current grapheme cluster. | |
138 | func (g *Graphemes) Width() int { | |
139 | if g.state < 0 { | |
140 | return 0 | |
141 | } | |
142 | return g.boundaries >> ShiftWidth | |
143 | } | |
144 | ||
136 | 145 | // Reset puts the iterator into its initial state such that the next call to |
137 | 146 | // [Graphemes.Next] sets it to the first grapheme cluster again. |
138 | 147 | func (g *Graphemes) Reset() { |
153 | 162 | return |
154 | 163 | } |
155 | 164 | |
165 | // The number of bits the grapheme property must be shifted to make place for | |
166 | // grapheme states. | |
167 | const shiftGraphemePropState = 4 | |
168 | ||
156 | 169 | // FirstGraphemeCluster returns the first grapheme cluster found in the given |
157 | 170 | // byte slice according to the rules of Unicode Standard Annex #29, Grapheme |
158 | 171 | // Cluster Boundaries. This function can be called continuously to extract all |
168 | 181 | // "cluster" byte slice is the sub-slice of the input slice containing the |
169 | 182 | // identified grapheme cluster. |
170 | 183 | // |
184 | // The returned width is the width of the grapheme cluster for most monospace | |
185 | // fonts where a value of 1 represents one character cell. | |
186 | // | |
171 | 187 | // Given an empty byte slice "b", the function returns nil values. |
172 | 188 | // |
173 | 189 | // While slightly less convenient than using the Graphemes class, this function |
174 | 190 | // has much better performance and makes no allocations. It lends itself well to |
175 | 191 | // large byte slices. |
176 | // | |
177 | // The "reserved" return value is a placeholder for future functionality and may | |
178 | // be ignored for the time being. | |
179 | func FirstGraphemeCluster(b []byte, state int) (cluster, rest []byte, reserved, newState int) { | |
192 | func FirstGraphemeCluster(b []byte, state int) (cluster, rest []byte, width, newState int) { | |
180 | 193 | // An empty byte slice returns nothing. |
181 | 194 | if len(b) == 0 { |
182 | 195 | return |
185 | 198 | // Extract the first rune. |
186 | 199 | r, length := utf8.DecodeRune(b) |
187 | 200 | if len(b) <= length { // If we're already past the end, there is nothing else to parse. |
188 | return b, nil, 0, grAny | |
201 | var prop int | |
202 | if state < 0 { | |
203 | prop = property(graphemeCodePoints, r) | |
204 | } else { | |
205 | prop = state >> shiftGraphemePropState | |
206 | } | |
207 | return b, nil, runeWidth(r, prop), grAny | (prop << shiftGraphemePropState) | |
189 | 208 | } |
190 | 209 | |
191 | 210 | // If we don't know the state, determine it now. |
211 | var firstProp int | |
192 | 212 | if state < 0 { |
193 | state, _ = transitionGraphemeState(state, r) | |
194 | } | |
213 | state, firstProp, _ = transitionGraphemeState(state, r) | |
214 | } else { | |
215 | firstProp = state >> shiftGraphemePropState | |
216 | } | |
217 | width += runeWidth(r, firstProp) | |
195 | 218 | |
196 | 219 | // Transition until we find a boundary. |
197 | var boundary bool | |
198 | 220 | for { |
221 | var ( | |
222 | prop int | |
223 | boundary bool | |
224 | ) | |
225 | ||
199 | 226 | r, l := utf8.DecodeRune(b[length:]) |
200 | state, boundary = transitionGraphemeState(state, r) | |
227 | state, prop, boundary = transitionGraphemeState(state&maskGraphemeState, r) | |
201 | 228 | |
202 | 229 | if boundary { |
203 | return b[:length], b[length:], 0, state | |
230 | return b[:length], b[length:], width, state | (prop << shiftGraphemePropState) | |
231 | } | |
232 | ||
233 | if firstProp != prExtendedPictographic && firstProp != prRegionalIndicator && firstProp != prL { | |
234 | width += runeWidth(r, prop) | |
235 | } else if firstProp == prExtendedPictographic { | |
236 | if r == 0xfe0e { | |
237 | width = 1 | |
238 | } else { | |
239 | width = 2 | |
240 | } | |
204 | 241 | } |
205 | 242 | |
206 | 243 | length += l |
207 | 244 | if len(b) <= length { |
208 | return b, nil, 0, grAny | |
245 | return b, nil, width, grAny | (prop << shiftGraphemePropState) | |
209 | 246 | } |
210 | 247 | } |
211 | 248 | } |
212 | 249 | |
213 | 250 | // FirstGraphemeClusterInString is like [FirstGraphemeCluster] but its input and |
214 | 251 | // outputs are strings. |
215 | func FirstGraphemeClusterInString(str string, state int) (cluster, rest string, reserved, newState int) { | |
252 | func FirstGraphemeClusterInString(str string, state int) (cluster, rest string, width, newState int) { | |
216 | 253 | // An empty string returns nothing. |
217 | 254 | if len(str) == 0 { |
218 | 255 | return |
221 | 258 | // Extract the first rune. |
222 | 259 | r, length := utf8.DecodeRuneInString(str) |
223 | 260 | if len(str) <= length { // If we're already past the end, there is nothing else to parse. |
224 | return str, "", 0, grAny | |
261 | var prop int | |
262 | if state < 0 { | |
263 | prop = property(graphemeCodePoints, r) | |
264 | } else { | |
265 | prop = state >> shiftGraphemePropState | |
266 | } | |
267 | return str, "", runeWidth(r, prop), grAny | (prop << shiftGraphemePropState) | |
225 | 268 | } |
226 | 269 | |
227 | 270 | // If we don't know the state, determine it now. |
271 | var firstProp int | |
228 | 272 | if state < 0 { |
229 | state, _ = transitionGraphemeState(state, r) | |
230 | } | |
273 | state, firstProp, _ = transitionGraphemeState(state, r) | |
274 | } else { | |
275 | firstProp = state >> shiftGraphemePropState | |
276 | } | |
277 | width += runeWidth(r, firstProp) | |
231 | 278 | |
232 | 279 | // Transition until we find a boundary. |
233 | var boundary bool | |
234 | 280 | for { |
281 | var ( | |
282 | prop int | |
283 | boundary bool | |
284 | ) | |
285 | ||
235 | 286 | r, l := utf8.DecodeRuneInString(str[length:]) |
236 | state, boundary = transitionGraphemeState(state, r) | |
287 | state, prop, boundary = transitionGraphemeState(state&maskGraphemeState, r) | |
237 | 288 | |
238 | 289 | if boundary { |
239 | return str[:length], str[length:], 0, state | |
290 | return str[:length], str[length:], width, state | (prop << shiftGraphemePropState) | |
291 | } | |
292 | ||
293 | if firstProp != prExtendedPictographic && firstProp != prRegionalIndicator && firstProp != prL { | |
294 | width += runeWidth(r, prop) | |
295 | } else if firstProp == prExtendedPictographic { | |
296 | if r == 0xfe0e { | |
297 | width = 1 | |
298 | } else { | |
299 | width = 2 | |
300 | } | |
240 | 301 | } |
241 | 302 | |
242 | 303 | length += l |
243 | 304 | if len(str) <= length { |
244 | return str, "", 0, grAny | |
245 | } | |
246 | } | |
247 | } | |
305 | return str, "", width, grAny | (prop << shiftGraphemePropState) | |
306 | } | |
307 | } | |
308 | } |
3 | 3 | |
4 | 4 | // graphemeBreakTestCases are Grapheme testcases taken from |
5 | 5 | // https://www.unicode.org/Public/14.0.0/ucd/auxiliary/GraphemeBreakTest.txt |
6 | // on July 25, 2022. See | |
6 | // on September 10, 2022. See | |
7 | 7 | // https://www.unicode.org/license.html for the Unicode license agreement. |
8 | 8 | var graphemeBreakTestCases = []testCase{ |
9 | 9 | {original: "\u0020\u0020", expected: [][]rune{{0x0020}, {0x0020}}}, // ÷ [0.2] SPACE (Other) ÷ [999.0] SPACE (Other) ÷ [0.3] |
6 | 6 | // and |
7 | 7 | // https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt |
8 | 8 | // ("Extended_Pictographic" only) |
9 | // on July 25, 2022. See https://www.unicode.org/license.html for the Unicode | |
9 | // on September 10, 2022. See https://www.unicode.org/license.html for the Unicode | |
10 | 10 | // license agreement. |
11 | 11 | var graphemeCodePoints = [][3]int{ |
12 | 12 | {0x0000, 0x0009, prControl}, // Cc [10] <control-0000>..<control-0009> |
26 | 26 | // |
27 | 27 | // This map is queried as follows: |
28 | 28 | // |
29 | // 1. Find specific state + specific property. Stop if found. | |
30 | // 2. Find specific state + any property. | |
31 | // 3. Find any state + specific property. | |
32 | // 4. If only (2) or (3) (but not both) was found, stop. | |
33 | // 5. If both (2) and (3) were found, use state from (3) and breaking instruction | |
34 | // from the transition with the lower rule number, prefer (3) if rule numbers | |
35 | // are equal. Stop. | |
36 | // 6. Assume grAny and grBoundary. | |
29 | // 1. Find specific state + specific property. Stop if found. | |
30 | // 2. Find specific state + any property. | |
31 | // 3. Find any state + specific property. | |
32 | // 4. If only (2) or (3) (but not both) was found, stop. | |
33 | // 5. If both (2) and (3) were found, use state from (3) and breaking instruction | |
34 | // from the transition with the lower rule number, prefer (3) if rule numbers | |
35 | // are equal. Stop. | |
36 | // 6. Assume grAny and grBoundary. | |
37 | 37 | // |
38 | 38 | // Unicode version 14.0.0. |
39 | 39 | var grTransitions = map[[2]int][3]int{ |
91 | 91 | } |
92 | 92 | |
93 | 93 | // transitionGraphemeState determines the new state of the grapheme cluster |
94 | // parser given the current state and the next code point. It also returns | |
95 | // whether a cluster boundary was detected. | |
96 | func transitionGraphemeState(state int, r rune) (newState int, boundary bool) { | |
94 | // parser given the current state and the next code point. It also returns the | |
95 | // code point's grapheme property (the value mapped by the [graphemeCodePoints] | |
96 | // table) and whether a cluster boundary was detected. | |
97 | func transitionGraphemeState(state int, r rune) (newState, prop int, boundary bool) { | |
97 | 98 | // Determine the property of the next character. |
98 | nextProperty := property(graphemeCodePoints, r) | |
99 | prop = property(graphemeCodePoints, r) | |
99 | 100 | |
100 | 101 | // Find the applicable transition. |
101 | transition, ok := grTransitions[[2]int{state, nextProperty}] | |
102 | transition, ok := grTransitions[[2]int{state, prop}] | |
102 | 103 | if ok { |
103 | 104 | // We have a specific transition. We'll use it. |
104 | return transition[0], transition[1] == grBoundary | |
105 | return transition[0], prop, transition[1] == grBoundary | |
105 | 106 | } |
106 | 107 | |
107 | 108 | // No specific transition found. Try the less specific ones. |
108 | 109 | transAnyProp, okAnyProp := grTransitions[[2]int{state, prAny}] |
109 | transAnyState, okAnyState := grTransitions[[2]int{grAny, nextProperty}] | |
110 | transAnyState, okAnyState := grTransitions[[2]int{grAny, prop}] | |
110 | 111 | if okAnyProp && okAnyState { |
111 | 112 | // Both apply. We'll use a mix (see comments for grTransitions). |
112 | 113 | newState = transAnyState[0] |
119 | 120 | |
120 | 121 | if okAnyProp { |
121 | 122 | // We only have a specific state. |
122 | return transAnyProp[0], transAnyProp[1] == grBoundary | |
123 | return transAnyProp[0], prop, transAnyProp[1] == grBoundary | |
123 | 124 | // This branch will probably never be reached because okAnyState will |
124 | 125 | // always be true given the current transition map. But we keep it here |
125 | 126 | // for future modifications to the transition map where this may not be |
128 | 129 | |
129 | 130 | if okAnyState { |
130 | 131 | // We only have a specific property. |
131 | return transAnyState[0], transAnyState[1] == grBoundary | |
132 | return transAnyState[0], prop, transAnyState[1] == grBoundary | |
132 | 133 | } |
133 | 134 | |
134 | 135 | // No known transition. GB999: Any ÷ Any. |
135 | return grAny, true | |
136 | return grAny, prop, true | |
136 | 137 | } |
3 | 3 | |
4 | 4 | // lineBreakTestCases are Grapheme testcases taken from |
5 | 5 | // https://www.unicode.org/Public/14.0.0/ucd/auxiliary/LineBreakTest.txt |
6 | // on July 25, 2022. See | |
6 | // on September 10, 2022. See | |
7 | 7 | // https://www.unicode.org/license.html for the Unicode license agreement. |
8 | 8 | var lineBreakTestCases = []testCase{ |
9 | 9 | {original: "\u0023\u0023", expected: [][]rune{{0x0023, 0x0023}}}, // × [0.3] NUMBER SIGN (AL) × [28.0] NUMBER SIGN (AL) ÷ [0.3] |
3 | 3 | |
4 | 4 | // lineBreakCodePoints are taken from |
5 | 5 | // https://www.unicode.org/Public/14.0.0/ucd/LineBreak.txt |
6 | // on July 25, 2022. See https://www.unicode.org/license.html for the Unicode | |
6 | // and | |
7 | // https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt | |
8 | // ("Extended_Pictographic" only) | |
9 | // on September 10, 2022. See https://www.unicode.org/license.html for the Unicode | |
7 | 10 | // license agreement. |
8 | 11 | var lineBreakCodePoints = [][4]int{ |
9 | 12 | {0x0000, 0x0008, prCM, gcCc}, // [9] <control-0000>..<control-0008> |
2 | 2 | // The Unicode properties as used in the various parsers. Only the ones needed |
3 | 3 | // in the context of this package are included. |
4 | 4 | const ( |
5 | prXX = 0 // Same as prAny. | |
6 | prAny = iota // prAny must be 0. | |
7 | prPrepend | |
5 | prXX = 0 // Same as prAny. | |
6 | prAny = iota // prAny must be 0. | |
7 | prPrepend // Grapheme properties must come first, to reduce the number of bits stored in the state vector. | |
8 | 8 | prCR |
9 | 9 | prLF |
10 | 10 | prControl |
85 | 85 | prW |
86 | 86 | prH |
87 | 87 | prF |
88 | prEmojiPresentation | |
88 | 89 | ) |
89 | 90 | |
90 | 91 | // Unicode General Categories. Only the ones needed in the context of this |
3 | 3 | |
4 | 4 | // sentenceBreakTestCases are Grapheme testcases taken from |
5 | 5 | // https://www.unicode.org/Public/14.0.0/ucd/auxiliary/SentenceBreakTest.txt |
6 | // on July 25, 2022. See | |
6 | // on September 10, 2022. See | |
7 | 7 | // https://www.unicode.org/license.html for the Unicode license agreement. |
8 | 8 | var sentenceBreakTestCases = []testCase{ |
9 | 9 | {original: "\u0001\u0001", expected: [][]rune{{0x0001, 0x0001}}}, // ÷ [0.2] <START OF HEADING> (Other) × [998.0] <START OF HEADING> (Other) ÷ [0.3] |
3 | 3 | |
4 | 4 | // sentenceBreakCodePoints are taken from |
5 | 5 | // https://www.unicode.org/Public/14.0.0/ucd/auxiliary/SentenceBreakProperty.txt |
6 | // on July 25, 2022. See https://www.unicode.org/license.html for the Unicode | |
6 | // and | |
7 | // https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt | |
8 | // ("Extended_Pictographic" only) | |
9 | // on September 10, 2022. See https://www.unicode.org/license.html for the Unicode | |
7 | 10 | // license agreement. |
8 | 11 | var sentenceBreakCodePoints = [][3]int{ |
9 | 12 | {0x0009, 0x0009, prSp}, // Cc <control-0009> |
1 | 1 | |
2 | 2 | import "unicode/utf8" |
3 | 3 | |
4 | // The bit masks used to extract boundary information returned by the [Step] | |
5 | // function. | |
4 | // The bit masks used to extract boundary information returned by [Step]. | |
6 | 5 | const ( |
7 | 6 | MaskLine = 3 |
8 | 7 | MaskWord = 4 |
9 | 8 | MaskSentence = 8 |
10 | 9 | ) |
11 | 10 | |
11 | // The number of bits to shift the boundary information returned by [Step] to | |
12 | // obtain the monospace width of the grapheme cluster. | |
13 | const ShiftWidth = 4 | |
14 | ||
12 | 15 | // The bit positions by which boundary flags are shifted by the [Step] function. |
13 | // This must correspond to the Mask constants. | |
16 | // These must correspond to the Mask constants. | |
14 | 17 | const ( |
15 | 18 | shiftWord = 2 |
16 | 19 | shiftSentence = 3 |
20 | // shiftwWidth is ShiftWidth above. No mask as these are always the remaining bits. | |
17 | 21 | ) |
18 | 22 | |
19 | 23 | // The bit positions by which states are shifted by the [Step] function. These |
20 | 24 | // values must ensure state values defined for each of the boundary algorithms |
21 | // don't overlap (and that they all still fit in a single int). | |
25 | // don't overlap (and that they all still fit in a single int). These must | |
26 | // correspond to the Mask constants. | |
22 | 27 | const ( |
23 | 28 | shiftWordState = 4 |
24 | 29 | shiftSentenceState = 9 |
25 | 30 | shiftLineState = 13 |
31 | shiftPropState = 21 // No mask as these are always the remaining bits. | |
26 | 32 | ) |
27 | 33 | |
28 | 34 | // The bit mask used to extract the state returned by the [Step] function, after |
53 | 59 | // boundary. |
54 | 60 | // - boundaries&MaskLine == LineCanBreak: You may or may not break the line at |
55 | 61 | // the boundary. |
62 | // - boundaries >> ShiftWidth: The width of the grapheme cluster for most | |
63 | // monospace fonts where a value of 1 represents one character cell. | |
56 | 64 | // |
57 | 65 | // This function can be called continuously to extract all grapheme clusters |
58 | 66 | // from a byte slice, as illustrated in the examples below. |
86 | 94 | // Extract the first rune. |
87 | 95 | r, length := utf8.DecodeRune(b) |
88 | 96 | if len(b) <= length { // If we're already past the end, there is nothing else to parse. |
89 | return b, nil, LineMustBreak | (1 << shiftWord) | (1 << shiftSentence), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState) | |
97 | var prop int | |
98 | if state < 0 { | |
99 | prop = property(graphemeCodePoints, r) | |
100 | } else { | |
101 | prop = state >> shiftPropState | |
102 | } | |
103 | return b, nil, LineMustBreak | (1 << shiftWord) | (1 << shiftSentence) | (runeWidth(r, prop) << ShiftWidth), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState) | (prop << shiftPropState) | |
90 | 104 | } |
91 | 105 | |
92 | 106 | // If we don't know the state, determine it now. |
93 | var graphemeState, wordState, sentenceState, lineState int | |
107 | var graphemeState, wordState, sentenceState, lineState, firstProp int | |
94 | 108 | remainder := b[length:] |
95 | 109 | if state < 0 { |
96 | graphemeState, _ = transitionGraphemeState(state, r) | |
110 | graphemeState, firstProp, _ = transitionGraphemeState(state, r) | |
97 | 111 | wordState, _ = transitionWordBreakState(state, r, remainder, "") |
98 | 112 | sentenceState, _ = transitionSentenceBreakState(state, r, remainder, "") |
99 | 113 | lineState, _ = transitionLineBreakState(state, r, remainder, "") |
102 | 116 | wordState = (state >> shiftWordState) & maskWordState |
103 | 117 | sentenceState = (state >> shiftSentenceState) & maskSentenceState |
104 | 118 | lineState = (state >> shiftLineState) & maskLineState |
119 | firstProp = state >> shiftPropState | |
105 | 120 | } |
106 | 121 | |
107 | 122 | // Transition until we find a grapheme cluster boundary. |
108 | var ( | |
109 | graphemeBoundary, wordBoundary, sentenceBoundary bool | |
110 | lineBreak int | |
111 | ) | |
123 | width := runeWidth(r, firstProp) | |
112 | 124 | for { |
125 | var ( | |
126 | graphemeBoundary, wordBoundary, sentenceBoundary bool | |
127 | lineBreak, prop int | |
128 | ) | |
129 | ||
113 | 130 | r, l := utf8.DecodeRune(remainder) |
114 | 131 | remainder = b[length+l:] |
115 | 132 | |
116 | graphemeState, graphemeBoundary = transitionGraphemeState(graphemeState, r) | |
133 | graphemeState, prop, graphemeBoundary = transitionGraphemeState(graphemeState, r) | |
117 | 134 | wordState, wordBoundary = transitionWordBreakState(wordState, r, remainder, "") |
118 | 135 | sentenceState, sentenceBoundary = transitionSentenceBreakState(sentenceState, r, remainder, "") |
119 | 136 | lineState, lineBreak = transitionLineBreakState(lineState, r, remainder, "") |
120 | 137 | |
121 | 138 | if graphemeBoundary { |
122 | boundary := lineBreak | |
139 | boundary := lineBreak | (width << ShiftWidth) | |
123 | 140 | if wordBoundary { |
124 | 141 | boundary |= 1 << shiftWord |
125 | 142 | } |
126 | 143 | if sentenceBoundary { |
127 | 144 | boundary |= 1 << shiftSentence |
128 | 145 | } |
129 | return b[:length], b[length:], boundary, graphemeState | (wordState << shiftWordState) | (sentenceState << shiftSentenceState) | (lineState << shiftLineState) | |
146 | return b[:length], b[length:], boundary, graphemeState | (wordState << shiftWordState) | (sentenceState << shiftSentenceState) | (lineState << shiftLineState) | (prop << shiftPropState) | |
147 | } | |
148 | ||
149 | if firstProp != prExtendedPictographic && firstProp != prRegionalIndicator && firstProp != prL { | |
150 | width += runeWidth(r, prop) | |
151 | } else if firstProp == prExtendedPictographic { | |
152 | if r == 0xfe0e { | |
153 | width = 1 | |
154 | } else { | |
155 | width = 2 | |
156 | } | |
130 | 157 | } |
131 | 158 | |
132 | 159 | length += l |
133 | 160 | if len(b) <= length { |
134 | return b, nil, LineMustBreak | (1 << shiftWord) | (1 << shiftSentence), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState) | |
161 | return b, nil, LineMustBreak | (1 << shiftWord) | (1 << shiftSentence) | (width << ShiftWidth), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState) | (prop << shiftPropState) | |
135 | 162 | } |
136 | 163 | } |
137 | 164 | } |
146 | 173 | // Extract the first rune. |
147 | 174 | r, length := utf8.DecodeRuneInString(str) |
148 | 175 | if len(str) <= length { // If we're already past the end, there is nothing else to parse. |
149 | return str, "", LineMustBreak | (1 << shiftWord) | (1 << shiftSentence), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState) | |
176 | prop := property(graphemeCodePoints, r) | |
177 | return str, "", LineMustBreak | (1 << shiftWord) | (1 << shiftSentence) | (runeWidth(r, prop) << ShiftWidth), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState) | |
150 | 178 | } |
151 | 179 | |
152 | 180 | // If we don't know the state, determine it now. |
153 | var graphemeState, wordState, sentenceState, lineState int | |
181 | var graphemeState, wordState, sentenceState, lineState, firstProp int | |
154 | 182 | remainder := str[length:] |
155 | 183 | if state < 0 { |
156 | graphemeState, _ = transitionGraphemeState(state, r) | |
184 | graphemeState, firstProp, _ = transitionGraphemeState(state, r) | |
157 | 185 | wordState, _ = transitionWordBreakState(state, r, nil, remainder) |
158 | 186 | sentenceState, _ = transitionSentenceBreakState(state, r, nil, remainder) |
159 | 187 | lineState, _ = transitionLineBreakState(state, r, nil, remainder) |
162 | 190 | wordState = (state >> shiftWordState) & maskWordState |
163 | 191 | sentenceState = (state >> shiftSentenceState) & maskSentenceState |
164 | 192 | lineState = (state >> shiftLineState) & maskLineState |
193 | firstProp = state >> shiftPropState | |
165 | 194 | } |
166 | 195 | |
167 | 196 | // Transition until we find a grapheme cluster boundary. |
168 | var ( | |
169 | graphemeBoundary, wordBoundary, sentenceBoundary bool | |
170 | lineBreak int | |
171 | ) | |
197 | width := runeWidth(r, firstProp) | |
172 | 198 | for { |
199 | var ( | |
200 | graphemeBoundary, wordBoundary, sentenceBoundary bool | |
201 | lineBreak, prop int | |
202 | ) | |
203 | ||
173 | 204 | r, l := utf8.DecodeRuneInString(remainder) |
174 | 205 | remainder = str[length+l:] |
175 | 206 | |
176 | graphemeState, graphemeBoundary = transitionGraphemeState(graphemeState, r) | |
207 | graphemeState, prop, graphemeBoundary = transitionGraphemeState(graphemeState, r) | |
177 | 208 | wordState, wordBoundary = transitionWordBreakState(wordState, r, nil, remainder) |
178 | 209 | sentenceState, sentenceBoundary = transitionSentenceBreakState(sentenceState, r, nil, remainder) |
179 | 210 | lineState, lineBreak = transitionLineBreakState(lineState, r, nil, remainder) |
180 | 211 | |
181 | 212 | if graphemeBoundary { |
182 | boundary := lineBreak | |
213 | boundary := lineBreak | (width << ShiftWidth) | |
183 | 214 | if wordBoundary { |
184 | 215 | boundary |= 1 << shiftWord |
185 | 216 | } |
189 | 220 | return str[:length], str[length:], boundary, graphemeState | (wordState << shiftWordState) | (sentenceState << shiftSentenceState) | (lineState << shiftLineState) |
190 | 221 | } |
191 | 222 | |
223 | if firstProp != prExtendedPictographic && firstProp != prRegionalIndicator && firstProp != prL { | |
224 | width += runeWidth(r, prop) | |
225 | } else if firstProp == prExtendedPictographic { | |
226 | if r == 0xfe0e { | |
227 | width = 1 | |
228 | } else { | |
229 | width = 2 | |
230 | } | |
231 | } | |
232 | ||
192 | 233 | length += l |
193 | 234 | if len(str) <= length { |
194 | return str, "", LineMustBreak | (1 << shiftWord) | (1 << shiftSentence), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState) | |
235 | return str, "", LineMustBreak | (1 << shiftWord) | (1 << shiftSentence) | (width << ShiftWidth), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState) | |
195 | 236 | } |
196 | 237 | } |
197 | 238 | } |
0 | package uniseg | |
1 | ||
2 | // runeWidth returns the monospace width for the given rune. The provided | |
3 | // grapheme property is a value mapped by the [graphemeCodePoints] table. | |
4 | // | |
5 | // Every rune has a width of 1, except for runes with the following properties | |
6 | // (evaluated in this order): | |
7 | // | |
8 | // - Control, CR, LF, Extend, ZWJ: Width of 0 | |
9 | // - \u2e3a, TWO-EM DASH: Width of 3 | |
10 | // - \u2e3b, THREE-EM DASH: Width of 4 | |
11 | // - East-Asian width Fullwidth and Wide: Width of 2 (Ambiguous and Neutral | |
12 | // have a width of 1) | |
13 | // - Regional Indicator: Width of 2 | |
14 | // - Extended Pictographic: Width of 2, unless Emoji Presentation is "No". | |
15 | func runeWidth(r rune, graphemeProperty int) int { | |
16 | switch graphemeProperty { | |
17 | case prControl, prCR, prLF, prExtend, prZWJ: | |
18 | return 0 | |
19 | case prRegionalIndicator: | |
20 | return 2 | |
21 | case prExtendedPictographic: | |
22 | if property(emojiPresentation, r) == prEmojiPresentation { | |
23 | return 2 | |
24 | } | |
25 | return 1 | |
26 | } | |
27 | ||
28 | switch r { | |
29 | case 0x2e3a: | |
30 | return 3 | |
31 | case 0x2e3b: | |
32 | return 4 | |
33 | } | |
34 | ||
35 | switch property(eastAsianWidth, r) { | |
36 | case prW, prF: | |
37 | return 2 | |
38 | } | |
39 | ||
40 | return 1 | |
41 | } | |
42 | ||
43 | // StringWidth returns the monospace width for the given string, that is, the | |
44 | // number of same-size cells to be occupied by the string. | |
45 | func StringWidth(s string) (width int) { | |
46 | state := -1 | |
47 | for len(s) > 0 { | |
48 | var w int | |
49 | _, s, w, state = FirstGraphemeClusterInString(s, state) | |
50 | width += w | |
51 | } | |
52 | return | |
53 | } |
0 | package uniseg | |
1 | ||
2 | import "testing" | |
3 | ||
4 | // widthTestCases is a list of test cases for the calculation of string widths. | |
5 | var widthTestCases = []struct { | |
6 | original string | |
7 | expected int | |
8 | }{ | |
9 | {"", 0}, // Control | |
10 | {"\b", 0}, | |
11 | {"\x00", 0}, | |
12 | {"\x05", 0}, | |
13 | {"\a", 0}, | |
14 | {"\u000a", 0}, // LF | |
15 | {"\u000d", 0}, // CR | |
16 | {"\n", 0}, | |
17 | {"\v", 0}, | |
18 | {"\f", 0}, | |
19 | {"\r", 0}, | |
20 | {"\x0e", 0}, | |
21 | {"\x0f", 0}, | |
22 | {"\u0300", 0}, // Extend | |
23 | {"\u200d", 0}, // ZERO WIDTH JOINER | |
24 | {"a", 1}, | |
25 | {"\u1b05", 1}, // N | |
26 | {"\u2985", 1}, // Na | |
27 | {"\U0001F100", 1}, // A | |
28 | {"\uff61", 1}, // H | |
29 | {"\ufe6a", 2}, // W | |
30 | {"\uff01", 2}, // F | |
31 | {"\u2e3a", 3}, // TWO-EM DASH | |
32 | {"\u2e3b", 4}, // THREE-EM DASH | |
33 | {"\u00a9", 1}, // Extended Pictographic (Emoji Presentation = No) | |
34 | {"\U0001F60A", 2}, // Extended Pictographic (Emoji Presentation = Yes) | |
35 | {"\U0001F1E6", 2}, // Regional Indicator | |
36 | {"\u061c\u061c", 0}, | |
37 | {"\u061c\u000a", 0}, | |
38 | {"\u061c\u000d", 0}, | |
39 | {"\u061c\u0300", 0}, | |
40 | {"\u061c\u200d", 0}, | |
41 | {"\u061ca", 1}, | |
42 | {"\u061c\u1b05", 1}, | |
43 | {"\u061c\u2985", 1}, | |
44 | {"\u061c\U0001F100", 1}, | |
45 | {"\u061c\uff61", 1}, | |
46 | {"\u061c\ufe6a", 2}, | |
47 | {"\u061c\uff01", 2}, | |
48 | {"\u061c\u2e3a", 3}, | |
49 | {"\u061c\u2e3b", 4}, | |
50 | {"\u061c\u00a9", 1}, | |
51 | {"\u061c\U0001F60A", 2}, | |
52 | {"\u061c\U0001F1E6", 2}, | |
53 | {"\u000a\u061c", 0}, | |
54 | {"\u000a\u000a", 0}, | |
55 | {"\u000a\u000d", 0}, | |
56 | {"\u000a\u0300", 0}, | |
57 | {"\u000a\u200d", 0}, | |
58 | {"\u000aa", 1}, | |
59 | {"\u000a\u1b05", 1}, | |
60 | {"\u000a\u2985", 1}, | |
61 | {"\u000a\U0001F100", 1}, | |
62 | {"\u000a\uff61", 1}, | |
63 | {"\u000a\ufe6a", 2}, | |
64 | {"\u000a\uff01", 2}, | |
65 | {"\u000a\u2e3a", 3}, | |
66 | {"\u000a\u2e3b", 4}, | |
67 | {"\u000a\u00a9", 1}, | |
68 | {"\u000a\U0001F60A", 2}, | |
69 | {"\u000a\U0001F1E6", 2}, | |
70 | {"\u000d\u061c", 0}, | |
71 | {"\u000d\u000a", 0}, | |
72 | {"\u000d\u000d", 0}, | |
73 | {"\u000d\u0300", 0}, | |
74 | {"\u000d\u200d", 0}, | |
75 | {"\u000da", 1}, | |
76 | {"\u000d\u1b05", 1}, | |
77 | {"\u000d\u2985", 1}, | |
78 | {"\u000d\U0001F100", 1}, | |
79 | {"\u000d\uff61", 1}, | |
80 | {"\u000d\ufe6a", 2}, | |
81 | {"\u000d\uff01", 2}, | |
82 | {"\u000d\u2e3a", 3}, | |
83 | {"\u000d\u2e3b", 4}, | |
84 | {"\u000d\u00a9", 1}, | |
85 | {"\u000d\U0001F60A", 2}, | |
86 | {"\u000d\U0001F1E6", 2}, | |
87 | {"\u0300\u061c", 0}, | |
88 | {"\u0300\u000a", 0}, | |
89 | {"\u0300\u000d", 0}, | |
90 | {"\u0300\u0300", 0}, | |
91 | {"\u0300\u200d", 0}, | |
92 | {"\u0300a", 1}, | |
93 | {"\u0300\u1b05", 1}, | |
94 | {"\u0300\u2985", 1}, | |
95 | {"\u0300\U0001F100", 1}, | |
96 | {"\u0300\uff61", 1}, | |
97 | {"\u0300\ufe6a", 2}, | |
98 | {"\u0300\uff01", 2}, | |
99 | {"\u0300\u2e3a", 3}, | |
100 | {"\u0300\u2e3b", 4}, | |
101 | {"\u0300\u00a9", 1}, | |
102 | {"\u0300\U0001F60A", 2}, | |
103 | {"\u0300\U0001F1E6", 2}, | |
104 | {"\u200d\u061c", 0}, | |
105 | {"\u200d\u000a", 0}, | |
106 | {"\u200d\u000d", 0}, | |
107 | {"\u200d\u0300", 0}, | |
108 | {"\u200d\u200d", 0}, | |
109 | {"\u200da", 1}, | |
110 | {"\u200d\u1b05", 1}, | |
111 | {"\u200d\u2985", 1}, | |
112 | {"\u200d\U0001F100", 1}, | |
113 | {"\u200d\uff61", 1}, | |
114 | {"\u200d\ufe6a", 2}, | |
115 | {"\u200d\uff01", 2}, | |
116 | {"\u200d\u2e3a", 3}, | |
117 | {"\u200d\u2e3b", 4}, | |
118 | {"\u200d\u00a9", 1}, | |
119 | {"\u200d\U0001F60A", 2}, | |
120 | {"\u200d\U0001F1E6", 2}, | |
121 | {"a\u061c", 1}, | |
122 | {"a\u000a", 1}, | |
123 | {"a\u000d", 1}, | |
124 | {"a\u0300", 1}, | |
125 | {"a\u200d", 1}, | |
126 | {"aa", 2}, | |
127 | {"a\u1b05", 2}, | |
128 | {"a\u2985", 2}, | |
129 | {"a\U0001F100", 2}, | |
130 | {"a\uff61", 2}, | |
131 | {"a\ufe6a", 3}, | |
132 | {"a\uff01", 3}, | |
133 | {"a\u2e3a", 4}, | |
134 | {"a\u2e3b", 5}, | |
135 | {"a\u00a9", 2}, | |
136 | {"a\U0001F60A", 3}, | |
137 | {"a\U0001F1E6", 3}, | |
138 | {"\u1b05\u061c", 1}, | |
139 | {"\u1b05\u000a", 1}, | |
140 | {"\u1b05\u000d", 1}, | |
141 | {"\u1b05\u0300", 1}, | |
142 | {"\u1b05\u200d", 1}, | |
143 | {"\u1b05a", 2}, | |
144 | {"\u1b05\u1b05", 2}, | |
145 | {"\u1b05\u2985", 2}, | |
146 | {"\u1b05\U0001F100", 2}, | |
147 | {"\u1b05\uff61", 2}, | |
148 | {"\u1b05\ufe6a", 3}, | |
149 | {"\u1b05\uff01", 3}, | |
150 | {"\u1b05\u2e3a", 4}, | |
151 | {"\u1b05\u2e3b", 5}, | |
152 | {"\u1b05\u00a9", 2}, | |
153 | {"\u1b05\U0001F60A", 3}, | |
154 | {"\u1b05\U0001F1E6", 3}, | |
155 | {"\u2985\u061c", 1}, | |
156 | {"\u2985\u000a", 1}, | |
157 | {"\u2985\u000d", 1}, | |
158 | {"\u2985\u0300", 1}, | |
159 | {"\u2985\u200d", 1}, | |
160 | {"\u2985a", 2}, | |
161 | {"\u2985\u1b05", 2}, | |
162 | {"\u2985\u2985", 2}, | |
163 | {"\u2985\U0001F100", 2}, | |
164 | {"\u2985\uff61", 2}, | |
165 | {"\u2985\ufe6a", 3}, | |
166 | {"\u2985\uff01", 3}, | |
167 | {"\u2985\u2e3a", 4}, | |
168 | {"\u2985\u2e3b", 5}, | |
169 | {"\u2985\u00a9", 2}, | |
170 | {"\u2985\U0001F60A", 3}, | |
171 | {"\u2985\U0001F1E6", 3}, | |
172 | {"\U0001F100\u061c", 1}, | |
173 | {"\U0001F100\u000a", 1}, | |
174 | {"\U0001F100\u000d", 1}, | |
175 | {"\U0001F100\u0300", 1}, | |
176 | {"\U0001F100\u200d", 1}, | |
177 | {"\U0001F100a", 2}, | |
178 | {"\U0001F100\u1b05", 2}, | |
179 | {"\U0001F100\u2985", 2}, | |
180 | {"\U0001F100\U0001F100", 2}, | |
181 | {"\U0001F100\uff61", 2}, | |
182 | {"\U0001F100\ufe6a", 3}, | |
183 | {"\U0001F100\uff01", 3}, | |
184 | {"\U0001F100\u2e3a", 4}, | |
185 | {"\U0001F100\u2e3b", 5}, | |
186 | {"\U0001F100\u00a9", 2}, | |
187 | {"\U0001F100\U0001F60A", 3}, | |
188 | {"\U0001F100\U0001F1E6", 3}, | |
189 | {"\uff61\u061c", 1}, | |
190 | {"\uff61\u000a", 1}, | |
191 | {"\uff61\u000d", 1}, | |
192 | {"\uff61\u0300", 1}, | |
193 | {"\uff61\u200d", 1}, | |
194 | {"\uff61a", 2}, | |
195 | {"\uff61\u1b05", 2}, | |
196 | {"\uff61\u2985", 2}, | |
197 | {"\uff61\U0001F100", 2}, | |
198 | {"\uff61\uff61", 2}, | |
199 | {"\uff61\ufe6a", 3}, | |
200 | {"\uff61\uff01", 3}, | |
201 | {"\uff61\u2e3a", 4}, | |
202 | {"\uff61\u2e3b", 5}, | |
203 | {"\uff61\u00a9", 2}, | |
204 | {"\uff61\U0001F60A", 3}, | |
205 | {"\uff61\U0001F1E6", 3}, | |
206 | {"\ufe6a\u061c", 2}, | |
207 | {"\ufe6a\u000a", 2}, | |
208 | {"\ufe6a\u000d", 2}, | |
209 | {"\ufe6a\u0300", 2}, | |
210 | {"\ufe6a\u200d", 2}, | |
211 | {"\ufe6aa", 3}, | |
212 | {"\ufe6a\u1b05", 3}, | |
213 | {"\ufe6a\u2985", 3}, | |
214 | {"\ufe6a\U0001F100", 3}, | |
215 | {"\ufe6a\uff61", 3}, | |
216 | {"\ufe6a\ufe6a", 4}, | |
217 | {"\ufe6a\uff01", 4}, | |
218 | {"\ufe6a\u2e3a", 5}, | |
219 | {"\ufe6a\u2e3b", 6}, | |
220 | {"\ufe6a\u00a9", 3}, | |
221 | {"\ufe6a\U0001F60A", 4}, | |
222 | {"\ufe6a\U0001F1E6", 4}, | |
223 | {"\uff01\u061c", 2}, | |
224 | {"\uff01\u000a", 2}, | |
225 | {"\uff01\u000d", 2}, | |
226 | {"\uff01\u0300", 2}, | |
227 | {"\uff01\u200d", 2}, | |
228 | {"\uff01a", 3}, | |
229 | {"\uff01\u1b05", 3}, | |
230 | {"\uff01\u2985", 3}, | |
231 | {"\uff01\U0001F100", 3}, | |
232 | {"\uff01\uff61", 3}, | |
233 | {"\uff01\ufe6a", 4}, | |
234 | {"\uff01\uff01", 4}, | |
235 | {"\uff01\u2e3a", 5}, | |
236 | {"\uff01\u2e3b", 6}, | |
237 | {"\uff01\u00a9", 3}, | |
238 | {"\uff01\U0001F60A", 4}, | |
239 | {"\uff01\U0001F1E6", 4}, | |
240 | {"\u2e3a\u061c", 3}, | |
241 | {"\u2e3a\u000a", 3}, | |
242 | {"\u2e3a\u000d", 3}, | |
243 | {"\u2e3a\u0300", 3}, | |
244 | {"\u2e3a\u200d", 3}, | |
245 | {"\u2e3aa", 4}, | |
246 | {"\u2e3a\u1b05", 4}, | |
247 | {"\u2e3a\u2985", 4}, | |
248 | {"\u2e3a\U0001F100", 4}, | |
249 | {"\u2e3a\uff61", 4}, | |
250 | {"\u2e3a\ufe6a", 5}, | |
251 | {"\u2e3a\uff01", 5}, | |
252 | {"\u2e3a\u2e3a", 6}, | |
253 | {"\u2e3a\u2e3b", 7}, | |
254 | {"\u2e3a\u00a9", 4}, | |
255 | {"\u2e3a\U0001F60A", 5}, | |
256 | {"\u2e3a\U0001F1E6", 5}, | |
257 | {"\u2e3b\u061c", 4}, | |
258 | {"\u2e3b\u000a", 4}, | |
259 | {"\u2e3b\u000d", 4}, | |
260 | {"\u2e3b\u0300", 4}, | |
261 | {"\u2e3b\u200d", 4}, | |
262 | {"\u2e3ba", 5}, | |
263 | {"\u2e3b\u1b05", 5}, | |
264 | {"\u2e3b\u2985", 5}, | |
265 | {"\u2e3b\U0001F100", 5}, | |
266 | {"\u2e3b\uff61", 5}, | |
267 | {"\u2e3b\ufe6a", 6}, | |
268 | {"\u2e3b\uff01", 6}, | |
269 | {"\u2e3b\u2e3a", 7}, | |
270 | {"\u2e3b\u2e3b", 8}, | |
271 | {"\u2e3b\u00a9", 5}, | |
272 | {"\u2e3b\U0001F60A", 6}, | |
273 | {"\u2e3b\U0001F1E6", 6}, | |
274 | {"\u00a9\u061c", 1}, | |
275 | {"\u00a9\u000a", 1}, | |
276 | {"\u00a9\u000d", 1}, | |
277 | {"\u00a9\u0300", 2}, // This is really 1 but we can't handle it. | |
278 | {"\u00a9\u200d", 2}, | |
279 | {"\u00a9a", 2}, | |
280 | {"\u00a9\u1b05", 2}, | |
281 | {"\u00a9\u2985", 2}, | |
282 | {"\u00a9\U0001F100", 2}, | |
283 | {"\u00a9\uff61", 2}, | |
284 | {"\u00a9\ufe6a", 3}, | |
285 | {"\u00a9\uff01", 3}, | |
286 | {"\u00a9\u2e3a", 4}, | |
287 | {"\u00a9\u2e3b", 5}, | |
288 | {"\u00a9\u00a9", 2}, | |
289 | {"\u00a9\U0001F60A", 3}, | |
290 | {"\u00a9\U0001F1E6", 3}, | |
291 | {"\U0001F60A\u061c", 2}, | |
292 | {"\U0001F60A\u000a", 2}, | |
293 | {"\U0001F60A\u000d", 2}, | |
294 | {"\U0001F60A\u0300", 2}, | |
295 | {"\U0001F60A\u200d", 2}, | |
296 | {"\U0001F60Aa", 3}, | |
297 | {"\U0001F60A\u1b05", 3}, | |
298 | {"\U0001F60A\u2985", 3}, | |
299 | {"\U0001F60A\U0001F100", 3}, | |
300 | {"\U0001F60A\uff61", 3}, | |
301 | {"\U0001F60A\ufe6a", 4}, | |
302 | {"\U0001F60A\uff01", 4}, | |
303 | {"\U0001F60A\u2e3a", 5}, | |
304 | {"\U0001F60A\u2e3b", 6}, | |
305 | {"\U0001F60A\u00a9", 3}, | |
306 | {"\U0001F60A\U0001F60A", 4}, | |
307 | {"\U0001F60A\U0001F1E6", 4}, | |
308 | {"\U0001F1E6\u061c", 2}, | |
309 | {"\U0001F1E6\u000a", 2}, | |
310 | {"\U0001F1E6\u000d", 2}, | |
311 | {"\U0001F1E6\u0300", 2}, | |
312 | {"\U0001F1E6\u200d", 2}, | |
313 | {"\U0001F1E6a", 3}, | |
314 | {"\U0001F1E6\u1b05", 3}, | |
315 | {"\U0001F1E6\u2985", 3}, | |
316 | {"\U0001F1E6\U0001F100", 3}, | |
317 | {"\U0001F1E6\uff61", 3}, | |
318 | {"\U0001F1E6\ufe6a", 4}, | |
319 | {"\U0001F1E6\uff01", 4}, | |
320 | {"\U0001F1E6\u2e3a", 5}, | |
321 | {"\U0001F1E6\u2e3b", 6}, | |
322 | {"\U0001F1E6\u00a9", 3}, | |
323 | {"\U0001F1E6\U0001F60A", 4}, | |
324 | {"\U0001F1E6\U0001F1E6", 2}, | |
325 | {"Ka\u0308se", 4}, // Käse (German, "cheese") | |
326 | {"\U0001f3f3\ufe0f\u200d\U0001f308", 2}, // Rainbow flag | |
327 | {"\U0001f1e9\U0001f1ea", 2}, // German flag | |
328 | {"\u0916\u093e", 2}, // खा (Hindi, "eat") | |
329 | {"\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f466", 2}, // Family: Man, Woman, Girl, Boy | |
330 | {"\u1112\u116f\u11b6", 2}, // 훯 (Hangul, conjoining Jamo, "h+weo+lh") | |
331 | {"\ud6ef", 2}, // 훯 (Hangul, precomposed, "h+weo+lh") | |
332 | {"\u79f0\u8c13", 4}, // 称谓 (Chinese, "title") | |
333 | {"\u0e1c\u0e39\u0e49", 1}, // ผู้ (Thai, "person") | |
334 | {"\u0623\u0643\u062a\u0648\u0628\u0631", 6}, // أكتوبر (Arabic, "October") | |
335 | {"\ua992\ua997\ua983", 3}, // ꦒꦗꦃ (Javanese, "elephant") | |
336 | {"\u263a", 1}, // White smiling face | |
337 | {"\u263a\ufe0f", 2}, // White smiling face (with variation selector 16 = emoji presentation) | |
338 | {"\u231b", 2}, // Hourglass | |
339 | {"\u231b\ufe0e", 1}, // Hourglass (with variation selector 15 = text presentation) | |
340 | } | |
341 | ||
342 | // String width tests using the StringWidth function. | |
343 | func TestWidthStringWidth(t *testing.T) { | |
344 | for index, testCase := range widthTestCases { | |
345 | actual := StringWidth(testCase.original) | |
346 | if actual != testCase.expected { | |
347 | t.Errorf("StringWidth(%q) is %d, expected %d (test case %d)", testCase.original, actual, testCase.expected, index) | |
348 | } | |
349 | } | |
350 | } | |
351 | ||
352 | // String width tests using the Graphemes class. | |
353 | func TestWidthGraphemes(t *testing.T) { | |
354 | for index, testCase := range widthTestCases { | |
355 | var actual int | |
356 | graphemes := NewGraphemes(testCase.original) | |
357 | for graphemes.Next() { | |
358 | actual += graphemes.Width() | |
359 | } | |
360 | if actual != testCase.expected { | |
361 | t.Errorf("Width of %q is %d, expected %d (test case %d)", testCase.original, actual, testCase.expected, index) | |
362 | } | |
363 | } | |
364 | } | |
365 | ||
366 | // String width tests using the FirstGraphemeCluster function. | |
367 | func TestWidthGraphemesFunctionBytes(t *testing.T) { | |
368 | for index, testCase := range widthTestCases { | |
369 | var actual, width int | |
370 | state := -1 | |
371 | text := []byte(testCase.original) | |
372 | for len(text) > 0 { | |
373 | _, text, width, state = FirstGraphemeCluster(text, state) | |
374 | actual += width | |
375 | } | |
376 | if actual != testCase.expected { | |
377 | t.Errorf("Width of %q is %d, expected %d (test case %d)", testCase.original, actual, testCase.expected, index) | |
378 | } | |
379 | } | |
380 | } | |
381 | ||
382 | // String width tests using the FirstGraphemeClusterString function. | |
383 | func TestWidthGraphemesFunctionString(t *testing.T) { | |
384 | for index, testCase := range widthTestCases { | |
385 | var actual, width int | |
386 | state := -1 | |
387 | text := testCase.original | |
388 | for len(text) > 0 { | |
389 | _, text, width, state = FirstGraphemeClusterInString(text, state) | |
390 | actual += width | |
391 | } | |
392 | if actual != testCase.expected { | |
393 | t.Errorf("Width of %q is %d, expected %d (test case %d)", testCase.original, actual, testCase.expected, index) | |
394 | } | |
395 | } | |
396 | } | |
397 | ||
398 | // String width tests using the Step function. | |
399 | func TestWidthStepBytes(t *testing.T) { | |
400 | for index, testCase := range widthTestCases { | |
401 | var actual, boundaries int | |
402 | state := -1 | |
403 | text := []byte(testCase.original) | |
404 | for len(text) > 0 { | |
405 | _, text, boundaries, state = Step(text, state) | |
406 | actual += boundaries >> ShiftWidth | |
407 | } | |
408 | if actual != testCase.expected { | |
409 | t.Errorf("Width of %q is %d, expected %d (test case %d)", testCase.original, actual, testCase.expected, index) | |
410 | } | |
411 | } | |
412 | } | |
413 | ||
414 | // String width tests using the StepString function. | |
415 | func TestWidthStepString(t *testing.T) { | |
416 | for index, testCase := range widthTestCases { | |
417 | var actual, boundaries int | |
418 | state := -1 | |
419 | text := testCase.original | |
420 | for len(text) > 0 { | |
421 | _, text, boundaries, state = StepString(text, state) | |
422 | actual += boundaries >> ShiftWidth | |
423 | } | |
424 | if actual != testCase.expected { | |
425 | t.Errorf("Width of %q is %d, expected %d (test case %d)", testCase.original, actual, testCase.expected, index) | |
426 | } | |
427 | } | |
428 | } |
3 | 3 | |
4 | 4 | // wordBreakTestCases are Grapheme testcases taken from |
5 | 5 | // https://www.unicode.org/Public/14.0.0/ucd/auxiliary/WordBreakTest.txt |
6 | // on July 25, 2022. See | |
6 | // on September 10, 2022. See | |
7 | 7 | // https://www.unicode.org/license.html for the Unicode license agreement. |
8 | 8 | var wordBreakTestCases = []testCase{ |
9 | 9 | {original: "\u0001\u0001", expected: [][]rune{{0x0001}, {0x0001}}}, // ÷ [0.2] <START OF HEADING> (Other) ÷ [999.0] <START OF HEADING> (Other) ÷ [0.3] |
6 | 6 | // and |
7 | 7 | // https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt |
8 | 8 | // ("Extended_Pictographic" only) |
9 | // on July 25, 2022. See https://www.unicode.org/license.html for the Unicode | |
9 | // on September 10, 2022. See https://www.unicode.org/license.html for the Unicode | |
10 | 10 | // license agreement. |
11 | 11 | var workBreakCodePoints = [][3]int{ |
12 | 12 | {0x000A, 0x000A, prLF}, // Cc <control-000A> |
623 | 623 | {0x212A, 0x212D, prALetter}, // L& [4] KELVIN SIGN..BLACK-LETTER CAPITAL C |
624 | 624 | {0x212F, 0x2134, prALetter}, // L& [6] SCRIPT SMALL E..SCRIPT SMALL O |
625 | 625 | {0x2135, 0x2138, prALetter}, // Lo [4] ALEF SYMBOL..DALET SYMBOL |
626 | {0x2139, 0x2139, prExtendedPictographic}, // E0.6 [1] (ℹ️) information | |
626 | 627 | {0x2139, 0x2139, prALetter}, // L& INFORMATION SOURCE |
627 | {0x2139, 0x2139, prExtendedPictographic}, // E0.6 [1] (ℹ️) information | |
628 | 628 | {0x213C, 0x213F, prALetter}, // L& [4] DOUBLE-STRUCK SMALL PI..DOUBLE-STRUCK CAPITAL PI |
629 | 629 | {0x2145, 0x2149, prALetter}, // L& [5] DOUBLE-STRUCK ITALIC CAPITAL D..DOUBLE-STRUCK ITALIC SMALL J |
630 | 630 | {0x214E, 0x214E, prALetter}, // L& TURNED SMALL F |