Codebase list golang-github-rivo-uniseg / db67758
Merge pull request #26 from rivo/width Monospace string width calculation (à la wcwidth). rivo authored 1 year, 8 months ago GitHub committed 1 year, 8 months ago
20 changed file(s) with 1115 addition(s) and 147 deletion(s). Raw diff Collapse all Expand all
22 [![Go Reference](https://pkg.go.dev/badge/github.com/rivo/uniseg.svg)](https://pkg.go.dev/github.com/rivo/uniseg)
33 [![Go Report](https://img.shields.io/badge/go%20report-A%2B-brightgreen.svg)](https://goreportcard.com/report/github.com/rivo/uniseg)
44
5 This Go package implements Unicode Text Segmentation according to [Unicode Standard Annex #29](https://unicode.org/reports/tr29/) and Unicode Line Breaking according to [Unicode Standard Annex #14](https://unicode.org/reports/tr14/) (Unicode version 14.0.0).
5 This Go package implements Unicode Text Segmentation according to [Unicode Standard Annex #29](https://unicode.org/reports/tr29/), Unicode Line Breaking according to [Unicode Standard Annex #14](https://unicode.org/reports/tr14/) (Unicode version 14.0.0), and monospace font string width calculation similar to [wcwidth](https://man7.org/linux/man-pages/man3/wcwidth.3.html).
66
77 ## Background
88
3030
3131 Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area. This package provides tools to determine where a string may or may not be broken and where it must be broken (for example after newline characters).
3232
33 ### Monospace Width
34
35 Most terminals or text displays / text editors using a monospace font (for example source code editors) use a fixed width for each character. Some characters such as emojis or characters found in Asian and other languages may take up more than one character cell. This package provides tools to determine the number of cells a string will take up when displayed in a monospace font. See [here](https://pkg.go.dev/github.com/rivo/uniseg#hdr-Monospace_Width) for more information.
36
3337 ## Installation
3438
3539 ```bash
4448 n := uniseg.GraphemeClusterCount("🇩🇪🏳️‍🌈")
4549 fmt.Println(n)
4650 // 2
51 ```
52
53 ### Calculating the Monospace String Width
54
55 ```go
56 width := uniseg.StringWidth("🇩🇪🏳️‍🌈!")
57 fmt.Println(width)
58 // 5
4759 ```
4860
4961 ### Using the [`Graphemes`](https://pkg.go.dev/github.com/rivo/uniseg#Graphemes) Class
00 /*
1 Package uniseg implements Unicode Text Segmentation and Unicode Line Breaking.
2 Unicode Text Segmentation conforms to Unicode Standard Annex #29
3 (https://unicode.org/reports/tr29/) and Unicode Line Breaking conforms to
4 Unicode Standard Annex #14 (https://unicode.org/reports/tr14/).
1 Package uniseg implements Unicode Text Segmentation, Unicode Line Breaking, and
2 string width calculation for monospace fonts. Unicode Text Segmentation conforms
3 to Unicode Standard Annex #29 (https://unicode.org/reports/tr29/) and Unicode
4 Line Breaking conforms to Unicode Standard Annex #14
5 (https://unicode.org/reports/tr14/).
56
67 In short, using this package, you can split a string into grapheme clusters
78 (what people would usually refer to as a "character"), into words, and into
1112 other languages. Additionally, you can use it to implement line breaking (or
1213 "word wrapping"), that is, to determine where text can be broken over to the
1314 next line when the width of the line is not big enough to fit the entire text.
15 Finally, you can use it to calculate the display width of a string for monospace
16 fonts.
1417
15 Grapheme Clusters
18 # Getting Started
19
20 If you just want to count the number of characters in a string, you can use
21 [GraphemeClusterCount]. If you want to determine the display width of a string,
22 you can use [StringWidth]. If you want to iterate over a string, you can use
23 [Step], [StepString], or the [Graphemes] class (more convenient but less
24 performant). This will provide you with all information: grapheme clusters,
25 word boundaries, sentence boundaries, line breaks, and monospace character
26 widths. The specialized functions [FirstGraphemeCluster],
27 [FirstGraphemeClusterInString], [FirstWord], [FirstWordInString],
28 [FirstSentence], and [FirstSentenceInString] can be used if only one type of
29 information is needed.
30
31 # Grapheme Clusters
1632
1733 Consider the rainbow flag emoji: 🏳️‍🌈. On most modern systems, it appears as one
1834 character. But its string representation actually has 14 bytes, so counting
2036 either: The flag has 4 Unicode code points, thus 4 runes. The stdlib function
2137 utf8.RuneCountInString("🏳️‍🌈") and len([]rune("🏳️‍🌈")) will both return 4.
2238
23 The uniseg.GraphemeClusterCount(str) function will return 1 for the rainbow flag
24 emoji. The Graphemes class and a variety of functions in this package will allow
25 you to split strings into its grapheme clusters.
39 The [GraphemeClusterCount] function will return 1 for the rainbow flag emoji.
40 The Graphemes class and a variety of functions in this package will allow you to
41 split strings into its grapheme clusters.
2642
27 Word Boundaries
43 # Word Boundaries
2844
2945 Word boundaries are used in a number of different contexts. The most familiar
3046 ones are selection (double-click mouse selection), cursor movement ("move to
3248 search and replace. This package provides methods for determining word
3349 boundaries.
3450
35 Sentence Boundaries
51 # Sentence Boundaries
3652
3753 Sentence boundaries are often used for triple-click or some other method of
3854 selecting or iterating through blocks of text that are larger than single words.
4056 database queries. This package provides methods for determining sentence
4157 boundaries.
4258
43 Line Breaking
59 # Line Breaking
4460
4561 Line breaking, also known as word wrapping, is the process of breaking a section
4662 of text into lines such that it will fit in the available width of a page,
4864 positions in a string where a line must be broken, may be broken, or must not be
4965 broken.
5066
67 # Monospace Width
68
69 Monospace width, as referred to in this package, is the width of a string in a
70 monospace font. This is commonly used in terminal user interfaces or text
71 displays or editors that don't support proportional fonts. A width of 1
72 corresponds to a single character cell. The C function [wcwidth()] and its
73 implementation in other programming languages is in widespread use for the same
74 purpose. However, there is no standard for the calculation of such widths, and
75 this package differs from wcwidth() in a number of ways, presumably to generate
76 more visually pleasing results.
77
78 To start, we assume that every code point has a width of 1, with the following
79 exceptions:
80
81 - Code points with grapheme cluster break properties Control, CR, LF, Extend,
82 and ZWJ have a width of 0.
83 - U+2E3A, Two-Em Dash, has a width of 3.
84 - U+2E3B, Three-Em Dash, has a width of 4.
85 - Characters with the East-Asian Width properties "Fullwidth" (F) and "Wide"
86 (W) have a width of 2. (Properties "Ambiguous" (A) and "Neutral" (N) both
87 have a width of 1.)
88 - Code points with grapheme cluster break property Regional Indicator have a
89 width of 2.
90 - Code points with grapheme cluster break property Extended Pictographic have
91 a width of 2, unless their Emoji Presentation flag is "No", in which case
92 the width is 1.
93
94 For Hangul grapheme clusters composed of conjoining Jamo and for Regional
95 Indicators (flags), all code points except the first one have a width of 0. For
96 grapheme clusters starting with an Extended Pictographic, any additional code
97 point will force a total width of 2, except if the Variation Selector-15
98 (U+FE0E) is included, in which case the total width is always 1.
99
100 Note that whether these widths appear correct depends on your application's
101 render engine, to which extent it conforms to the Unicode Standard, and its
102 choice of font.
103
104 [wcwidth()]: https://man7.org/linux/man-pages/man3/wcwidth.3.html
51105 */
52106 package uniseg
33
44 // eastAsianWidth are taken from
55 // https://www.unicode.org/Public/14.0.0/ucd/EastAsianWidth.txt
6 // on July 25, 2022. See https://www.unicode.org/license.html for the Unicode
6 // and
7 // https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt
8 // ("Extended_Pictographic" only)
9 // on September 10, 2022. See https://www.unicode.org/license.html for the Unicode
710 // license agreement.
811 var eastAsianWidth = [][3]int{
912 {0x0000, 0x001F, prN}, // Cc [32] <control-0000>..<control-001F>
0 package uniseg
1
2 // Code generated via go generate from gen_properties.go. DO NOT EDIT.
3
4 // emojiPresentation are taken from
5 //
6 // and
7 // https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt
8 // ("Extended_Pictographic" only)
9 // on September 10, 2022. See https://www.unicode.org/license.html for the Unicode
10 // license agreement.
11 var emojiPresentation = [][3]int{
12 {0x231A, 0x231B, prEmojiPresentation}, // E0.6 [2] (⌚..⌛) watch..hourglass done
13 {0x23E9, 0x23EC, prEmojiPresentation}, // E0.6 [4] (⏩..⏬) fast-forward button..fast down button
14 {0x23F0, 0x23F0, prEmojiPresentation}, // E0.6 [1] (⏰) alarm clock
15 {0x23F3, 0x23F3, prEmojiPresentation}, // E0.6 [1] (⏳) hourglass not done
16 {0x25FD, 0x25FE, prEmojiPresentation}, // E0.6 [2] (◽..◾) white medium-small square..black medium-small square
17 {0x2614, 0x2615, prEmojiPresentation}, // E0.6 [2] (☔..☕) umbrella with rain drops..hot beverage
18 {0x2648, 0x2653, prEmojiPresentation}, // E0.6 [12] (♈..♓) Aries..Pisces
19 {0x267F, 0x267F, prEmojiPresentation}, // E0.6 [1] (♿) wheelchair symbol
20 {0x2693, 0x2693, prEmojiPresentation}, // E0.6 [1] (⚓) anchor
21 {0x26A1, 0x26A1, prEmojiPresentation}, // E0.6 [1] (⚡) high voltage
22 {0x26AA, 0x26AB, prEmojiPresentation}, // E0.6 [2] (⚪..⚫) white circle..black circle
23 {0x26BD, 0x26BE, prEmojiPresentation}, // E0.6 [2] (⚽..⚾) soccer ball..baseball
24 {0x26C4, 0x26C5, prEmojiPresentation}, // E0.6 [2] (⛄..⛅) snowman without snow..sun behind cloud
25 {0x26CE, 0x26CE, prEmojiPresentation}, // E0.6 [1] (⛎) Ophiuchus
26 {0x26D4, 0x26D4, prEmojiPresentation}, // E0.6 [1] (⛔) no entry
27 {0x26EA, 0x26EA, prEmojiPresentation}, // E0.6 [1] (⛪) church
28 {0x26F2, 0x26F3, prEmojiPresentation}, // E0.6 [2] (⛲..⛳) fountain..flag in hole
29 {0x26F5, 0x26F5, prEmojiPresentation}, // E0.6 [1] (⛵) sailboat
30 {0x26FA, 0x26FA, prEmojiPresentation}, // E0.6 [1] (⛺) tent
31 {0x26FD, 0x26FD, prEmojiPresentation}, // E0.6 [1] (⛽) fuel pump
32 {0x2705, 0x2705, prEmojiPresentation}, // E0.6 [1] (✅) check mark button
33 {0x270A, 0x270B, prEmojiPresentation}, // E0.6 [2] (✊..✋) raised fist..raised hand
34 {0x2728, 0x2728, prEmojiPresentation}, // E0.6 [1] (✨) sparkles
35 {0x274C, 0x274C, prEmojiPresentation}, // E0.6 [1] (❌) cross mark
36 {0x274E, 0x274E, prEmojiPresentation}, // E0.6 [1] (❎) cross mark button
37 {0x2753, 0x2755, prEmojiPresentation}, // E0.6 [3] (❓..❕) red question mark..white exclamation mark
38 {0x2757, 0x2757, prEmojiPresentation}, // E0.6 [1] (❗) red exclamation mark
39 {0x2795, 0x2797, prEmojiPresentation}, // E0.6 [3] (➕..➗) plus..divide
40 {0x27B0, 0x27B0, prEmojiPresentation}, // E0.6 [1] (➰) curly loop
41 {0x27BF, 0x27BF, prEmojiPresentation}, // E1.0 [1] (➿) double curly loop
42 {0x2B1B, 0x2B1C, prEmojiPresentation}, // E0.6 [2] (⬛..⬜) black large square..white large square
43 {0x2B50, 0x2B50, prEmojiPresentation}, // E0.6 [1] (⭐) star
44 {0x2B55, 0x2B55, prEmojiPresentation}, // E0.6 [1] (⭕) hollow red circle
45 {0x1F004, 0x1F004, prEmojiPresentation}, // E0.6 [1] (🀄) mahjong red dragon
46 {0x1F0CF, 0x1F0CF, prEmojiPresentation}, // E0.6 [1] (🃏) joker
47 {0x1F18E, 0x1F18E, prEmojiPresentation}, // E0.6 [1] (🆎) AB button (blood type)
48 {0x1F191, 0x1F19A, prEmojiPresentation}, // E0.6 [10] (🆑..🆚) CL button..VS button
49 {0x1F1E6, 0x1F1FF, prEmojiPresentation}, // E0.0 [26] (🇦..🇿) regional indicator symbol letter a..regional indicator symbol letter z
50 {0x1F201, 0x1F201, prEmojiPresentation}, // E0.6 [1] (🈁) Japanese “here” button
51 {0x1F21A, 0x1F21A, prEmojiPresentation}, // E0.6 [1] (🈚) Japanese “free of charge” button
52 {0x1F22F, 0x1F22F, prEmojiPresentation}, // E0.6 [1] (🈯) Japanese “reserved” button
53 {0x1F232, 0x1F236, prEmojiPresentation}, // E0.6 [5] (🈲..🈶) Japanese “prohibited” button..Japanese “not free of charge” button
54 {0x1F238, 0x1F23A, prEmojiPresentation}, // E0.6 [3] (🈸..🈺) Japanese “application” button..Japanese “open for business” button
55 {0x1F250, 0x1F251, prEmojiPresentation}, // E0.6 [2] (🉐..🉑) Japanese “bargain” button..Japanese “acceptable” button
56 {0x1F300, 0x1F30C, prEmojiPresentation}, // E0.6 [13] (🌀..🌌) cyclone..milky way
57 {0x1F30D, 0x1F30E, prEmojiPresentation}, // E0.7 [2] (🌍..🌎) globe showing Europe-Africa..globe showing Americas
58 {0x1F30F, 0x1F30F, prEmojiPresentation}, // E0.6 [1] (🌏) globe showing Asia-Australia
59 {0x1F310, 0x1F310, prEmojiPresentation}, // E1.0 [1] (🌐) globe with meridians
60 {0x1F311, 0x1F311, prEmojiPresentation}, // E0.6 [1] (🌑) new moon
61 {0x1F312, 0x1F312, prEmojiPresentation}, // E1.0 [1] (🌒) waxing crescent moon
62 {0x1F313, 0x1F315, prEmojiPresentation}, // E0.6 [3] (🌓..🌕) first quarter moon..full moon
63 {0x1F316, 0x1F318, prEmojiPresentation}, // E1.0 [3] (🌖..🌘) waning gibbous moon..waning crescent moon
64 {0x1F319, 0x1F319, prEmojiPresentation}, // E0.6 [1] (🌙) crescent moon
65 {0x1F31A, 0x1F31A, prEmojiPresentation}, // E1.0 [1] (🌚) new moon face
66 {0x1F31B, 0x1F31B, prEmojiPresentation}, // E0.6 [1] (🌛) first quarter moon face
67 {0x1F31C, 0x1F31C, prEmojiPresentation}, // E0.7 [1] (🌜) last quarter moon face
68 {0x1F31D, 0x1F31E, prEmojiPresentation}, // E1.0 [2] (🌝..🌞) full moon face..sun with face
69 {0x1F31F, 0x1F320, prEmojiPresentation}, // E0.6 [2] (🌟..🌠) glowing star..shooting star
70 {0x1F32D, 0x1F32F, prEmojiPresentation}, // E1.0 [3] (🌭..🌯) hot dog..burrito
71 {0x1F330, 0x1F331, prEmojiPresentation}, // E0.6 [2] (🌰..🌱) chestnut..seedling
72 {0x1F332, 0x1F333, prEmojiPresentation}, // E1.0 [2] (🌲..🌳) evergreen tree..deciduous tree
73 {0x1F334, 0x1F335, prEmojiPresentation}, // E0.6 [2] (🌴..🌵) palm tree..cactus
74 {0x1F337, 0x1F34A, prEmojiPresentation}, // E0.6 [20] (🌷..🍊) tulip..tangerine
75 {0x1F34B, 0x1F34B, prEmojiPresentation}, // E1.0 [1] (🍋) lemon
76 {0x1F34C, 0x1F34F, prEmojiPresentation}, // E0.6 [4] (🍌..🍏) banana..green apple
77 {0x1F350, 0x1F350, prEmojiPresentation}, // E1.0 [1] (🍐) pear
78 {0x1F351, 0x1F37B, prEmojiPresentation}, // E0.6 [43] (🍑..🍻) peach..clinking beer mugs
79 {0x1F37C, 0x1F37C, prEmojiPresentation}, // E1.0 [1] (🍼) baby bottle
80 {0x1F37E, 0x1F37F, prEmojiPresentation}, // E1.0 [2] (🍾..🍿) bottle with popping cork..popcorn
81 {0x1F380, 0x1F393, prEmojiPresentation}, // E0.6 [20] (🎀..🎓) ribbon..graduation cap
82 {0x1F3A0, 0x1F3C4, prEmojiPresentation}, // E0.6 [37] (🎠..🏄) carousel horse..person surfing
83 {0x1F3C5, 0x1F3C5, prEmojiPresentation}, // E1.0 [1] (🏅) sports medal
84 {0x1F3C6, 0x1F3C6, prEmojiPresentation}, // E0.6 [1] (🏆) trophy
85 {0x1F3C7, 0x1F3C7, prEmojiPresentation}, // E1.0 [1] (🏇) horse racing
86 {0x1F3C8, 0x1F3C8, prEmojiPresentation}, // E0.6 [1] (🏈) american football
87 {0x1F3C9, 0x1F3C9, prEmojiPresentation}, // E1.0 [1] (🏉) rugby football
88 {0x1F3CA, 0x1F3CA, prEmojiPresentation}, // E0.6 [1] (🏊) person swimming
89 {0x1F3CF, 0x1F3D3, prEmojiPresentation}, // E1.0 [5] (🏏..🏓) cricket game..ping pong
90 {0x1F3E0, 0x1F3E3, prEmojiPresentation}, // E0.6 [4] (🏠..🏣) house..Japanese post office
91 {0x1F3E4, 0x1F3E4, prEmojiPresentation}, // E1.0 [1] (🏤) post office
92 {0x1F3E5, 0x1F3F0, prEmojiPresentation}, // E0.6 [12] (🏥..🏰) hospital..castle
93 {0x1F3F4, 0x1F3F4, prEmojiPresentation}, // E1.0 [1] (🏴) black flag
94 {0x1F3F8, 0x1F407, prEmojiPresentation}, // E1.0 [16] (🏸..🐇) badminton..rabbit
95 {0x1F408, 0x1F408, prEmojiPresentation}, // E0.7 [1] (🐈) cat
96 {0x1F409, 0x1F40B, prEmojiPresentation}, // E1.0 [3] (🐉..🐋) dragon..whale
97 {0x1F40C, 0x1F40E, prEmojiPresentation}, // E0.6 [3] (🐌..🐎) snail..horse
98 {0x1F40F, 0x1F410, prEmojiPresentation}, // E1.0 [2] (🐏..🐐) ram..goat
99 {0x1F411, 0x1F412, prEmojiPresentation}, // E0.6 [2] (🐑..🐒) ewe..monkey
100 {0x1F413, 0x1F413, prEmojiPresentation}, // E1.0 [1] (🐓) rooster
101 {0x1F414, 0x1F414, prEmojiPresentation}, // E0.6 [1] (🐔) chicken
102 {0x1F415, 0x1F415, prEmojiPresentation}, // E0.7 [1] (🐕) dog
103 {0x1F416, 0x1F416, prEmojiPresentation}, // E1.0 [1] (🐖) pig
104 {0x1F417, 0x1F429, prEmojiPresentation}, // E0.6 [19] (🐗..🐩) boar..poodle
105 {0x1F42A, 0x1F42A, prEmojiPresentation}, // E1.0 [1] (🐪) camel
106 {0x1F42B, 0x1F43E, prEmojiPresentation}, // E0.6 [20] (🐫..🐾) two-hump camel..paw prints
107 {0x1F440, 0x1F440, prEmojiPresentation}, // E0.6 [1] (👀) eyes
108 {0x1F442, 0x1F464, prEmojiPresentation}, // E0.6 [35] (👂..👤) ear..bust in silhouette
109 {0x1F465, 0x1F465, prEmojiPresentation}, // E1.0 [1] (👥) busts in silhouette
110 {0x1F466, 0x1F46B, prEmojiPresentation}, // E0.6 [6] (👦..👫) boy..woman and man holding hands
111 {0x1F46C, 0x1F46D, prEmojiPresentation}, // E1.0 [2] (👬..👭) men holding hands..women holding hands
112 {0x1F46E, 0x1F4AC, prEmojiPresentation}, // E0.6 [63] (👮..💬) police officer..speech balloon
113 {0x1F4AD, 0x1F4AD, prEmojiPresentation}, // E1.0 [1] (💭) thought balloon
114 {0x1F4AE, 0x1F4B5, prEmojiPresentation}, // E0.6 [8] (💮..💵) white flower..dollar banknote
115 {0x1F4B6, 0x1F4B7, prEmojiPresentation}, // E1.0 [2] (💶..💷) euro banknote..pound banknote
116 {0x1F4B8, 0x1F4EB, prEmojiPresentation}, // E0.6 [52] (💸..📫) money with wings..closed mailbox with raised flag
117 {0x1F4EC, 0x1F4ED, prEmojiPresentation}, // E0.7 [2] (📬..📭) open mailbox with raised flag..open mailbox with lowered flag
118 {0x1F4EE, 0x1F4EE, prEmojiPresentation}, // E0.6 [1] (📮) postbox
119 {0x1F4EF, 0x1F4EF, prEmojiPresentation}, // E1.0 [1] (📯) postal horn
120 {0x1F4F0, 0x1F4F4, prEmojiPresentation}, // E0.6 [5] (📰..📴) newspaper..mobile phone off
121 {0x1F4F5, 0x1F4F5, prEmojiPresentation}, // E1.0 [1] (📵) no mobile phones
122 {0x1F4F6, 0x1F4F7, prEmojiPresentation}, // E0.6 [2] (📶..📷) antenna bars..camera
123 {0x1F4F8, 0x1F4F8, prEmojiPresentation}, // E1.0 [1] (📸) camera with flash
124 {0x1F4F9, 0x1F4FC, prEmojiPresentation}, // E0.6 [4] (📹..📼) video camera..videocassette
125 {0x1F4FF, 0x1F502, prEmojiPresentation}, // E1.0 [4] (📿..🔂) prayer beads..repeat single button
126 {0x1F503, 0x1F503, prEmojiPresentation}, // E0.6 [1] (🔃) clockwise vertical arrows
127 {0x1F504, 0x1F507, prEmojiPresentation}, // E1.0 [4] (🔄..🔇) counterclockwise arrows button..muted speaker
128 {0x1F508, 0x1F508, prEmojiPresentation}, // E0.7 [1] (🔈) speaker low volume
129 {0x1F509, 0x1F509, prEmojiPresentation}, // E1.0 [1] (🔉) speaker medium volume
130 {0x1F50A, 0x1F514, prEmojiPresentation}, // E0.6 [11] (🔊..🔔) speaker high volume..bell
131 {0x1F515, 0x1F515, prEmojiPresentation}, // E1.0 [1] (🔕) bell with slash
132 {0x1F516, 0x1F52B, prEmojiPresentation}, // E0.6 [22] (🔖..🔫) bookmark..water pistol
133 {0x1F52C, 0x1F52D, prEmojiPresentation}, // E1.0 [2] (🔬..🔭) microscope..telescope
134 {0x1F52E, 0x1F53D, prEmojiPresentation}, // E0.6 [16] (🔮..🔽) crystal ball..downwards button
135 {0x1F54B, 0x1F54E, prEmojiPresentation}, // E1.0 [4] (🕋..🕎) kaaba..menorah
136 {0x1F550, 0x1F55B, prEmojiPresentation}, // E0.6 [12] (🕐..🕛) one o’clock..twelve o’clock
137 {0x1F55C, 0x1F567, prEmojiPresentation}, // E0.7 [12] (🕜..🕧) one-thirty..twelve-thirty
138 {0x1F57A, 0x1F57A, prEmojiPresentation}, // E3.0 [1] (🕺) man dancing
139 {0x1F595, 0x1F596, prEmojiPresentation}, // E1.0 [2] (🖕..🖖) middle finger..vulcan salute
140 {0x1F5A4, 0x1F5A4, prEmojiPresentation}, // E3.0 [1] (🖤) black heart
141 {0x1F5FB, 0x1F5FF, prEmojiPresentation}, // E0.6 [5] (🗻..🗿) mount fuji..moai
142 {0x1F600, 0x1F600, prEmojiPresentation}, // E1.0 [1] (😀) grinning face
143 {0x1F601, 0x1F606, prEmojiPresentation}, // E0.6 [6] (😁..😆) beaming face with smiling eyes..grinning squinting face
144 {0x1F607, 0x1F608, prEmojiPresentation}, // E1.0 [2] (😇..😈) smiling face with halo..smiling face with horns
145 {0x1F609, 0x1F60D, prEmojiPresentation}, // E0.6 [5] (😉..😍) winking face..smiling face with heart-eyes
146 {0x1F60E, 0x1F60E, prEmojiPresentation}, // E1.0 [1] (😎) smiling face with sunglasses
147 {0x1F60F, 0x1F60F, prEmojiPresentation}, // E0.6 [1] (😏) smirking face
148 {0x1F610, 0x1F610, prEmojiPresentation}, // E0.7 [1] (😐) neutral face
149 {0x1F611, 0x1F611, prEmojiPresentation}, // E1.0 [1] (😑) expressionless face
150 {0x1F612, 0x1F614, prEmojiPresentation}, // E0.6 [3] (😒..😔) unamused face..pensive face
151 {0x1F615, 0x1F615, prEmojiPresentation}, // E1.0 [1] (😕) confused face
152 {0x1F616, 0x1F616, prEmojiPresentation}, // E0.6 [1] (😖) confounded face
153 {0x1F617, 0x1F617, prEmojiPresentation}, // E1.0 [1] (😗) kissing face
154 {0x1F618, 0x1F618, prEmojiPresentation}, // E0.6 [1] (😘) face blowing a kiss
155 {0x1F619, 0x1F619, prEmojiPresentation}, // E1.0 [1] (😙) kissing face with smiling eyes
156 {0x1F61A, 0x1F61A, prEmojiPresentation}, // E0.6 [1] (😚) kissing face with closed eyes
157 {0x1F61B, 0x1F61B, prEmojiPresentation}, // E1.0 [1] (😛) face with tongue
158 {0x1F61C, 0x1F61E, prEmojiPresentation}, // E0.6 [3] (😜..😞) winking face with tongue..disappointed face
159 {0x1F61F, 0x1F61F, prEmojiPresentation}, // E1.0 [1] (😟) worried face
160 {0x1F620, 0x1F625, prEmojiPresentation}, // E0.6 [6] (😠..😥) angry face..sad but relieved face
161 {0x1F626, 0x1F627, prEmojiPresentation}, // E1.0 [2] (😦..😧) frowning face with open mouth..anguished face
162 {0x1F628, 0x1F62B, prEmojiPresentation}, // E0.6 [4] (😨..😫) fearful face..tired face
163 {0x1F62C, 0x1F62C, prEmojiPresentation}, // E1.0 [1] (😬) grimacing face
164 {0x1F62D, 0x1F62D, prEmojiPresentation}, // E0.6 [1] (😭) loudly crying face
165 {0x1F62E, 0x1F62F, prEmojiPresentation}, // E1.0 [2] (😮..😯) face with open mouth..hushed face
166 {0x1F630, 0x1F633, prEmojiPresentation}, // E0.6 [4] (😰..😳) anxious face with sweat..flushed face
167 {0x1F634, 0x1F634, prEmojiPresentation}, // E1.0 [1] (😴) sleeping face
168 {0x1F635, 0x1F635, prEmojiPresentation}, // E0.6 [1] (😵) face with crossed-out eyes
169 {0x1F636, 0x1F636, prEmojiPresentation}, // E1.0 [1] (😶) face without mouth
170 {0x1F637, 0x1F640, prEmojiPresentation}, // E0.6 [10] (😷..🙀) face with medical mask..weary cat
171 {0x1F641, 0x1F644, prEmojiPresentation}, // E1.0 [4] (🙁..🙄) slightly frowning face..face with rolling eyes
172 {0x1F645, 0x1F64F, prEmojiPresentation}, // E0.6 [11] (🙅..🙏) person gesturing NO..folded hands
173 {0x1F680, 0x1F680, prEmojiPresentation}, // E0.6 [1] (🚀) rocket
174 {0x1F681, 0x1F682, prEmojiPresentation}, // E1.0 [2] (🚁..🚂) helicopter..locomotive
175 {0x1F683, 0x1F685, prEmojiPresentation}, // E0.6 [3] (🚃..🚅) railway car..bullet train
176 {0x1F686, 0x1F686, prEmojiPresentation}, // E1.0 [1] (🚆) train
177 {0x1F687, 0x1F687, prEmojiPresentation}, // E0.6 [1] (🚇) metro
178 {0x1F688, 0x1F688, prEmojiPresentation}, // E1.0 [1] (🚈) light rail
179 {0x1F689, 0x1F689, prEmojiPresentation}, // E0.6 [1] (🚉) station
180 {0x1F68A, 0x1F68B, prEmojiPresentation}, // E1.0 [2] (🚊..🚋) tram..tram car
181 {0x1F68C, 0x1F68C, prEmojiPresentation}, // E0.6 [1] (🚌) bus
182 {0x1F68D, 0x1F68D, prEmojiPresentation}, // E0.7 [1] (🚍) oncoming bus
183 {0x1F68E, 0x1F68E, prEmojiPresentation}, // E1.0 [1] (🚎) trolleybus
184 {0x1F68F, 0x1F68F, prEmojiPresentation}, // E0.6 [1] (🚏) bus stop
185 {0x1F690, 0x1F690, prEmojiPresentation}, // E1.0 [1] (🚐) minibus
186 {0x1F691, 0x1F693, prEmojiPresentation}, // E0.6 [3] (🚑..🚓) ambulance..police car
187 {0x1F694, 0x1F694, prEmojiPresentation}, // E0.7 [1] (🚔) oncoming police car
188 {0x1F695, 0x1F695, prEmojiPresentation}, // E0.6 [1] (🚕) taxi
189 {0x1F696, 0x1F696, prEmojiPresentation}, // E1.0 [1] (🚖) oncoming taxi
190 {0x1F697, 0x1F697, prEmojiPresentation}, // E0.6 [1] (🚗) automobile
191 {0x1F698, 0x1F698, prEmojiPresentation}, // E0.7 [1] (🚘) oncoming automobile
192 {0x1F699, 0x1F69A, prEmojiPresentation}, // E0.6 [2] (🚙..🚚) sport utility vehicle..delivery truck
193 {0x1F69B, 0x1F6A1, prEmojiPresentation}, // E1.0 [7] (🚛..🚡) articulated lorry..aerial tramway
194 {0x1F6A2, 0x1F6A2, prEmojiPresentation}, // E0.6 [1] (🚢) ship
195 {0x1F6A3, 0x1F6A3, prEmojiPresentation}, // E1.0 [1] (🚣) person rowing boat
196 {0x1F6A4, 0x1F6A5, prEmojiPresentation}, // E0.6 [2] (🚤..🚥) speedboat..horizontal traffic light
197 {0x1F6A6, 0x1F6A6, prEmojiPresentation}, // E1.0 [1] (🚦) vertical traffic light
198 {0x1F6A7, 0x1F6AD, prEmojiPresentation}, // E0.6 [7] (🚧..🚭) construction..no smoking
199 {0x1F6AE, 0x1F6B1, prEmojiPresentation}, // E1.0 [4] (🚮..🚱) litter in bin sign..non-potable water
200 {0x1F6B2, 0x1F6B2, prEmojiPresentation}, // E0.6 [1] (🚲) bicycle
201 {0x1F6B3, 0x1F6B5, prEmojiPresentation}, // E1.0 [3] (🚳..🚵) no bicycles..person mountain biking
202 {0x1F6B6, 0x1F6B6, prEmojiPresentation}, // E0.6 [1] (🚶) person walking
203 {0x1F6B7, 0x1F6B8, prEmojiPresentation}, // E1.0 [2] (🚷..🚸) no pedestrians..children crossing
204 {0x1F6B9, 0x1F6BE, prEmojiPresentation}, // E0.6 [6] (🚹..🚾) men’s room..water closet
205 {0x1F6BF, 0x1F6BF, prEmojiPresentation}, // E1.0 [1] (🚿) shower
206 {0x1F6C0, 0x1F6C0, prEmojiPresentation}, // E0.6 [1] (🛀) person taking bath
207 {0x1F6C1, 0x1F6C5, prEmojiPresentation}, // E1.0 [5] (🛁..🛅) bathtub..left luggage
208 {0x1F6CC, 0x1F6CC, prEmojiPresentation}, // E1.0 [1] (🛌) person in bed
209 {0x1F6D0, 0x1F6D0, prEmojiPresentation}, // E1.0 [1] (🛐) place of worship
210 {0x1F6D1, 0x1F6D2, prEmojiPresentation}, // E3.0 [2] (🛑..🛒) stop sign..shopping cart
211 {0x1F6D5, 0x1F6D5, prEmojiPresentation}, // E12.0 [1] (🛕) hindu temple
212 {0x1F6D6, 0x1F6D7, prEmojiPresentation}, // E13.0 [2] (🛖..🛗) hut..elevator
213 {0x1F6DD, 0x1F6DF, prEmojiPresentation}, // E14.0 [3] (🛝..🛟) playground slide..ring buoy
214 {0x1F6EB, 0x1F6EC, prEmojiPresentation}, // E1.0 [2] (🛫..🛬) airplane departure..airplane arrival
215 {0x1F6F4, 0x1F6F6, prEmojiPresentation}, // E3.0 [3] (🛴..🛶) kick scooter..canoe
216 {0x1F6F7, 0x1F6F8, prEmojiPresentation}, // E5.0 [2] (🛷..🛸) sled..flying saucer
217 {0x1F6F9, 0x1F6F9, prEmojiPresentation}, // E11.0 [1] (🛹) skateboard
218 {0x1F6FA, 0x1F6FA, prEmojiPresentation}, // E12.0 [1] (🛺) auto rickshaw
219 {0x1F6FB, 0x1F6FC, prEmojiPresentation}, // E13.0 [2] (🛻..🛼) pickup truck..roller skate
220 {0x1F7E0, 0x1F7EB, prEmojiPresentation}, // E12.0 [12] (🟠..🟫) orange circle..brown square
221 {0x1F7F0, 0x1F7F0, prEmojiPresentation}, // E14.0 [1] (🟰) heavy equals sign
222 {0x1F90C, 0x1F90C, prEmojiPresentation}, // E13.0 [1] (🤌) pinched fingers
223 {0x1F90D, 0x1F90F, prEmojiPresentation}, // E12.0 [3] (🤍..🤏) white heart..pinching hand
224 {0x1F910, 0x1F918, prEmojiPresentation}, // E1.0 [9] (🤐..🤘) zipper-mouth face..sign of the horns
225 {0x1F919, 0x1F91E, prEmojiPresentation}, // E3.0 [6] (🤙..🤞) call me hand..crossed fingers
226 {0x1F91F, 0x1F91F, prEmojiPresentation}, // E5.0 [1] (🤟) love-you gesture
227 {0x1F920, 0x1F927, prEmojiPresentation}, // E3.0 [8] (🤠..🤧) cowboy hat face..sneezing face
228 {0x1F928, 0x1F92F, prEmojiPresentation}, // E5.0 [8] (🤨..🤯) face with raised eyebrow..exploding head
229 {0x1F930, 0x1F930, prEmojiPresentation}, // E3.0 [1] (🤰) pregnant woman
230 {0x1F931, 0x1F932, prEmojiPresentation}, // E5.0 [2] (🤱..🤲) breast-feeding..palms up together
231 {0x1F933, 0x1F93A, prEmojiPresentation}, // E3.0 [8] (🤳..🤺) selfie..person fencing
232 {0x1F93C, 0x1F93E, prEmojiPresentation}, // E3.0 [3] (🤼..🤾) people wrestling..person playing handball
233 {0x1F93F, 0x1F93F, prEmojiPresentation}, // E12.0 [1] (🤿) diving mask
234 {0x1F940, 0x1F945, prEmojiPresentation}, // E3.0 [6] (🥀..🥅) wilted flower..goal net
235 {0x1F947, 0x1F94B, prEmojiPresentation}, // E3.0 [5] (🥇..🥋) 1st place medal..martial arts uniform
236 {0x1F94C, 0x1F94C, prEmojiPresentation}, // E5.0 [1] (🥌) curling stone
237 {0x1F94D, 0x1F94F, prEmojiPresentation}, // E11.0 [3] (🥍..🥏) lacrosse..flying disc
238 {0x1F950, 0x1F95E, prEmojiPresentation}, // E3.0 [15] (🥐..🥞) croissant..pancakes
239 {0x1F95F, 0x1F96B, prEmojiPresentation}, // E5.0 [13] (🥟..🥫) dumpling..canned food
240 {0x1F96C, 0x1F970, prEmojiPresentation}, // E11.0 [5] (🥬..🥰) leafy green..smiling face with hearts
241 {0x1F971, 0x1F971, prEmojiPresentation}, // E12.0 [1] (🥱) yawning face
242 {0x1F972, 0x1F972, prEmojiPresentation}, // E13.0 [1] (🥲) smiling face with tear
243 {0x1F973, 0x1F976, prEmojiPresentation}, // E11.0 [4] (🥳..🥶) partying face..cold face
244 {0x1F977, 0x1F978, prEmojiPresentation}, // E13.0 [2] (🥷..🥸) ninja..disguised face
245 {0x1F979, 0x1F979, prEmojiPresentation}, // E14.0 [1] (🥹) face holding back tears
246 {0x1F97A, 0x1F97A, prEmojiPresentation}, // E11.0 [1] (🥺) pleading face
247 {0x1F97B, 0x1F97B, prEmojiPresentation}, // E12.0 [1] (🥻) sari
248 {0x1F97C, 0x1F97F, prEmojiPresentation}, // E11.0 [4] (🥼..🥿) lab coat..flat shoe
249 {0x1F980, 0x1F984, prEmojiPresentation}, // E1.0 [5] (🦀..🦄) crab..unicorn
250 {0x1F985, 0x1F991, prEmojiPresentation}, // E3.0 [13] (🦅..🦑) eagle..squid
251 {0x1F992, 0x1F997, prEmojiPresentation}, // E5.0 [6] (🦒..🦗) giraffe..cricket
252 {0x1F998, 0x1F9A2, prEmojiPresentation}, // E11.0 [11] (🦘..🦢) kangaroo..swan
253 {0x1F9A3, 0x1F9A4, prEmojiPresentation}, // E13.0 [2] (🦣..🦤) mammoth..dodo
254 {0x1F9A5, 0x1F9AA, prEmojiPresentation}, // E12.0 [6] (🦥..🦪) sloth..oyster
255 {0x1F9AB, 0x1F9AD, prEmojiPresentation}, // E13.0 [3] (🦫..🦭) beaver..seal
256 {0x1F9AE, 0x1F9AF, prEmojiPresentation}, // E12.0 [2] (🦮..🦯) guide dog..white cane
257 {0x1F9B0, 0x1F9B9, prEmojiPresentation}, // E11.0 [10] (🦰..🦹) red hair..supervillain
258 {0x1F9BA, 0x1F9BF, prEmojiPresentation}, // E12.0 [6] (🦺..🦿) safety vest..mechanical leg
259 {0x1F9C0, 0x1F9C0, prEmojiPresentation}, // E1.0 [1] (🧀) cheese wedge
260 {0x1F9C1, 0x1F9C2, prEmojiPresentation}, // E11.0 [2] (🧁..🧂) cupcake..salt
261 {0x1F9C3, 0x1F9CA, prEmojiPresentation}, // E12.0 [8] (🧃..🧊) beverage box..ice
262 {0x1F9CB, 0x1F9CB, prEmojiPresentation}, // E13.0 [1] (🧋) bubble tea
263 {0x1F9CC, 0x1F9CC, prEmojiPresentation}, // E14.0 [1] (🧌) troll
264 {0x1F9CD, 0x1F9CF, prEmojiPresentation}, // E12.0 [3] (🧍..🧏) person standing..deaf person
265 {0x1F9D0, 0x1F9E6, prEmojiPresentation}, // E5.0 [23] (🧐..🧦) face with monocle..socks
266 {0x1F9E7, 0x1F9FF, prEmojiPresentation}, // E11.0 [25] (🧧..🧿) red envelope..nazar amulet
267 {0x1FA70, 0x1FA73, prEmojiPresentation}, // E12.0 [4] (🩰..🩳) ballet shoes..shorts
268 {0x1FA74, 0x1FA74, prEmojiPresentation}, // E13.0 [1] (🩴) thong sandal
269 {0x1FA78, 0x1FA7A, prEmojiPresentation}, // E12.0 [3] (🩸..🩺) drop of blood..stethoscope
270 {0x1FA7B, 0x1FA7C, prEmojiPresentation}, // E14.0 [2] (🩻..🩼) x-ray..crutch
271 {0x1FA80, 0x1FA82, prEmojiPresentation}, // E12.0 [3] (🪀..🪂) yo-yo..parachute
272 {0x1FA83, 0x1FA86, prEmojiPresentation}, // E13.0 [4] (🪃..🪆) boomerang..nesting dolls
273 {0x1FA90, 0x1FA95, prEmojiPresentation}, // E12.0 [6] (🪐..🪕) ringed planet..banjo
274 {0x1FA96, 0x1FAA8, prEmojiPresentation}, // E13.0 [19] (🪖..🪨) military helmet..rock
275 {0x1FAA9, 0x1FAAC, prEmojiPresentation}, // E14.0 [4] (🪩..🪬) mirror ball..hamsa
276 {0x1FAB0, 0x1FAB6, prEmojiPresentation}, // E13.0 [7] (🪰..🪶) fly..feather
277 {0x1FAB7, 0x1FABA, prEmojiPresentation}, // E14.0 [4] (🪷..🪺) lotus..nest with eggs
278 {0x1FAC0, 0x1FAC2, prEmojiPresentation}, // E13.0 [3] (🫀..🫂) anatomical heart..people hugging
279 {0x1FAC3, 0x1FAC5, prEmojiPresentation}, // E14.0 [3] (🫃..🫅) pregnant man..person with crown
280 {0x1FAD0, 0x1FAD6, prEmojiPresentation}, // E13.0 [7] (🫐..🫖) blueberries..teapot
281 {0x1FAD7, 0x1FAD9, prEmojiPresentation}, // E14.0 [3] (🫗..🫙) pouring liquid..jar
282 {0x1FAE0, 0x1FAE7, prEmojiPresentation}, // E14.0 [8] (🫠..🫧) melting face..bubbles
283 {0x1FAF0, 0x1FAF6, prEmojiPresentation}, // E14.0 [7] (🫰..🫶) hand with index finger and thumb crossed..heart hands
284 }
306306 // Output: First |line.
307307 //‖Second |line.‖
308308 }
309
310 func ExampleStringWidth() {
311 fmt.Println(uniseg.StringWidth("Hello, 世界"))
312 // Output: 11
313 }
22 // This program generates a property file in Go file from Unicode Character
33 // Database auxiliary data files. The command line arguments are as follows:
44 //
5 // 1. The name of the Unicode data file (just the filename, without extension).
6 // 2. The name of the locally generated Go file.
7 // 3. The name of the slice mapping code points to properties.
8 // 4. The name of the generator, for logging purposes.
9 // 5. (Optional) Flags, comma-separated. The following flags are available:
10 // - "emojis": include emoji properties (Extended Pictographic only).
11 // - "gencat": include general category properties.
5 // 1. The name of the Unicode data file (just the filename, without extension).
6 // Can be "-" (to skip) if the emoji flag is included.
7 // 2. The name of the locally generated Go file.
8 // 3. The name of the slice mapping code points to properties.
9 // 4. The name of the generator, for logging purposes.
10 // 5. (Optional) Flags, comma-separated. The following flags are available:
11 // - "emojis=<property>": include the specified emoji properties (e.g.
12 // "Extended_Pictographic").
13 // - "gencat": include general category properties.
1214 //
13 //go:generate go run gen_properties.go auxiliary/GraphemeBreakProperty graphemeproperties.go graphemeCodePoints graphemes emojis
14 //go:generate go run gen_properties.go auxiliary/WordBreakProperty wordproperties.go workBreakCodePoints words emojis
15 //go:generate go run gen_properties.go auxiliary/GraphemeBreakProperty graphemeproperties.go graphemeCodePoints graphemes emojis=Extended_Pictographic
16 //go:generate go run gen_properties.go auxiliary/WordBreakProperty wordproperties.go workBreakCodePoints words emojis=Extended_Pictographic
1517 //go:generate go run gen_properties.go auxiliary/SentenceBreakProperty sentenceproperties.go sentenceBreakCodePoints sentences
1618 //go:generate go run gen_properties.go LineBreak lineproperties.go lineBreakCodePoints lines gencat
1719 //go:generate go run gen_properties.go EastAsianWidth eastasianwidth.go eastAsianWidth eastasianwidth
20 //go:generate go run gen_properties.go - emojipresentation.go emojiPresentation emojipresentation emojis=Emoji_Presentation
1821 package main
1922
2023 import (
3740 // We want to test against a specific version rather than the latest. When the
3841 // package is upgraded to a new version, change these to generate new tests.
3942 const (
40 gbpURL = `https://www.unicode.org/Public/14.0.0/ucd/%s.txt`
41 emojiURL = `https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt`
43 propertyURL = `https://www.unicode.org/Public/14.0.0/ucd/%s.txt`
44 emojiURL = `https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt`
4245 )
4346
4447 // The regular expression for a line containing a code point range property.
5457 log.SetFlags(0)
5558
5659 // Parse flags.
57 flags := make(map[string]struct{})
60 flags := make(map[string]string)
5861 if len(os.Args) >= 6 {
5962 for _, flag := range strings.Split(os.Args[5], ",") {
60 flags[flag] = struct{}{}
63 flagFields := strings.Split(flag, "=")
64 if len(flagFields) == 1 {
65 flags[flagFields[0]] = "yes"
66 } else {
67 flags[flagFields[0]] = flagFields[1]
68 }
6169 }
6270 }
6371
6472 // Parse the text file and generate Go source code from it.
65 var emojis string
66 if _, ok := flags["emojis"]; ok {
67 emojis = emojiURL
68 }
6973 _, includeGeneralCategory := flags["gencat"]
70 src, err := parse(fmt.Sprintf(gbpURL, os.Args[1]), emojis, includeGeneralCategory)
74 var mainURL string
75 if os.Args[1] != "-" {
76 mainURL = fmt.Sprintf(propertyURL, os.Args[1])
77 }
78 src, err := parse(mainURL, flags["emojis"], includeGeneralCategory)
7179 if err != nil {
7280 log.Fatal(err)
7381 }
8795
8896 // parse parses the Unicode Properties text files located at the given URLs and
8997 // returns their equivalent Go source code to be used in the uniseg package. If
90 // "emojiURL" is an empty string, no emoji code points will be included. If
98 // "emojiProperty" is not an empty string, emoji code points for that emoji
99 // property (e.g. "Extended_Pictographic") will be included. In those cases, you
100 // may pass an empty "propertyURL" to skip parsing the main properties file. If
91101 // "includeGeneralCategory" is true, the Unicode General Category property will
92102 // be extracted from the comments and included in the output.
93 func parse(gbpURL, emojiURL string, includeGeneralCategory bool) (string, error) {
103 func parse(propertyURL, emojiProperty string, includeGeneralCategory bool) (string, error) {
104 if propertyURL == "" && emojiProperty == "" {
105 return "", errors.New("no properties to parse")
106 }
107
94108 // Temporary buffer to hold properties.
95109 var properties [][4]string
96110
97111 // Open the first URL.
98 log.Printf("Parsing %s", gbpURL)
99 res, err := http.Get(gbpURL)
100 if err != nil {
101 return "", err
102 }
103 in1 := res.Body
104 defer in1.Close()
105
106 // Parse it.
107 scanner := bufio.NewScanner(in1)
108 num := 0
109 for scanner.Scan() {
110 num++
111 line := strings.TrimSpace(scanner.Text())
112
113 // Skip comments and empty lines.
114 if strings.HasPrefix(line, "#") || line == "" {
115 continue
116 }
117
118 // Everything else must be a code point range, a property and a comment.
119 from, to, property, comment, err := parseProperty(line)
112 if propertyURL != "" {
113 log.Printf("Parsing %s", propertyURL)
114 res, err := http.Get(propertyURL)
120115 if err != nil {
121 return "", fmt.Errorf("%s line %d: %v", os.Args[4], num, err)
122 }
123 properties = append(properties, [4]string{from, to, property, comment})
124 }
125 if err := scanner.Err(); err != nil {
126 return "", err
116 return "", err
117 }
118 in1 := res.Body
119 defer in1.Close()
120
121 // Parse it.
122 scanner := bufio.NewScanner(in1)
123 num := 0
124 for scanner.Scan() {
125 num++
126 line := strings.TrimSpace(scanner.Text())
127
128 // Skip comments and empty lines.
129 if strings.HasPrefix(line, "#") || line == "" {
130 continue
131 }
132
133 // Everything else must be a code point range, a property and a comment.
134 from, to, property, comment, err := parseProperty(line)
135 if err != nil {
136 return "", fmt.Errorf("%s line %d: %v", os.Args[4], num, err)
137 }
138 properties = append(properties, [4]string{from, to, property, comment})
139 }
140 if err := scanner.Err(); err != nil {
141 return "", err
142 }
127143 }
128144
129145 // Open the second URL.
130 if emojiURL != "" {
146 if emojiProperty != "" {
131147 log.Printf("Parsing %s", emojiURL)
132 res, err = http.Get(emojiURL)
148 res, err := http.Get(emojiURL)
133149 if err != nil {
134150 return "", err
135151 }
137153 defer in2.Close()
138154
139155 // Parse it.
140 scanner = bufio.NewScanner(in2)
141 num = 0
156 scanner := bufio.NewScanner(in2)
157 num := 0
142158 for scanner.Scan() {
143159 num++
144160 line := scanner.Text()
145161
146162 // Skip comments, empty lines, and everything not containing
147163 // "Extended_Pictographic".
148 if strings.HasPrefix(line, "#") || line == "" || !strings.Contains(line, "Extended_Pictographic") {
164 if strings.HasPrefix(line, "#") || line == "" || !strings.Contains(line, emojiProperty) {
149165 continue
150166 }
151167
188204 // Code generated via go generate from gen_properties.go. DO NOT EDIT.
189205
190206 // ` + os.Args[3] + ` are taken from
191 // ` + gbpURL + emojiComment + `
207 // ` + propertyURL + emojiComment + `
192208 // on ` + time.Now().Format("January 2, 2006") + `. See https://www.unicode.org/license.html for the Unicode
193209 // license agreement.
194210 var ` + os.Args[3] + ` = [][` + strconv.Itoa(columns) + `]int{
33
44 // Graphemes implements an iterator over Unicode grapheme clusters, or
55 // user-perceived characters. While iterating, it also provides information
6 // about word boundaries, sentence boundaries, and line breaks.
6 // about word boundaries, sentence boundaries, line breaks, and monospace
7 // character widths.
78 //
89 // After constructing the class via [NewGraphemes] for a given string "str",
910 // [Graphemes.Next] is called for every grapheme cluster in a loop until it
1011 // returns false. Inside the loop, information about the grapheme cluster as
11 // well as boundary information is available via the various methods (see
12 // examples below).
12 // well as boundary information and character width is available via the various
13 // methods (see examples below).
1314 //
1415 // Using this class to iterate over a string is convenient but it is much slower
1516 // than using this package's [Step] or [StepString] functions or any of the
133134 return g.boundaries & MaskLine
134135 }
135136
137 // Width returns the monospace width of the current grapheme cluster.
138 func (g *Graphemes) Width() int {
139 if g.state < 0 {
140 return 0
141 }
142 return g.boundaries >> ShiftWidth
143 }
144
136145 // Reset puts the iterator into its initial state such that the next call to
137146 // [Graphemes.Next] sets it to the first grapheme cluster again.
138147 func (g *Graphemes) Reset() {
153162 return
154163 }
155164
165 // The number of bits the grapheme property must be shifted to make place for
166 // grapheme states.
167 const shiftGraphemePropState = 4
168
156169 // FirstGraphemeCluster returns the first grapheme cluster found in the given
157170 // byte slice according to the rules of Unicode Standard Annex #29, Grapheme
158171 // Cluster Boundaries. This function can be called continuously to extract all
168181 // "cluster" byte slice is the sub-slice of the input slice containing the
169182 // identified grapheme cluster.
170183 //
184 // The returned width is the width of the grapheme cluster for most monospace
185 // fonts where a value of 1 represents one character cell.
186 //
171187 // Given an empty byte slice "b", the function returns nil values.
172188 //
173189 // While slightly less convenient than using the Graphemes class, this function
174190 // has much better performance and makes no allocations. It lends itself well to
175191 // large byte slices.
176 //
177 // The "reserved" return value is a placeholder for future functionality and may
178 // be ignored for the time being.
179 func FirstGraphemeCluster(b []byte, state int) (cluster, rest []byte, reserved, newState int) {
192 func FirstGraphemeCluster(b []byte, state int) (cluster, rest []byte, width, newState int) {
180193 // An empty byte slice returns nothing.
181194 if len(b) == 0 {
182195 return
185198 // Extract the first rune.
186199 r, length := utf8.DecodeRune(b)
187200 if len(b) <= length { // If we're already past the end, there is nothing else to parse.
188 return b, nil, 0, grAny
201 var prop int
202 if state < 0 {
203 prop = property(graphemeCodePoints, r)
204 } else {
205 prop = state >> shiftGraphemePropState
206 }
207 return b, nil, runeWidth(r, prop), grAny | (prop << shiftGraphemePropState)
189208 }
190209
191210 // If we don't know the state, determine it now.
211 var firstProp int
192212 if state < 0 {
193 state, _ = transitionGraphemeState(state, r)
194 }
213 state, firstProp, _ = transitionGraphemeState(state, r)
214 } else {
215 firstProp = state >> shiftGraphemePropState
216 }
217 width += runeWidth(r, firstProp)
195218
196219 // Transition until we find a boundary.
197 var boundary bool
198220 for {
221 var (
222 prop int
223 boundary bool
224 )
225
199226 r, l := utf8.DecodeRune(b[length:])
200 state, boundary = transitionGraphemeState(state, r)
227 state, prop, boundary = transitionGraphemeState(state&maskGraphemeState, r)
201228
202229 if boundary {
203 return b[:length], b[length:], 0, state
230 return b[:length], b[length:], width, state | (prop << shiftGraphemePropState)
231 }
232
233 if firstProp != prExtendedPictographic && firstProp != prRegionalIndicator && firstProp != prL {
234 width += runeWidth(r, prop)
235 } else if firstProp == prExtendedPictographic {
236 if r == 0xfe0e {
237 width = 1
238 } else {
239 width = 2
240 }
204241 }
205242
206243 length += l
207244 if len(b) <= length {
208 return b, nil, 0, grAny
245 return b, nil, width, grAny | (prop << shiftGraphemePropState)
209246 }
210247 }
211248 }
212249
213250 // FirstGraphemeClusterInString is like [FirstGraphemeCluster] but its input and
214251 // outputs are strings.
215 func FirstGraphemeClusterInString(str string, state int) (cluster, rest string, reserved, newState int) {
252 func FirstGraphemeClusterInString(str string, state int) (cluster, rest string, width, newState int) {
216253 // An empty string returns nothing.
217254 if len(str) == 0 {
218255 return
221258 // Extract the first rune.
222259 r, length := utf8.DecodeRuneInString(str)
223260 if len(str) <= length { // If we're already past the end, there is nothing else to parse.
224 return str, "", 0, grAny
261 var prop int
262 if state < 0 {
263 prop = property(graphemeCodePoints, r)
264 } else {
265 prop = state >> shiftGraphemePropState
266 }
267 return str, "", runeWidth(r, prop), grAny | (prop << shiftGraphemePropState)
225268 }
226269
227270 // If we don't know the state, determine it now.
271 var firstProp int
228272 if state < 0 {
229 state, _ = transitionGraphemeState(state, r)
230 }
273 state, firstProp, _ = transitionGraphemeState(state, r)
274 } else {
275 firstProp = state >> shiftGraphemePropState
276 }
277 width += runeWidth(r, firstProp)
231278
232279 // Transition until we find a boundary.
233 var boundary bool
234280 for {
281 var (
282 prop int
283 boundary bool
284 )
285
235286 r, l := utf8.DecodeRuneInString(str[length:])
236 state, boundary = transitionGraphemeState(state, r)
287 state, prop, boundary = transitionGraphemeState(state&maskGraphemeState, r)
237288
238289 if boundary {
239 return str[:length], str[length:], 0, state
290 return str[:length], str[length:], width, state | (prop << shiftGraphemePropState)
291 }
292
293 if firstProp != prExtendedPictographic && firstProp != prRegionalIndicator && firstProp != prL {
294 width += runeWidth(r, prop)
295 } else if firstProp == prExtendedPictographic {
296 if r == 0xfe0e {
297 width = 1
298 } else {
299 width = 2
300 }
240301 }
241302
242303 length += l
243304 if len(str) <= length {
244 return str, "", 0, grAny
245 }
246 }
247 }
305 return str, "", width, grAny | (prop << shiftGraphemePropState)
306 }
307 }
308 }
33
44 // graphemeBreakTestCases are Grapheme testcases taken from
55 // https://www.unicode.org/Public/14.0.0/ucd/auxiliary/GraphemeBreakTest.txt
6 // on July 25, 2022. See
6 // on September 10, 2022. See
77 // https://www.unicode.org/license.html for the Unicode license agreement.
88 var graphemeBreakTestCases = []testCase{
99 {original: "\u0020\u0020", expected: [][]rune{{0x0020}, {0x0020}}}, // ÷ [0.2] SPACE (Other) ÷ [999.0] SPACE (Other) ÷ [0.3]
66 // and
77 // https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt
88 // ("Extended_Pictographic" only)
9 // on July 25, 2022. See https://www.unicode.org/license.html for the Unicode
9 // on September 10, 2022. See https://www.unicode.org/license.html for the Unicode
1010 // license agreement.
1111 var graphemeCodePoints = [][3]int{
1212 {0x0000, 0x0009, prControl}, // Cc [10] <control-0000>..<control-0009>
2626 //
2727 // This map is queried as follows:
2828 //
29 // 1. Find specific state + specific property. Stop if found.
30 // 2. Find specific state + any property.
31 // 3. Find any state + specific property.
32 // 4. If only (2) or (3) (but not both) was found, stop.
33 // 5. If both (2) and (3) were found, use state from (3) and breaking instruction
34 // from the transition with the lower rule number, prefer (3) if rule numbers
35 // are equal. Stop.
36 // 6. Assume grAny and grBoundary.
29 // 1. Find specific state + specific property. Stop if found.
30 // 2. Find specific state + any property.
31 // 3. Find any state + specific property.
32 // 4. If only (2) or (3) (but not both) was found, stop.
33 // 5. If both (2) and (3) were found, use state from (3) and breaking instruction
34 // from the transition with the lower rule number, prefer (3) if rule numbers
35 // are equal. Stop.
36 // 6. Assume grAny and grBoundary.
3737 //
3838 // Unicode version 14.0.0.
3939 var grTransitions = map[[2]int][3]int{
9191 }
9292
9393 // transitionGraphemeState determines the new state of the grapheme cluster
94 // parser given the current state and the next code point. It also returns
95 // whether a cluster boundary was detected.
96 func transitionGraphemeState(state int, r rune) (newState int, boundary bool) {
94 // parser given the current state and the next code point. It also returns the
95 // code point's grapheme property (the value mapped by the [graphemeCodePoints]
96 // table) and whether a cluster boundary was detected.
97 func transitionGraphemeState(state int, r rune) (newState, prop int, boundary bool) {
9798 // Determine the property of the next character.
98 nextProperty := property(graphemeCodePoints, r)
99 prop = property(graphemeCodePoints, r)
99100
100101 // Find the applicable transition.
101 transition, ok := grTransitions[[2]int{state, nextProperty}]
102 transition, ok := grTransitions[[2]int{state, prop}]
102103 if ok {
103104 // We have a specific transition. We'll use it.
104 return transition[0], transition[1] == grBoundary
105 return transition[0], prop, transition[1] == grBoundary
105106 }
106107
107108 // No specific transition found. Try the less specific ones.
108109 transAnyProp, okAnyProp := grTransitions[[2]int{state, prAny}]
109 transAnyState, okAnyState := grTransitions[[2]int{grAny, nextProperty}]
110 transAnyState, okAnyState := grTransitions[[2]int{grAny, prop}]
110111 if okAnyProp && okAnyState {
111112 // Both apply. We'll use a mix (see comments for grTransitions).
112113 newState = transAnyState[0]
119120
120121 if okAnyProp {
121122 // We only have a specific state.
122 return transAnyProp[0], transAnyProp[1] == grBoundary
123 return transAnyProp[0], prop, transAnyProp[1] == grBoundary
123124 // This branch will probably never be reached because okAnyState will
124125 // always be true given the current transition map. But we keep it here
125126 // for future modifications to the transition map where this may not be
128129
129130 if okAnyState {
130131 // We only have a specific property.
131 return transAnyState[0], transAnyState[1] == grBoundary
132 return transAnyState[0], prop, transAnyState[1] == grBoundary
132133 }
133134
134135 // No known transition. GB999: Any ÷ Any.
135 return grAny, true
136 return grAny, prop, true
136137 }
33
44 // lineBreakTestCases are Grapheme testcases taken from
55 // https://www.unicode.org/Public/14.0.0/ucd/auxiliary/LineBreakTest.txt
6 // on July 25, 2022. See
6 // on September 10, 2022. See
77 // https://www.unicode.org/license.html for the Unicode license agreement.
88 var lineBreakTestCases = []testCase{
99 {original: "\u0023\u0023", expected: [][]rune{{0x0023, 0x0023}}}, // × [0.3] NUMBER SIGN (AL) × [28.0] NUMBER SIGN (AL) ÷ [0.3]
33
44 // lineBreakCodePoints are taken from
55 // https://www.unicode.org/Public/14.0.0/ucd/LineBreak.txt
6 // on July 25, 2022. See https://www.unicode.org/license.html for the Unicode
6 // and
7 // https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt
8 // ("Extended_Pictographic" only)
9 // on September 10, 2022. See https://www.unicode.org/license.html for the Unicode
710 // license agreement.
811 var lineBreakCodePoints = [][4]int{
912 {0x0000, 0x0008, prCM, gcCc}, // [9] <control-0000>..<control-0008>
22 // The Unicode properties as used in the various parsers. Only the ones needed
33 // in the context of this package are included.
44 const (
5 prXX = 0 // Same as prAny.
6 prAny = iota // prAny must be 0.
7 prPrepend
5 prXX = 0 // Same as prAny.
6 prAny = iota // prAny must be 0.
7 prPrepend // Grapheme properties must come first, to reduce the number of bits stored in the state vector.
88 prCR
99 prLF
1010 prControl
8585 prW
8686 prH
8787 prF
88 prEmojiPresentation
8889 )
8990
9091 // Unicode General Categories. Only the ones needed in the context of this
33
44 // sentenceBreakTestCases are Grapheme testcases taken from
55 // https://www.unicode.org/Public/14.0.0/ucd/auxiliary/SentenceBreakTest.txt
6 // on July 25, 2022. See
6 // on September 10, 2022. See
77 // https://www.unicode.org/license.html for the Unicode license agreement.
88 var sentenceBreakTestCases = []testCase{
99 {original: "\u0001\u0001", expected: [][]rune{{0x0001, 0x0001}}}, // ÷ [0.2] <START OF HEADING> (Other) × [998.0] <START OF HEADING> (Other) ÷ [0.3]
33
44 // sentenceBreakCodePoints are taken from
55 // https://www.unicode.org/Public/14.0.0/ucd/auxiliary/SentenceBreakProperty.txt
6 // on July 25, 2022. See https://www.unicode.org/license.html for the Unicode
6 // and
7 // https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt
8 // ("Extended_Pictographic" only)
9 // on September 10, 2022. See https://www.unicode.org/license.html for the Unicode
710 // license agreement.
811 var sentenceBreakCodePoints = [][3]int{
912 {0x0009, 0x0009, prSp}, // Cc <control-0009>
11
22 import "unicode/utf8"
33
4 // The bit masks used to extract boundary information returned by the [Step]
5 // function.
4 // The bit masks used to extract boundary information returned by [Step].
65 const (
76 MaskLine = 3
87 MaskWord = 4
98 MaskSentence = 8
109 )
1110
11 // The number of bits to shift the boundary information returned by [Step] to
12 // obtain the monospace width of the grapheme cluster.
13 const ShiftWidth = 4
14
1215 // The bit positions by which boundary flags are shifted by the [Step] function.
13 // This must correspond to the Mask constants.
16 // These must correspond to the Mask constants.
1417 const (
1518 shiftWord = 2
1619 shiftSentence = 3
20 // shiftwWidth is ShiftWidth above. No mask as these are always the remaining bits.
1721 )
1822
1923 // The bit positions by which states are shifted by the [Step] function. These
2024 // values must ensure state values defined for each of the boundary algorithms
21 // don't overlap (and that they all still fit in a single int).
25 // don't overlap (and that they all still fit in a single int). These must
26 // correspond to the Mask constants.
2227 const (
2328 shiftWordState = 4
2429 shiftSentenceState = 9
2530 shiftLineState = 13
31 shiftPropState = 21 // No mask as these are always the remaining bits.
2632 )
2733
2834 // The bit mask used to extract the state returned by the [Step] function, after
5359 // boundary.
5460 // - boundaries&MaskLine == LineCanBreak: You may or may not break the line at
5561 // the boundary.
62 // - boundaries >> ShiftWidth: The width of the grapheme cluster for most
63 // monospace fonts where a value of 1 represents one character cell.
5664 //
5765 // This function can be called continuously to extract all grapheme clusters
5866 // from a byte slice, as illustrated in the examples below.
8694 // Extract the first rune.
8795 r, length := utf8.DecodeRune(b)
8896 if len(b) <= length { // If we're already past the end, there is nothing else to parse.
89 return b, nil, LineMustBreak | (1 << shiftWord) | (1 << shiftSentence), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState)
97 var prop int
98 if state < 0 {
99 prop = property(graphemeCodePoints, r)
100 } else {
101 prop = state >> shiftPropState
102 }
103 return b, nil, LineMustBreak | (1 << shiftWord) | (1 << shiftSentence) | (runeWidth(r, prop) << ShiftWidth), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState) | (prop << shiftPropState)
90104 }
91105
92106 // If we don't know the state, determine it now.
93 var graphemeState, wordState, sentenceState, lineState int
107 var graphemeState, wordState, sentenceState, lineState, firstProp int
94108 remainder := b[length:]
95109 if state < 0 {
96 graphemeState, _ = transitionGraphemeState(state, r)
110 graphemeState, firstProp, _ = transitionGraphemeState(state, r)
97111 wordState, _ = transitionWordBreakState(state, r, remainder, "")
98112 sentenceState, _ = transitionSentenceBreakState(state, r, remainder, "")
99113 lineState, _ = transitionLineBreakState(state, r, remainder, "")
102116 wordState = (state >> shiftWordState) & maskWordState
103117 sentenceState = (state >> shiftSentenceState) & maskSentenceState
104118 lineState = (state >> shiftLineState) & maskLineState
119 firstProp = state >> shiftPropState
105120 }
106121
107122 // Transition until we find a grapheme cluster boundary.
108 var (
109 graphemeBoundary, wordBoundary, sentenceBoundary bool
110 lineBreak int
111 )
123 width := runeWidth(r, firstProp)
112124 for {
125 var (
126 graphemeBoundary, wordBoundary, sentenceBoundary bool
127 lineBreak, prop int
128 )
129
113130 r, l := utf8.DecodeRune(remainder)
114131 remainder = b[length+l:]
115132
116 graphemeState, graphemeBoundary = transitionGraphemeState(graphemeState, r)
133 graphemeState, prop, graphemeBoundary = transitionGraphemeState(graphemeState, r)
117134 wordState, wordBoundary = transitionWordBreakState(wordState, r, remainder, "")
118135 sentenceState, sentenceBoundary = transitionSentenceBreakState(sentenceState, r, remainder, "")
119136 lineState, lineBreak = transitionLineBreakState(lineState, r, remainder, "")
120137
121138 if graphemeBoundary {
122 boundary := lineBreak
139 boundary := lineBreak | (width << ShiftWidth)
123140 if wordBoundary {
124141 boundary |= 1 << shiftWord
125142 }
126143 if sentenceBoundary {
127144 boundary |= 1 << shiftSentence
128145 }
129 return b[:length], b[length:], boundary, graphemeState | (wordState << shiftWordState) | (sentenceState << shiftSentenceState) | (lineState << shiftLineState)
146 return b[:length], b[length:], boundary, graphemeState | (wordState << shiftWordState) | (sentenceState << shiftSentenceState) | (lineState << shiftLineState) | (prop << shiftPropState)
147 }
148
149 if firstProp != prExtendedPictographic && firstProp != prRegionalIndicator && firstProp != prL {
150 width += runeWidth(r, prop)
151 } else if firstProp == prExtendedPictographic {
152 if r == 0xfe0e {
153 width = 1
154 } else {
155 width = 2
156 }
130157 }
131158
132159 length += l
133160 if len(b) <= length {
134 return b, nil, LineMustBreak | (1 << shiftWord) | (1 << shiftSentence), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState)
161 return b, nil, LineMustBreak | (1 << shiftWord) | (1 << shiftSentence) | (width << ShiftWidth), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState) | (prop << shiftPropState)
135162 }
136163 }
137164 }
146173 // Extract the first rune.
147174 r, length := utf8.DecodeRuneInString(str)
148175 if len(str) <= length { // If we're already past the end, there is nothing else to parse.
149 return str, "", LineMustBreak | (1 << shiftWord) | (1 << shiftSentence), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState)
176 prop := property(graphemeCodePoints, r)
177 return str, "", LineMustBreak | (1 << shiftWord) | (1 << shiftSentence) | (runeWidth(r, prop) << ShiftWidth), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState)
150178 }
151179
152180 // If we don't know the state, determine it now.
153 var graphemeState, wordState, sentenceState, lineState int
181 var graphemeState, wordState, sentenceState, lineState, firstProp int
154182 remainder := str[length:]
155183 if state < 0 {
156 graphemeState, _ = transitionGraphemeState(state, r)
184 graphemeState, firstProp, _ = transitionGraphemeState(state, r)
157185 wordState, _ = transitionWordBreakState(state, r, nil, remainder)
158186 sentenceState, _ = transitionSentenceBreakState(state, r, nil, remainder)
159187 lineState, _ = transitionLineBreakState(state, r, nil, remainder)
162190 wordState = (state >> shiftWordState) & maskWordState
163191 sentenceState = (state >> shiftSentenceState) & maskSentenceState
164192 lineState = (state >> shiftLineState) & maskLineState
193 firstProp = state >> shiftPropState
165194 }
166195
167196 // Transition until we find a grapheme cluster boundary.
168 var (
169 graphemeBoundary, wordBoundary, sentenceBoundary bool
170 lineBreak int
171 )
197 width := runeWidth(r, firstProp)
172198 for {
199 var (
200 graphemeBoundary, wordBoundary, sentenceBoundary bool
201 lineBreak, prop int
202 )
203
173204 r, l := utf8.DecodeRuneInString(remainder)
174205 remainder = str[length+l:]
175206
176 graphemeState, graphemeBoundary = transitionGraphemeState(graphemeState, r)
207 graphemeState, prop, graphemeBoundary = transitionGraphemeState(graphemeState, r)
177208 wordState, wordBoundary = transitionWordBreakState(wordState, r, nil, remainder)
178209 sentenceState, sentenceBoundary = transitionSentenceBreakState(sentenceState, r, nil, remainder)
179210 lineState, lineBreak = transitionLineBreakState(lineState, r, nil, remainder)
180211
181212 if graphemeBoundary {
182 boundary := lineBreak
213 boundary := lineBreak | (width << ShiftWidth)
183214 if wordBoundary {
184215 boundary |= 1 << shiftWord
185216 }
189220 return str[:length], str[length:], boundary, graphemeState | (wordState << shiftWordState) | (sentenceState << shiftSentenceState) | (lineState << shiftLineState)
190221 }
191222
223 if firstProp != prExtendedPictographic && firstProp != prRegionalIndicator && firstProp != prL {
224 width += runeWidth(r, prop)
225 } else if firstProp == prExtendedPictographic {
226 if r == 0xfe0e {
227 width = 1
228 } else {
229 width = 2
230 }
231 }
232
192233 length += l
193234 if len(str) <= length {
194 return str, "", LineMustBreak | (1 << shiftWord) | (1 << shiftSentence), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState)
235 return str, "", LineMustBreak | (1 << shiftWord) | (1 << shiftSentence) | (width << ShiftWidth), grAny | (wbAny << shiftWordState) | (sbAny << shiftSentenceState) | (lbAny << shiftLineState)
195236 }
196237 }
197238 }
0 package uniseg
1
2 // runeWidth returns the monospace width for the given rune. The provided
3 // grapheme property is a value mapped by the [graphemeCodePoints] table.
4 //
5 // Every rune has a width of 1, except for runes with the following properties
6 // (evaluated in this order):
7 //
8 // - Control, CR, LF, Extend, ZWJ: Width of 0
9 // - \u2e3a, TWO-EM DASH: Width of 3
10 // - \u2e3b, THREE-EM DASH: Width of 4
11 // - East-Asian width Fullwidth and Wide: Width of 2 (Ambiguous and Neutral
12 // have a width of 1)
13 // - Regional Indicator: Width of 2
14 // - Extended Pictographic: Width of 2, unless Emoji Presentation is "No".
15 func runeWidth(r rune, graphemeProperty int) int {
16 switch graphemeProperty {
17 case prControl, prCR, prLF, prExtend, prZWJ:
18 return 0
19 case prRegionalIndicator:
20 return 2
21 case prExtendedPictographic:
22 if property(emojiPresentation, r) == prEmojiPresentation {
23 return 2
24 }
25 return 1
26 }
27
28 switch r {
29 case 0x2e3a:
30 return 3
31 case 0x2e3b:
32 return 4
33 }
34
35 switch property(eastAsianWidth, r) {
36 case prW, prF:
37 return 2
38 }
39
40 return 1
41 }
42
43 // StringWidth returns the monospace width for the given string, that is, the
44 // number of same-size cells to be occupied by the string.
45 func StringWidth(s string) (width int) {
46 state := -1
47 for len(s) > 0 {
48 var w int
49 _, s, w, state = FirstGraphemeClusterInString(s, state)
50 width += w
51 }
52 return
53 }
0 package uniseg
1
2 import "testing"
3
4 // widthTestCases is a list of test cases for the calculation of string widths.
5 var widthTestCases = []struct {
6 original string
7 expected int
8 }{
9 {"", 0}, // Control
10 {"\b", 0},
11 {"\x00", 0},
12 {"\x05", 0},
13 {"\a", 0},
14 {"\u000a", 0}, // LF
15 {"\u000d", 0}, // CR
16 {"\n", 0},
17 {"\v", 0},
18 {"\f", 0},
19 {"\r", 0},
20 {"\x0e", 0},
21 {"\x0f", 0},
22 {"\u0300", 0}, // Extend
23 {"\u200d", 0}, // ZERO WIDTH JOINER
24 {"a", 1},
25 {"\u1b05", 1}, // N
26 {"\u2985", 1}, // Na
27 {"\U0001F100", 1}, // A
28 {"\uff61", 1}, // H
29 {"\ufe6a", 2}, // W
30 {"\uff01", 2}, // F
31 {"\u2e3a", 3}, // TWO-EM DASH
32 {"\u2e3b", 4}, // THREE-EM DASH
33 {"\u00a9", 1}, // Extended Pictographic (Emoji Presentation = No)
34 {"\U0001F60A", 2}, // Extended Pictographic (Emoji Presentation = Yes)
35 {"\U0001F1E6", 2}, // Regional Indicator
36 {"\u061c\u061c", 0},
37 {"\u061c\u000a", 0},
38 {"\u061c\u000d", 0},
39 {"\u061c\u0300", 0},
40 {"\u061c\u200d", 0},
41 {"\u061ca", 1},
42 {"\u061c\u1b05", 1},
43 {"\u061c\u2985", 1},
44 {"\u061c\U0001F100", 1},
45 {"\u061c\uff61", 1},
46 {"\u061c\ufe6a", 2},
47 {"\u061c\uff01", 2},
48 {"\u061c\u2e3a", 3},
49 {"\u061c\u2e3b", 4},
50 {"\u061c\u00a9", 1},
51 {"\u061c\U0001F60A", 2},
52 {"\u061c\U0001F1E6", 2},
53 {"\u000a\u061c", 0},
54 {"\u000a\u000a", 0},
55 {"\u000a\u000d", 0},
56 {"\u000a\u0300", 0},
57 {"\u000a\u200d", 0},
58 {"\u000aa", 1},
59 {"\u000a\u1b05", 1},
60 {"\u000a\u2985", 1},
61 {"\u000a\U0001F100", 1},
62 {"\u000a\uff61", 1},
63 {"\u000a\ufe6a", 2},
64 {"\u000a\uff01", 2},
65 {"\u000a\u2e3a", 3},
66 {"\u000a\u2e3b", 4},
67 {"\u000a\u00a9", 1},
68 {"\u000a\U0001F60A", 2},
69 {"\u000a\U0001F1E6", 2},
70 {"\u000d\u061c", 0},
71 {"\u000d\u000a", 0},
72 {"\u000d\u000d", 0},
73 {"\u000d\u0300", 0},
74 {"\u000d\u200d", 0},
75 {"\u000da", 1},
76 {"\u000d\u1b05", 1},
77 {"\u000d\u2985", 1},
78 {"\u000d\U0001F100", 1},
79 {"\u000d\uff61", 1},
80 {"\u000d\ufe6a", 2},
81 {"\u000d\uff01", 2},
82 {"\u000d\u2e3a", 3},
83 {"\u000d\u2e3b", 4},
84 {"\u000d\u00a9", 1},
85 {"\u000d\U0001F60A", 2},
86 {"\u000d\U0001F1E6", 2},
87 {"\u0300\u061c", 0},
88 {"\u0300\u000a", 0},
89 {"\u0300\u000d", 0},
90 {"\u0300\u0300", 0},
91 {"\u0300\u200d", 0},
92 {"\u0300a", 1},
93 {"\u0300\u1b05", 1},
94 {"\u0300\u2985", 1},
95 {"\u0300\U0001F100", 1},
96 {"\u0300\uff61", 1},
97 {"\u0300\ufe6a", 2},
98 {"\u0300\uff01", 2},
99 {"\u0300\u2e3a", 3},
100 {"\u0300\u2e3b", 4},
101 {"\u0300\u00a9", 1},
102 {"\u0300\U0001F60A", 2},
103 {"\u0300\U0001F1E6", 2},
104 {"\u200d\u061c", 0},
105 {"\u200d\u000a", 0},
106 {"\u200d\u000d", 0},
107 {"\u200d\u0300", 0},
108 {"\u200d\u200d", 0},
109 {"\u200da", 1},
110 {"\u200d\u1b05", 1},
111 {"\u200d\u2985", 1},
112 {"\u200d\U0001F100", 1},
113 {"\u200d\uff61", 1},
114 {"\u200d\ufe6a", 2},
115 {"\u200d\uff01", 2},
116 {"\u200d\u2e3a", 3},
117 {"\u200d\u2e3b", 4},
118 {"\u200d\u00a9", 1},
119 {"\u200d\U0001F60A", 2},
120 {"\u200d\U0001F1E6", 2},
121 {"a\u061c", 1},
122 {"a\u000a", 1},
123 {"a\u000d", 1},
124 {"a\u0300", 1},
125 {"a\u200d", 1},
126 {"aa", 2},
127 {"a\u1b05", 2},
128 {"a\u2985", 2},
129 {"a\U0001F100", 2},
130 {"a\uff61", 2},
131 {"a\ufe6a", 3},
132 {"a\uff01", 3},
133 {"a\u2e3a", 4},
134 {"a\u2e3b", 5},
135 {"a\u00a9", 2},
136 {"a\U0001F60A", 3},
137 {"a\U0001F1E6", 3},
138 {"\u1b05\u061c", 1},
139 {"\u1b05\u000a", 1},
140 {"\u1b05\u000d", 1},
141 {"\u1b05\u0300", 1},
142 {"\u1b05\u200d", 1},
143 {"\u1b05a", 2},
144 {"\u1b05\u1b05", 2},
145 {"\u1b05\u2985", 2},
146 {"\u1b05\U0001F100", 2},
147 {"\u1b05\uff61", 2},
148 {"\u1b05\ufe6a", 3},
149 {"\u1b05\uff01", 3},
150 {"\u1b05\u2e3a", 4},
151 {"\u1b05\u2e3b", 5},
152 {"\u1b05\u00a9", 2},
153 {"\u1b05\U0001F60A", 3},
154 {"\u1b05\U0001F1E6", 3},
155 {"\u2985\u061c", 1},
156 {"\u2985\u000a", 1},
157 {"\u2985\u000d", 1},
158 {"\u2985\u0300", 1},
159 {"\u2985\u200d", 1},
160 {"\u2985a", 2},
161 {"\u2985\u1b05", 2},
162 {"\u2985\u2985", 2},
163 {"\u2985\U0001F100", 2},
164 {"\u2985\uff61", 2},
165 {"\u2985\ufe6a", 3},
166 {"\u2985\uff01", 3},
167 {"\u2985\u2e3a", 4},
168 {"\u2985\u2e3b", 5},
169 {"\u2985\u00a9", 2},
170 {"\u2985\U0001F60A", 3},
171 {"\u2985\U0001F1E6", 3},
172 {"\U0001F100\u061c", 1},
173 {"\U0001F100\u000a", 1},
174 {"\U0001F100\u000d", 1},
175 {"\U0001F100\u0300", 1},
176 {"\U0001F100\u200d", 1},
177 {"\U0001F100a", 2},
178 {"\U0001F100\u1b05", 2},
179 {"\U0001F100\u2985", 2},
180 {"\U0001F100\U0001F100", 2},
181 {"\U0001F100\uff61", 2},
182 {"\U0001F100\ufe6a", 3},
183 {"\U0001F100\uff01", 3},
184 {"\U0001F100\u2e3a", 4},
185 {"\U0001F100\u2e3b", 5},
186 {"\U0001F100\u00a9", 2},
187 {"\U0001F100\U0001F60A", 3},
188 {"\U0001F100\U0001F1E6", 3},
189 {"\uff61\u061c", 1},
190 {"\uff61\u000a", 1},
191 {"\uff61\u000d", 1},
192 {"\uff61\u0300", 1},
193 {"\uff61\u200d", 1},
194 {"\uff61a", 2},
195 {"\uff61\u1b05", 2},
196 {"\uff61\u2985", 2},
197 {"\uff61\U0001F100", 2},
198 {"\uff61\uff61", 2},
199 {"\uff61\ufe6a", 3},
200 {"\uff61\uff01", 3},
201 {"\uff61\u2e3a", 4},
202 {"\uff61\u2e3b", 5},
203 {"\uff61\u00a9", 2},
204 {"\uff61\U0001F60A", 3},
205 {"\uff61\U0001F1E6", 3},
206 {"\ufe6a\u061c", 2},
207 {"\ufe6a\u000a", 2},
208 {"\ufe6a\u000d", 2},
209 {"\ufe6a\u0300", 2},
210 {"\ufe6a\u200d", 2},
211 {"\ufe6aa", 3},
212 {"\ufe6a\u1b05", 3},
213 {"\ufe6a\u2985", 3},
214 {"\ufe6a\U0001F100", 3},
215 {"\ufe6a\uff61", 3},
216 {"\ufe6a\ufe6a", 4},
217 {"\ufe6a\uff01", 4},
218 {"\ufe6a\u2e3a", 5},
219 {"\ufe6a\u2e3b", 6},
220 {"\ufe6a\u00a9", 3},
221 {"\ufe6a\U0001F60A", 4},
222 {"\ufe6a\U0001F1E6", 4},
223 {"\uff01\u061c", 2},
224 {"\uff01\u000a", 2},
225 {"\uff01\u000d", 2},
226 {"\uff01\u0300", 2},
227 {"\uff01\u200d", 2},
228 {"\uff01a", 3},
229 {"\uff01\u1b05", 3},
230 {"\uff01\u2985", 3},
231 {"\uff01\U0001F100", 3},
232 {"\uff01\uff61", 3},
233 {"\uff01\ufe6a", 4},
234 {"\uff01\uff01", 4},
235 {"\uff01\u2e3a", 5},
236 {"\uff01\u2e3b", 6},
237 {"\uff01\u00a9", 3},
238 {"\uff01\U0001F60A", 4},
239 {"\uff01\U0001F1E6", 4},
240 {"\u2e3a\u061c", 3},
241 {"\u2e3a\u000a", 3},
242 {"\u2e3a\u000d", 3},
243 {"\u2e3a\u0300", 3},
244 {"\u2e3a\u200d", 3},
245 {"\u2e3aa", 4},
246 {"\u2e3a\u1b05", 4},
247 {"\u2e3a\u2985", 4},
248 {"\u2e3a\U0001F100", 4},
249 {"\u2e3a\uff61", 4},
250 {"\u2e3a\ufe6a", 5},
251 {"\u2e3a\uff01", 5},
252 {"\u2e3a\u2e3a", 6},
253 {"\u2e3a\u2e3b", 7},
254 {"\u2e3a\u00a9", 4},
255 {"\u2e3a\U0001F60A", 5},
256 {"\u2e3a\U0001F1E6", 5},
257 {"\u2e3b\u061c", 4},
258 {"\u2e3b\u000a", 4},
259 {"\u2e3b\u000d", 4},
260 {"\u2e3b\u0300", 4},
261 {"\u2e3b\u200d", 4},
262 {"\u2e3ba", 5},
263 {"\u2e3b\u1b05", 5},
264 {"\u2e3b\u2985", 5},
265 {"\u2e3b\U0001F100", 5},
266 {"\u2e3b\uff61", 5},
267 {"\u2e3b\ufe6a", 6},
268 {"\u2e3b\uff01", 6},
269 {"\u2e3b\u2e3a", 7},
270 {"\u2e3b\u2e3b", 8},
271 {"\u2e3b\u00a9", 5},
272 {"\u2e3b\U0001F60A", 6},
273 {"\u2e3b\U0001F1E6", 6},
274 {"\u00a9\u061c", 1},
275 {"\u00a9\u000a", 1},
276 {"\u00a9\u000d", 1},
277 {"\u00a9\u0300", 2}, // This is really 1 but we can't handle it.
278 {"\u00a9\u200d", 2},
279 {"\u00a9a", 2},
280 {"\u00a9\u1b05", 2},
281 {"\u00a9\u2985", 2},
282 {"\u00a9\U0001F100", 2},
283 {"\u00a9\uff61", 2},
284 {"\u00a9\ufe6a", 3},
285 {"\u00a9\uff01", 3},
286 {"\u00a9\u2e3a", 4},
287 {"\u00a9\u2e3b", 5},
288 {"\u00a9\u00a9", 2},
289 {"\u00a9\U0001F60A", 3},
290 {"\u00a9\U0001F1E6", 3},
291 {"\U0001F60A\u061c", 2},
292 {"\U0001F60A\u000a", 2},
293 {"\U0001F60A\u000d", 2},
294 {"\U0001F60A\u0300", 2},
295 {"\U0001F60A\u200d", 2},
296 {"\U0001F60Aa", 3},
297 {"\U0001F60A\u1b05", 3},
298 {"\U0001F60A\u2985", 3},
299 {"\U0001F60A\U0001F100", 3},
300 {"\U0001F60A\uff61", 3},
301 {"\U0001F60A\ufe6a", 4},
302 {"\U0001F60A\uff01", 4},
303 {"\U0001F60A\u2e3a", 5},
304 {"\U0001F60A\u2e3b", 6},
305 {"\U0001F60A\u00a9", 3},
306 {"\U0001F60A\U0001F60A", 4},
307 {"\U0001F60A\U0001F1E6", 4},
308 {"\U0001F1E6\u061c", 2},
309 {"\U0001F1E6\u000a", 2},
310 {"\U0001F1E6\u000d", 2},
311 {"\U0001F1E6\u0300", 2},
312 {"\U0001F1E6\u200d", 2},
313 {"\U0001F1E6a", 3},
314 {"\U0001F1E6\u1b05", 3},
315 {"\U0001F1E6\u2985", 3},
316 {"\U0001F1E6\U0001F100", 3},
317 {"\U0001F1E6\uff61", 3},
318 {"\U0001F1E6\ufe6a", 4},
319 {"\U0001F1E6\uff01", 4},
320 {"\U0001F1E6\u2e3a", 5},
321 {"\U0001F1E6\u2e3b", 6},
322 {"\U0001F1E6\u00a9", 3},
323 {"\U0001F1E6\U0001F60A", 4},
324 {"\U0001F1E6\U0001F1E6", 2},
325 {"Ka\u0308se", 4}, // Käse (German, "cheese")
326 {"\U0001f3f3\ufe0f\u200d\U0001f308", 2}, // Rainbow flag
327 {"\U0001f1e9\U0001f1ea", 2}, // German flag
328 {"\u0916\u093e", 2}, // खा (Hindi, "eat")
329 {"\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f466", 2}, // Family: Man, Woman, Girl, Boy
330 {"\u1112\u116f\u11b6", 2}, // 훯 (Hangul, conjoining Jamo, "h+weo+lh")
331 {"\ud6ef", 2}, // 훯 (Hangul, precomposed, "h+weo+lh")
332 {"\u79f0\u8c13", 4}, // 称谓 (Chinese, "title")
333 {"\u0e1c\u0e39\u0e49", 1}, // ผู้ (Thai, "person")
334 {"\u0623\u0643\u062a\u0648\u0628\u0631", 6}, // أكتوبر (Arabic, "October")
335 {"\ua992\ua997\ua983", 3}, // ꦒꦗꦃ (Javanese, "elephant")
336 {"\u263a", 1}, // White smiling face
337 {"\u263a\ufe0f", 2}, // White smiling face (with variation selector 16 = emoji presentation)
338 {"\u231b", 2}, // Hourglass
339 {"\u231b\ufe0e", 1}, // Hourglass (with variation selector 15 = text presentation)
340 }
341
342 // String width tests using the StringWidth function.
343 func TestWidthStringWidth(t *testing.T) {
344 for index, testCase := range widthTestCases {
345 actual := StringWidth(testCase.original)
346 if actual != testCase.expected {
347 t.Errorf("StringWidth(%q) is %d, expected %d (test case %d)", testCase.original, actual, testCase.expected, index)
348 }
349 }
350 }
351
352 // String width tests using the Graphemes class.
353 func TestWidthGraphemes(t *testing.T) {
354 for index, testCase := range widthTestCases {
355 var actual int
356 graphemes := NewGraphemes(testCase.original)
357 for graphemes.Next() {
358 actual += graphemes.Width()
359 }
360 if actual != testCase.expected {
361 t.Errorf("Width of %q is %d, expected %d (test case %d)", testCase.original, actual, testCase.expected, index)
362 }
363 }
364 }
365
366 // String width tests using the FirstGraphemeCluster function.
367 func TestWidthGraphemesFunctionBytes(t *testing.T) {
368 for index, testCase := range widthTestCases {
369 var actual, width int
370 state := -1
371 text := []byte(testCase.original)
372 for len(text) > 0 {
373 _, text, width, state = FirstGraphemeCluster(text, state)
374 actual += width
375 }
376 if actual != testCase.expected {
377 t.Errorf("Width of %q is %d, expected %d (test case %d)", testCase.original, actual, testCase.expected, index)
378 }
379 }
380 }
381
382 // String width tests using the FirstGraphemeClusterString function.
383 func TestWidthGraphemesFunctionString(t *testing.T) {
384 for index, testCase := range widthTestCases {
385 var actual, width int
386 state := -1
387 text := testCase.original
388 for len(text) > 0 {
389 _, text, width, state = FirstGraphemeClusterInString(text, state)
390 actual += width
391 }
392 if actual != testCase.expected {
393 t.Errorf("Width of %q is %d, expected %d (test case %d)", testCase.original, actual, testCase.expected, index)
394 }
395 }
396 }
397
398 // String width tests using the Step function.
399 func TestWidthStepBytes(t *testing.T) {
400 for index, testCase := range widthTestCases {
401 var actual, boundaries int
402 state := -1
403 text := []byte(testCase.original)
404 for len(text) > 0 {
405 _, text, boundaries, state = Step(text, state)
406 actual += boundaries >> ShiftWidth
407 }
408 if actual != testCase.expected {
409 t.Errorf("Width of %q is %d, expected %d (test case %d)", testCase.original, actual, testCase.expected, index)
410 }
411 }
412 }
413
414 // String width tests using the StepString function.
415 func TestWidthStepString(t *testing.T) {
416 for index, testCase := range widthTestCases {
417 var actual, boundaries int
418 state := -1
419 text := testCase.original
420 for len(text) > 0 {
421 _, text, boundaries, state = StepString(text, state)
422 actual += boundaries >> ShiftWidth
423 }
424 if actual != testCase.expected {
425 t.Errorf("Width of %q is %d, expected %d (test case %d)", testCase.original, actual, testCase.expected, index)
426 }
427 }
428 }
33
44 // wordBreakTestCases are Grapheme testcases taken from
55 // https://www.unicode.org/Public/14.0.0/ucd/auxiliary/WordBreakTest.txt
6 // on July 25, 2022. See
6 // on September 10, 2022. See
77 // https://www.unicode.org/license.html for the Unicode license agreement.
88 var wordBreakTestCases = []testCase{
99 {original: "\u0001\u0001", expected: [][]rune{{0x0001}, {0x0001}}}, // ÷ [0.2] <START OF HEADING> (Other) ÷ [999.0] <START OF HEADING> (Other) ÷ [0.3]
66 // and
77 // https://unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt
88 // ("Extended_Pictographic" only)
9 // on July 25, 2022. See https://www.unicode.org/license.html for the Unicode
9 // on September 10, 2022. See https://www.unicode.org/license.html for the Unicode
1010 // license agreement.
1111 var workBreakCodePoints = [][3]int{
1212 {0x000A, 0x000A, prLF}, // Cc <control-000A>
623623 {0x212A, 0x212D, prALetter}, // L& [4] KELVIN SIGN..BLACK-LETTER CAPITAL C
624624 {0x212F, 0x2134, prALetter}, // L& [6] SCRIPT SMALL E..SCRIPT SMALL O
625625 {0x2135, 0x2138, prALetter}, // Lo [4] ALEF SYMBOL..DALET SYMBOL
626 {0x2139, 0x2139, prExtendedPictographic}, // E0.6 [1] (ℹ️) information
626627 {0x2139, 0x2139, prALetter}, // L& INFORMATION SOURCE
627 {0x2139, 0x2139, prExtendedPictographic}, // E0.6 [1] (ℹ️) information
628628 {0x213C, 0x213F, prALetter}, // L& [4] DOUBLE-STRUCK SMALL PI..DOUBLE-STRUCK CAPITAL PI
629629 {0x2145, 0x2149, prALetter}, // L& [5] DOUBLE-STRUCK ITALIC CAPITAL D..DOUBLE-STRUCK ITALIC SMALL J
630630 {0x214E, 0x214E, prALetter}, // L& TURNED SMALL F