1
0
mirror of https://github.com/vlang/v.git synced 2023-08-10 21:13:21 +03:00

scanner: multibyte rune literals now support unicode, hex, and octal escape codes (#13140)

This commit is contained in:
jeffmikels
2022-01-18 19:23:25 -05:00
committed by GitHub
parent bb6c46e1ef
commit 7a2705d8ce
7 changed files with 234 additions and 102 deletions

View File

@@ -476,16 +476,33 @@ d := b + x // d is of type `f64` - automatic promotion of `x`'s value
### Strings
```v
```v nofmt
name := 'Bob'
println(name.len)
println(name[0]) // indexing gives a byte B
println(name[1..3]) // slicing gives a string 'ob'
windows_newline := '\r\n' // escape special characters like in C
assert name.len == 3 // will print 3
assert name[0] == byte(66) // indexing gives a byte, byte(66) == `B`
assert name[1..3] == 'ob' // slicing gives a string 'ob'
// escape codes
windows_newline := '\r\n' // escape special characters like in C
assert windows_newline.len == 2
// arbitrary bytes can be directly specified using `\x##` notation where `#` is
// a hex digit aardvark_str := '\x61ardvark' assert aardvark_str == 'aardvark'
assert '\xc0'[0] == byte(0xc0)
// or using octal escape `\###` notation where `#` is an octal digit
aardvark_str2 := '\141ardvark'
assert aardvark_str2 == 'aardvark'
// Unicode can be specified directly as `\u####` where # is a hex digit
// and will be converted internally to its UTF-8 representation
star_str := '\u2605' // ★
assert star_str == '★'
assert star_str == '\xe2\x98\x85' // UTF-8 can be specified this way too.
```
In V, a string is a read-only array of bytes. String data is encoded using UTF-8:
In V, a string is a read-only array of bytes. All Unicode characters are encoded using UTF-8:
```v
s := 'hello 🌎' // emoji takes 4 bytes
assert s.len == 10
@@ -503,11 +520,12 @@ String values are immutable. You cannot mutate elements:
mut s := 'hello 🌎'
s[0] = `H` // not allowed
```
> error: cannot assign to `s[i]` since V strings are immutable
Note that indexing a string will produce a `byte`, not a `rune` nor another `string`.
Indexes correspond to bytes in the string, not Unicode code points. If you want to
convert the `byte` to a `string`, use the `ascii_str()` method:
Note that indexing a string will produce a `byte`, not a `rune` nor another `string`. Indexes
correspond to _bytes_ in the string, not Unicode code points. If you want to convert the `byte` to a
`string`, use the `.ascii_str()` method on the `byte`:
```v
country := 'Netherlands'
@@ -515,20 +533,13 @@ println(country[0]) // Output: 78
println(country[0].ascii_str()) // Output: N
```
Character literals have type `rune`. To denote them, use `
Both single and double quotes can be used to denote strings. For consistency, `vfmt` converts double
quotes to single quotes unless the string contains a single quote character.
For raw strings, prepend `r`. Escape handling is not done for raw strings:
```v
rocket := `🚀`
assert 'aloha!'[0] == `a`
```
Both single and double quotes can be used to denote strings. For consistency,
`vfmt` converts double quotes to single quotes unless the string contains a single quote character.
For raw strings, prepend `r`. Raw strings are not escaped:
```v
s := r'hello\nworld'
s := r'hello\nworld' // the `\n` will be preserved as two characters
println(s) // "hello\nworld"
```
@@ -537,41 +548,79 @@ Strings can be easily converted to integers:
```v
s := '42'
n := s.int() // 42
// all int literals are supported
assert '0xc3'.int() == 195
assert '0o10'.int() == 8
assert '0b1111_0000_1010'.int() == 3850
assert '-0b1111_0000_1010'.int() == -3850
```
### Runes
A `rune` represents a unicode character and is an alias for `u32`. Runes can be created like this:
```v
x := `🚀`
```
A string can be converted to runes by the `.runes()` method.
```v
hello := 'Hello World 👋'
hello_runes := hello.runes() // [`H`, `e`, `l`, `l`, `o`, ` `, `W`, `o`, `r`, `l`, `d`, ` `, `👋`]
```
For more advanced `string` processing and conversions, refer to the
[vlib/strconv](https://modules.vlang.io/strconv.html) module.
### String interpolation
Basic interpolation syntax is pretty simple - use `$` before a variable name.
The variable will be converted to a string and embedded into the literal:
Basic interpolation syntax is pretty simple - use `$` before a variable name. The variable will be
converted to a string and embedded into the literal:
```v
name := 'Bob'
println('Hello, $name!') // Hello, Bob!
```
It also works with fields: `'age = $user.age'`.
If you need more complex expressions, use `${}`: `'can register = ${user.age > 13}'`.
Format specifiers similar to those in C's `printf()` are also supported.
`f`, `g`, `x`, etc. are optional and specify the output format.
The compiler takes care of the storage size, so there is no `hd` or `llu`.
It also works with fields: `'age = $user.age'`. If you need more complex expressions, use `${}`:
`'can register = ${user.age > 13}'`.
Format specifiers similar to those in C's `printf()` are also supported. `f`, `g`, `x`, `o`, `b`,
etc. are optional and specify the output format. The compiler takes care of the storage size, so
there is no `hd` or `llu`.
To use a format specifier, follow this pattern:
`${varname:[flags][width][.precision][type]}`
- flags: may be zero or more of the following: `-` to left-align output within the field, `0` to use
`0` as the padding character instead of the default `space` character. (Note: V does not currently
support the use of `'` or `#` as format flags, and V supports but doesn't need `+` to right-align
since that's the default.)
- width: may be an integer value describing the minimum width of total field to output.
- precision: an integer value preceeded by a `.` will guarantee that many digits after the decimal
point, if the input variable is a float. Ignored if variable is an integer.
- type: `f` and `F` specify the input is a float and should be rendered as such, `e` and `E` specify
the input is a float and should be rendered as an exponent (partially broken), `g` and `G` specify
the input is a float--the renderer will use floating point notation for small values and exponent
notation for large values, `d` specifies the input is an integer and should be rendered in base-10
digits, `x` and `X` require an integer and will render it as hexadecimal digits, `o` requires an
integer and will render it as octal digits, `b` requires an integer and will render it as binary
digits, `s` requires a string (almost never used).
Note: when a numeric type can render alphabetic characters, such as hex strings or special values
like `infinity`, the lowercase version of the type forces lowercase alphabetics and the uppercase
version forces uppercase alphabetics.
Also note: in most cases, it's best to leave the format type empty. Floats will be rendered by
default as `g`, integers will be rendered by default as `d`, and `s` is almost always redundant.
There are only three cases where specifying a type is recommended:
- format strings are parsed at compile time, so specifing a type can help detect errors then
- format strings default to using lowercase letters for hex digits and the `e` in exponents. Use a
uppercase type to force the use of uppercase hex digits and an uppercase `E` in exponents.
- format strings are the most convenient way to get hex, binary or octal strings from an integer.
See
[Format Placeholder Specification](https://en.wikipedia.org/wiki/Printf_format_string#Format_placeholder_specification)
for more information.
```v
x := 123.4567
println('x = ${x:4.2f}')
println('[${x:10}]') // pad with spaces on the left => [ 123.457]
println('[${int(x):-10}]') // pad with spaces on the right => [123 ]
println('[${x:.2}]') // round to two decimal places => [123.46]
println('[${x:10}]') // right-align with spaces on the left => [ 123.457]
println('[${int(x):-10}]') // left-align with spaces on the right => [123 ]
println('[${int(x):010}]') // pad with zeros on the left => [0000000123]
println('[${int(x):b}]') // output as binary => [1111011]
println('[${int(x):o}]') // output as octal => [173]
println('[${int(x):X}]') // output as uppercase hex => [7B]
```
### String operators
@@ -585,13 +634,14 @@ s += 'world' // `+=` is used to append to a string
println(s) // "hello world"
```
All operators in V must have values of the same type on both sides.
You cannot concatenate an integer to a string:
All operators in V must have values of the same type on both sides. You cannot concatenate an
integer to a string:
```v failcompile
age := 10
println('age = ' + age) // not allowed
```
> error: infix expr: cannot use `int` (right expression) as `string`
We have to either convert `age` to a `string`:
@@ -608,6 +658,62 @@ age := 12
println('age = $age')
```
### Runes
A `rune` represents a single Unicode character and is an alias for `u32`. To denote them, use `
(backticks) :
```v
rocket := `🚀`
```
A `rune` can be converted to a UTF-8 string by using the `.str()` method.
```v
rocket := `🚀`
assert rocket.str() == '🚀'
```
A `rune` can be converted to UTF-8 bytes by using the `.bytes()` method.
```v
rocket := `🚀`
assert rocket.bytes() == [byte(0xf0), 0x9f, 0x9a, 0x80]
```
Hex, Unicode, and Octal escape sequences also work in a `rune` literal:
```v
assert `\x61` == `a`
assert `\141` == `a`
assert `\u0061` == `a`
// multibyte literals work too
assert `\u2605` == `★`
assert `\u2605`.bytes() == [byte(0xe2), 0x98, 0x85]
assert `\xe2\x98\x85`.bytes() == [byte(0xe2), 0x98, 0x85]
assert `\342\230\205`.bytes() == [byte(0xe2), 0x98, 0x85]
```
Note that `rune` literals use the same escape syntax as strings, but they can only hold one unicode
character. Therefore, if your code does not specify a single Unicode character, you will receive an
error at compile time.
Also remember that strings are indexed as bytes, not runes, so beware:
```v
rocket_string := '🚀'
assert rocket_string[0] != `🚀`
assert 'aloha!'[0] == `a`
```
A string can be converted to runes by the `.runes()` method.
```v
hello := 'Hello World 👋'
hello_runes := hello.runes() // [`H`, `e`, `l`, `l`, `o`, ` `, `W`, `o`, `r`, `l`, `d`, ` `, `👋`]
```
### Numbers
```v