1
0
mirror of https://github.com/vlang/v.git synced 2023-08-10 21:13:21 +03:00
v/vlib/regex/README.md

877 lines
28 KiB
Markdown
Raw Normal View History

2020-12-18 07:57:31 +03:00
# V RegEx (Regular expression) 1.0 alpha
2020-01-16 02:39:33 +03:00
[TOC]
## Introduction
2020-01-16 02:39:33 +03:00
Here are the assumptions made during the writing of the implementation, that
are valid for all the `regex` module features:
1. The matching stops at the end of the string, *not* at newline characters.
2. The basic atomic elements of this regex engine are the tokens.
In a query string a simple character is a token.
2020-01-16 02:39:33 +03:00
## Differences with PCRE:
2020-01-16 02:39:33 +03:00
NB: We must point out that the **V-Regex module is not PCRE compliant** and thus
some behaviour will be different. This difference is due to the V philosophy,
to have one way and keep it simple.
2020-01-16 02:39:33 +03:00
The main differences can be summarized in the following points:
2020-01-16 02:39:33 +03:00
- The basic element **is the token not the sequence of symbols**, and the most
simple token, is a single character.
2020-01-16 02:39:33 +03:00
- `|` **the OR operator acts on tokens,** for example `abc|ebc` is not
`abc` OR `ebc`. Instead it is evaluated like `ab`, followed by `c OR e`,
followed by `bc`, because the **token is the base element**,
not the sequence of symbols.
Note: **Two char classes with an `OR` in the middle is a syntax error.**
2020-01-16 02:39:33 +03:00
- The **match operation stops at the end of the string**. It does *NOT* stop
at new line characters.
2020-01-16 02:39:33 +03:00
## Tokens
The tokens are the atomic units, used by this regex engine.
They can be one of the following:
2020-01-16 02:39:33 +03:00
### Simple char
This token is a simple single character like `a` or `b` etc.
### Match positional delimiters
`^` Matches the start of the string.
`$` Matches the end of the string.
2020-01-16 02:39:33 +03:00
### Char class (cc)
The character classes match all the chars specified inside. Use square
brackets `[ ]` to enclose them.
2020-01-16 02:39:33 +03:00
The sequence of the chars in the character class, is evaluated with an OR op.
2020-01-16 02:39:33 +03:00
For example, the cc `[abc]`, matches any character, that is `a` or `b` or `c`,
but it doesn't match `C` or `z`.
2020-01-16 02:39:33 +03:00
Inside a cc, it is possible to specify a "range" of characters, for example
`[ad-h]` is equivalent to writing `[adefgh]`.
2020-01-16 02:39:33 +03:00
A cc can have different ranges at the same time, for example `[a-zA-z0-9]`
matches all the latin lowercase, uppercase and numeric characters.
2020-01-16 02:39:33 +03:00
It is possible to negate the meaning of a cc, using the caret char at the
start of the cc like this: `[^abc]` . That matches every char that is NOT
`a` or `b` or `c`.
2020-01-16 02:39:33 +03:00
A cc can contain meta-chars like: `[a-z\d]`, that match all the lowercase
latin chars `a-z` and all the digits `\d`.
2020-01-16 02:39:33 +03:00
It is possible to mix all the properties of the char class together.
NB: In order to match the `-` (minus) char, it must be preceded by
a backslash in the cc, for example `[\-_\d\a]` will match:
`-` minus,
`_` underscore,
`\d` numeric chars,
`\a` lower case chars.
2020-01-16 02:39:33 +03:00
### Meta-chars
A meta-char is specified by a backslash, before a character.
For example `\w` is the meta-char `w`.
2020-01-16 02:39:33 +03:00
A meta-char can match different types of characters.
2020-01-16 02:39:33 +03:00
* `\w` matches a word char char `[a-zA-Z0-9_]`
* `\W` matches a non word char
2020-05-21 16:22:39 +03:00
* `\d` matches a digit `[0-9]`
* `\D` matches a non digit
* `\s` matches a space char, one of `[' ','\t','\n','\r','\v','\f']`
2020-05-21 16:22:39 +03:00
* `\S` matches a non space char
* `\a` matches only a lowercase char `[a-z]`
2020-05-21 16:22:39 +03:00
* `\A` matches only an uppercase char `[A-Z]`
2020-01-16 02:39:33 +03:00
### Quantifier
Each token can have a quantifier, that specifies how many times the character
must be matched.
2020-01-16 02:39:33 +03:00
#### **Short quantifiers**
2020-01-16 02:39:33 +03:00
2020-05-21 16:22:39 +03:00
- `?` matches 0 or 1 time, `a?b` matches both `ab` or `b`
- `+` matches *at least* 1 time, for example, `a+` matches both `aaa` or `a`
- `*` matches 0 or more times, for example, `a*b` matches `aaab`, `ab` or `b`
2020-01-16 02:39:33 +03:00
#### **Long quantifiers**
2020-01-16 02:39:33 +03:00
- `{x}` matches exactly x times, `a{2}` matches `aa`, but not `aaa` or `a`
- `{min,}` matches at least min times, `a{2,}` matches `aaa` or `aa`, not `a`
- `{,max}` matches at least 0 times and at maximum max times,
for example, `a{,2}` matches `a` and `aa`, but doesn't match `aaa`
- `{min,max}` matches from min times, to max times, for example
`a{2,3}` matches `aa` and `aaa`, but doesn't match `a` or `aaaa`
2020-01-16 02:39:33 +03:00
A long quantifier, may have a `greedy off` flag, that is the `?`
character after the brackets. `{2,4}?` means to match the minimum
number of possible tokens, in this case 2.
2020-01-16 02:39:33 +03:00
### Dot char
2020-01-16 02:39:33 +03:00
The dot is a particular meta-char, that matches "any char".
2020-01-16 02:39:33 +03:00
It is simpler to explain it with an example:
2020-01-16 02:39:33 +03:00
Suppose you have `abccc ddeef` as a source string, that you want to parse
with a regex. The following table show the query strings and the result of
parsing source string.
| query string | result |
|--------------|-------------|
| `.*c` | `abc` |
| `.*dd` | `abcc dd` |
| `ab.*e` | `abccc dde` |
2020-01-16 02:39:33 +03:00
| `ab.{3} .*e` | `abccc dde` |
The dot matches any character, until the next token match is satisfied.
2020-01-16 02:39:33 +03:00
**Important Note:** *Consecutive dots, for example `...`, are not allowed.*
*This will cause a syntax error. Use a quantifier instead.*
2020-01-16 02:39:33 +03:00
### OR token
The token `|`, means a logic OR operation between two consecutive tokens,
i.e. `a|b` matches a character that is `a` or `b`.
2020-01-16 02:39:33 +03:00
The OR token can work in a "chained way": `a|(b)|cd ` means test first `a`,
if the char is not `a`, then test the group `(b)`, and if the group doesn't
match too, finally test the token `c`.
2020-01-16 02:39:33 +03:00
NB: ** unlike in PCRE, the OR operation works at token level!**
It doesn't work at concatenation level!
NB2: **Two char classes with an `OR` in the middle is a syntax error.**
2020-01-16 02:39:33 +03:00
That also means, that a query string like `abc|bde` is not equal to
`(abc)|(bde)`, but instead to `ab(c|b)de.
The OR operation works only for `c|b`, not at char concatenation level.
2020-01-16 02:39:33 +03:00
### Groups
Groups are a method to create complex patterns with repetitions of blocks
of tokens. The groups are delimited by round brackets `( )`. Groups can be
nested. Like all other tokens, groups can have a quantifier too.
2020-01-16 02:39:33 +03:00
`c(pa)+z` match `cpapaz` or `cpaz` or `cpapapaz` .
`(c(pa)+z ?)+` matches `cpaz cpapaz cpapapaz` or `cpapaz`
2020-01-16 02:39:33 +03:00
Lets analyze this last case, first we have the group `#0`, that is the most
outer round brackets `(...)+`. This group has a quantifier `+`, that say to
match its content *at least one time*.
2020-01-16 02:39:33 +03:00
Then we have a simple char token `c`, and a second group `#1`: `(pa)+`.
This group also tries to match the sequence `pa`, *at least one time*,
as specified by the `+` quantifier.
2020-01-16 02:39:33 +03:00
Then, we have another simple token `z` and another simple token ` ?`,
i.e. the space char (ascii code 32) followed by the `?` quantifier,
which means that the preceding space should be matched 0 or 1 time.
2020-01-16 02:39:33 +03:00
This explains why the `(c(pa)+z ?)+` query string,
can match `cpaz cpapaz cpapapaz` .
2020-01-16 02:39:33 +03:00
In this implementation the groups are "capture groups". This means that the
last temporal result for each group, can be retrieved from the `RE` struct.
2020-01-16 02:39:33 +03:00
The "capture groups" are stored as indexes in the field `groups`,
that is an `[]int` inside the `RE` struct.
2020-01-16 02:39:33 +03:00
**example:**
```v oksyntax
text := 'cpaz cpapaz cpapapaz'
query := r'(c(pa)+z ?)+'
mut re := regex.regex_opt(query) or { panic(err) }
2020-01-16 02:39:33 +03:00
println(re.get_query())
// #0(c#1(pa)+z ?)+
// #0 and #1 are the ids of the groups, are shown if re.debug is 1 or 2
2020-01-16 02:39:33 +03:00
start, end := re.match_string(text)
// [start=0, end=20] match => [cpaz cpapaz cpapapaz]
mut gi := 0
for gi < re.groups.len {
if re.groups[gi] >= 0 {
println('${gi / 2} :[${text[re.groups[gi]..re.groups[gi + 1]]}]')
2020-01-16 02:39:33 +03:00
}
gi += 2
}
// groups captured
// 0 :[cpapapaz]
// 1 :[pa]
```
**note:** *to show the `group id number` in the result of the `get_query()`*
*the flag `debug` of the RE object must be `1` or `2`*
2020-01-16 02:39:33 +03:00
In order to simplify the use of the captured groups, it possible to use the
utility function: `get_group_list`.
This function return a list of groups using this support struct:
2020-12-06 04:04:07 +03:00
```v oksyntax
pub struct Re_group {
pub:
start int = -1
end int = -1
}
```
Here an example of use:
2020-12-06 04:04:07 +03:00
```v oksyntax
/*
This simple function converts an HTML RGB value with 3 or 6 hex digits to
an u32 value, this function is not optimized and it is only for didatical
purpose. Example: #A0B0CC #A9F
*/
fn convert_html_rgb(in_col string) u32 {
mut n_digit := if in_col.len == 4 { 1 } else { 2 }
mut col_mul := if in_col.len == 4 { 4 } else { 0 }
// this is the regex query, it use the V string interpolation to customize the regex query
2020-12-06 04:04:07 +03:00
// NOTE: if you want use escaped code you must use the r"" (raw) strings,
// *** please remember that the V interpoaltion doesn't work on raw strings. ***
query := '#([a-fA-F0-9]{$n_digit})([a-fA-F0-9]{$n_digit})([a-fA-F0-9]{$n_digit})'
mut re := regex.regex_opt(query) or { panic(err) }
start, end := re.match_string(in_col)
2020-12-06 04:04:07 +03:00
println('start: $start, end: $end')
mut res := u32(0)
if start >= 0 {
group_list := re.get_group_list() // this is the utility function
2020-12-06 04:04:07 +03:00
r := ('0x' + in_col[group_list[0].start..group_list[0].end]).int() << col_mul
g := ('0x' + in_col[group_list[1].start..group_list[1].end]).int() << col_mul
b := ('0x' + in_col[group_list[2].start..group_list[2].end]).int() << col_mul
println('r: $r g: $g b: $b')
res = u32(r) << 16 | u32(g) << 8 | u32(b)
}
return res
}
```
2020-12-18 07:57:31 +03:00
Others utility functions are `get_group_by_id` and `get_group_bounds_by_id`
that get directly the string of a group using its `id`:
2020-12-18 07:57:31 +03:00
```v ignore
txt := "my used string...."
for g_index := 0; g_index < re.group_count ; g_index++ {
println("#${g_index} [${re.get_group_by_id(txt, g_index)}] \
bounds: ${re.get_group_bounds_by_id(g_index)}")
}
```
More helper functions are listed in the **Groups query functions** section.
2020-01-25 21:12:23 +03:00
### Groups Continuous saving
In particular situations, it is useful to have a continuous group saving.
This is possible by initializing the `group_csave` field in the `RE` struct.
2020-01-25 21:12:23 +03:00
This feature allows you to collect data in a continuous/streaming way.
2020-01-25 21:12:23 +03:00
In the example, we can pass a text, followed by an integer list,
that we wish to collect. To achieve this task, we can use the continuous
group saving, by enabling the right flag: `re.group_csave_flag = true`.
2020-01-25 21:12:23 +03:00
The `.group_csave` array will be filled then, following this logic:
2020-01-25 21:12:23 +03:00
`re.group_csave[0]` - number of total saved records
`re.group_csave[1+n*3]` - id of the saved group
`re.group_csave[1+n*3]` - start index in the source string of the saved group
`re.group_csave[1+n*3]` - end index in the source string of the saved group
2020-01-25 21:12:23 +03:00
The regex will save groups, until it finishes, or finds that the array has no
more space. If the space ends, no error is raised, and further records will
not be saved.
2020-01-25 21:12:23 +03:00
```v ignore
2020-12-18 07:57:31 +03:00
import regex
fn main(){
txt := "http://www.ciao.mondo/hello/pippo12_/pera.html"
query := r"(?P<format>https?)|(?P<format>ftps?)://(?P<token>[\w_]+.)+"
mut re := regex.regex_opt(query) or { panic(err) }
//println(re.get_code()) // uncomment to see the print of the regex execution code
re.debug=2 // enable maximum log
println("String: ${txt}")
println("Query : ${re.get_query()}")
re.debug=0 // disable log
re.group_csave_flag = true
start, end := re.match_string(txt)
if start >= 0 {
println("Match ($start, $end) => [${txt[start..end]}]")
} else {
println("No Match")
}
if re.group_csave_flag == true && start >= 0 && re.group_csave.len > 0{
println("cg: $re.group_csave")
mut cs_i := 1
for cs_i < re.group_csave[0]*3 {
g_id := re.group_csave[cs_i]
st := re.group_csave[cs_i+1]
en := re.group_csave[cs_i+2]
println("cg[$g_id] $st $en:[${txt[st..en]}]")
cs_i += 3
}
}
2020-01-25 21:12:23 +03:00
}
```
The output will be:
```
2020-12-18 07:57:31 +03:00
String: http://www.ciao.mondo/hello/pippo12_/pera.html
Query : #0(?P<format>https?)|{8,14}#0(?P<format>ftps?)://#1(?P<token>[\w_]+.)+
Match (0, 46) => [http://www.ciao.mondo/hello/pippo12_/pera.html]
cg: [8, 0, 0, 4, 1, 7, 11, 1, 11, 16, 1, 16, 22, 1, 22, 28, 1, 28, 37, 1, 37, 42, 1, 42, 46]
cg[0] 0 4:[http]
cg[1] 7 11:[www.]
cg[1] 11 16:[ciao.]
cg[1] 16 22:[mondo/]
cg[1] 22 28:[hello/]
cg[1] 28 37:[pippo12_/]
cg[1] 37 42:[pera.]
cg[1] 42 46:[html]
2020-01-25 21:12:23 +03:00
```
### Named capturing groups
This regex module supports partially the question mark `?` PCRE syntax for groups.
`(?:abcd)` **non capturing group**: the content of the group will not be saved.
`(?P<mygroup>abcdef)` **named group:** the group content is saved and labeled
as `mygroup`.
The label of the groups is saved in the `group_map` of the `RE` struct,
that is a map from `string` to `int`, where the value is the index in
`group_csave` list of indexes.
Here is an example for how to use them:
```v ignore
import regex
2020-12-18 07:57:31 +03:00
fn main(){
txt := "http://www.ciao.mondo/hello/pippo12_/pera.html"
query := r"(?P<format>https?)|(?P<format>ftps?)://(?P<token>[\w_]+.)+"
mut re := regex.regex_opt(query) or { panic(err) }
//println(re.get_code()) // uncomment to see the print of the regex execution code
re.debug=2 // enable maximum log
println("String: ${txt}")
println("Query : ${re.get_query()}")
re.debug=0 // disable log
start, end := re.match_string(txt)
if start >= 0 {
println("Match ($start, $end) => [${txt[start..end]}]")
} else {
println("No Match")
}
for name in re.group_map.keys() {
println("group:'$name' \t=> [${re.get_group_by_name(txt, name)}] \
bounds: ${re.get_group_bounds_by_name(name)}")
}
}
```
Output:
```
2020-12-18 07:57:31 +03:00
String: http://www.ciao.mondo/hello/pippo12_/pera.html
Query : #0(?P<format>https?)|{8,14}#0(?P<format>ftps?)://#1(?P<token>[\w_]+.)+
Match (0, 46) => [http://www.ciao.mondo/hello/pippo12_/pera.html]
group:'format' => [http] bounds: (0, 4)
group:'token' => [html] bounds: (42, 46)
```
In order to simplify the use of the named groups, it is possible to
use a name map in the `re` struct, using the function `re.get_group_by_name`.
Here is a more complex example of using them:
2020-12-06 04:04:07 +03:00
```v oksyntax
// This function demostrate the use of the named groups
fn convert_html_rgb_n(in_col string) u32 {
mut n_digit := if in_col.len == 4 { 1 } else { 2 }
mut col_mul := if in_col.len == 4 { 4 } else { 0 }
query := '#(?P<red>[a-fA-F0-9]{$n_digit})' + '(?P<green>[a-fA-F0-9]{$n_digit})' +
'(?P<blue>[a-fA-F0-9]{$n_digit})'
mut re := regex.regex_opt(query) or { panic(err) }
start, end := re.match_string(in_col)
2020-12-06 04:04:07 +03:00
println('start: $start, end: $end')
mut res := u32(0)
if start >= 0 {
2020-12-18 07:57:31 +03:00
red_s, red_e := re.get_group_by_name('red')
2020-12-06 04:04:07 +03:00
r := ('0x' + in_col[red_s..red_e]).int() << col_mul
2020-12-18 07:57:31 +03:00
green_s, green_e := re.get_group_by_name('green')
2020-12-06 04:04:07 +03:00
g := ('0x' + in_col[green_s..green_e]).int() << col_mul
2020-12-18 07:57:31 +03:00
blue_s, blue_e := re.get_group_by_name('blue')
2020-12-06 04:04:07 +03:00
b := ('0x' + in_col[blue_s..blue_e]).int() << col_mul
println('r: $r g: $g b: $b')
res = u32(r) << 16 | u32(g) << 8 | u32(b)
}
return res
}
```
Other utilities are `get_group_by_name` and `get_group_bounds_by_name`,
that return the string of a group using its `name`:
2020-12-18 07:57:31 +03:00
```v ignore
txt := "my used string...."
for name in re.group_map.keys() {
println("group:'$name' \t=> [${re.get_group_by_name(txt, name)}] \
bounds: ${re.get_group_bounds_by_name(name)}")
}
```
### Groups query functions
These functions are helpers to query the captured groups
```v ignore
// get_group_bounds_by_name get a group boundaries by its name
pub fn (re RE) get_group_bounds_by_name(group_name string) (int, int)
// get_group_by_name get a group string by its name
2020-12-18 07:57:31 +03:00
pub fn (re RE) get_group_by_name(group_name string) string
// get_group_by_id get a group boundaries by its id
pub fn (re RE) get_group_bounds_by_id(group_id int) (int,int)
// get_group_by_id get a group string by its id
pub fn (re RE) get_group_by_id(in_txt string, group_id int) string
struct Re_group {
pub:
start int = -1
end int = -1
}
2020-12-18 07:57:31 +03:00
// get_group_list return a list of Re_group for the found groups
pub fn (re RE) get_group_list() []Re_group
```
2020-01-16 02:39:33 +03:00
## Flags
It is possible to set some flags in the regex parser, that change
the behavior of the parser itself.
2020-01-16 02:39:33 +03:00
```v ignore
2020-01-16 02:39:33 +03:00
// example of flag settings
mut re := regex.new()
re.flag = regex.f_bin
2020-01-16 02:39:33 +03:00
```
- `f_bin`: parse a string as bytes, utf-8 management disabled.
2020-01-16 02:39:33 +03:00
- `f_efm`: exit on the first char matches in the query, used by the
find function.
- `f_ms`: matches only if the index of the start match is 0,
same as `^` at the start of the query string.
- `f_me`: matches only if the end index of the match is the last char
of the input string, same as `$` end of query string.
- `f_nl`: stop the matching if found a new line char `\n` or `\r`
2020-01-16 02:39:33 +03:00
## Functions
### Initializer
These functions are helper that create the `RE` struct,
a `RE` struct can be created manually if you needed.
2020-01-16 02:39:33 +03:00
2020-01-18 09:38:00 +03:00
#### **Simplified initializer**
2020-01-16 02:39:33 +03:00
2020-12-06 04:04:07 +03:00
```v ignore
2020-01-16 02:39:33 +03:00
// regex create a regex object from the query string and compile it
pub fn regex_opt(in_query string) ?RE
2020-01-16 02:39:33 +03:00
```
2020-01-18 09:38:00 +03:00
#### **Base initializer**
2020-01-16 02:39:33 +03:00
2020-12-06 04:04:07 +03:00
```v ignore
2020-01-16 02:39:33 +03:00
// new_regex create a REgex of small size, usually sufficient for ordinary use
pub fn new() RE
2020-01-16 02:39:33 +03:00
```
#### **Custom initialization**
For some particular needs, it is possible to initialize a fully customized regex:
```v ignore
2020-12-21 07:36:14 +03:00
pattern = r"ab(.*)(ac)"
// init custom regex
mut re := regex.RE{}
// max program length, can not be longer then the pattern
re.prog = []Token {len: pattern.len + 1}
// can not be more char class the the length of the pattern
re.cc = []CharClass{len: pattern.len}
re.group_csave_flag = false // true enable continuos group saving if needed
re.group_max_nested = 128 // set max 128 group nested possible
re.group_max = pattern.len>>1 // we can't have more groups than the half of the pattern legth
re.group_stack = []int{len: re.group_max, init: -1}
re.group_data = []int{len: re.group_max, init: -1}
```
### Compiling
After an initializer is used, the regex expression must be compiled with:
```v ignore
// compile compiles the REgex returning an error if the compilation fails
pub fn (re mut RE) compile_opt(in_txt string) ?
2020-01-16 02:39:33 +03:00
```
### Matching Functions
2020-01-16 02:39:33 +03:00
These are the matching functions
2020-01-16 02:39:33 +03:00
```v ignore
2020-01-16 02:39:33 +03:00
// match_string try to match the input string, return start and end index if found else start is -1
pub fn (re mut RE) match_string(in_txt string) (int,int)
```
2020-12-18 07:57:31 +03:00
## Find and Replace
There are the following find and replace functions:
#### Find functions
```v ignore
// find try to find the first match in the input string
// return start and end index if found else start is -1
pub fn (re mut RE) find(in_txt string) (int,int)
// find_all find all the "non overlapping" occurrences of the matching pattern
// return a list of start end indexes like: [3,4,6,8]
// the matches are [3,4] and [6,8]
pub fn (re mut RE) find_all(in_txt string) []int
// find_all find all the "non overlapping" occurrences of the matching pattern
// return a list of strings
// the result is like ["first match","secon match"]
pub fn (mut re RE) find_all_str(in_txt string) []string
```
#### Replace functions
```v ignore
// replace return a string where the matches are replaced with the repl_str string,
// this function support groups in the replace string
pub fn (re mut RE) replace(in_txt string, repl string) string
```
replace string can include groups references:
```v ignore
txt := "Today it is a good day."
query := r'(a\w)[ ,.]'
mut re := regex.regex_opt(query)?
res := re.replace(txt, r"__[\0]__")
```
in this example we used the group `0` in the replace string: `\0`, the result will be:
```
Today it is a good day. => Tod__[ay]__it is a good d__[ay]__
```
**Note:** in the replace strings can be used only groups from `0` to `9`.
If the usage of `groups` in the replace process, is not needed, it is possible
to use a quick function:
```v ignore
// replace_simple return a string where the matches are replaced with the replace string
pub fn (mut re RE) replace_simple(in_txt string, repl string) string
```
#### Custom replace function
For complex find and replace operations, you can use `replace_by_fn` .
The `replace_by_fn`, uses a custom replace callback function, thus
allowing customizations. The custom callback function is called for
every non overlapped find.
The custom callback function must be of the type:
2020-12-18 07:57:31 +03:00
```v ignore
// type of function used for custom replace
// in_txt source text
// start index of the start of the match in in_txt
// end index of the end of the match in in_txt
// --- the match is in in_txt[start..end] ---
fn (re RE, in_txt string, start int, end int) string
2020-12-18 07:57:31 +03:00
```
The following example will clarify its usage:
2020-12-18 07:57:31 +03:00
```v ignore
import regex
// customized replace functions
// it will be called on each non overlapped find
fn my_repl(re regex.RE, in_txt string, start int, end int) string {
g0 := re.get_group_by_id(in_txt, 0)
g1 := re.get_group_by_id(in_txt, 1)
g2 := re.get_group_by_id(in_txt, 2)
return "*$g0*$g1*$g2*"
}
fn main(){
txt := "today [John] is gone to his house with (Jack) and [Marie]."
query := r"(.)(\A\w+)(.)"
mut re := regex.regex_opt(query) or { panic(err) }
result := re.replace_by_fn(txt, my_repl)
println(result)
}
```
Output:
```
today *[*John*]* is gone to his house with *(*Jack*)* and *[*Marie*]*.
```
2020-01-16 02:39:33 +03:00
## Debugging
This module has few small utilities to you write regex patterns.
2020-01-16 02:39:33 +03:00
2020-01-18 09:38:00 +03:00
### **Syntax errors highlight**
2020-01-16 02:39:33 +03:00
The next example code shows how to visualize regex pattern syntax errors
in the compilation phase:
2020-01-16 02:39:33 +03:00
```v oksyntax
query := r'ciao da ab[ab-]'
// there is an error, a range not closed!!
mut re := new()
re.compile_opt(query) or { println(err) }
2020-01-16 02:39:33 +03:00
// output!!
// query: ciao da ab[ab-]
// err : ----------^
// ERROR: ERR_SYNTAX_ERROR
2020-01-16 02:39:33 +03:00
```
2020-01-18 09:38:00 +03:00
### **Compiled code**
2020-01-16 02:39:33 +03:00
It is possible to view the compiled code calling the function `get_query()`.
The result will be something like this:
2020-01-16 02:39:33 +03:00
```
========================================
2020-12-18 07:57:31 +03:00
v RegEx compiler v 1.0 alpha output:
PC: 0 ist: 92000000 ( GROUP_START #:0 { 1, 1}
PC: 1 ist: 98000000 . DOT_CHAR nx chk: 4 { 1, 1}
PC: 2 ist: 94000000 ) GROUP_END #:0 { 1, 1}
PC: 3 ist: 92000000 ( GROUP_START #:1 { 1, 1}
PC: 4 ist: 90000000 [\A] BSLS { 1, 1}
PC: 5 ist: 90000000 [\w] BSLS { 1,MAX}
PC: 6 ist: 94000000 ) GROUP_END #:1 { 1, 1}
PC: 7 ist: 92000000 ( GROUP_START #:2 { 1, 1}
PC: 8 ist: 98000000 . DOT_CHAR nx chk: -1 last! { 1, 1}
PC: 9 ist: 94000000 ) GROUP_END #:2 { 1, 1}
PC: 10 ist: 88000000 PROG_END { 0, 0}
2020-01-16 02:39:33 +03:00
========================================
2020-12-18 07:57:31 +03:00
2020-01-16 02:39:33 +03:00
```
2020-01-16 04:07:36 +03:00
`PC`:`int` is the program counter or step of execution, each single step is a token.
2020-01-16 02:39:33 +03:00
2020-01-16 04:07:36 +03:00
`ist`:`hex` is the token instruction id.
2020-01-16 02:39:33 +03:00
2020-01-16 04:07:36 +03:00
`[a]` is the char used by the token.
2020-01-16 02:39:33 +03:00
2020-01-16 04:07:36 +03:00
`query_ch` is the type of token.
2020-01-16 02:39:33 +03:00
2020-01-16 04:07:36 +03:00
`{m,n}` is the quantifier, the greedy off flag `?` will be showed if present in the token
2020-01-16 02:39:33 +03:00
2020-01-18 09:38:00 +03:00
### **Log debug**
2020-01-16 02:39:33 +03:00
The log debugger allow to print the status of the regex parser when the
parser is running. It is possible to have two different levels of
debug information: 1 is normal, while 2 is verbose.
2020-01-16 02:39:33 +03:00
Here is an example:
2020-01-16 02:39:33 +03:00
*normal* - list only the token instruction with their values
2020-01-16 02:39:33 +03:00
```ignore
2020-01-16 02:39:33 +03:00
// re.flag = 1 // log level normal
flags: 00000000
# 2 s: ist_load PC: i,ch,len:[ 0,'a',1] f.m:[ -1, -1] query_ch: [a]{1,1}:0 (#-1)
# 5 s: ist_load PC: i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [b]{2,3}:0? (#-1)
# 7 s: ist_load PC: i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1)
2020-01-16 02:39:33 +03:00
# 10 PROG_END
```
*verbose* - list all the instructions and states of the parser
2020-01-16 02:39:33 +03:00
```ignore
2020-01-16 02:39:33 +03:00
flags: 00000000
# 0 s: start PC: NA
# 1 s: ist_next PC: NA
# 2 s: ist_load PC: i,ch,len:[ 0,'a',1] f.m:[ -1, -1] query_ch: [a]{1,1}:0 (#-1)
# 3 s: ist_quant_p PC: i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [a]{1,1}:1 (#-1)
2020-01-16 02:39:33 +03:00
# 4 s: ist_next PC: NA
# 5 s: ist_load PC: i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [b]{2,3}:0? (#-1)
# 6 s: ist_quant_p PC: i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1)
# 7 s: ist_load PC: i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1)
# 8 s: ist_quant_p PC: i,ch,len:[ 3,'b',1] f.m:[ 0, 2] query_ch: [b]{2,3}:2? (#-1)
2020-01-16 02:39:33 +03:00
# 9 s: ist_next PC: NA
# 10 PROG_END
# 11 PROG_END
```
2020-01-16 04:07:36 +03:00
the columns have the following meaning:
2020-01-16 02:39:33 +03:00
`# 2` number of actual steps from the start of parsing
`s: ist_next` state of the present step
`PC: 1` program counter of the step
`=>7fffffff ` hex code of the instruction
2020-01-16 02:39:33 +03:00
`i,ch,len:[ 0,'a',1]` `i` index in the source string, `ch` the char parsed,
`len` the length in byte of the char parsed
2020-01-16 02:39:33 +03:00
`f.m:[ 0, 1]` `f` index of the first match in the source string, `m` index that is actual matching
`query_ch: [b]` token in use and its char
`{2,3}:1?` quantifier `{min,max}`, `:1` is the actual counter of repetition,
`?` is the greedy off flag if present.
2020-01-16 02:39:33 +03:00
2020-01-18 09:38:00 +03:00
### **Custom Logger output**
The debug functions output uses the `stdout` as default,
it is possible to provide an alternative output, by setting a custom
output function:
2020-01-18 09:38:00 +03:00
```v oksyntax
2020-01-18 09:38:00 +03:00
// custom print function, the input will be the regex debug string
fn custom_print(txt string) {
println('my log: $txt')
2020-01-18 09:38:00 +03:00
}
mut re := new()
re.log_func = custom_print
// every debug output from now will call this function
2020-01-18 09:38:00 +03:00
```
2020-01-16 02:39:33 +03:00
## Example code
Here an example that perform some basically match of strings
2020-01-16 02:39:33 +03:00
2020-12-18 07:57:31 +03:00
```v ignore
import regex
2020-01-16 02:39:33 +03:00
2020-12-18 07:57:31 +03:00
fn main(){
txt := "http://www.ciao.mondo/hello/pippo12_/pera.html"
query := r"(?P<format>https?)|(?P<format>ftps?)://(?P<token>[\w_]+.)+"
mut re := regex.regex_opt(query) or { panic(err) }
start, end := re.match_string(txt)
if start >= 0 {
println("Match ($start, $end) => [${txt[start..end]}]")
for g_index := 0; g_index < re.group_count ; g_index++ {
println("#${g_index} [${re.get_group_by_id(txt, g_index)}] \
bounds: ${re.get_group_bounds_by_id(g_index)}")
}
for name in re.group_map.keys() {
println("group:'$name' \t=> [${re.get_group_by_name(txt, name)}] \
bounds: ${re.get_group_bounds_by_name(name)}")
}
} else {
println("No Match")
}
2020-01-16 02:39:33 +03:00
}
```
Here an example of total customization of the regex environment creation:
```v ignore
import regex
fn main(){
txt := "today John is gone to his house with Jack and Marie."
query := r"(?:(?P<word>\A\w+)|(?:\a\w+)[\s.]?)+"
// init regex
mut re := regex.RE{}
// max program length, can not be longer then the query
re.prog = []regex.Token {len: query.len + 1}
// can not be more char class the the length of the query
re.cc = []regex.CharClass{len: query.len}
re.prog = []regex.Token {len: query.len+1}
// enable continuos group saving
re.group_csave_flag = true
// set max 128 group nested
re.group_max_nested = 128
// we can't have more groups than the half of the query legth
re.group_max = query.len>>1
// compile the query
re.compile_opt(query) or { panic(err) }
start, end := re.match_string(txt)
if start >= 0 {
println("Match ($start, $end) => [${txt[start..end]}]")
} else {
println("No Match")
}
// show results for continuos group saving
if re.group_csave_flag == true && start >= 0 && re.group_csave.len > 0{
println("cg: $re.group_csave")
mut cs_i := 1
for cs_i < re.group_csave[0]*3 {
g_id := re.group_csave[cs_i]
st := re.group_csave[cs_i+1]
en := re.group_csave[cs_i+2]
println("cg[$g_id] $st $en:[${txt[st..en]}]")
cs_i += 3
}
}
// show results for captured groups
if start >= 0 {
println("Match ($start, $end) => [${txt[start..end]}]")
for g_index := 0; g_index < re.group_count ; g_index++ {
println("#${g_index} [${re.get_group_by_id(txt, g_index)}] \
bounds: ${re.get_group_bounds_by_id(g_index)}")
}
for name in re.group_map.keys() {
println("group:'$name' \t=> [${re.get_group_by_name(txt, name)}] \
bounds: ${re.get_group_bounds_by_name(name)}")
}
} else {
println("No Match")
}
}
```
More examples are available in the test code for the `regex` module,
see `vlib/regex/regex_test.v`.