1
0
mirror of https://github.com/vlang/v.git synced 2023-08-10 21:13:21 +03:00

docs_ci: check all md files except thirdparty (#6855)

This commit is contained in:
Lukas Neubert
2020-11-18 18:28:28 +01:00
committed by GitHub
parent d8f64f516b
commit df4165c7ee
20 changed files with 373 additions and 221 deletions

View File

@@ -8,10 +8,12 @@ Write here the introduction... not today!! -_-
## Basic assumption
In this release, during the writing of the code some assumptions are made and are valid for all the features.
In this release, during the writing of the code some assumptions are made
and are valid for all the features.
1. The matching stops at the end of the string not at the newline chars.
2. The basic elements of this regex engine are the tokens, in a query string a simple char is a token. The token is the atomic unit of this regex engine.
2. The basic elements of this regex engine are the tokens,
in a query string a simple char is a token. The token is the atomic unit of this regex engine.
## Match positional limiter
@@ -37,19 +39,26 @@ The cc matches all the chars specified inside, it is delimited by square bracket
the sequence of chars in the class is evaluated with an OR operation.
For example, the following cc `[abc]` matches any char that is `a` or `b` or `c` but doesn't match `C` or `z`.
For example, the following cc `[abc]` matches any char that is `a` or `b` or `c`
but doesn't match `C` or `z`.
Inside a cc is possible to specify a "range" of chars, for example `[ad-f]` is equivalent to write `[adef]`.
Inside a cc is possible to specify a "range" of chars,
for example `[ad-f]` is equivalent to write `[adef]`.
A cc can have different ranges at the same time like `[a-zA-z0-9]` that matches all the lowercase,uppercase and numeric chars.
A cc can have different ranges at the same time like `[a-zA-z0-9]` that matches all the lowercase,
uppercase and numeric chars.
It is possible negate the cc using the caret char at the start of the cc like: `[^abc]` that matches every char that is not `a` or `b` or `c`.
It is possible negate the cc using the caret char at the start of the cc like: `[^abc]`
that matches every char that is not `a` or `b` or `c`.
A cc can contain meta-chars like: `[a-z\d]` that matches all the lowercase latin chars `a-z` and all the digits `\d`.
A cc can contain meta-chars like: `[a-z\d]` that matches all the lowercase latin chars `a-z`
and all the digits `\d`.
It is possible to mix all the properties of the char class together.
**Note:** In order to match the `-` (minus) char, it must be located at the first position in the cc, for example `[-_\d\a]` will match `-` minus, `_`underscore, `\d` numeric chars, `\a` lower case chars.
**Note:** In order to match the `-` (minus) char, it must be located at the first position
in the cc, for example `[-_\d\a]` will match `-` minus, `_`underscore, `\d` numeric chars,
`\a` lower case chars.
### Meta-chars
@@ -63,7 +72,7 @@ A meta-char can match different type of chars.
* `\D` matches a non digit
* `\s`matches a space char, one of `[' ','\t','\n','\r','\v','\f']`
* `\S` matches a non space char
* `\a` matches only a lowercase char `[a-z]`
* `\a` matches only a lowercase char `[a-z]`
* `\A` matches only an uppercase char `[A-Z]`
### Quantifier
@@ -80,16 +89,21 @@ Each token can have a quantifier that specify how many times the char can or mus
- `{x}` matches exactly x time, `a{2}` matches `aa` but doesn't match `aaa` or `a`
- `{min,}` matches at minimum min time, `a{2,}` matches `aaa` or `aa` but doesn't match `a`
- `{,max}` matches at least 0 time and maximum max time, `a{,2}` matches `a` and `aa` but doesn't match `aaa`
- `{min,max}` matches from min times to max times, `a{2,3}` matches `aa` and `aaa` but doesn't match `a` or `aaaa`
- `{,max}` matches at least 0 time and maximum max time,
`a{,2}` matches `a` and `aa` but doesn't match `aaa`
- `{min,max}` matches from min times to max times,
`a{2,3}` matches `aa` and `aaa` but doesn't match `a` or `aaaa`
a long quantifier may have a `greedy off` flag that is the `?` char after the brackets, `{2,4}?` means to match the minimum number possible tokens in this case 2.
a long quantifier may have a `greedy off` flag that is the `?` char after the brackets,
`{2,4}?` means to match the minimum number possible tokens in this case 2.
### dot char
the dot is a particular meta char that matches "any char", is more simple explain it with an example:
the dot is a particular meta char that matches "any char",
is more simple explain it with an example:
suppose to have `abccc ddeef` as source string to parse with regex, the following table show the query strings and the result of parsing source string.
suppose to have `abccc ddeef` as source string to parse with regex,
the following table show the query strings and the result of parsing source string.
| query string | result |
| ------------ | ------ |
@@ -102,39 +116,50 @@ the dot char matches any char until the next token match is satisfied.
### OR token
the token `|` is a logic OR operation between two consecutive tokens, `a|b` matches a char that is `a` or `b`.
the token `|` is a logic OR operation between two consecutive tokens,
`a|b` matches a char that is `a` or `b`.
The OR token can work in a "chained way": `a|(b)|cd ` test first `a` if the char is not `a` then test the group `(b)` and if the group doesn't match test the token `c`.
The OR token can work in a "chained way": `a|(b)|cd ` test first `a` if the char is not `a`
then test the group `(b)` and if the group doesn't match test the token `c`.
**note: The OR work at token level! It doesn't work at concatenation level!**
A query string like `abc|bde` is not equal to `(abc)|(bde)`!! The OR work only on `c|b` not at char concatenation level.
A query string like `abc|bde` is not equal to `(abc)|(bde)`!!
The OR work only on `c|b` not at char concatenation level.
### Groups
Groups are a method to create complex patterns with repetition of blocks of tokens.
The groups are delimited by round brackets `( )`, groups can be nested and can have a quantifier as all the tokens.
The groups are delimited by round brackets `( )`,
groups can be nested and can have a quantifier as all the tokens.
`c(pa)+z` match `cpapaz` or `cpaz` or `cpapapaz` .
`(c(pa)+z ?)+` matches `cpaz cpapaz cpapapaz` or `cpapaz`
`(c(pa)+z ?)+` matches `cpaz cpapaz cpapapaz` or `cpapaz`
let analyze this last case, first we have the group `#0` that are the most outer round brackets `(...)+`, this group has a quantifier that say to match its content at least one time `+`.
let analyze this last case, first we have the group `#0`
that are the most outer round brackets `(...)+`,
this group has a quantifier that say to match its content at least one time `+`.
After we have a simple char token `c` and a second group that is the number `#1` :`(pa)+`, this group try to match the sequence `pa` at least one time as specified by the `+` quantifier.
After we have a simple char token `c` and a second group that is the number `#1` :`(pa)+`,
this group try to match the sequence `pa` at least one time as specified by the `+` quantifier.
After, we have another simple token `z` and another simple token ` ?` that is the space char (ascii code 32) followed by the `?` quantifier that say to capture the space char 0 or 1 time.
After, we have another simple token `z` and another simple token ` ?`
that is the space char (ascii code 32) followed by the `?` quantifier
that say to capture the space char 0 or 1 time.
This explain because the `(c(pa)+z ?)+` query string can match `cpaz cpapaz cpapapaz` .
In this implementation the groups are "capture groups", it means that the last temporal result for each group can be retrieved from the `RE` struct.
In this implementation the groups are "capture groups",
it means that the last temporal result for each group can be retrieved from the `RE` struct.
The "capture groups" are store as couple of index in the field `groups` that is an `[]int` inside the `RE` struct.
The "capture groups" are store as couple of index in the field `groups`
that is an `[]int` inside the `RE` struct.
**example:**
```v
```v oksyntax
text := "cpaz cpapaz cpapapaz"
query:= r"(c(pa)+z ?)+"
mut re := regex.regex_opt(query) or { panic(err) }
@@ -157,16 +182,19 @@ for gi < re.groups.len {
// 1 :[pa]
```
**note:** *to show the `group id number` in the result of the `get_query()` the flag `debug` of the RE object must be `1` or `2`*
**note:** *to show the `group id number` in the result of the `get_query()`*
*the flag `debug` of the RE object must be `1` or `2`*
### Groups Continuous saving
In particular situations it is useful have a continuous save of the groups, this is possible initializing the saving array field in `RE` struct: `group_csave`.
In particular situations it is useful have a continuous save of the groups,
this is possible initializing the saving array field in `RE` struct: `group_csave`.
This feature allow to collect data in a continuous way.
In the example we pass a text followed by a integer list that we want collect.
To achieve this task we can use the continuous saving of the group that save each captured group in a array that we set with: `re.group_csave = [-1].repeat(3*20+1)`.
In the example we pass a text followed by a integer list that we want collect.
To achieve this task we can use the continuous saving of the group
that save each captured group in a array that we set with: `re.group_csave = [-1].repeat(3*20+1)`.
The array will be filled with the following logic:
@@ -176,9 +204,10 @@ The array will be filled with the following logic:
`re.group_csave[1+n*3]` start index in the source string of the saved group
`re.group_csave[1+n*3]` end index in the source string of the saved group
The regex save until finish or found that the array have no space. If the space ends no error is raised, further records will not be saved.
The regex save until finish or found that the array have no space.
If the space ends no error is raised, further records will not be saved.
```v
```v oksyntax
fn example2() {
test_regex()
@@ -234,7 +263,7 @@ cg id: 0 [4, 8] => [ 01,]
cg id: 0 [8, 11] => [23,]
cg id: 0 [11, 15] => [45 ,]
cg id: 0 [15, 19] => [56, ]
cg id: 0 [19, 21] => [78]
cg id: 0 [19, 21] => [78]
```
### Named capturing groups
@@ -245,13 +274,14 @@ This regex module support partially the question mark `?` PCRE syntax for groups
`(?P<mygroup>abcdef)` **named group:** the group content is saved and labeled as `mygroup`
The label of the groups is saved in the `group_map` of the `RE` struct, this is a map from `string` to `int` where the value is the index in `group_csave` list of index.
The label of the groups is saved in the `group_map` of the `RE` struct,
this is a map from `string` to `int` where the value is the index in `group_csave` list of index.
Have a look at the example for the use of them.
example:
```v
```v oksyntax
import regex
fn main() {
test_regex()
@@ -270,8 +300,8 @@ fn main() {
q_str := re.get_query()
println("O.Query: $query")
println("Query : $q_str")
re.debug = 0
re.debug = 0
start, end := re.match_string(text)
if start < 0 {
err_str := re.get_parse_error_string(start)
@@ -331,7 +361,7 @@ cg id: 1 [22, 28] => [hello/]
cg id: 1 [28, 37] => [pippo12_/]
cg id: 1 [37, 42] => [pera.]
cg id: 1 [42, 46] => [html]
raw array: [8, 0, 0, 4, 1, 7, 11, 1, 11, 16, 1, 16, 22, 1, 22, 28, 1, 28, 37, 1, 37, 42, 1, 42, 46]
raw array: [8, 0, 0, 4, 1, 7, 11, 1, 11, 16, 1, 16, 22, 1, 22, 28, 1, 28, 37, 1, 37, 42, 1, 42, 46]
named capturing groups:
'format':[0, 4] => 'http'
'token':[42, 46] => 'html'
@@ -341,25 +371,27 @@ named capturing groups:
It is possible to set some flags in the regex parser that change the behavior of the parser itself.
```v
```v oksyntax
// example of flag settings
mut re := regex.new()
re.flag = regex.F_BIN
re.flag = regex.F_BIN
```
- `F_BIN`: parse a string as bytes, utf-8 management disabled.
- `F_EFM`: exit on the first char matches in the query, used by the find function.
- `F_MS`: matches only if the index of the start match is 0, same as `^` at the start of the query string.
- `F_ME`: matches only if the end index of the match is the last char of the input string, same as `$` end of query string.
- `F_MS`: matches only if the index of the start match is 0,
same as `^` at the start of the query string.
- `F_ME`: matches only if the end index of the match is the last char of the input string,
same as `$` end of query string.
- `F_NL`: stop the matching if found a new line char `\n` or `\r`
## Functions
### Initializer
These functions are helper that create the `RE` struct, a `RE` struct can be created manually if you needed.
These functions are helper that create the `RE` struct,
a `RE` struct can be created manually if you needed.
#### **Simplified initializer**
@@ -378,7 +410,7 @@ pub fn new() RE
pub fn new_by_size(mult int) RE
```
After a base initializer is used, the regex expression must be compiled with:
```v
```v oksyntax
// compile compiles the REgex returning an error if the compilation fails
pub fn (re mut RE) compile_opt(in_txt string) ?
```
@@ -387,7 +419,7 @@ pub fn (re mut RE) compile_opt(in_txt string) ?
These are the operative functions
```v
```v oksyntax
// match_string try to match the input string, return start and end index if found else start is -1
pub fn (re mut RE) match_string(in_txt string) (int,int)
@@ -409,7 +441,7 @@ This module has few small utilities to help the writing of regex expressions.
the following example code show how to visualize the syntax errors in the compilation phase:
```v
```v oksyntax
query:= r"ciao da ab[ab-]" // there is an error, a range not closed!!
mut re := new()
@@ -425,7 +457,8 @@ re.compile_opt(query) or { println(err) }
### **Compiled code**
It is possible view the compiled code calling the function `get_query()` the result will be something like this:
It is possible to view the compiled code calling the function `get_query()`.
The result will be something like this:
```
========================================
@@ -495,21 +528,24 @@ the columns have the following meaning:
`PC: 1` program counter of the step
`=>7fffffff ` hex code of the instruction
`=>7fffffff ` hex code of the instruction
`i,ch,len:[ 0,'a',1]` `i` index in the source string, `ch` the char parsed, `len` the length in byte of the char parsed
`i,ch,len:[ 0,'a',1]` `i` index in the source string, `ch` the char parsed,
`len` the length in byte of the char parsed
`f.m:[ 0, 1]` `f` index of the first match in the source string, `m` index that is actual matching
`query_ch: [b]` token in use and its char
`{2,3}:1?` quantifier `{min,max}`, `:1` is the actual counter of repetition, `?` is the greedy off flag if present
`{2,3}:1?` quantifier `{min,max}`, `:1` is the actual counter of repetition,
`?` is the greedy off flag if present.
### **Custom Logger output**
The debug functions output uses the `stdout` as default, it is possible to provide an alternative output setting a custom output function:
The debug functions output uses the `stdout` as default,
it is possible to provide an alternative output setting a custom output function:
```v
```v oksyntax
// custom print function, the input will be the regex debug string
fn custom_print(txt string) {
println("my log: $txt")
@@ -524,7 +560,7 @@ re.log_func = custom_print // every debug output from now will call this functi
Here there is a simple code to perform some basically match of strings
```v
```v oksyntax
struct TestObj {
source string // source string to parse
query string // regex query string
@@ -545,18 +581,18 @@ fn example() {
for c,tst in tests {
mut re := regex.new()
re.compile_opt(tst.query) or { println(err) continue }
// print the query parsed with the groups ids
re.debug = 1 // set debug on at minimum level
println("#${c:2d} query parsed: ${re.get_query()}")
re.debug = 0
// do the match
start, end := re.match_string(tst.source)
if start >= 0 && end > start {
println("#${c:2d} found in: [$start, $end] => [${tst.source[start..end]}]")
}
}
// print the groups
mut gi := 0
for gi < re.groups.len {
@@ -564,7 +600,7 @@ fn example() {
println("group ${gi/2:2d} :[${tst.source[re.groups[gi]..re.groups[gi+1]]}]")
}
gi += 2
}
}
println("")
}
}
@@ -575,4 +611,3 @@ fn main() {
```
more example code is available in the test code for the `regex` module `vlib\regex\regex_test.v`.