regex: bug fixes, docs

2023-08-10 21:13:21 +03:00 · 2020-01-18 07:38:00 +01:00 · 2020-01-18 07:38:00 +01:00 · 36660ce749
commit 36660ce749
parent ad7bc37672
3 changed files with 95 additions and 68 deletions
--- a/vlib/regex/README.md
+++ b/vlib/regex/README.md
@ -4,14 +4,14 @@

 ## introduction

-Write here the introduction
+Write here the introduction... not today!! -_-

 ## Basic assumption

-In this release, during the writing of the code some assumption are made and are valid for all the features.
+In this release, during the writing of the code some assumptions are made and are valid for all the features.

 1. The matching stops at the end of the string not at the newline chars.
-2. The basic element of this regex engine are the tokens, in query string a simple char is a token. The token is the atomic unit of this regex engine.
+2. The basic elements of this regex engine are the tokens, in a query string a simple char is a token. The token is the atomic unit of this regex engine.

 ## Match positional limiter

@ -21,11 +21,11 @@ The module supports the following features:

 `^` (Caret.) Matches at the start of the string

-`?` Matches at the end of the string
+`$` Matches at the end of the string

 ## Tokens

-The tokens are the atomic unit used by this regex engine and can be ones of the following:
+The tokens are the atomic units used by this regex engine and can be ones of the following:

 ### Simple char

@ -33,11 +33,11 @@ this token is a simple single character like `a`.

 ### Char class (cc)

-The cc match all the chars specified in its inside, it is delimited by square brackets `[ ]`
+The cc match all the chars specified inside, it is delimited by square brackets `[ ]`

 the sequence of chars in the class is evaluated with an OR operation.

-For example the following cc `[abc]` match any char that is `a` or `b` or `c` but doesn't match `C` or `z`.
+For example, the following cc `[abc]` match any char that is `a` or `b` or `c` but doesn't match `C` or `z`.

 Inside a cc is possible to specify a "range" of chars, for example `[ad-f]` is equivalent to write `[adef]`. 

@ -68,17 +68,17 @@ A meta-char can match different type of chars.

 Each token can have a quantifier that specify how many times the char can or must be matched.

-**Short quantifier**
+#### **Short quantifier**

 - `?` match 0 or 1 time, `a?b` match both `ab` or `b`
 - `+` match at minimum 1 time, `a+` match both `aaa` or `a`
 - `*` match 0 or more time, `a*b` match both `aaab` or `ab` or `b`

-**Long quantifier**
+#### **Long quantifier**

 - `{x}` match exactly x time, `a{2}` match `aa` but doesn't match `aaa` or `a`
 - `{min,}` match at minimum min time, `a{2,}` match `aaa` or `aa` but doesn't match `a`
- `{,max}` match at least 1 time and maximum max time, `a{,2}` match `a` and `aa` but doesn't match `aaa`
+- `{,max}` match at least 0 time and maximum max time, `a{,2}` match `a` and `aa` but doesn't match `aaa`
 - `{min,max}` match from min times to max times, `a{2,3}` match `aa` and `aaa` but doesn't match `a` or `aaaa`

 a long quantifier may have a `greedy off` flag that is the `?` char after the brackets, `{2,4}?` means to match the minimum number possible tokens in this case 2.
@ -102,7 +102,7 @@ the dot char match any char until the next token match is satisfied.

 the token `|` is a logic OR operation between two consecutive tokens, `a|b` match a char that is `a` or `b`.

-The or token can work in a "chained way": `a|(b)|cd ` test first `a` if the char is not `a` the test the group `(b)` and if the group doesn't match test the token `c`.
+The OR token can work in a "chained way": `a|(b)|cd ` test first `a` if the char is not `a` then test the group `(b)` and if the group doesn't match test the token `c`.

 **note: The OR work at token level! It doesn't work at concatenation level!**

@ -181,16 +181,16 @@ re.flag = regex.F_BIN

 ### Initializer

-These function are helper that create the `RE` struct, a `RE` struct can be created manually if you needed.
+These functions are helper that create the `RE` struct, a `RE` struct can be created manually if you needed.

-**Simplified initializer**
+#### **Simplified initializer**

 ```v
 // regex create a regex object from the query string and compile it
 pub fn regex(in_query string) (RE,int,int)
 ```

-**Base initializer**
+#### **Base initializer**

 ```v
 // new_regex create a REgex of small size, usually sufficient for ordinary use
@ -199,13 +199,13 @@ pub fn new_regex() RE
 // new_regex_by_size create a REgex of large size, mult specify the scale factor of the memory that will be allocated
 pub fn new_regex_by_size(mult int) RE
 ```
-After the base initializer use, the regex expression must be compiled with:
+After a base initializer is used, the regex expression must be compiled with:
 ```v
 // compile return (return code, index) where index is the index of the error in the query string if return code is an error code
 pub fn (re mut RE) compile(in_txt string) (int,int)
 ```

-### Functions
+### Operative Functions

 These are the operative functions

@ -227,7 +227,7 @@ pub fn (re mut RE) replace(in_txt string, repl string) string

 This module has few small utilities to help the writing of regex expressions.

-**Syntax errors highlight**
+### **Syntax errors highlight**

 the following example code show how to visualize the syntax errors in the compilation phase:

@ -256,7 +256,7 @@ if re_err != COMPILE_OK {

 ```

-**Compiled code**
+### **Compiled code**

 It is possible view the compiled code calling the function `get_query()` the result will be something like this:

@ -279,7 +279,7 @@ PC:  2 ist: 88000000 PROG_END {  0,  0}

 `{m,n}` is the quantifier, the greedy off flag  `?`  will be showed if present in the token

-**Log debug**
+### **Log debug**

 The log debugger allow to print the status of the regex parser when the parser is running.

@ -338,6 +338,21 @@ the columns have the following meaning:

 `{2,3}:1?` quantifier `{min,max}`, `:1` is the actual counter of repetition, `?` is the greedy off flag if present

+### **Custom Logger output**
+
+The debug functions output uses the `stdout` as default, it is possible to  provide an alternative output setting a custom output function:
+
+```v
+// custom print function, the input will be the regex debug string
+fn custom_print(txt string) {
+	println("my log: $txt")
+}
+
+mut re := new_regex()
+re.log_func = custom_print  // every debug output from now will call this function
+
+```
+
 ## Example code

 Here there is a simple code to perform some basically match of strings
--- a/vlib/regex/regex.v
+++ b/vlib/regex/regex.v
@ -200,7 +200,6 @@ pub fn (re RE) get_parse_error_string(err int) string {
 	}
 }

-
 // utf8_str convert and utf8 sequence to a printable string
 [inline]
 fn utf8_str(ch u32) string {
@ -231,7 +230,7 @@ mut:
 	ist u32 = u32(0)

 	// char
-	ch u32                 = u32(0)// char of the token if any
+	ch u32                 = u32(0)  // char of the token if any
 	ch_len byte            = byte(0) // char len

 	// Quantifiers / branch
@ -245,7 +244,7 @@ mut:
 	// counters for quantifier check (repetitions)
 	rep int = 0

-	// validator function pointer and control char
+	// validator function pointer
 	validator fn (byte) bool

 	// groups variables
@ -280,9 +279,9 @@ pub const (

 struct StateDotObj{
 mut:
-	i  int                = 0   // char index in the input buffer
-	pc int                = 0   // program counter saved
-	mi int                = 0   // match_index saved
+	i  int                = -1  // char index in the input buffer
+	pc int                = -1   // program counter saved
+	mi int                = -1   // match_index saved
 	group_stack_index int = -1  // group index stack pointer saved
 }

@ -648,7 +647,7 @@ fn (re RE) parse_quantifier(in_txt string, in_i int) (int, int, int, bool) {

 		// min parsing skip if comma present
 		if status == .start && ch == `,` {
-			q_min = 1 // default min in a {} quantifier is 1
+			q_min = 0 // default min in a {} quantifier is 0
 			status = .comma_checked
 			i++
 			continue
@ -998,6 +997,7 @@ pub fn (re mut RE) compile(in_txt string) (int,int) {
 	// Post processing
 	//******************************************

+
 	// count IST_DOT_CHAR to set the size of the state stack
 	mut pc1 := 0
 	mut tmp_count := 0
@ -1007,9 +1007,9 @@ pub fn (re mut RE) compile(in_txt string) (int,int) {
 		}
 		pc1++
 	}
+
 	// init the state stack
-	re.state_stack = [StateDotObj{}].repeat(tmp_count+1)
-	
+	re.state_stack = [StateDotObj{}].repeat(tmp_count+1)	
 	
 	// OR branch
 	// a|b|cd
@ -1279,7 +1279,8 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {

 	mut pc := -1                     // program counter
 	mut state := StateObj{}          // actual state
-	mut ist := u32(0)                // Program Counter
+	mut ist := u32(0)                // actual instruction
+	mut l_ist := u32(0)              // last matched instruction

 	mut group_stack      := [-1].repeat(re.group_max)
 	mut group_data       := [-1].repeat(re.group_max)
@ -1359,7 +1360,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
 								tmp_gr := re.prog[re.prog[pc].goto_pc].group_rep
 								buf2.write("GROUP_START #:${tmp_gi} rep:${tmp_gr} ")
 							} else if ist == IST_GROUP_END {
-								buf2.write("GROUP_END   #:${re.prog[pc].group_id} deep:${group_index} ")
+								buf2.write("GROUP_END   #:${re.prog[pc].group_id} deep:${group_index}")
 							}
 						}
 						if re.prog[pc].rep_max == MAX_QUANTIFIER {
@ -1417,17 +1418,10 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
 			}

 			// manage IST_DOT_CHAR
-			if re.state_stack_index >= 0 {
-				//C.printf("DOT CHAR text end management!\n")
-				// if DOT CHAR is not the last instruction and we are still going, then no match!!
-				if pc < re.prog.len && re.prog[pc+1].ist != IST_PROG_END {
-					return NO_MATCH_FOUND,0
-				}
-			}

 			m_state == .end
 			break
-			return NO_MATCH_FOUND,0
+			//return NO_MATCH_FOUND,0
 		}

 		// starting and init
@ -1475,12 +1469,13 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
 		// check if stop 
 		if m_state == .stop {
 			// if we are in restore state ,do it and restart
-			if re.state_stack_index >= 0 {	
+			//C.printf("re.state_stack_index %d\n",re.state_stack_index )
+			if re.state_stack_index >=0 && re.state_stack[re.state_stack_index].pc >= 0 {
 				i = re.state_stack[re.state_stack_index].i
 				pc = re.state_stack[re.state_stack_index].pc
 				state.match_index =	re.state_stack[re.state_stack_index].mi
 				group_index = re.state_stack[re.state_stack_index].group_stack_index
-				
+
 				m_state = .ist_load
 				continue
 			}
@ -1499,12 +1494,22 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
 			// program end
 			if ist == IST_PROG_END {
 				// if we are in match exit well
+				
 				if group_index >= 0 && state.match_index >= 0 {
 					group_index = -1
 				}
-								
+
+				// we have a DOT MATCH on going
+				//C.printf("IST_PROG_END l_ist: %08x\n", l_ist)
+				if re.state_stack_index>=0 && l_ist == IST_DOT_CHAR {
+					m_state = .stop
+					continue
+				}
+
+				re.state_stack_index = -1
 				m_state = .stop
 				continue
+				
 			}

 			// check GROUP start, no quantifier is checkd for this token!!
@ -1527,7 +1532,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
 					//C.printf("g.id: %d group_index: %d\n", re.prog[pc].group_id, group_index)
 					if group_index >= 0 {
 	 					start_i   := group_stack[group_index]
-	 					group_stack[group_index]=-1
+	 					//group_stack[group_index]=-1

 	 					// save group results
 						g_index := re.prog[pc].group_id*2
@ -1537,6 +1542,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
 							re.groups[g_index] = 0
 						}
 						re.groups[g_index+1] = i
+						//C.printf("GROUP %d END [%d, %d]\n", re.prog[pc].group_id, re.groups[g_index], re.groups[g_index+1])
 					}
 					
 					re.prog[pc].group_rep++ // increase repetitions
@ -1568,6 +1574,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
 			else if ist == IST_DOT_CHAR {
 				//C.printf("IST_DOT_CHAR rep: %d\n", re.prog[pc].rep)
 				state.match_flag = true
+				l_ist = u32(IST_DOT_CHAR)

 				if first_match < 0 {
 					first_match = i
@ -1575,12 +1582,23 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
 				state.match_index = i
 				re.prog[pc].rep++	

-				if re.prog[pc].rep == 1 {
+				//if re.prog[pc].rep >= re.prog[pc].rep_min && re.prog[pc].rep <= re.prog[pc].rep_max {
+				if re.prog[pc].rep >= 0 && re.prog[pc].rep <= re.prog[pc].rep_max {
+					//C.printf("DOT CHAR save state : %d\n", re.state_stack_index)
 					// save the state
-					re.state_stack_index++
+					
+					// manage first dot char
+					if re.state_stack_index < 0 {
+						re.state_stack_index++
+					}
+
 					re.state_stack[re.state_stack_index].pc = pc
 					re.state_stack[re.state_stack_index].mi = state.match_index
 					re.state_stack[re.state_stack_index].group_stack_index = group_index
+				} else {
+					re.state_stack[re.state_stack_index].pc = -1
+					re.state_stack[re.state_stack_index].mi = -1
+					re.state_stack[re.state_stack_index].group_stack_index = -1
 				}

 				if re.prog[pc].rep >= 1 && re.state_stack_index >= 0 {
@ -1590,19 +1608,11 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
 				// manage * and {0,} quantifier
 				if re.prog[pc].rep_min > 0 {
 					i += char_len // next char
+					l_ist = u32(IST_DOT_CHAR)
 				}
-				
-				if re.prog[pc+1].ist !=  IST_GROUP_END {
-					m_state = .ist_next
-					continue
-				} 
-				// IST_DOT_CHAR is the last instruction, get all
-				else {
-					//C.printf("We are the last one!\n")
-					pc-- 
-					m_state = .ist_next_ks
-					continue
-				}
+
+				m_state = .ist_next
+				continue

 			}

@ -1622,6 +1632,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {

 				if cc_res {
 					state.match_flag = true
+					l_ist = u32(IST_CHAR_CLASS_POS)
 					
 					if first_match < 0 {
 						first_match = i
@ -1645,6 +1656,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
 				//C.printf("BSLS in_ch: %c res: %d\n", ch, tmp_res)
 				if tmp_res {
 					state.match_flag = true
+					l_ist = u32(IST_BSLS_CHAR)
 					
 					if first_match < 0 {
 						first_match = i
@ -1669,6 +1681,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
 				if re.prog[pc].ch == ch
 				{
 					state.match_flag = true
+					l_ist = u32(IST_SIMPLE_CHAR)
 					
 					if first_match < 0 {
 						first_match = i
@ -1857,7 +1870,7 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {
 			}

 			// no other options
-			//C.printf("NO_MATCH_FOUND\n")
+			//C.printf("ist_quant_n NO_MATCH_FOUND\n")
 			result = NO_MATCH_FOUND
 			m_state = .stop
 			continue
@ -1873,12 +1886,6 @@ pub fn (re mut RE) match_base(in_txt byteptr, in_txt_len int ) (int,int) {

 			rep := re.prog[pc].rep
 			
-			// clear the actual dot char capture state
-			if re.state_stack_index >= 0 {
-				//C.printf("Drop the DOT_CHAR state!\n")
-				re.state_stack_index--
-			}
-
 			// under range
 			if rep > 0 && rep < re.prog[pc].rep_min {
 				//C.printf("ist_quant_p UNDER RANGE\n")
--- a/vlib/regex/regex_test.v
+++ b/vlib/regex/regex_test.v
@ -33,15 +33,13 @@ match_test_suite = [
 	TestItem{"this is a good sample.",r"( ?\w+){,4}",0,14},
 	TestItem{"this is a good sample.",r"( ?\w+){,5}",0,21},
 	TestItem{"this is a good sample.",r"( ?\w+){2,3}",0,9},
-	TestItem{"this is a good sample.",r"(\s?\w+){2,3}",0,9},
-	TestItem{"this is a good sample.",r".*i(\w)+",0,4},
+	TestItem{"this is a good sample.",r"(\s?\w+){2,3}",0,9},	
 	TestItem{"this these those.",r"(th[ei]se?\s|\.)+",0,11},
 	TestItem{"this these those ",r"(th[eio]se? ?)+",0,17},
 	TestItem{"this these those ",r"(th[eio]se? )+",0,17},
 	TestItem{"this,these,those. over",r"(th[eio]se?[,. ])+",0,17},
 	TestItem{"soday,this,these,those. over",r"(th[eio]se?[,. ])+",6,23},
-	TestItem{"soday,this,these,those. over",r".*,(th[eio]se?[,. ])+",0,23},
-	TestItem{"soday,this,these,thesa.thesi over",r".*,(th[ei]se?[,. ])+(thes[ai][,. ])+",0,29},
+	
 	TestItem{"cpapaz",r"(c(pa)+z)",0,6},
 	TestItem{"this is a cpapaz over",r"(c(pa)+z)",10,16},
 	TestItem{"this is a cpapapez over",r"(c(p[ae])+z)",10,18},
@ -56,16 +54,23 @@ match_test_suite = [
 	TestItem{"this cpapaz adce aabe",r"(c(pa)+z)(\s[\a]+){2}",5,21},
 	TestItem{"1234this cpapaz adce aabe",r"(c(pa)+z)(\s[\a]+){2}$",9,25},
 	TestItem{"this cpapaz adce aabe third",r"(c(pa)+z)(\s[\a]+){2}",5,21},
+	TestItem{"123cpapaz ole. pippo",r"(c(pa)+z)(\s+\a+[\.,]?)+",3,20},
+	
+	TestItem{"this is a good sample.",r".*i(\w)+",0,4},
+	TestItem{"soday,this,these,those. over",r".*,(th[eio]se?[,. ])+",0,23},
+	TestItem{"soday,this,these,thesa.thesi over",r".*,(th[ei]se?[,. ])+(thes[ai][,. ])+",0,29},
 	TestItem{"cpapaz ole. pippo,",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,18},
 	TestItem{"cpapaz ole. pippo",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,17},
 	TestItem{"cpapaz ole. pippo, 852",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,18},
 	TestItem{"123cpapaz ole. pippo",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,20},
 	TestItem{"...cpapaz ole. pippo",r".*(c(pa)+z)(\s+\a+[\.,]?)+",0,20},
-	TestItem{"123cpapaz ole. pippo",r"(c(pa)+z)(\s+\a+[\.,]?)+",3,20},
+	
 	TestItem{"cpapaz ole. pippo,",r".*c.+ole.*pi",0,14},
 	TestItem{"cpapaz ole. pipipo,",r".*c.+ole.*p([ip])+o",0,18},
 	TestItem{"cpapaz ole. pipipo",r"^.*c.+ol?e.*p([ip])+o$",0,18},
 	TestItem{"abbb",r"ab{2,3}?",0,3},
+	TestItem{" pippo pera",r"\s(.*)pe(.*)",0,11},
+	TestItem{" abb",r"\s(.*)",0,4},

 	// negative
 	TestItem{"zthis ciao",r"((t[hieo]+se?)\s*)+",-1,0},