Implementing Kleene closure
I'm writing a parser that is suppose to deal with context sensitive grammar a la XBNF, and so far it can deal with context free grammar (and does not fail where spirit parser library does on simple grammars like (ab|a)b due to its greedy algo.) However before I go on extending to XBNF, I wanted to implement the Kleene closure for convenience as it is quite inconvenient to deal with left recursion, and besides its suppose to do semantic actions in the same way spirit parser library does. The question is how can I make the implementation efficient? The closure can terminate at any time, even when its not suppose to, so it would seem that the greedy method is the most efficient at first, but then again how do you deal with something like a*ab? It could back off one step and exit the closure but then I'm not sure when is it suppose to try other alternatives as in (a|b)*. Supposedly I could do it by terminating the closure at first and then if it fails, go with the other alternatives, but does that not defeat the greediness of the algo?
Re: Implementing Kleene closure
Typically (as in, Regex libraries usually do this), you go greedy first, but backtrack one element at a time if the following fails.
The (a|b)* case is tricky in so far as a and b really shouldn't be able to match the same sequence - if they are, I believe that tends to fall under the heading of pathological expressions, i.e. those that require extreme time and memory overheads to solve. For each iteration of the kleene star, you'd have to record which alternatives you already tried, and on backtracking try the remaining ones instead. This means quite a lot of memory overhead for the backtracking stack.
Re: Implementing Kleene closure
Well, apparantley thats something you would need to do to parse say an integer, but is it really that resource consuming? Thanks for the suggestion though although I figured I would have to do something of the sort. I already have a backtracking stack since otherwise I could impossibly handle grammars like (ab|a)b which spirit doesn't seem to be able to - well actually its a vector since you have to go back and forth on it without popping to continue in a hierarchy of rules(but I guess that is something that you don't have to with regular expressions) which holds all productions made and if it comes to a Kleene star, it pushes a Kleene begin on the stack and a Kleene end as soon as it has tried all possible rules in the Kleene closure, however if a rule matches, it will greedily try to add more. When deep backtracking the Kleene end is popped off and it will continue on remaining alternatives on the previously matched rule, and go greedy again if it more matches are found.
Re: Implementing Kleene closure
What's wrong with rewriting "(ab|a)b" as "abb?"? It seems to me that most pathological grammars can be re-written in a better way, although possibly not by a machine.
Quote:
Well, apparantley thats something you would need to do to parse say an integer, but is it really that resource consuming?
No, you wouldn't need it, unless you want the integer to be immediately followed by more digits that are part of the next match. And then you still don't need the complex thing I mentioned, because each alternative in the repetition of integers (i.e. each digit) is completely mutually exclusive.
Re: Implementing Kleene closure
it could be rewritten as
"abb|ab"
but that last b could be involved in a more complex expression which would make it hard if not impossible to extract it and likewise (ab|a) could be a very complex expression. Besides altering the grammar will make it impossible to deal with semantic actions; one reason why there is no dealing with left recursion in spirit parser library.
Quote:
No, you wouldn't need it, unless you want the integer to be immediately followed by more digits that are part of the next match. And then you still don't need the complex thing I mentioned, because each alternative in the repetition of integers (i.e. each digit) is completely mutually exclusive.
How about parsing mathematical expressions like 2+3*5? There is bound to be backtracking if it believes 3 is a term at first.
Re: Implementing Kleene closure
Backtracking is easy, yes. But not past fully matched expressions.
Example: let's say that a = ('t'+ 'x') while b = ('t' 'x'+).
As can easily be seen, both a and b can match the string "tx". Let's now take the expression (a|b) - call it c - and apply it to "txxxx".
Using Spirit's rules, i.e. not backtracking past successful submatches, means that the string is processed like this.
Code:
+ c.parse("txxxx")
++ a.parse("txxxx")
+++ ('t'+).parse("txxxx")
++++ ('t').parse("txxxx")
---- ('t') : match "t", remainder "xxxx"
++++ ('t').parse("xxxx")
---- ('t') : fail
--- ('t'+) : match "t", remainder "xxxx"
+++ ('x').parse("xxxx")
--- ('x') : match "x", remainder "xxx"
-- a : match "tx", remainder "xxx"
- c : match "tx", remainder "xxx"
At this point, whoever called c.parse() may notice that not all of the string has been matched, but that doesn't concern c anymore, because it matched successfully. It never tries b, because a matches, and Spirit never backtracks across successful matches. (If a larger expression containing a successful match fails, it will backtrack to the start of that, but that's different.)
However, mathematical expressions usually don't require ANY backtracking.
Let's take a look at your expression, using the grammar in Spirits docs:
Code:
group ::= '(' expression ')'
factor ::= integer | group
term ::= factor (('*' factor) | ('/' factor))*
expression ::= term (('+' term) | ('-' term))*
The string to match is "2+3*5". The grammar does not define integer, let's define it as digit+, with digit being [0-9].
Code:
+ expression.parse("2+3*5")
++ term.parse("2+3*5")
+++ factor.parse("2+3*5")
++++ integer.parse("2+3*5")
+++++ digit.parse("2+3*5")
----- digit : match "2", remainder "+3*5"
+++++ digit.parse("+3*5")
----- digit: fail
---- integer: match "2", remainder "+3*5"
--- factor: match "2", remainder "+3*5"
+++ (('*' factor) | ('/' factor))*.parse("+3*5")
<skipping>
+++++ '*'.parse("+3*5")
------ '*' : fail
+++++ '/'.parse("+3*5")
------ '/' : fail
<skipping>
--- (('*' factor) | ('/' factor))* : fail
-- term : match "2", remainder "+3*5"
++ (('+' term) | ('-' term))*.parse("+3*5")
<skipping>
+++++ '+'.parse("+3*5")
----- '+' : match "+", remainder "3*5"
+++++ term.parse("3*5")
++++++ factor.parse("3*5")
+++++++ integer.parse("3*5")
<skipping>
------- integer: match "3", remainder "*5"
------ factor: match "3", remainder "*5"
++++++ (('*' factor) | ('/' factor))*.parse("*5")
<skipping>
+++++++++ '*'.parse("*5")
--------- '*' : match "*", remainder "5"
+++++++++ factor.parse("5")
<skipping>
--------- factor: match "5", remainder ""
<skipping>
+++++++ ('*' factor) | ('/' factor).parse("")
<skipping>
------- ('*' factor) | ('/' factor) : fail
------ (('*' factor) | ('/' factor))* : match "*5", remainder ""
----- term: match "3*5", remainder ""
<skipping>
+++ ('+' term) | ('-' term).parse("")
--- ('+' term) | ('-' term) : fail
-- (('+' term) | ('-' term))* : match "+3*5"
- expression : match "2+3*5"
As you can see, this extremely simple grammar never requires any backtracking at all, only multiple examination of a single character. The resulting AST (I'm too lazy to draw the whole parse tree again) is:
Code:
expression
+ '+'
+ term
+ factor
+ integer
+ term
+ '*'
+ factor
+ integer
+ factor
+ integer
Re: Implementing Kleene closure
Yeah that makes sense, as a factor in fact is a term and so it tries to match as much as possible while in factor, it gets the whole 3*5 with it.
In the first example, can't you put some sort of end of text and have it try to match '\0' in "txxxx"? If so then it would fail, since it doesn't backtrack wouldn't it?
I'm starting to wonder how useful this backtracking thing actually is, do you know any cases in which its actually used?
Re: Implementing Kleene closure
Quote:
Originally Posted by kedaman
If so then it would fail, since it doesn't backtrack wouldn't it?
Depends on what exactly you mean. Let's introduce the matcher eos, which matches the end of the sequence. In Spirit, I think it has a different name.
There are two things we can do. First, we can reuse the definitions of a and b from my previous post, but now define c as `(a|b)eos`. This will fail to parse "txxxx": a will match "tx", but eos cannot match "xxx", and Spirit doesn't backtrack past success, so it won't try b instead.
However, if we redefine a and b:
a = `'t'+ 'x' eos`
b = `'t' 'x'+ eos`
and keep c:
c = `(a|b)`
then the following happens.
a starts to match "tx". Then it comes to `eos` and fails to match "xxx" with it. It fails. Spirit backtracks to where a started and tries b instead. b first matches "t", then `'x'+` against "xxxx" completely, and finally successfully matches `eos` against "". The entire expression is successful.
This behaviour is entirely natural for a recursive-descent parser, by the way. The program's stack itself provides the backtracking. A function can only fail as long as it hasn't completely matched, i.e. execution is still within the function. If it succeeds, execution leaves the function and the stack frame is destroyed. That's why you can't backtrack in it.
Or let's hand-write the above two parsers as pseudo-code. This is the first, which has the eos in c:
Code:
parser c(front, back)
(t, success) = a(front, back)
if(not success) then (t, success) = b(front, back)
if(not success) then return (t, false)
return eos(t, back)
end
parser a(front, back)
(t, success) = `'t'+`(front, back)
if(not success) then return (t, false)
return `'x'`(t, back)
end
parser b(front, back)
(t, success) = `'t'`(front, back)
if(not success) then return (t, false)
return `'x'+`(t, back)
end
As you can see, by the time eos fails to match, the code is already past the b call.
Here's the other one:
Code:
parser c(front, back)
(t, success) = a(front, back)
if(not success) then (t, success) = b(front, back)
if(not success) then return (t, false)
return (t, true)
end
parser a(front, back)
(front, success) = `'t'+`(front, back)
if(not success) then return (front, false)
(front, success) = `'x'`(front, back)
if(not success) then return (front, false)
return eos(front, back)
end
parser b(front, back)
(front, success) = `'t'`(front, back)
if(not success) then return (front, false)
(front, success) = `'x'+`(front, back)
if(not success) then return (front, false)
return eos(front, back)
end
This time, a will fail, causing b to be tried, which will succeed.
The system behind the rdp is simple: in a sequence of alternatives, if a prefix of all matches of one alternative is a valid complete match for an alternative further to the left, the alternative will never be tried.
By the way, what does that Haskell snippet in your sig actually do? I can't figure it out.
Re: Implementing Kleene closure
Ok so there was a way to sneak around that too, but I guess thats only because the example is so simple. But if we have something very complicated there instead of eos, perhaps even have c* then I'm sure there will be backtracking needed, that is of the sort that don't use the program stack.
Once again, thanks for the help, CB.
The snippet is suppose to prove that 6*9=42 (as in the hitchhikers guide to the galaxy) where 42 is actually in base 13 i.e. 54. What I've done there is define a recursive function p, so that p(0) ="" and p(x) is a concatenation of x%13 and p(x/13) but I've used divmod here instead of doing those calculations separately. Basically it does something like itoa with base 13. Since it will append the letters in the front, I had to reverse the output string.
Re: Implementing Kleene closure
Right, I had forgotten what the : operator does.
Re: Implementing Kleene closure
It doesn't actually concatenate, that's not the right word, concatenation is what ++ does. Lists (and thus also strings) in haskell are like stacks and p:q means p is the topmost element and q is the rest so [1,2,3] is the same as 1:[2,3] as well as 1:2:[3] and 1:2:3:[]
Re: Implementing Kleene closure
So it's the cons operator. Whatever.