1. Recursive Decent Parser (example)

Recursive Decent Parser (example) Just the parser bit is about 100 lines of code. Comments add a little bulk.

Update:

  • I just noticed I forgot to include unary -/+. So, I fixed it.
  • Noticed tabs and spaces both used for indentation. Fixed it. tab = space * 4
  • Fixed "pyramid" code by reducing block level nesting.
 
forward function main() 
forward function opAddSub() 
forward function opMultDiv() 
forward function opPrefix() 
forward function opPower() 
forward function parseTerm() 
forward function advance_ptr() 
 
-- ---    This parser is VERY unforgiving by way of expression format. --- 
-- Most notably the spaces between terms, operations, and parentheses are mandatory as written. 
-- I didn't write a robust tokenizer for this.    It's just an expression parser as simple as  
-- I know how to make it.     
 
-- These were the test cases I used.    Very light weight testing.    Exactly -- Zero (0) -- tests  
-- for malformed expressions.    This parser should not be considered production-ready without 
-- some real tests.    But it's probably close. 
 
--sequence str = "2 + 3 * ( 4 + 5 )" 
--sequence str = "2 + 3 * 4" 
--sequence str = "2 * 3 + 4 * 6" 
--sequence str = "2 + 2 ^ 3" 
--sequence str = "4 ^ 3" 
--sequence str = "( 2 + 2 ) ^ 3" 
--sequence str = "1 + 2 + 4"  
--sequence str = "1 + 2 - 4"  
--sequence str = "4 ^ 2 ^ 2" 
--sequence str = "2 * 3 ^ 2" 
--sequence str = "( 2 + 3 ) * ( 4 + 5 )" 
--sequence str = "( 3 + 4 * 2 / ( 1 - 5 ) ^ 2 ^ 3 ) * - 1" 
--sequence str = "( 3 + 4 * 2 / ( 1 - 5 ) ^ 2 ^ 3 )" 
--sequence str = "- 2 ^ 2" 
sequence str = "- 3 + ( 4 * 2 ) ^ ( 1 + 1 ) / 8 - - 1" 
 
-- The simplest tokenizer I know of. 
sequence expression = split(str, " ") 
 
-- simulated pointer. Because for me, it kept things simple. 
integer ptr = 1 
 
-- But even if my fake pointer is "simple," a little safety doesn't hurt. 
function advance_ptr()  
    if ptr < length(expression) then 
        ptr += 1 
    return 1  
    else 
        return 0 
    end if 
end function 
 
-- In this parser, the lowest operation is addition and subtraction.     
-- So, this is the entry point to parsing expressions. 
function opAddSub() 
    atom left = opMultDiv() -- Order of operations states multiply and divide come before add and subtract. 
    sequence op = expression[ptr] 
    sequence ops = "+-" 
    printf(1, "Looking for %s in %s\n", {op[1], ops}) 
    while find(op[1], ops) do  
        if not advance_ptr() then 
            exit -- if we couldn't advance the pointer, bail. 
        end if 
        atom right = opMultDiv()    -- this moves the pointer to the next op. 
        if op = "+" then 
            left = left + right 
        else 
            left = left - right 
        end if 
        op = expression[ptr] -- now we're ready to read the next op 
    end while 
    return left 
end function 
 
-- This is very similar to opAddSub.    The difference is this one is  
-- doing different operations and has a divide by 0 check. 
function opMultDiv() 
    atom left = opPrefix() -- Order of operations states exponents come before multiply and divide. 
    sequence op = expression[ptr] 
    sequence ops = "*/" 
    printf(1, "Looking for %s in %s\n", {op[1], ops}) 
    while find(op[1], ops) do  
        if not advance_ptr() then 
            exit 
        end if 
         
        atom right = opPrefix()    -- this moves the pointer to the next op. 
        if op = "*" then 
            left = left * right 
        elsif right = 0 then 
            puts(1, "Error, cannot divide by 0.\n") 
            abort(1) 
        else 
            left = left / right 
        end if 
        op = expression[ptr] -- now we're ready to read the next op 
    end while 
    return left 
end function 
 
-- This is how unary signs are handled.    Essentially, what's happening here is that there is a sign  
-- where a number should be.    So, we'll handle it. 
function opPrefix() 
    integer sign = 1 
    sequence tok = expression[ptr] 
    while tok = "+" or tok = "-" do 
        if tok = "-" then sign = -sign end if --    in case someone does something like ----4  
        if not advance_ptr() then exit end if 
        tok = expression[ptr] 
    end while 
    return sign * opPower() 
end function 
 
-- This looks the same as the other two, but it isn't.    Exponents are evaluated 
-- Right to Left.    The other operations are evaluated from Left to Right.     
-- This tripped me up a little.    Notice this is the first time  
-- We're actually doing recursion. The right side of the binary expression comes 
-- from calling itself.  I'll point it out with a comment.     
function opPower() 
    atom left = parseTerm() 
    sequence op = expression[ptr] 
    sequence ops = "^" 
    printf(1, "Looking for %s in %s\n", {op[1], ops}) 
    while find(op[1], ops) do  
        if not advance_ptr() then 
            exit 
        end if 
         
        atom right = opPower()    -- this is calling itself and will cause 4 ^ 2 ^ 2 to be evaluated like 4 ^ ( 2 ^ 2 ) 
        left = power(left, right) 
        op = expression[ptr] -- now we're ready to read the next op 
    end while 
    return left 
end function 
 
-- NOTE -- This function moves the pointer.    Be aware of that, please. 
-- The objective of this function is to return a single atom value. 
-- We'll deal with params here.    If I were going to add variables or 
-- function calls to this parser, this is where I would do it. 
function parseTerm() 
    sequence token = expression[ptr]; 
    if atom(to_number(token)) then 
        atom success = advance_ptr() -- <<-- This is why the pointer advances in a seemingly sneaky way.    Sorry. 
        return to_number(token); 
    elsif token = "(" then 
        atom success = advance_ptr() -- consume the (    <<-- pointer moves  
        if not success then 
            printf(1, "No term after open ( ", {token}) 
            abort(0) 
        end if 
        atom retval = opAddSub() 
        if expression[ptr] != ")" then 
            printf(1, "expected ) but got %s", {expression[ptr]})      
            abort(0)         
        end if 
        success = advance_ptr() -- consume the )    <<-- pointer moves  
        return retval 
    else 
        printf(1, "expected number or ( but got %s", {token}) 
        abort(0); 
    end if 
end function 
 
function main() 
    atom num = opAddSub(); 
    return num 
end function 
 
printf (1, "main exited with %f\n", {main()}) 

Ronald Weidner

new topic     » topic index » view message » categorize

2. Re: Recursive Decent Parser (example)

Just so everyone knows, that's a Phix program, btw (obvs I approve), and won't run as-is on Euphoria, not that it would be difficult to tweak (eg token="(" ==> equal(token,"(")).

new topic     » goto parent     » topic index » view message » categorize

3. Re: Recursive Decent Parser (example)

Now, pure Euphoria. Slightly less tested. :)

include std/convert.e 
include std/sequence.e 
 
-- ---    This parser is VERY unforgiving by way of expression format. --- 
-- Most notably, the spaces between terms, operations, and parentheses are mandatory as written. 
-- I didn't write a robust tokenizer for this.    It's just an expression parser as simple as  
-- I know how to make it.     
 
-- These were the test cases I used.    Very light weight testing.    Exactly -- Zero (0) -- tests  
-- for malformed expressions.    This parser should not be considered production-ready without 
-- some real tests.    But it's probably close. 
 
--sequence str = "2 + 3 * ( 4 + 5 )" 
--sequence str = "2 + 3 * 4" 
--sequence str = "2 * 3 + 4 * 6" 
--sequence str = "2 + 2 ^ 3" 
--sequence str = "4 ^ 3" 
--sequence str = "( 2 + 2 ) ^ 3" 
--sequence str = "1 + 2 + 4"  
--sequence str = "1 + 2 - 4"  
--sequence str = "4 ^ 2 ^ 2" 
--sequence str = "2 * 3 ^ 2" 
--sequence str = "( 2 + 3 ) * ( 4 + 5 )" 
--sequence str = "( 3 + 4 * 2 / ( 1 - 5 ) ^ 2 ^ 3 ) * - 1" 
--sequence str = "( 3 + 4 * 2 / ( 1 - 5 ) ^ 2 ^ 3 )" 
--sequence str = "- 2 ^ 2" 
sequence str = "- 3 + ( 4 * 2 ) ^ ( 1 + 1 ) / 8 - - 1" 
--sequence str = "( 4 + 5 )" 
 
 
-- The simplest tokenizer I know of. 
sequence expression = split(str, " ") 
 
-- simulated pointer. Because for me, it kept things simple. 
integer ptr = 1 
 
-- But even if my fake pointer is "simple," a little safety doesn't hurt. 
function advance_ptr()  
    if ptr < length(expression) then 
        ptr += 1 
        return 1  
    end if 
    return 0 
end function 
 
-- In this parser, the lowest operation is addition and subtraction.     
-- So, this is the entry point to parsing expressions. 
function opAddSub() 
    atom left = opMultDiv() -- Order of operations states multiply and divide come before add and subtract. 
    sequence op = expression[ptr] 
    sequence ops = "+-" 
    printf(1, "Looking for %s in %s\n", {op[1], ops}) 
    while find(op[1], ops) do  
        if not advance_ptr() then -- Move to the next token, probably a number. 
            exit -- if we couldn't advance the pointer, bail. 
        end if 
        atom right = opMultDiv()    -- fetch an number and move the pointer to the next token.  Probably an op. 
        if equal(op, "+") then 
            left = left + right 
        else 
            left = left - right 
        end if 
        op = expression[ptr] -- now we're ready to read the next op 
    end while 
    return left 
end function 
 
-- This is very similar to opAddSub.    The difference is this one is  
-- doing different operations and has a divide by 0 check. 
function opMultDiv() 
    atom left = opPrefix() -- Order of operations states exponents come before multiply and divide. 
    sequence op = expression[ptr] 
    sequence ops = "*/" 
    printf(1, "Looking for %s in %s\n", {op[1], ops}) 
    while find(op[1], ops) do  
        if not advance_ptr() then 
            exit 
        end if 
         
        atom right = opPrefix()    -- this moves the pointer to the next op. 
        if equal(op, "*") then 
            left = left * right 
        elsif right = 0 then 
            puts(1, "Error, cannot divide by 0.\n") 
            abort(1) 
        else 
            left = left / right 
        end if 
        op = expression[ptr] -- now we're ready to read the next op 
    end while 
    return left 
end function 
 
-- This is how unary signs are handled.    Essentially, what's happening here is that there is a sign  
-- where a number should be.    So, we'll handle it. 
function opPrefix() 
    integer sign = 1 
    sequence tok = expression[ptr] 
    while equal(tok, "+") or equal(tok, "-") do 
        if equal(tok, "-") then sign = -sign end if --    in case someone does something like ----4  
        if not advance_ptr() then exit end if 
        tok = expression[ptr] 
    end while 
    return sign * opPower() 
end function 
 
-- This looks the same as the other two, but it isn't.    Exponents are evaluated 
-- Right to Left.    The other operations are evaluated from Left to Right.     
-- This tripped me up a little.    Notice this is the first time  
-- We're actually doing recursion. The right side of the binary expression comes 
-- from calling itself.  I'll point it out with a comment.     
function opPower() 
    atom left = parseTerm() 
    sequence op = expression[ptr] 
    sequence ops = "^" 
    printf(1, "Looking for %s in %s\n", {op[1], ops}) 
    while find(op[1], ops) do  
        if not advance_ptr() then 
            exit 
        end if 
         
        atom right = opPower()    -- this is calling itself and will cause 4 ^ 2 ^ 2 to be evaluated like 4 ^ ( 2 ^ 2 ) 
        left = power(left, right) 
        op = expression[ptr] -- now we're ready to read the next op 
    end while 
    return left 
end function 
 
-- NOTE -- This function moves the pointer.    Be aware of that, please. 
-- The objective of this function is to return a single atom value. 
-- We'll deal with parentheses here.    If I were going to add variables or 
-- function calls to this parser, this is where I would do it. 
function parseTerm() 
    sequence token = expression[ptr] 
    if is_number(token) then 
        atom success = advance_ptr() -- <<-- This is why the pointer advances in a seemingly sneaky way.    Sorry. 
        return to_number(token) 
    elsif equal(token, "(") then 
        atom success = advance_ptr() -- consume the (    <<-- pointer moves  
        if not success then 
            printf(1, "No term after open ( ", {token}) 
            abort(0) 
        end if 
        atom retval = opAddSub() -- <<-- This is recursion 
        if not equal(expression[ptr], ")") then 
            printf(1, "expected ) but got %s", {expression[ptr]})      
            abort(0)         
        end if 
        success = advance_ptr() -- consume the )    <<-- pointer moves  
        return retval 
    else 
        printf(1, "expected number or ( but got %s", {token}) 
        abort(0) 
    end if 
end function 
 
function is_number(sequence maybe) 
    integer found_dot = 0 
    integer i = 1 
    integer c = 0 
    integer retval = 1 
    while i < (length(maybe) + 1) do 
        c = maybe[i] 
        if equal(c, '.') and found_dot then 
            return 0 
        end if 
 
        if equal(c, '.') then  
            found_dot = 1 
        elsif not find(c, "0123456789") then 
            return 0 
        end if 
        i+=1 
    end while 
    return retval 
end function 
 
function main() 
    atom num = opAddSub() 
    return num 
end function 
 
printf (1, "main exited with %f\n", {main()}) 
new topic     » goto parent     » topic index » view message » categorize

4. Re: Recursive Decent Parser (example)

FWIW. I replaced main() in both Phix and Euphoria code with:

function main() 
    sequence tests = { 
        {"2 + 3 * ( 4 + 5 )", "29.000000"}, 
        {"2 + 3 * 4", "14.000000"}, 
        {"2 * 3 + 4 * 6", "30.000000"}, 
        {"2 + 2 ^ 3", "10.000000"}, 
        {"4 ^ 3", "64.000000"}, 
        {"( 2 + 2 ) ^ 3", "64.000000"}, 
        {"1 + 2 + 4" , "7.000000"}, 
        {"1 + 2 - 4" , "-1.000000"}, 
        {"4 ^ 2 ^ 2", "256.000000"}, 
        {"2 * 3 ^ 2", "18.000000"}, 
        {"( 2 + 3 ) * ( 4 + 5 )", "45.000000"}, 
        {"( 3 + 4 * 2 / ( 1 - 5 ) ^ 2 ^ 3 ) * - 1", "-3.000122"}, 
        {"( 3 + 4 * 2 / ( 1 - 5 ) ^ 2 ^ 3 )", "3.000122"}, 
        {"- 2 ^ 2", "-4.000000"}, 
        {"- 3 + ( 4 * 2 ) ^ ( 1 + 1 ) / 8 - - 1", "6.000000"} 
    } 
     
    sequence results = {} 
     
    integer i = 1 
    while i < (length(tests) + 1) do 
        --The simplest tokenizer I know of. 
        expression = split(tests[i][1], " ") 
        ptr = 1 
        sequence result = sprintf("%f", {opAddSub()}) 
        sequence pass_fail = "PASS" 
        if not equal(tests[i][2], result) then 
            pass_fail = "FAIL" 
        end if 
        results = append(results, sprintf("%s: Test: [%s] Expected: [%s] Got: [%s]\n", {pass_fail, tests[i][1], tests[i][2], result})) 
        i += 1 
    end while 
     
    i = 1 
    while i < length(results) + 1 do 
        puts(1, results[i]) 
        i += 1 
    end while 
     
    return 0 
end function 

And this is the result: (Answers limited to 6 decimal places and verified by google)

PASS: Test: [2 + 3 * ( 4 + 5 )] Expected: [29.000000] Got: [29.000000] 
PASS: Test: [2 + 3 * 4] Expected: [14.000000] Got: [14.000000] 
PASS: Test: [2 * 3 + 4 * 6] Expected: [30.000000] Got: [30.000000] 
PASS: Test: [2 + 2 ^ 3] Expected: [10.000000] Got: [10.000000] 
PASS: Test: [4 ^ 3] Expected: [64.000000] Got: [64.000000] 
PASS: Test: [( 2 + 2 ) ^ 3] Expected: [64.000000] Got: [64.000000] 
PASS: Test: [1 + 2 + 4] Expected: [7.000000] Got: [7.000000] 
PASS: Test: [1 + 2 - 4] Expected: [-1.000000] Got: [-1.000000] 
PASS: Test: [4 ^ 2 ^ 2] Expected: [256.000000] Got: [256.000000] 
PASS: Test: [2 * 3 ^ 2] Expected: [18.000000] Got: [18.000000] 
PASS: Test: [( 2 + 3 ) * ( 4 + 5 )] Expected: [45.000000] Got: [45.000000] 
PASS: Test: [( 3 + 4 * 2 / ( 1 - 5 ) ^ 2 ^ 3 ) * - 1] Expected: [-3.000122] Got: [-3.000122] 
PASS: Test: [( 3 + 4 * 2 / ( 1 - 5 ) ^ 2 ^ 3 )] Expected: [3.000122] Got: [3.000122] 
PASS: Test: [- 2 ^ 2] Expected: [-4.000000] Got: [-4.000000] 
PASS: Test: [- 3 + ( 4 * 2 ) ^ ( 1 + 1 ) / 8 - - 1] Expected: [6.000000] Got: [6.000000] 
main exited with 0.000000 
 

new topic     » goto parent     » topic index » view message » categorize

5. Re: Recursive Decent Parser (example)

I think I've saved myself a lot of trouble by using RPN for my Hypatia calculator ...

new topic     » goto parent     » topic index » view message » categorize

6. Re: Recursive Decent Parser (example)

Did you write it in Euphoria? Is it shareable?

UPDATE: I see it is shareable. Cool. Reverse Polish Notation Calculator

new topic     » goto parent     » topic index » view message » categorize

7. Re: Recursive Decent Parser (example)

xecronix said...

Did you write it in Euphoria? Is it shareable?

Originally it was in Euphoria, but there were a few problems which I couldn't resolve - I don't fully remember what they were, but they went away when I tried Phix, so it's been Phix ever since.

Yes, you can freely download it from hypatia-rpn.net, also the source code, though this is regrettably poorly documented.

If you have a look at it, I'd be happy to hear what you think!

new topic     » goto parent     » topic index » view message » categorize

8. Re: Recursive Decent Parser (example)

Great work! There are quite a few gold nuggets in your code-some of which are directly relevant to BzScript, a language I'm currently working on. Honestly, it's probably the most digestible example of a REPL implementation I've seen so far.

To be truthful, I've spent about 30 minutes reviewing your code. With dedicated time, it would likely take a few days to fully digest it all. That said, it's absolutely worth the effort. There are clear study pieces in there that stand out.

Thanks again for sharing this solid example.

By the way, I'm curious about your comment that reading my RDP example made you feel good about choosing RPN. What was it that helped reinforce your decision? BZScript did some sort of hybrid Pratt/RDP to create an AST. Which, honestly, is only one step away from linear eval, RPN as I see it. I've strongly considered starting execution at this point (AST). But, one more iteration and I could have a linear representation (RPN). I'm wondering if it's worth the effort. (tree vs linear)

new topic     » goto parent     » topic index » view message » categorize

9. Re: Recursive Decent Parser (example)

Thank you for your praise! As I've said and you have now seen for yourself, the code is poorly documented, it has grown slowly and not always systematically (to put it mildly) over a period of almost 20 years, and I tend to use variable names that are short and convey little information ... If you have any questions please feel free to ask, you can reach me by mail, robert@drs.at

xecronix said...

By the way, I'm curious about your comment that reading my RDP example made you feel good about choosing RPN. What was it that helped reinforce your decision?

Oh, there was no deep meaning to those words - looking at your code just showed me again that it isn't trivial to parse infix notation, and made me glad that I never tried it. :) Even though laziness wasn't my only motive to avoid it, I really like RPN.

new topic     » goto parent     » topic index » view message » categorize

Search



Quick Links

User menu

Not signed in.

Misc Menu