5.2 Euphoria Internals

The interpreter has four binary components:

  • Interpreter
  • Translator
  • Backend
  • Library

The Euphoria interpreter has two parts: the frontend and the backend. The frontend is a parser that converts source-code into a set of Intermediate Language (IL) instructions. The backend then takes the IL instructions and executes the program.

When the interpreter executes source-code, the frontend parses and prepares the code, and then the backend executes the code.

When the shrouder executes source-code, only the frontend is run producing an .il file. This .il file may be run by the backend as an independent step to execute the program.

When the binder executes source-code, the .il instructions produced by the frontend are combined with the backend to produce a stand-alone executable program. The executable program may then be run independetly at any time.

When the translator executes source-code, the .il instructions are translated into C-code. This C-code is compiled with an installed C compiler producing an executable program.

The library is called by the backend for the many builtins included in Euphoria.

5.2.1 The Euphoria Data Structures

5.2.1.1 The Euphoria representation of a Euphoria Object

Every Euphoria object is stored as-is. A special unlikely floating point value is used for NOVALUE. NOVALUE signifies that a variable has not been assigned a value or the end of a sequence.

5.2.1.2 The C Representation of a Euphoria Object

Every Euphoria object is either stored as is, or as an encoded pointer. A Euphoria integer is stored in a 32-bit signed integer. If the number is too big for a Euphoria integer, it is assigned to a 64-bit double float in a structure and an encoded pointer to that structure is stored in the said 32-bit memory space. Sequences are stored in a similar way.

 32 bit number range:
  0X8      0XA      0XC      0XE      0X0      0X2      0X4      0X6      0X8
-4*2^29  -3*2^29  -2*2^29-1  -2^29   0*2^29   1*2^29   2*2^29   3*2^29 4*2^29 
   *--------*--------*--------*--------*--------*--------*--------*--------o
                     o NOVALUE = -2*2^29-1
		     o<-----------ATOM_INT---------[-2*2^29..4*2^29)------>o
	    |<----------------ATOM_DBL-------[-3*2^29..4*2^29)------------>o
-->|        |<-- IS_SEQUENCE [-4*2^29..-3*2^29)
-->|                 o<--- IS_DBL_OR_SEQUENCE [-4*2^29..-2*2^29-1)
-->|sequence|<-------
            |<------------------  atom   --------------->|
    ------->| double |<--------  
                     |<--------     integer    --------->|
   |<--------------------- object ---------------------->|		     

Euphoria integers are stored in object variables as-is. An object variable is a four byte signed integer. Legal integer values for Euphoria integers are between -1,073,741,824 ( -230 ) and +1,073,741,823 ( 230-1 ). Unsigned hexadecimal numbers from C000_0000 to FFFF_FFFF are the negative integers and numbers from 0000_0000 to 3FFF_FFFF are the positive integers. The hexadecimal values not used as integers are thus 4000_0000 to BFFF_FFFF. Other values are for encoded pointers. Pointers are always 8 byte aligned. So a pointer is stored in 29-bits instead of 32 and can fit in a hexadecimal range 0x2000_0000 long. The pointers are encoded in such a way that their encoded values will never be in the range of the integers. Pointers to sequence structures (struct s1) are encoded into a range between 8000_0000 to 9FFF_FFFF. Pointers to structures for doubles (struct d) are encoded into a range between A000_0000 to BFFF_FFFF. A special value NOVALUE is at the end of the range of encoded pointers is BFFF_FFFF and it signifies that there is no value yet assigned to a variable and it also signifies the end of a sequence. In C, values of this type are stored in the 'object' type. The range 4000_0000 to 7FFF_FFFF is unused.

A double structure 'struct d' could indeed contain a value that is legally in the range of a Euphoria integer. So the encoded pointer to this structure is recognized by the interpreter as an 'integer' but in this internals document when we say Euphoria integer we mean it actually is a C integer in the legal Euphoria integer range.

5.2.2 The C Representations of a Euphoria Sequence and a Euphoria Atom

// Sequence Header 
struct s1
{
 object_ptr base;     // base is such that base[1] is the first element
 long length;         // this is the sequence length
 long ref;            // ref is the number of as virtual copies of this sequence
 long postfill;       // is how many extra objects could fit at the end of base
 cleanup_ptr cleanup; // this is a pointer to a Euphoria routine that is run 
                      // just before the sequence is freed.
}

However, we allocate more than this structure. Inside the allocated data but past the structure, there also is an area of 'pre free space'; sequence data pointed to by base[1] to base[$], $ being the length; a NOVALUE terminator for the sequence, and an area of post fill space. In memory, immediately following the structure there is the following data stored:

 object pre_fill_space[]; // could have 0 (not exist) or more elements before used data
 object base[1..$];       // sequence members pointed to by base
 object base[$+1];        // a magic number terminating the sequence members (NOVALUE)
 object post_fill_space[];// could have 0 (not exist) or more elements after used data

Taken together these are what get represented in memory.

base length ref postfill cleanup pre fill space base[1..$] NOVALUE post fill space

By their nature, sequences are variable length, dynamic entities and so the C structure needs to cater for this. When a sequence is created, we allocate enough RAM for the combined header and the initial storage for the elements.

Field Description
base This contains the address of the first element less the length of one element. Thus base[1] points to the first element and base[0] points to a fictitious element just before the first one, which is never used.
Initially, base contains the address of the last member of the sequence header but as the sequence is resized, it can point to the last member or anywhere after.
length Contains the current number of elements in the sequence.
ref Contains the count of references to this sequence. Only when this is zero, can the RAM used by the sequence be returned to the system for reuse.
postfill The size of 'post fill space' in element spaces. Rather than using bytes, postfill is measured in objects which are each address wide elements. If this is non-zero, we can append to the sequence with at most postfill new elements before needing to reallocate RAM.
cleanup If not null, it points to a routine that is called immediately before the sequence is deleted.
pre fill space There are 0 or more spaces before base[1]. We can calculate the free space in *objects* at the front of a sequence, s1, in C by
(&s1.base[1] - (object_ptr)(1+&s1)).
In EUPHORIA, you will have to divide by the size of a C_POINTER on the difference. When elements are removed from the front of a sequence, we simply adjust the address in base to point to the new first element and reduce the length count. If we want to prepend and this pre fill space has some positive size, then we make room by decrementing base and increment the length. The new data is then assigned to base[1].
base[1]..base[length] sequence data This is actual data.
base[$+1] This is always set to NOVALUE.
post fill space There are 0 or more spaces after base[length+1]. The number of spaces is stored in postfill. If postfill is non-zero we can append by incrementing the length, decrementing postfill and assigning the new data to base[$]. When we remove from the end of the sequence, we increment postfill and decrement the length.

// Atom Header
struct d
{
 double dbl;          // the actual value of a double number.
 long ref;            // ref is the number of virtual copies of this double
 cleanup_ptr cleanup; // this is a pointer to a Euphoria routine that is run 
                      // just before the sequence is freed.
}

Now offset of the 'ref' in struct d must be the same as the offset of the 'ref' in struct s1. To this end, the 64bit implementation of 4.1 has these members in a different order.

5.2.3 The Euphoria Object Macros and Functions

5.2.3.1 Description

The macros are imperfect. For example, IS_SEQUENCE(NOVALUE) returns TRUE and IS_ATOM_DBL() will return TRUE for integer values as well as encoded pointers to 'struct d's. This is why there is an order that these tests are made: We test IS_ATOM_INT and if that fails we can use IS_ATOM_DBL and then that will only be true if we pass an encoded pointer to a double. We must be sure that something is not NOVALUE before we use IS_SEQUENCE on it.

Often we know foo is not NOVALUE before getting into this:

// object foo
if (IS_ATOM_INT(foo)) {
 // some code for a Euphoria integer
} else if (IS_ATOM_DBL(foo)) {
 // some code for a double
} else {
 // code for a sequence foo
}

A sequence is held in a 'struct s1' type and a double is contained in a 'struct d'.

5.2.4 Type Value Functions and Macros

5.2.4.1 IS_ATOM_INT

<internal> int IS_ATOM_INT( object o )
Returns

true if object is a Euphoria integer and not an encoded pointer.

Note

IS_ATOM_INT() will return true even though the argument is out of the Euphoria integer range when the argument is positive. These values are not possible encoded pointers.

5.2.4.2 IS_ATOM_DBL

<internal> int IS_ATOM_DBL( object o )
Returns

true if the object is an encoded pointer to a double struct.

Assumption

o must not be a Euphoria integer.

5.2.4.3 IS_ATOM

<internal> int IS_ATOM( object o )
Returns

true if the object is a Euphoria integer or an encoded pointer to a 'struct d'.

5.2.4.4 IS_SEQUENCE

<internal> int IS_SEQUENCE( object o )
Returns

true if the object is an encoded pointer to a 'struct s1'.

Assumption

o is not NOVALUE.

5.2.4.5 IS_DBL_OR_SEQUENCE

<internal> int IS_DBL_OR_SEQUENCE( object o )
Returns

true if the object is an encoded pointer of either kind of structure.

5.2.5 Type Conversion Functions and Macros

5.2.5.1 MAKE_INT

<internal> object MAKE_INT( signed int x )
Returns

an object with the same value as x. x must be with in the integer range of a legal Euphoria integer type.

5.2.5.2 MAKE_UINT

<internal> object MAKE_UINT( unsigned int x )
Returns

an object with the same value as x.

Assumption

x must be an unsigned integer with in the integer range of a C unsigned int type.

Example

MAKE_UINT(4*1000*1000*1000) will make a Euphoria value of four billion by creating a double.

5.2.5.3 MAKE_SEQ

<internal> object MAKE_SEQ( struct s1 * sptr )
Returns

an object with an argument of a pointer to a 'struct s1' The pointer is encoded into a range for sequences and returned.

5.2.5.4 NewString

<internal> object NewString(char *s)
Returns

an object representation of a Euphoria byte string s. The returned encoded pointer is a sequence with all of the bytes from s copied over.

5.2.5.5 MAKE_DBL

<internal> object MAKE_DBL( struct d * dptr )
Returns

an object with an argument of a pointer to a 'struct d' The pointer is encoded into a range for doubles and returned.

5.2.5.6 NewDouble

<internal> object NewDouble( double dbl )
Returns

an object with an argument a double dbl. A struct d is allocated and dbl is assigned to the value part of that structure. The pointer is encoded into the range for doubles and returned.

5.2.5.7 DBL_PTR

<internal> struct d * DBL_PTR( object o )
Returns

The pointer to a 'struct d' from the object o.

Assumption

IS_ATOM_INT(o) is FALSE and IS_ATOM_DBL(o) is TRUE.

5.2.5.8 SEQ_PTR

<internal> struct s1 * SEQ_PTR( object o )
Returns

The pointer to a 'struct s1' from the object o.

Assumption

IS_SEQUENCE(o) is TRUE and o is not NOVALUE.

get_pos_int
<internal> uintptr_t get_pos_int(char *where, object x)
Returns

a unsigned long value by truncating what x's value is to an integer

Comment

Any object may be passed. A sequence results in a runtime failure. There may be a cast of a double to a smaller ranged long type.

5.2.6 Creating Objects

5.2.6.1 NewS1

<internal> object NewS1 ( long size )
Returns

A sequence object with size members which are not yet set to a value.

5.2.7 Object Constants

Use MAXINT and MININT to check for overflow and underflow, NOVALUE to check if a variable has not been assigned, and use NOVALUE to terminate a sequence.

5.2.7.1 NOVALUE

<internal> object NOVALUE

Indicates that a variable has not been assigned and also terminates a sequence.

5.2.7.2 MININT

<internal> signed int MININT

The minimal Euphoria integer. This is -(230).

5.2.7.3 MAXINT

<internal> signed int MAXINT

The maximal Euphoria integer. This is 230-1.

5.2.7.4 HIGH_BITS

<internal> signed int HIGH_BITS

HIGH_BITS is an integer value such that if another integer value c lies outside of the range between MININT and MAXINT, c+HIGH_BITS will be non-negative.

Proof that HIGH_BITS is #C000_0000 on 32-bit version of EUPHORIA.
  • In the following expressions powers have higher precedence than unuary minus.* if c is a non-ATOM-INT value, then

c belongs to the set [-231,-230-1(=NOVALUE)] U [230,231].

c+-230 belongs to the set [-231-230,-230-1-230] U [230-230,230] which is [-3*230,-231-1] U [0,230]. However the lower values wrap around to non-negative numbers:

-231-1 wraps to 231-1. -3*230 wraps around to 230.

c+-230 belongs to the set [230,231-1] U [0,230] = [0,231-1]

This is the set of all non-negative numbers that can fit into 32-bit signed longs. -230 is the unsigned version of #C000_0000. QED.

A visual way of looking at it is, adding #C000_0000 to the set of non-ATOM_INTS rotates the set to the negative side by -MININT (2^30). The already negative ones wrap around to the positive; the positive numbers stay positive and hug the zero. Since adding #C000_0000 on registers is 1-1 and onto, we also know that ATOM_INTs will all be mapped to negative signed longs.

Testing for Overflow:

There are two ways to test for overflow:

  1. (c > MAXINT) || (c < MININT)
  2. (c + HIGH_BITS) >= 0

5.2.7.5 Parser

Inserting tokens into the token buffer is the easiest way to add features to the EUPHORIA parser. The tokens are two-element sequences one of the class of token and the other the token's value:

{<class>,<value>}

Each of the class values are capitalized words for some keyword or VARIABLE. The list of constants is in reswords.e. Often it is enough to only examin the class. In the case of variables, it is important to know which variable. In this case the second element, comes into play.

You can use putback() to put tokens into the token buffer. The tokens will be pulled out by the parser in a filo manner, like a stack.

5.2.7.6 Backend Instructions

After the Parser processes the instructions. It creates Backend instructions that are easily translated or interpreted. The system uses opcodes and some parameters which are put on a stack. This backend language is similar to assembler. You have opcodes (instructions) and parameters. These parameters must be integers themselves but some may serve as pointers to arbitrary EUPHORIA objects. As a developer of EUPHORIA itself, rather than a developer that uses EUPHORIA, it is important to know exactly what these opcodes do and what they are for. In this section we will document what they are for, and how they manipulate the instruction pointer, and stack.

IF instruction:

The IF instruction is used for making runtime branch statements. The IF instruction takes the top of the stack as the condition value, if the condition is 0, it passes control to the address stored just below the top of the stack. If the condition is non-zero and an atom the instruction pointer just past the failure address.

[ IF instruction ] [ test value ] [ failure address ]

INTEGER_CHECK instruction:

The INTEGER_CHECK is used to ensure that something has a value considered to be 'integer' to the EUPHORIA language definition. The instruction takes the next argument as a pointer to a value and determines whether this value is in the legal integer range, regardless of how that number is represented. If not in legal range, then the program ends execution in a type-check failure error message.

[ INTEGER_CHECK instruction ] [ test pointer ]

ATOM_CHECK instruction:

The ATOM_CHECK is used to determine whether something has a numeric value rather than a sequence. The instruction takes an argument as a pointer to a value and determines whether the value is an atom. If it is not an atom, then the program ends execution in a type-check failure error message.

[ ATOM_CHECK instruction ] [ test pointer ]

IS_AN_INTEGER instruction:

The IS_AN_INTEGER instruction is used to determine whether something has a value considered to be 'integer' to the EUPHORIA language definition. The instruction takes the argument as a pointer to a value and determines whether this value is in the legal integer range, regardless of how that number is represented. If it is in the 'integer' range then the value pointed by the second argument will be 1 otherwise it will be 0.

[ IS_AN_INTEGER instruction ] [ test pointer ][ return value pointer ]