Special Feature: ColorForth Commentary
COLOR.ASM: Compiler
NOTICE: This is a work in progress. Parts of my commentary are still very rough. However, I have it up on the web because even in this rough form it may be useful to ColorForth enthusiasts. Expect the contents of these files to change frequently.
ColorForth's compiler works by adding bytes of machine code to the end of a buffer called the dictionary. (This buffer starts at address 0x100000 — the one-megabyte mark — and extends upward in memory.) ColorForth maintains a pointer to the end of this buffer in the variable h.
Words in the MACRO wordlist are executed at compile-time in order to add bytes of machine code to the definition of the word currently being defined. These words also provide opportunities for other macro words to optimize the machine code generated. Sometimes a word will leave in list the address of the machine code that the word adds to the definition. Another word will check list to see if a specific instruction is in the definition, and if so, the word may take the instruction out or substitute another one.
DROP
The word drop removes the top item from the data stack. Specifically, it moves the second item on the stack into the TOS register, overwriting the original top item, and adjusts the NOS register to point to what was the third item on the stack. The result is one cell fewer on the stack.
ColorForth implements drop as the one-byte LODSD instruction (machine-code value: 0xAD). Therefore the routine to compile a drop simply adds the machine-code value for LODSD at the end of the dictionary and increments the end-of-dictionary pointer by one byte. However, this version also stores the old end-of-dictionary pointer (the location of the LODSD instruction being added) into list, so that if another routine (such as qdup) wants to optimize away the LODSD, it can do so.
cdrop: ;"drop" ;[Refs: qlit, macro2] mov edx, h mov list, edx mov byte ptr [edx], 0adh ; lodsd inc h ret
DUP
The word dup pushes a copy of TOS onto the data stack. Specifically, it adjusts the NOS register to make room for an extra cell on the stack and copies TOS into this extra cell. The result is one cell more on the stack.
The ColorForth kernel defines two routines for compiling dup: qdup and cdup.
Optimized DUP (qdup)
The word ?dup is often used at the beginning of a macro word to compile a dup only if the definition of the word being defined does not already end with a drop (i.e., a LODSD instruction). If the drop is there, it is removed, and dup is not compiled.
qdup: ;"?dup" ;[Refs: literal, macro2] mov edx, h ; See if the last optimizable instruc- dec edx ; tion was one byte back from the end cmp list, edx ; of the dictionary. jnz cdup ; If not, just compile a DUP. cmp byte ptr [edx], 0adh ; If so, was the instruction LODSD? jnz cdup ; If not, just compile a DUP. mov h, edx ; If so, remove the LODSD and compile ret ; nothing.
Non-optimized DUP (cdup)
If the programmer knows that the preceding instruction was not a drop, he can just compile a dup directly.
Note that the dword and byte being added to the dictionary correspond to the five bytes (in hex) 8D 76 FC 89 06. The first three bytes are the machine code for the instruction lea esi, [esi-4]; the last two correspond to mov [esi], eax. (ESI is NOS, the pointer to the next [second] item on the stack. The stack grows downward, and the size of each item on the stack is four bytes, so lea esi, [esi-4] makes room on the stack for another item. EAX is TOS, the register containing the top item on the stack, so mov [esi], eax copies TOS into the new space.)
cdup: ;"dup" ;[Refs: qdup, macro2] mov edx, h mov dword ptr [edx], 89fc768dh mov byte ptr [4+edx], 06 add h, 5 ret
The interpreter
This isn't the entire interpreter; the complete list of routines called by inter is at spaces/adefine.
adup: ;[Refs: forthd, alit, lit] dup_ ret var1: ;[Refs: variable] dup_ mov eax, [4+forth0+ecx*4] ret
variable
The interpreter calls variable when it encounters a magenta word (a variable to be defined).
Recall that ColorForth defines two wordlists — FORTH and MACRO. If a pre-parsed word is green, then what the interpreter does with the word depends on the wordlist in which the word is found. If the word is in the MACRO wordlist, then the word's definition (machine code) is executed immediately. If the word is in the FORTH wordlist, then a CALL to the word's definition is compiled at the end of the dictionary (machine-code buffer).
Recall also that ColorForth defines each wordlist as a pair of arrays — a word array and an address array. A pre-parsed word is sought in the word array. If a match is found, then the address of the word's definition will be found at the same offset in the address array.
Defining a new word usually creates one new entry — one new word and one new address — in one of the two wordlists. Defining a variable is more expensive; a variable definition uses four wordlist entries — two entries in each wordlist. The data added to the wordlists are as follows:
|
|
When the interpreter finds a magenta word, it calls variable to handle the word. Variable calls forthd to add the word to the FORTH wordlist, set the ECX register to the address of the word-array cell that contains the new word, and set the word's address field to the address of the end of the dictionary as usual. Variable stores into this address field the address of var1, which does nothing except push [4+forth0+ecx*4] onto the stack. Variable then creates a dummy entry in the FORTH wordlist, into which is stored the address of the pre-parsed word after the magenta word being handled. This address goes into the word array; the corresponding cell in the address array is left unused. (This word-array cell for the dummy entry is the address to which "[4+forth0+ecx*4]" refers.)
variable: ;Used by inter to handle magenta (color=12, variable) words. ;[Refs: adefine] call forthd mov [forth2-forth0+ecx], offset var1 inc forths ; dummy entry for source address mov [4+ecx], edi
Then variable creates parallel entries in the MACRO wordlist. This operation is the same as above, except that variable calls macrod instead of forthd, and it stores, not the address of var1, but the address of the routine following variable (marked in the code with a local label "@@"), which I choose to call "var2".
call macrod mov [forth2-forth0+ecx], offset @f inc macros mov [4+ecx], edi
The wordlist entries created, variable increments the pre-parsed-word pointer, so that the interpreter will not interpret the value following the magenta word (which of course is the value of the variable).
inc edi ret
So how is a variable used after it is defined?
Chuck Moore states that yellow variables are preferred on the Pentium — that is, you use a variable by having its address pushed onto the stack at compile time. When the interpreter finds a yellow word, it looks up the word in the FORTH wordlist and executes the code at the address found in the word's entry. In this case, the address is that of var1, which grabs the address from the word field of the next (dummy) entry in the wordlist and pushes it onto the stack at compile time. This address is of course that of the variable's value, stored within the pre-parsed source code.
What if a variable name is green? Chuck Moore states: "A green variable compiles a call to code executed at run time to put the address on the stack." When the interpreter finds a green word, it looks up the word first in the MACRO wordlist, then in the FORTH wordlist. In this case, the interpreter will find the variable name in the MACRO wordlist (since the name was added to both wordlists). The address in the MACRO wordlist is that of "var2" below — which generates code to push the address of the variable's value onto the stack at run time.
"var2"
-- routine called if variable is invoked as a MACRO --
@@: call [lit] mov eax, [4+macro0+ecx*4] jmp @f
cnum
Cnum ("compile number") takes a 32-bit number from the pre-parsed source code and compiles it into the machine code needed to push that number onto the stack.
A 32-bit number is stored in the pre-parsed code as two 32-bit cells. Only the bottom five bits of the first cell are used — the bottom four bits are the "color," in this case color five (a "green" number to be compiled), and the other bit indicates whether the number should be displayed as decimal or hexadecimal. The second cell contains the 32-bit number.
Whereas most routines called by inter (the interpreter loop) access the current pre-parsed word via
cnum: ;Used by inter to handle green (color=5, 32-bit number) words. ;[Refs: abort1, adefine] call [lit] mov eax, [edi*4] ; Get cell AFTER "prefix." inc edi jmp @f
cshort
Cshort ("compile short number") takes a 27-bit number from the pre-parsed source code and compiles it into the machine code needed to push that number onto the stack.
A number small enough to be represented in 27 or fewer bits is stored in the pre-parsed source code as a single cell. The bottom four bits contain the "color," in this case color six (a "green" number to be compiled), the next bit indicates whether to display the number as decimal or hexadecimal, and the other 27 bits contain the number.
cshort: ;Used by inter to handle green (color=6, 27-bit number) words. ;[Refs: abort1, adefine] call [lit] mov eax, [-4+edi*4] sar eax, 5 @@: call literal drop ret alit: ;[Refs: execute, adup, short_, num] mov lit, offset adup
literal
Literal takes the 32-bit number at the top of the stack and compiles from it the machine code needed to push that number onto the stack later. (In most cases it generates a dup followed by a MOV EAX, <32_bit_number> instruction.) Note that this routine does not drop the number from the stack after compiling it.
literal: ;[Refs: cshort] call qdup ; Compile code to preserve top of stack. mov edx, list ; Preserve previous instruction address mov list+4, edx ; saved for optimization. mov edx, h ; Save address of the "MOV EAX" mov list, edx ; instruction about to be compiled. mov byte ptr [edx], 0b8h ; Compile "MOV EAX, <literal>" -- mov [1+edx], eax ; using the literal on the stack. add h, 5 ; Adjust end-of-dictionary pointer. ret
qcompile
Qcompile retrieves the next pre-parsed word from the source code and looks it up, first in the MACRO wordlist, then in the FORTH wordlist. If the word is found in the MACRO wordlist, the word's definition is executed. If the word is found in the FORTH wordlist, a CALL to the word's definition is compiled into the dictionary. If the word is found in neither, the interpreter aborts.
qcompile: ;Used by inter to handle green (color=4, compile) words. ;[Refs: abort1, adefine] call [lit] mov eax, [-4+edi*4] ; Get next pre-parsed word. and eax, -20o ; Mask out color bits (bits 0..3). call mfind ; Look it up in MACRO wordlist. jnz @f drop ; If found in MACRO wordlist, jmp [macro2+ecx*4] ; execute its definition immediately. @@: call find ; Otherwise, find it in FORTH wordlist. mov eax, [forth2+ecx*4] @@: jnz abort
call_
Call_ takes the routine address at the top of the stack and compiles from it the machine code needed to call that routine.
call_: ;[Refs: none] mov edx, h mov list, edx mov byte ptr [edx], 0e8h ; Store CALL opcode into dictionary. add edx, 5 ; Calculate correct offset from the CALL sub eax, edx ; instruction to the routine called. mov [-4+edx], eax ; Store offset into dictionary. mov h, edx drop ret compile: ;Used by inter to handle cyan (color=7, compile macro) words. ;[Refs: adefine] call [lit] mov eax, [-4+edi*4] and eax, -20o call mfind mov eax, [macro2+ecx*4] jmp @b short_: ;Used by inter to handle yellow (color=8, 27-bit number) words. ;[Refs: adefine] mov lit, offset alit dup_ mov eax, [-4+edi*4] sar eax, 5 ret num: ;Used by inter to handle yellow (color=2, 32-bit number) words. ;[Refs: spaces] mov lit, offset alit dup_ mov eax, [edi*4] inc edi ret
Writing machine code into the dictionary
These words allow the ColorForth programmer to create MACRO words that write machine code into the dictionary, from one to four bytes at a time.
comma: ;"," ;Moves a cell (four bytes) of machine code to the end of the dictionary. ;[Refs: forth2] mov ecx, 4 @@: mov edx, h mov [edx], eax mov eax, [esi] ; drop lea edx, [edx+ecx] lea esi, [esi+4] mov h, edx ; drop ret comma1: ;"1," ;Moves one byte of machine code to the end of the dictionary. ;[Refs: forth2] mov ecx, 1 jmp @b comma2: ;"2," ;Moves two bytes of machine code to the end of the dictionary. ;[Refs: forth2] mov ecx, 2 jmp @b comma3: ;"3," ;Moves three bytes of machine code to the end of the dictionary. ;[Refs: forth2] mov ecx, 3 jmp @b
Macro words
The semicolon (;) is another optimizing word. It works like this: If the last optimizable instruction compiled was a CALL, change this to a JuMP; otherwise add a RETurn instruction to the end of the dictionary.
One nice side effect of this, as Chuck Moore has observed, is that this allows tail recursion of words in the FORTH wordlist, like this:
word ... condition if word ; then ... ;
If a word calls itself, it would normally fill the return stack with copies of the same return address — but if it jumps to the beginning of its own definition instead, the return stack is not overfilled.
semi: ;";" ;Compiles code to return to caller (either JMP or RET). ;[Refs: macro2] mov edx, h sub edx, 5 cmp list, edx jnz @f cmp byte ptr [edx], 0e8h jnz @f inc byte ptr [edx] ; jmp ret @@: mov byte ptr [5+edx], 0c3h ; ret inc h ret then: ;[Refs: macro2] mov list, esp mov edx, h sub edx, eax mov [-1+eax], dl drop ret begin: ;[Refs: macro2] mov list, esp here: ;Returns end-of-dictionary pointer on data stack. ;[Refs: forth2] dup_ mov eax, h ret
qlit: If the last instruction in the dictionary is "MOV EAX, literal", move the literal onto the data stack and remove the instruction from the dictionary. If the instruction was preceded by a DUP, remove that also. Returns with zero flag set if nothing placed on stack, cleared if literal on stack.
?lit if do-something-with-literal drop then ;
qlit: ;"?lit" ;[Refs: forth2] mov edx, h lea edx, [edx-5] cmp list, edx jnz @f cmp byte ptr [edx], 0b8h jnz @f dup_ mov eax, list+4 mov list, eax mov eax, [1+edx] cmp dword ptr [edx-5], 89fc768dh ; dup jz q1 mov h, edx jmp cdrop q1: ;[Refs: qlit] add h, -10 ; flag nz ret @@: xor edx, edx ; flag z ret less: ;[Refs: forth2] cmp [esi], eax jl @f ; flag nz ; Bug-fix! Original was "js" xor ecx, ecx ; flag z @@: ret
qignore
If the interpreter has found the last pre-parsed word (a "null" cell), qignore pops the return stack twice. The effect is to return to the caller of load. Qignore has to pop two items from the return stack — the address to return to load, and the value that was in EDI before load pushed it onto the return stack. The RET instruction returns control to the routine that called load.
(MY THEORY: Popping a return address and returning to the caller's caller might save a code-cache flush-and-refill, because if you return to a caller, this potentially causes the processor to flush the code cache and refill it if the caller's code is, for whatever reason, not in the cache -- not likely, but possible -- This is especially wasteful if the caller just executes another RET immediately afterward, because that's potentially another flush-and-refill. This is a VERY small concern in this case, of course, but Chuck Moore has implied that you keep improving your software by handling these small concerns because the effects of doing so build up eventually, and you end up with smaller, faster code.)
qignore: ;Used by inter to handle extension (color=0) words. ;[Refs: spaces] test dword ptr [-4+edi*4], -20o ; Unless word's top 28 bits are jnz nul ; null, return to interpreter. pop edi ; [RST] ; Otherwise, exit interpreter. pop edi ; [RST] nul: ;[Refs: start1, qignore, adefine, anumber, display, ekeys, ekbd0, eout] ret
jump
The word jump introduces a jump table within ColorForth code, like this:
word ... number jump word0 word1 word2 ... ;
Here, number causes some number N to be pushed onto the stack, and jump takes that number and jumps to wordN (the Nth word afterward).
Note that the code here expects that each entry in the jump table is five bytes long, with the address of the word's definition in the latter four bytes — which is what compiling a word in the FORTH wordlist generates: a CALL to an address. A jump table containing MACRO words, generally speaking, will not work.
jump: ;[Refs: forth2] pop edx ; [RST] add edx, eax lea edx, [5+eax*4+edx] add edx, [-4+edx] drop jmp edx
load
Load does not load blocks from disk; the blocks are assumed to be in memory already. Load simply multiplies the block number by 256 (0x100) to get the block's address. (Thus block 18 (0x12) is at address (0x12 shl 8) or 0x1200.) The interpreter begins grabbing the pre-parsed words already in memory and using each word's color bits to determine the routine to use to handle that word.
load: ;( b -- ) Interprets pre-parsed words in the block given. ;[Refs: start1, forth2] shl eax, 10-2 push edi ; [RST] mov edi, eax drop
inter
Note that inter is an endless loop. The only way to return from the interpreter (e.g., when the interpreter has no more pre-parsed words to interpret) is for a routine called by the interpreter (such as qignore) to pop an item from the return stack before returning (and thus to return to inter's caller). (Only qignore checks for the last word in the pre-parsed code, so there is no need for inter to do it.)
inter: ;[Refs: inter] mov edx, [edi*4] inc edi ; (Thus all routines work on [-4+edi*4].) and edx, 17o ; Clear all bits except color bits (0..3). call spaces[edx*4] ; Use result as offset to routine to run. jmp inter
16 interpreter vectors
align 4 spaces ;[Refs: abort1, inter, sps] dd offset qignore, offset execute, offset num ; colors 0..2 adefine ;[Refs: sdefine] dd 5+offset macro_ ; color 3 ; This is altered by sdefine, which stores the ; address of either macrod (to define a MACRO ; word) or forthd (to define a FORTH word) here. dd offset qcompile, offset cnum, offset cshort, offset compile ; 4..7 dd offset short_, offset nul, offset nul, offset nul ; 8..11 dd offset variable, offset nul, offset nul, offset nul ; 12..15
Other variables
lit
This code vector points to either of two routines:
The routines that set lit to point to alit are all routines that handle yellow cells (words or numbers), so alit is needed only to handle immediate execution of words, and not for defining or compiling. Lit usually points to adup — it is reset to adup's address not only whenever a word is being defined but also whenever alit is called.
(But is [lit] called during handling of current word, or between words? I remember something about the transition between yellow and green words, or vice versa, causing something special with the compiler.)
It is helpful here to check out what each of the interpreter subroutines does with the stack. Nul of course does nothing, and qignore, macrod, forthd, and variable all scrupulously avoid altering the stack. Execute uses the stack but then restores it to what it was. Num and short_ both leave something on the stack, while the remaining routines — qcompile, cnum, cshort, and compile — all execute "call [lit]" first thing, use the TOS without first executing a DUP, and then DROP later.
Now look at what the two lit routines do. Adup duplicates the TOS and returns, while alit, like qignore and the word-definition routines, scrupulously avoid altering the stack.
Now we begin to see how these routines work together. Num and short_ both leave something on the stack, and they both set lit to alit. The following invocation of any of qcompile, cnum, cshort, or compile does a call [lit], which results in alit being executed, which avoids altering the stack but does create a mov eax, <literal> instruction using the TOS (the number placed on the stack by either num and short_) as the literal. Once alit is called, lit is restored to adup, so that if qcompile or any other routine that does a call [lit] is called again, adup is called, which just preserves the TOS and returns, so that the caller can safely overwrite the TOS.
Note that execute does not leave anything on the stack itself, but it is invoked when the interpreter finds a yellow word, whose definition is supposed to leave something (a single number, per Chuck Moore) on the stack. Other than that, it acts like num and short_, in that it sets lit to alit, so that the item left on the stack can be compiled into a mov eax, <item> instruction.
lit ;[Refs: execute, forthd, variable, cnum, cshort, alit, qcompile, compile, ; short_, num] dd offset adup mk ;[Refs: mark, empty] dd 0, 0, 0
h
This is the pointer to the end of the code space — the next byte to be overwritten when a word needs to be compiled. The code space begins at address 0x100000 (the one-megabyte mark), so if the code space is empty, h is set at 0x100000. As the compiler adds bytes of machine code to the code space, h advances.
h ;[Refs: mark, empty, forthd, cdrop, qdup, cdup, literal, call_, comma, semi, ; then, here, qlit, q1] dd 40000h*4
last
(forthd stores into "last" the DWORD pointer to last word defined -- the "colorless" pre-parsed word stored in the word array of whichever of the two wordlists was last updated -- presumably "last" is used for optimizing code)
last ;[Refs: forthd, last_] dd 0
class
From what I can tell, class always contains zero. Empty stores a zero here. Forthd will use this as a pointer to code if it is nonzero, but it doesn't store anything here. I have found no other references to class anywhere in the kernel, and class doesn't seem to be exposed in any way via the two wordlists. I'm guessing that this was intended as a way to customize the interpreter by having some custom code executed whenever a word is defined, except that I haven't found any way to do that except by altering the assembly-language source and reassembling ColorForth.
class ;[Refs: empty, forthd] dd 0
list
This is a list of up to two addresses within the dictionary. Each address is that of a previously compiled instruction — an instruction that could later be changed or removed in order to optimize the machine code.
list ;[Refs: forthd, cdrop, qdup, literal, call_, semi, then, begin, qlit] dd 0, 0
Check the index for other entries.