ColorForth Commentary: COLOR.ASM--Compiler

COLOR.ASM: Compiler

NOTICE: This is a work in progress. Parts of my commentary are still very rough. However, I have it up on the web because even in this rough form it may be useful to ColorForth enthusiasts. Expect the contents of these files to change frequently.

ColorForth's compiler works by adding bytes of machine code to the end of a buffer called the dictionary. (This buffer starts at address 0x100000 — the one-megabyte mark — and extends upward in memory.) ColorForth maintains a pointer to the end of this buffer in the variable h.

Words in the MACRO wordlist are executed at compile-time in order to add bytes of machine code to the definition of the word currently being defined. These words also provide opportunities for other macro words to optimize the machine code generated. Sometimes a word will leave in list the address of the machine code that the word adds to the definition. Another word will check list to see if a specific instruction is in the definition, and if so, the word may take the instruction out or substitute another one.

DROP

The word drop removes the top item from the data stack. Specifically, it moves the second item on the stack into the TOS register, overwriting the original top item, and adjusts the NOS register to point to what was the third item on the stack. The result is one cell fewer on the stack.

ColorForth implements drop as the one-byte LODSD instruction (machine-code value: 0xAD). Therefore the routine to compile a drop simply adds the machine-code value for LODSD at the end of the dictionary and increments the end-of-dictionary pointer by one byte. However, this version also stores the old end-of-dictionary pointer (the location of the LODSD instruction being added) into list, so that if another routine (such as qdup) wants to optimize away the LODSD, it can do so.

cdrop: ;"drop"
;[Refs: qlit, macro2]
	mov	edx, h
	mov	list, edx
	mov	byte ptr [edx], 0adh ; lodsd
	inc	h
	ret

DUP

The word dup pushes a copy of TOS onto the data stack. Specifically, it adjusts the NOS register to make room for an extra cell on the stack and copies TOS into this extra cell. The result is one cell more on the stack.

The ColorForth kernel defines two routines for compiling dup: qdup and cdup.

Optimized DUP (qdup)

The word ?dup is often used at the beginning of a macro word to compile a dup only if the definition of the word being defined does not already end with a drop (i.e., a LODSD instruction). If the drop is there, it is removed, and dup is not compiled.

qdup: ;"?dup"
;[Refs: literal, macro2]
	mov	edx, h               ; See if the last optimizable instruc-
	dec	edx                  ;   tion was one byte back from the end
	cmp	list, edx            ;   of the dictionary.
	jnz	cdup                 ; If not, just compile a DUP.
	cmp	byte ptr [edx], 0adh ; If so, was the instruction LODSD?
	jnz	cdup                 ; If not, just compile a DUP.
	mov	h, edx               ; If so, remove the LODSD and compile
	ret	                     ;   nothing.

Non-optimized DUP (cdup)

If the programmer knows that the preceding instruction was not a drop, he can just compile a dup directly.

Note that the dword and byte being added to the dictionary correspond to the five bytes (in hex) 8D 76 FC 89 06. The first three bytes are the machine code for the instruction lea esi, [esi-4]; the last two correspond to mov [esi], eax. (ESI is NOS, the pointer to the next [second] item on the stack. The stack grows downward, and the size of each item on the stack is four bytes, so lea esi, [esi-4] makes room on the stack for another item. EAX is TOS, the register containing the top item on the stack, so mov [esi], eax copies TOS into the new space.)

cdup: ;"dup"
;[Refs: qdup, macro2]
	mov	edx, h
	mov	dword ptr [edx], 89fc768dh
	mov	byte ptr [4+edx], 06
	add	h, 5
	ret

The interpreter

This isn't the entire interpreter; the complete list of routines called by inter is at spaces/adefine.

adup:
;[Refs: forthd, alit, lit]
	dup_
	ret

var1:
;[Refs: variable]
	dup_
	mov	eax, [4+forth0+ecx*4]
	ret

variable

The interpreter calls variable when it encounters a magenta word (a variable to be defined).

Recall that ColorForth defines two wordlists — FORTH and MACRO. If a pre-parsed word is green, then what the interpreter does with the word depends on the wordlist in which the word is found. If the word is in the MACRO wordlist, then the word's definition (machine code) is executed immediately. If the word is in the FORTH wordlist, then a CALL to the word's definition is compiled at the end of the dictionary (machine-code buffer).

Recall also that ColorForth defines each wordlist as a pair of arrays — a word array and an address array. A pre-parsed word is sought in the word array. If a match is found, then the address of the word's definition will be found at the same offset in the address array.

Defining a new word usually creates one new entry — one new word and one new address — in one of the two wordlists. Defining a variable is more expensive; a variable definition uses four wordlist entries — two entries in each wordlist. The data added to the wordlists are as follows:

FORTH wordlist
Word (name)	Address
word	`var1`
[4+ecx]

MACRO wordlist
Word (name)	Address
word	`var2`
[4+ecx]

When the interpreter finds a magenta word, it calls variable to handle the word. Variable calls forthd to add the word to the FORTH wordlist, set the ECX register to the address of the word-array cell that contains the new word, and set the word's address field to the address of the end of the dictionary as usual. Variable stores into this address field the address of var1, which does nothing except push [4+forth0+ecx*4] onto the stack. Variable then creates a dummy entry in the FORTH wordlist, into which is stored the address of the pre-parsed word after the magenta word being handled. This address goes into the word array; the corresponding cell in the address array is left unused. (This word-array cell for the dummy entry is the address to which "[4+forth0+ecx*4]" refers.)

variable:
;Used by inter to handle magenta (color=12, variable) words.
;[Refs: adefine]
	call	forthd
	mov	[forth2-forth0+ecx], offset var1
	inc	forths ; dummy entry for source address
	mov	[4+ecx], edi

Then variable creates parallel entries in the MACRO wordlist. This operation is the same as above, except that variable calls macrod instead of forthd, and it stores, not the address of var1, but the address of the routine following variable (marked in the code with a local label "@@"), which I choose to call "var2".

	call	macrod
	mov	[forth2-forth0+ecx], offset @f
	inc	macros
	mov	[4+ecx], edi

The wordlist entries created, variable increments the pre-parsed-word pointer, so that the interpreter will not interpret the value following the magenta word (which of course is the value of the variable).

	inc	edi
	ret

So how is a variable used after it is defined?

Chuck Moore states that yellow variables are preferred on the Pentium — that is, you use a variable by having its address pushed onto the stack at compile time. When the interpreter finds a yellow word, it looks up the word in the FORTH wordlist and executes the code at the address found in the word's entry. In this case, the address is that of var1, which grabs the address from the word field of the next (dummy) entry in the wordlist and pushes it onto the stack at compile time. This address is of course that of the variable's value, stored within the pre-parsed source code.

What if a variable name is green? Chuck Moore states: "A green variable compiles a call to code executed at run time to put the address on the stack." When the interpreter finds a green word, it looks up the word first in the MACRO wordlist, then in the FORTH wordlist. In this case, the interpreter will find the variable name in the MACRO wordlist (since the name was added to both wordlists). The address in the MACRO wordlist is that of "var2" below — which generates code to push the address of the variable's value onto the stack at run time.

"var2"

-- routine called if variable is invoked as a MACRO --

@@:	call	[lit]
	mov	eax, [4+macro0+ecx*4]
	jmp	@f

cnum

Cnum ("compile number") takes a 32-bit number from the pre-parsed source code and compiles it into the machine code needed to push that number onto the stack.

A 32-bit number is stored in the pre-parsed code as two 32-bit cells. Only the bottom five bits of the first cell are used — the bottom four bits are the "color," in this case color five (a "green" number to be compiled), and the other bit indicates whether the number should be displayed as decimal or hexadecimal. The second cell contains the 32-bit number.

Whereas most routines called by inter (the interpreter loop) access the current pre-parsed word via [-4+edi*4], cnum needs to access the following pre-parsed word (the 32-bit number itself) via [edi*4].

cnum:
;Used by inter to handle green (color=5, 32-bit number) words.
;[Refs: abort1, adefine]
	call	[lit]
	mov	eax, [edi*4]         ; Get cell AFTER "prefix."
	inc	edi
	jmp	@f

cshort

Cshort ("compile short number") takes a 27-bit number from the pre-parsed source code and compiles it into the machine code needed to push that number onto the stack.

A number small enough to be represented in 27 or fewer bits is stored in the pre-parsed source code as a single cell. The bottom four bits contain the "color," in this case color six (a "green" number to be compiled), the next bit indicates whether to display the number as decimal or hexadecimal, and the other 27 bits contain the number.

cshort:
;Used by inter to handle green (color=6, 27-bit number) words.
;[Refs: abort1, adefine]
	call	[lit]
	mov	eax, [-4+edi*4]
	sar	eax, 5
@@:	call	literal
	drop
	ret

alit:
;[Refs: execute, adup, short_, num]
	mov	lit, offset adup

literal

Literal takes the 32-bit number at the top of the stack and compiles from it the machine code needed to push that number onto the stack later. (In most cases it generates a dup followed by a MOV EAX, <32_bit_number> instruction.) Note that this routine does not drop the number from the stack after compiling it.

literal:
;[Refs: cshort]
	call	qdup                 ; Compile code to preserve top of stack.
	mov	edx, list            ; Preserve previous instruction address
	mov	list+4, edx          ;   saved for optimization.
	mov	edx, h               ; Save address of the "MOV EAX"
	mov	list, edx            ;   instruction about to be compiled.
	mov	byte ptr [edx], 0b8h ; Compile "MOV EAX, <literal>" --
	mov	[1+edx], eax         ;   using the literal on the stack.
	add	h, 5                 ; Adjust end-of-dictionary pointer.
	ret

qcompile

Qcompile retrieves the next pre-parsed word from the source code and looks it up, first in the MACRO wordlist, then in the FORTH wordlist. If the word is found in the MACRO wordlist, the word's definition is executed. If the word is found in the FORTH wordlist, a CALL to the word's definition is compiled into the dictionary. If the word is found in neither, the interpreter aborts.

qcompile:
;Used by inter to handle green (color=4, compile) words.
;[Refs: abort1, adefine]
	call	[lit]
	mov	eax, [-4+edi*4]      ; Get next pre-parsed word.
	and	eax, -20o            ; Mask out color bits (bits 0..3).
	call	mfind                ; Look it up in MACRO wordlist.
	jnz	@f
	drop	                     ; If found in MACRO wordlist,
	jmp	[macro2+ecx*4]       ;   execute its definition immediately.
@@:	call	find                 ; Otherwise, find it in FORTH wordlist.
	mov	eax, [forth2+ecx*4]
@@:	jnz	abort

call_

Call_ takes the routine address at the top of the stack and compiles from it the machine code needed to call that routine.

call_:
;[Refs: none]
	mov	edx, h
	mov	list, edx
	mov	byte ptr [edx], 0e8h ; Store CALL opcode into dictionary.
	add	edx, 5               ; Calculate correct offset from the CALL
	sub	eax, edx             ;   instruction to the routine called.
	mov	[-4+edx], eax        ; Store offset into dictionary.
	mov	h, edx
	drop
	ret

compile:
;Used by inter to handle cyan (color=7, compile macro) words.
;[Refs: adefine]
	call	[lit]
	mov	eax, [-4+edi*4]
	and	eax, -20o
	call	mfind
	mov	eax, [macro2+ecx*4]
	jmp	@b

short_:
;Used by inter to handle yellow (color=8, 27-bit number) words.
;[Refs: adefine]
	mov	lit, offset alit
	dup_
	mov	eax, [-4+edi*4]
	sar	eax, 5
	ret

num:
;Used by inter to handle yellow (color=2, 32-bit number) words.
;[Refs: spaces]
	mov	lit, offset alit
	dup_
	mov	eax, [edi*4]
	inc	edi
	ret

Writing machine code into the dictionary

These words allow the ColorForth programmer to create MACRO words that write machine code into the dictionary, from one to four bytes at a time.

comma: ;","
;Moves a cell (four bytes) of machine code to the end of the dictionary.
;[Refs: forth2]
	mov	ecx, 4
@@:	mov	edx, h
	mov	[edx], eax
	mov	eax, [esi] ; drop
	lea	edx, [edx+ecx]
	lea	esi, [esi+4]
	mov	h, edx
;	drop
	ret

comma1: ;"1,"
;Moves one byte of machine code to the end of the dictionary.
;[Refs: forth2]
	mov	ecx, 1
	jmp	@b

comma2: ;"2,"
;Moves two bytes of machine code to the end of the dictionary.
;[Refs: forth2]
	mov	ecx, 2
	jmp	@b

comma3: ;"3,"
;Moves three bytes of machine code to the end of the dictionary.
;[Refs: forth2]
	mov	ecx, 3
	jmp	@b

Macro words

The semicolon (;) is another optimizing word. It works like this: If the last optimizable instruction compiled was a CALL, change this to a JuMP; otherwise add a RETurn instruction to the end of the dictionary.

One nice side effect of this, as Chuck Moore has observed, is that this allows tail recursion of words in the FORTH wordlist, like this:

word ... condition if word ; then ... ;

If a word calls itself, it would normally fill the return stack with copies of the same return address — but if it jumps to the beginning of its own definition instead, the return stack is not overfilled.

semi: ;";"
;Compiles code to return to caller (either JMP or RET).
;[Refs: macro2]
	mov	edx, h
	sub	edx, 5
	cmp	list, edx
	jnz	@f
	cmp	byte ptr [edx], 0e8h
	jnz	@f
	inc	byte ptr [edx] ; jmp
	ret
@@:	mov	byte ptr [5+edx], 0c3h ; ret
	inc	h
	ret

then:
;[Refs: macro2]
	mov	list, esp
	mov	edx, h
	sub	edx, eax
	mov	[-1+eax], dl
	drop
	ret

begin:
;[Refs: macro2]
	mov	list, esp

here:
;Returns end-of-dictionary pointer on data stack.
;[Refs: forth2]
	dup_
	mov	eax, h
	ret

qlit: If the last instruction in the dictionary is "MOV EAX, literal", move the literal onto the data stack and remove the instruction from the dictionary. If the instruction was preceded by a DUP, remove that also. Returns with zero flag set if nothing placed on stack, cleared if literal on stack.

?lit if do-something-with-literal drop then ;

qlit: ;"?lit"
;[Refs: forth2]
	mov	edx, h
	lea	edx, [edx-5]
	cmp	list, edx
	jnz	@f
	cmp	byte ptr [edx], 0b8h
	jnz	@f
	dup_
	mov	eax, list+4
	mov	list, eax
	mov	eax, [1+edx]
	cmp	dword ptr [edx-5], 89fc768dh ; dup
	jz	q1
	mov	h, edx
	jmp	cdrop

q1:
;[Refs: qlit]
	add	h, -10 ; flag nz
	ret
@@:	xor	edx, edx ; flag z
	ret

less:
;[Refs: forth2]
	cmp	[esi], eax
	jl	@f ; flag nz        ; Bug-fix! Original was "js"
	xor	ecx, ecx ; flag z
@@:	ret

qignore

If the interpreter has found the last pre-parsed word (a "null" cell), qignore pops the return stack twice. The effect is to return to the caller of load. Qignore has to pop two items from the return stack — the address to return to load, and the value that was in EDI before load pushed it onto the return stack. The RET instruction returns control to the routine that called load.

(MY THEORY: Popping a return address and returning to the caller's caller might save a code-cache flush-and-refill, because if you return to a caller, this potentially causes the processor to flush the code cache and refill it if the caller's code is, for whatever reason, not in the cache -- not likely, but possible -- This is especially wasteful if the caller just executes another RET immediately afterward, because that's potentially another flush-and-refill. This is a VERY small concern in this case, of course, but Chuck Moore has implied that you keep improving your software by handling these small concerns because the effects of doing so build up eventually, and you end up with smaller, faster code.)

qignore:
;Used by inter to handle extension (color=0) words.
;[Refs: spaces]
	test	dword ptr [-4+edi*4], -20o ; Unless word's top 28 bits are
	jnz	nul                        ;   null, return to interpreter.
	pop	edi ; [RST]                ; Otherwise, exit interpreter.
	pop	edi ; [RST]

nul:
;[Refs: start1, qignore, adefine, anumber, display, ekeys, ekbd0, eout]
	ret

jump

The word jump introduces a jump table within ColorForth code, like this:

word ... number jump word0 word1 word2 ... ;

Here, number causes some number N to be pushed onto the stack, and jump takes that number and jumps to wordN (the Nth word afterward).

Note that the code here expects that each entry in the jump table is five bytes long, with the address of the word's definition in the latter four bytes — which is what compiling a word in the FORTH wordlist generates: a CALL to an address. A jump table containing MACRO words, generally speaking, will not work.

jump:
;[Refs: forth2]
	pop	edx ; [RST]
	add	edx, eax
	lea	edx, [5+eax*4+edx]
	add	edx, [-4+edx]
	drop
	jmp	edx

load

Load does not load blocks from disk; the blocks are assumed to be in memory already. Load simply multiplies the block number by 256 (0x100) to get the block's address. (Thus block 18 (0x12) is at address (0x12 shl 8) or 0x1200.) The interpreter begins grabbing the pre-parsed words already in memory and using each word's color bits to determine the routine to use to handle that word.

load:
;( b -- ) Interprets pre-parsed words in the block given.
;[Refs: start1, forth2]
	shl	eax, 10-2
	push	edi ; [RST]
	mov	edi, eax
	drop

inter

Note that inter is an endless loop. The only way to return from the interpreter (e.g., when the interpreter has no more pre-parsed words to interpret) is for a routine called by the interpreter (such as qignore) to pop an item from the return stack before returning (and thus to return to inter's caller). (Only qignore checks for the last word in the pre-parsed code, so there is no need for inter to do it.)

inter:
;[Refs: inter]
	mov	edx, [edi*4]
	inc	edi           ; (Thus all routines work on [-4+edi*4].)
	and	edx, 17o      ; Clear all bits except color bits (0..3).
	call	spaces[edx*4] ; Use result as offset to routine to run.
	jmp	inter

16 interpreter vectors

align 4
spaces
;[Refs: abort1, inter, sps]
	dd offset qignore, offset execute, offset num           ; colors 0..2

adefine
;[Refs: sdefine]
	dd 5+offset macro_                                          ; color 3
		; This is altered by sdefine, which stores the
		; address of either macrod (to define a MACRO
		; word) or forthd (to define a FORTH word) here.
	dd offset qcompile, offset cnum, offset cshort, offset compile ; 4..7
	dd offset short_, offset nul, offset nul, offset nul          ; 8..11
	dd offset variable, offset nul, offset nul, offset nul       ; 12..15

Other variables

lit

This code vector points to either of two routines:

adup. The routines forthd/macrod and alit store the address of adup into lit.
alit. The routines execute, short_, and num store the address of alit into lit.

The routines that set lit to point to alit are all routines that handle yellow cells (words or numbers), so alit is needed only to handle immediate execution of words, and not for defining or compiling. Lit usually points to adup — it is reset to adup's address not only whenever a word is being defined but also whenever alit is called.

(But is [lit] called during handling of current word, or between words? I remember something about the transition between yellow and green words, or vice versa, causing something special with the compiler.)

It is helpful here to check out what each of the interpreter subroutines does with the stack. Nul of course does nothing, and qignore, macrod, forthd, and variable all scrupulously avoid altering the stack. Execute uses the stack but then restores it to what it was. Num and short_ both leave something on the stack, while the remaining routines — qcompile, cnum, cshort, and compile — all execute "call [lit]" first thing, use the TOS without first executing a DUP, and then DROP later.

Now look at what the two lit routines do. Adup duplicates the TOS and returns, while alit, like qignore and the word-definition routines, scrupulously avoid altering the stack.

Now we begin to see how these routines work together. Num and short_ both leave something on the stack, and they both set lit to alit. The following invocation of any of qcompile, cnum, cshort, or compile does a call [lit], which results in alit being executed, which avoids altering the stack but does create a mov eax, <literal> instruction using the TOS (the number placed on the stack by either num and short_) as the literal. Once alit is called, lit is restored to adup, so that if qcompile or any other routine that does a call [lit] is called again, adup is called, which just preserves the TOS and returns, so that the caller can safely overwrite the TOS.

Note that execute does not leave anything on the stack itself, but it is invoked when the interpreter finds a yellow word, whose definition is supposed to leave something (a single number, per Chuck Moore) on the stack. Other than that, it acts like num and short_, in that it sets lit to alit, so that the item left on the stack can be compiled into a mov eax, <item> instruction.

lit
;[Refs: execute, forthd, variable, cnum, cshort, alit, qcompile, compile,
;       short_, num]
	dd offset adup

mk
;[Refs: mark, empty]
	dd 0, 0, 0

This is the pointer to the end of the code space — the next byte to be overwritten when a word needs to be compiled. The code space begins at address 0x100000 (the one-megabyte mark), so if the code space is empty, h is set at 0x100000. As the compiler adds bytes of machine code to the code space, h advances.

h
;[Refs: mark, empty, forthd, cdrop, qdup, cdup, literal, call_, comma, semi,
;       then, here, qlit, q1]
	dd 40000h*4

last

(forthd stores into "last" the DWORD pointer to last word defined -- the "colorless" pre-parsed word stored in the word array of whichever of the two wordlists was last updated -- presumably "last" is used for optimizing code)

last
;[Refs: forthd, last_]
	dd 0

class

From what I can tell, class always contains zero. Empty stores a zero here. Forthd will use this as a pointer to code if it is nonzero, but it doesn't store anything here. I have found no other references to class anywhere in the kernel, and class doesn't seem to be exposed in any way via the two wordlists. I'm guessing that this was intended as a way to customize the interpreter by having some custom code executed whenever a word is defined, except that I haven't found any way to do that except by altering the assembly-language source and reassembling ColorForth.

class
;[Refs: empty, forthd]
	dd 0

list

This is a list of up to two addresses within the dictionary. Each address is that of a previously compiled instruction — an instruction that could later be changed or removed in order to optimize the machine code.

list
;[Refs: forthd, cdrop, qdup, literal, call_, semi, then, begin, qlit]
	dd 0, 0

Check the index for other entries.