The Preface to “Macro SPITBOL: The High-Performance SNOBOL4 Language,” by Mark B. Emmer and Edward K. Quillen, includes the following:
Robert Dewar created Macro SPITBOL in the mid 1970’s, and showed the world that a high-performance version of SNOBOL4 was possible. His Macro SPITBOL compiler forms the kernel of this implementation. As an example of “aggressive” assembly-language-programming, it remains a programs tour de force.
I was reminded of this when I came upon a comment by Mark in the source for Macro SPITBOL:
note: as an experiment, we tried aligning all ent and prc’s on a dword boundary, in an attempt to minimize instruction stall time on the first opcode of a procedure. the resultant exe file waslarger by 592 bytes, and actually ran 0.1% slower. we will continue to use odd alignment of block entry routines so that the .cepp conditional may be used.
592 bytes? 0.1% difference in performance?
Can you think of another programmer who pays this level of attention to performance?
Here is another:
* move chars from xl (esi) to xr (edi), count in wa (ecx)
* the following sequence "old method" is shorter than the "new method"
* shown below, but is much slower because of the conditional jumps that
* cause the instruction cache to be flushed for 3 out of 4 count values.
* old method:
* shr ecx,1
* jnc tmp1
* movsb ; move odd byte
* tmp1 shr ecx,1
* jnc tmp2
* movsw ; move odd word
* tmp2 rep movsd ; move string as double words
* genop('jnc','short ' (t1 = genlab()))
* genopl(t1 label.delim,'shr','ecx','1')
* genop('jnc','short ' (t1 = genlab()))
* genopl(t1 label.delim,'rep','movsd') :(opdone)
* new method:
* shrd eax,ecx,1 ; preserve ecx in eax
* shr ecx,2 ; preserve ecx in cy, divide by 4
* rep movsd ; move dwords, leaves ecx=0
* adc ecx,ecx ; copy cy to ecx
* rep movsw ; copy 1 or 0 words, leaves ecx=0
* shld ecx,eax,1 ; copy eax to ecx
* rep movsb ; copy 1 or 0 bytes
Some knowledge — indeed more than I have — of programming for Intel’s x86 architecture is needed to fully appreciate the above, but the attention to detail is obvious.
Dewar’s MACRO SPITBOL was preceded by SPITBOL/360, written in IBM360 Assembly Language in the late 1960’s by Dewar and Ken Belcher. 
It must be the most efficient 360 assembly language program ever written. Robert once remarked that it was the only program known to him where the authors kept a copy of the instruction timing/cycle details for each model of the 360 in front of them as they wrote the code, and they frequently consulted these details to decide on best code sequence.
SPITBOL/360 demonstrated an astounding level of performance. SPITBOL/360 routinely compiled hundreds of thousands of lines of source code per second.
I once heard, perhaps from Robert himself, of one of the great tricks in SPITBOL/360.
SNOBOL has an integer variable
&STLIMIT. A counter is initially set to this value. Whenever a new statement is executed, the counter is decremented, and if the value reaches zero then an exception is raised. This is used to avoid runaway loops and other code that would result in execution going on forever.
The obvious method is to use an integer counter, decrement it when a new statement is executed, and raise the exception when the value zero is reached.
Dewar and Belcher’s device was to maintain the counter as a floating-point (real) variable. It was set by computing a “magic” value and initializing the counter to that value. When a new statement was executed, the value 1.0 was added to the floating point counter. The magic was in that a floating-point exception, due to overflow of the value, would be raised at the appropriate time.
Andy Koenig — a fellow contributor to Macro SPITBOL — made mention of this device in his column Some Programs Are Poorly Designed On Purpose. The post includes several other examples from SPITBOL of what Andy calls “nefarious purposes” in the use of floating point arithmetic.
Another device, one of much greater generality, was the use of indirect threaded code (ITC), described in Dewar’s paper Indirect Threaded Code.
Most interpreters use a “switch on opcode” statement when a new statement is to be execute. However, this involves testing the value, and doing a switch instruction to the appropriate code, which tends to slow things down. Indeed this is one of the main costs of writing an interpreter instead of generating actual code for the target machine. After all, the hardware is designed to extract opcodes and take the appropriate execution as fast as it is possible.
When using ITC, the transition from one abstract instruction to the next is made not by a switch, but by doing an “indirect branch” to the address contained in the the code block for the next statement.
By the way, this is the reason you cannot write code in C to execute the Minimal assembly language level code in which SPITBOL is written. C has no such contruct (what is needed is something like
By the way, Dewar is equally aggressive when it comes to documenting his code. See for example the source code for SPITBOL in file
spitbol.min available at Github Hardbol SPITBOL. I consider it a tour de force in program documentation.
For example, while working together on the port of Macro SPITBOL to the IBM PC in the early 80’s, I remember particularly an instance where Dewar saved a file, and then immediately went back to edit the source when he noticed there was an extraneous space..
1. SPITBOL/360 is available in open source form under the GPL license. See SPITBOL 360.
Dewar and Belcher applied their aggressive programing techniques in the early 80’s by writing a COBOL compiler in COBOL! The wikipedia entry for Robert Dewar includes:
In the 1970s Dewar was a principal author of the Realia COBOL compiler, widely used in commercial environments to this day (marketed by Computer Associates).
Written in COBOL, a remarkable feat in itself, I learned later from colleagues at IBM that Realia COBOL generated code that was better than that produced by IBM’s product COBOL compiler