I've been writing 80386 assembly language for about ten years now, and I've gotten rather used to it -- enough, at least, to write some fairly long assembly language fragments for some obscure video program. 80386 assembly has a lot of goofy syntax and instruction set idiosyncracies, and it's almost the ugliest and nastiest assembly language I know, second only to 8086 real mode. It probably didn't help that I had the joy of learning the latter with the original IBM Macro Assembler 1.00.
I know a couple of other assembly languages, too, such as 6502 -- of which my main abuses can be summed up with BIT $C030 -- and enough MIPS that I can squeak by when reading a PowerPC listing. My favorite assembly language of all time, however, is that of the MC68000. Not only was that the CPU on which I did most of my early major assembly language coding, but the assembly code is very readable, and the instruction set rather flexible.
The 68000 has the rare (unique?) characteristic of having separate address and data registers. There are eight data registers, D0-D7, and eight address registers, A0-A7. A7 is also the stack pointer, SP. This means that 68000 assembly language tends to be easier to follow because you can easily tell which registers hold data and which hold addresses. For example, this is 68000 assembly to compute the sum of an array of words:
moveq #0, d0 moveq #0, d1 loop: move.w (a0)+, d0 add.l d0, d1 dbra d2, loop
It also had an instruction to quickly push or pop a series of registers, which made function prologues and epilogues very compact:
movem.l a0-a4/a6/d0-d3/d5/d7, -(sp)
...and it had both memory-to-memory moves and predecrement/postincrement modes:
copyloop: move.l (a0)+, (a1)+ dbra d0, copyloop
With as many as fifteen registers available for use, the 68000 naturally lends itself to a register passing scheme for function call parameters and keeping most working variables within the register set. I hate writing spill code, so I tended to use most or all of the registers, often to the point of reusing address registers for data or using the upper 16 bits of data registers for temporary storage.
For the most part, the 68000 treats all data and address registers equivalently, so register selection is easy. One hidden trap is that:
move.b d0, -(a7) move.b (a7)+, d0
...actually transfer words, since byte moves with predecrement or preincrement are special-cased to keep the stack aligned. Of course, if you were a real 68K programmer, you abused this to quickly shift a word by 8 bits.
Most of the coding I did was on an Amiga 2000 with a 28MHz cached 68000 (don't ask). A little bit was done to optimize programs written in Lattice C — which stubbornly used longwords for ints — but most of the 68K I wrote was actually in a language extension for a BASIC variant called AMOS Professional, which is best described as BASIC on steroids. I added statements for things like asynchronous disk I/O, smooth slider UI widgets, nearly instant maze generation, and LFSR noise generation. This might sound insane, but you have to realize that this was a BASIC interpreter that had verbs for disabling interrupts and running code from vertical blank.
A major annoyance of the 68000 is that it is incapable of accessing word or longword data misaligned, i.e. on an odd address. You would want to align variables anyway for performance, but a slow read is still better than having code blow up with an address error when you forget. The Amiga didn't have a way to cleanly kill a task and reclaim its resources; the usual mode of development was to keep dragging the program errors off screen until you ran out of memory and then restart your machine, which was optimized to boot quickly. (This is the same thing you did in Windows 95 when programs kept crashing in the code to dismiss the crash dialog.)
The other annoyance was discovering that the assembler had helpfully encoded a equality branch as beq.l, which is fine except that branch instructions with 32-bit offsets required a 68020.
I only did a little 68000 coding on the Macintosh for a class project, and I hated it, because classic MacOS insists on being able to move your application's data segment (pointed to by A5) during many system calls. This meant you couldn't keep pointers to mutable data unless you relocated them somehow. I ended up littering my assembly with junk like this:
sub.l a5, a2 sub.l a5, a3 jsr randomcall add.l a5, a3 add.l a5, a2
Besides that little annoyance, during that class I tried to optimize my programs before submission, which probably confused the poor TAs who thought that movea.b d0, a0 was a valid instruction.
I did a little bit of 68000 later on a Palm device, which used a 68328 (Dragonball), an embedded MCPU with a 68K core. Using gas for an assembler was a little annoying because I had to prefix all register references with %, but that was more than made up for by being able to use the C preprocessor.
Long time ago, when BYTE magazine was still good, they had a green-covered issue on the 68000 which had an article with goofy tricks and tips for writing fast 68000 code. The 68000 is often instruction fetch bottlenecked, so often the goal was to select the fast instructions that ran as close as possible to the 4 clock/instruction limit. The simplest, and most useful trick, was that the straightforward way of clearing a register:
...was actually two clocks slower than using the move-quick instruction:
Such tricks were the basis of knowing how to abuse the instruction set. Some of my favorites are (forgive me for errors, since I haven't done 68K for a while):
The movem (move multiple) instruction, while usually for saving or restoring registers from the stack, was also the fastest way to move blocks of memory. Using 15 registers, you could nearly reach the limit of the bus. It was significantly faster than the standard copy loop, possibly even on a 68010 with its loop mode.
movep was an obscure instruction to read or write data from an 8-bit device that had been connected to the 68000's 16-bit bus. It would transfer a word or longword on a series of byte locations whose addresses were all even or odd. You could also use this to slightly optimize a series of sparse, unaligned moves. I'm proud to say I broke someone's 68000 emulator with this instruction once, presumably because its use is so rare.
Unsigned clamping (for which I got sworn at by a fellow programmer):
cmp.w d1, d0 bcs.b noclamp spl d0 ext.w d0 and.w d1, d0 noclamp:
Jump tables were surprisingly difficult to do quickly. Unlike the 386, which treats addressing modes in a call or jump instruction the same way as for others, the 68000 uses the effective address of the memory operand instead, which means you can't do a table lookup in the JSR instruction. So I ended up actually making the table a list of branches... with some optimized cases:
add.w d0, d0 add.w d0, d0 jmp jumptable(pc, d0.w) jumptable: bra.w target1 moveq #17, d2 bra.b target2 moveq #-19, d2 bra.b target2 bra.b target3 nop