One feature/limitation of my computer design is that it uses an Arduino to replace a lot of extra logic and interface components. Specifically, the Arduino performs the following functions:
- Serial I/O
- The computer piggy-backs off of the Arduino's built-in serial to USB interface chips to communicate with the host laptop.
- Power-up reset circuitry
- The 6502 needs to run for 50 or so full clock cycles with the RESET line held low, and then RESET is sent high for normal operation.
- System clock
- Because the computer uses the Arduino's USB interface, the Arduino must be kept synchronous to the computer. If the computer runs too quickly, it could outpace the Arduino's ability to read and send data.
The last point is the most important of all. The Arduino must be kept in lock-step with the computer at any given time in order to ensure that I/O data isn't missed. Thus, the the clock output is just a digital output from the Arduino. This means that the computer's maximum speed relies on how quickly the Arduino (or more specifically, the Atmega328P) can do it's internal operations.
Beginning to Optimize
Arduino Core
This is a lot better than before: both function calls have been removed, so no time will be taken up by doing subroutine jumps or navigating through the expensive digitlalWrite function. But, here's the kicker: even with avr-gcc's optimizations, this assembly is still extremely unoptimized!
Port Manipulation with Pre-Computed Values
Functionally, this is exactly the same as the bitwise operations above, it's still going to be toggling one bit off-and-on. But, now the produced assembly should change to:
Limited, but Faster
This presents a series of interesting challenges. This is one of those use cases where every compute cycle done by the ATmega matters and will directly impact the performance of the entire computer. Reducing the total number of clock cycles done by the ATmega becomes the challenge. And, since we're dealing directly with hardware, the Arduino core and even the AVR-GCC compiler is too slow! The solution comes down to direct port manipulation and AVR assembly.
Beginning to Optimize
The first iteration of the Arduino's code used the Arduino core exclusively. Meaning, lots of digitalRead and digitalWrite function calls. The Arduino core is nice for higher-level abstracted interfaces, but falls short when it comes to speed. The first alternative, instead of calling something like digitalWrite(3, HIGH), is to replace that code with direct port manipulation. This involves looking over the ATmega328P datasheet and cross-referencing the Arduino's pinout with the Atmega's IO ports.
The ports themselves are represented as 8-bit values (labeled PORTA, PORTB, PORTC, etc.). Each bit in that byte corresponds to one of the GPIO pins for that port. Thus, modifying or reading a value from the port can be done through direct assignment in C or bitwise operations. You could do something like PORTC ^= 0x03 to toggle a pin state, for example.
A statement like that, however, will still need to be compiled into assembly code for the AVR chip. Usually avr-gcc is pretty good at optimizing code. But, we're in an odd position: every single cycle the AVR chip does reduces the speed of the computer.
Counting Cycles
Arduino Core
When compiled, the Arduino core assembles the statements:
digitalWrite(CLK, LOW);
digitalWrite(CLK, HIGH);
into:
448 ldi r22,0
449 ldi r24,lo8(3)
450 call digitalWrite
451 .LVL48:
452 .loc 1 89 0
453 ldi r22,lo8(1)
454 ldi r24,lo8(3)
455 call digitalWrite
456 .LVL49:
457 rjmp .L26
449 ldi r24,lo8(3)
450 call digitalWrite
451 .LVL48:
452 .loc 1 89 0
453 ldi r22,lo8(1)
454 ldi r24,lo8(3)
455 call digitalWrite
456 .LVL49:
457 rjmp .L26
This uses two registers, four loads, and two function calls. The function calls themselves use a large number of cycles as it uses the value in register r24 to lookup the specific port and pin. This runs slow. Too slow.
Testing with the generic Arduino functions had the computer running at a maximum speed of around 60kHz. The computer does work at this speed, but it's practically unusuable.
Port Manipulation
The next step would be to try and remove the function calls entirely. The CLK pin is never going to change, so we can hard-code the port values. The simplest approach would be to use a bitwise exclusive-or on the specific bit that corresponds with the CLK pin.
PORTD = PORTD ^ 0x8;
PORTD = PORTD ^ 0x8;
This compiles into the following assembly:
317 in r25,0xb
318 eor r25,r18
319 out 0xb,r25
320 .loc 1 89 0
321 in r25,0xb
322 eor r25,r18
323 out 0xb,r25
318 eor r25,r18
319 out 0xb,r25
320 .loc 1 89 0
321 in r25,0xb
322 eor r25,r18
323 out 0xb,r25
This is a lot better than before: both function calls have been removed, so no time will be taken up by doing subroutine jumps or navigating through the expensive digitlalWrite function. But, here's the kicker: even with avr-gcc's optimizations, this assembly is still extremely unoptimized!
This isn't entirely avr-gcc's fault, the compiler has no idea what the pins are going to be connected to, so it needs to be prepared for the unexpected. The behavior above is similar to the behavior of the volatile keyword in C: the compiler makes the assumption that this pin's value could change without warning. Which, means that the port's value cannot be cached and must be fetched every time. Thus, on line 321, the compiler is doing another read of the port, even though the value shouldn't have changed from the previous instruction (line 319).
Port Manipulation with Pre-Computed Values
Looking at the problem a bit more, we could drop the exclusive-or operations entirely. In fact, all six instructions in the previous compilation can be done in two instructions and zero registers in total. The solution is using pre-computed values on the ports. |
Specifically changing the above to,
PORTD = PORTD & 0xF7;
PORTD = PORTD | 0x8;
315 cbi 0xb,3
316 sbi 0xb,3
The clear and set bit instructions above take two CPU cycles each to complete, for a total of four CPU cycles. Thus, with the context of the loop this function exists in, the shortest possible loop for the ATmega chip, while maintaining all functionality from above, becomes this:
302 .L11:
303 cbi 0xb,3 // set clk pin low
304 sbi 0xb,3 // set clk pin high
305 subi r24,lo8(-(-1)) // check if arduino is being accessed
306 brne .L11 // if not, goto .L11
In total, this section of code uses seven CPU cycles. Thus, with the Arduino's crystal oscillator clocked at 16MHz, the maximum possible switching speed of the CLK pin becomes 16MHz / 7 = 2.28MHz.
Limited, but Faster
2.28MHz seems to be the magic number, I can't think of a way to reduce the number of instructions in the loop above any lower except for possibly using some type of frequency doubler in hardware. For a 6502 computer, that is a respectable speed. The Apple 1 and 2 ran at 1MHz, for example. I've tested the computer with the Arduino removed, using a function generator as a clock source, and the computer is stable up to about 12MHz.
Unfortantely, due to how the Arduino needs to be in sync with the computer at all times, some operations performed by the Arduino will take longer than others. So, the output clock signal is not stable. But, 2.28MHz is a lot nicer to work with than 60KHz!
Comments
Post a Comment