Skip to main content

TPC65 - Arduino Optimization

One feature/limitation of my computer design is that it uses an Arduino to replace a lot of extra logic and interface components. Specifically, the Arduino performs the following functions:

  • Serial I/O
    • The computer piggy-backs off of the Arduino's built-in serial to USB interface chips to communicate with the host laptop.
  • Power-up reset circuitry
    • The 6502 needs to run for 50 or so full clock cycles with the RESET line held low, and then RESET is sent high for normal operation.
  • System clock
    • Because the computer uses the Arduino's USB interface, the Arduino must be kept synchronous to the computer. If the computer runs too quickly, it could outpace the Arduino's ability to read and send data.
 

The last point is the most important of all. The Arduino must be kept in lock-step with the computer at any given time in order to ensure that I/O data isn't missed. Thus, the the clock output is just a digital output from the Arduino. This means that the computer's maximum speed relies on how quickly the Arduino (or more specifically, the Atmega328P) can do it's internal operations.

This presents a series of interesting challenges. This is one of those use cases where every compute cycle done by the ATmega matters and will directly impact the performance of the entire computer. Reducing the total number of clock cycles done by the ATmega becomes the challenge. And, since we're dealing directly with hardware, the Arduino core and even the AVR-GCC compiler is too slow! The solution comes down to direct port manipulation and AVR assembly. 



Beginning to Optimize

The first iteration of the Arduino's code used the Arduino core exclusively. Meaning, lots of digitalRead and digitalWrite function calls. The Arduino core is nice for higher-level abstracted interfaces, but falls short when it comes to speed. The first alternative, instead of calling something like digitalWrite(3, HIGH), is to replace that code with direct port manipulation. This involves looking over the ATmega328P datasheet and cross-referencing the Arduino's pinout with the Atmega's IO ports. 

The ports themselves are represented as 8-bit values (labeled PORTA, PORTB, PORTC, etc.). Each bit in that byte corresponds to one of the GPIO pins for that port. Thus, modifying or reading a value from the port can be done through direct assignment in C or bitwise operations. You could do something like PORTC ^= 0x03 to toggle a pin state, for example. 

A statement like that, however, will still need to be compiled into assembly code for the AVR chip. Usually avr-gcc is pretty good at optimizing code. But, we're in an odd position: every single cycle the AVR chip does reduces the speed of the computer. 

Counting Cycles

Arduino Core

When compiled, the Arduino core assembles the statements: 

digitalWrite(CLK, LOW); 
digitalWrite(CLK, HIGH);
 
into:
 
 448     ldi r22,0
 449     ldi r24,lo8(3)
 450     call digitalWrite
 451 .LVL48:
 452     .loc 1 89 0
 453     ldi r22,lo8(1)
 454     ldi r24,lo8(3)
 455     call digitalWrite
 456 .LVL49:
 457     rjmp .L26
 
This uses two registers, four loads, and two function calls. The function calls themselves use a large number of cycles as it uses the value in register r24 to lookup the specific port and pin. This runs slow. Too slow. 
 
Testing with the generic Arduino functions had the computer running at a maximum speed of around 60kHz. The computer does work at this speed, but it's practically unusuable.

Port Manipulation

The next step would be to try and remove the function calls entirely. The CLK pin is never going to change, so we can hard-code the port values. The simplest approach would be to use a bitwise exclusive-or on the specific bit that corresponds with the CLK pin. 

PORTD = PORTD ^ 0x8; 
PORTD = PORTD ^ 0x8;


This compiles into the following assembly:

 317     in r25,0xb
 318     eor r25,r18
 319     out 0xb,r25
 320     .loc 1 89 0
 321     in r25,0xb
 322     eor r25,r18
 323     out 0xb,r25

This is a lot better than before: both function calls have been removed, so no time will be taken up by doing subroutine jumps or navigating through the expensive digitlalWrite function. But, here's the kicker: even with avr-gcc's optimizations, this assembly is still extremely unoptimized! 

This isn't entirely avr-gcc's fault, the compiler has no idea what the pins are going to be connected to, so it needs to be prepared for the unexpected. The behavior above is similar to the behavior of the volatile keyword in C: the compiler makes the assumption that this pin's value could change without warning. Which, means that the port's value cannot be cached and must be fetched every time. Thus, on line 321, the compiler is doing another read of the port, even though the value shouldn't have changed from the previous instruction (line 319). 
 

Port Manipulation with Pre-Computed Values

Looking at the problem a bit more, we could drop the exclusive-or operations entirely. In fact, all six instructions in the previous compilation can be done in two instructions and zero registers in total. The solution is using pre-computed values on the ports.

Specifically changing the above to, 

PORTD = PORTD & 0xF7;   
PORTD = PORTD | 0x8;
 
Functionally, this is exactly the same as the bitwise operations above, it's still going to be toggling one bit off-and-on. But, now the produced assembly should change to:

315     cbi 0xb,3
316     sbi 0xb,3
 
The clear and set bit instructions above take two CPU cycles each to complete, for a total of four CPU cycles. Thus, with the context of the loop this function exists in, the shortest possible loop for the ATmega chip, while maintaining all functionality from above, becomes this:
 
302 .L11:
303     cbi 0xb,3           // set clk pin low
304     sbi 0xb,3           // set clk pin high
305     subi r24,lo8(-(-1)) // check if arduino is being accessed
306     brne .L11           // if not, goto .L11
 
In total, this section of code uses seven CPU cycles. Thus, with the Arduino's crystal oscillator clocked at 16MHz, the maximum possible switching speed of the CLK pin becomes 16MHz / 7 = 2.28MHz.
 

Limited, but Faster

2.28MHz seems to be the magic number, I can't think of a way to reduce the number of instructions in the loop above any lower except for possibly using some type of frequency doubler in hardware. For a 6502 computer, that is a respectable speed. The Apple 1 and 2 ran at 1MHz, for example. I've tested the computer with the Arduino removed, using a function generator as a clock source, and the computer is stable up to about 12MHz.

Unfortantely, due to how the Arduino needs to be in sync with the computer at all times, some operations performed by the Arduino will take longer than others. So, the output clock signal is not stable. But, 2.28MHz is a lot nicer to work with than 60KHz!



Comments

Popular posts from this blog

TFORTH

Overview TFORTH is my custom version of FORTH. It started as a programming exercise that grew into a somewhat usable language. This version of FORTH is a bit odd in that it has no compiling mode at all . Not even bytecode compiling. Everything is 100% interpreted. So, it's essentially a fancy text parser. It also doesn't try to be ANSI compliant at all. That being said, it does support a lot of common FORTH things like you would expect, including: custom function definitions, calling external functions via function pointers, full access to the system's memory map, input/output, etc.   Variables There are some big differences, under-the-hood, with this version of FORTH. For one, there are no variables. Instead, I define variables as functions with a index to the "variable page," a page in memory dedicated to temporary storage. So, for example,    5 0 ! would assign the literal value 5 to index 0 in the variable page. So, to declare a "variable," I would d...

TPC65 - History: breadboards and protoboards

Since I was a teenager, something I've always wanted to do was design and build a computer, chip-by-chip. I was inspired by the movie Pirates of Silicon Valley, specifically when Steve Wozniak built a computer, the Apple 1, that started an empire. So, I started learning electronics with the main goal of making my own computer.   Breadboard Shenanigans   In 2018, I finally got around to working with the legendary 6502 CPU and built my first rudimentary computer that barely worked. This version of the computer relied on an Arduino Mega to act as an EEPROM and serial interface. Thus, the entire computer needed to be synchronized to the Arduino to function. Unfortunately, building computers on breadboards is a pretty frustrating task. Breadboards are prone to manufacturing defects that can cause intermittent shorts between components, intermittent connections between the breadboard and components, massive amounts of parasitic capacitance and inductance. All of which are incre...