Touch Develop retirement postponed until June 22, 2019. Sign-in and access to cloud assets to be removed on May 23, 2018.Learn More..

Touch Develop in 208 Bits

The Prehistory

You probably know that you can run Touch Develop scripts in a browser, on a tablet, phone or laptop. You may also know we have a way of running them in the cloud with node.js. But microcontrollers? As a matter of fact, yes, we can!
We started, over two years ago, with my proof of concept Touch Develop to C compiler which was written as a Touch Develop script itself. Later Peli extended it greatly and converted it into an Arduino Upload plugin, so that it can easily be accessed from the editing environment. This works great, but still requires quite a bit of installation – you need a C compiler on the local machine, as well as special drivers to flash the Arduino device. If you want to use it in a school environment this is quite a barrier.
Enter ARM mbed. It’s an online embedded C++ prototyping environment. You write your program in the browser, send it to the cloud for compilation, and then flash the board, which acts as an USB mass storage device, by just copying the file over there. No installation required, everything works in the browser. Simply awesome!
The BBC micro:bit (prototype)

The Bit

Given this extreme ease of use and ARM’s involvement in the partnership, the mbed platform became the base for the BBC micro:bit. The micro:bit is an educational-minded, low-cost device to be given to every 11 year old in the UK to excite kids about programming. It is following the tradition of BBC Micro, an 8-bit computer released in 1981 (a very good vintage, I have to tell you), which inspired a generation of geeks. Interestingly, the specs of the micro:bit are just a slight upgrade over the BBC micro – the CPU is maybe 20 times faster and memory size is similar. This is fully intentional – the device is meant to be simple and easy to understand. The main change, in my opinion, is the technology advance making it cheap enough to give out to a million kids. Then, of course, it’s also a Thing, as in Internet of Things, with Bluetooth Low Energy (BLE) and very long battery life.
Back to software, to go with the new shiny hardware, we needed a more capable Touch Develop to C++ compiler. Jonathan set out to write one. It uses a number of more recent C++ features, like lambda functions, to produce highly readable C++ code from Touch Develop abstract syntax tree. Moreover, he also designed an interface to hook up external editors to Touch Develop, and let them use the same compiler. We have three of these right now – Block Editor (also by Jonathan), based on Google Blockly, Code Kingdoms, and MicroPython. The first two are using this AST route to get a binary to be put on the device.

The Inspiration

MicroPython, however, does something different. It takes a pre-compiled Python interpreter binary, appends the text of the script, and flashes the entire thing on the device. The best thing about this process is that it happens completely in the browser! You hit a button and immediately get a binary, regardless of your network connectivity.
There are two reasons why this is a big deal, and both relate to compilation speed. Compiling a simple C++ program, and linking it into a .hex file (the format used to flash the device) takes about 1 second of CPU time. With at least some virtualization and communication overheads, it gets us into 2s territory. Now, if the process is generic enough (can handle arbitrary changes to programs etc.), this can easily get into 5-10s. This means at least a couple second wait for the user hitting the ‘compile’ button, which is not great. Moreover, this is at least 2 to 3 orders of magnitude more than the CPU time of say saving your script in the cloud, searching for published scripts forcing us to limit the number of users in the system and throttle rate of compilation requests per user.
However, there is a reason there was no Python on BBC Micro.
The Python programs are transferred as text to the device, compiled there into a bytecode, and then executed with help of a runtime utilizing a garbage collector. I’m super-impressed the MicroPython folks managed to fit all of that in the 16KB of RAM available on the board after disabling BLE! A more space-efficient approach, allowing for bigger programs, would be to generate the bytecode in the browser and put it in the flash memory (of which there is “plenty” – 256KB). This would however jeopardize the dynamic features of Python.
But what if we had a simple language that could be compiled to bytecode and didn’t have dynamic features? Oh wait!
 Staples of debugging: serial console and a hex table.

The Bytecode

There many different kinds of bytecode. Most popular are Java bytecode, .NET IL, and Python bytecode. In fact, it is quite easy to roll your own.
Typically, the bytecode is executed on some sort of a stack machine. Operations (say addition of two numbers or function call) pop arguments from the stack and push the results there. It is then quite easy to translate the source language. For example, to translate e0 + e1, where e0 and e1 are arbitrary expressions, you first translate e0, which will at execution time result in the value of e0 being pushed on the stack, then translate e1, and then emit add instruction which will push the result of addition on the stack. The primitive expressions, like number literals, or local variable references, would usually have a specialized instruction.
Now, my bytecode for Touch Develop to be run on the micro:bit was exceptionally simple. There was about 25 instructions, some of them coming in a few variants. You can take a look at the bytecode interpreter in GitHub.
You’ll note a few things:
  • there is in fact no ADD or MULTIPLY instructions; all of these are handled through CALLFUNC instruction, which calls into the runtime
  • there is a number of instructions like CALL<N>FUNC, CALL<N>PROC - these are for calling functions (which return a value) or procedures (which don’t) with different number of arguments
  • there are for example LDLOC and LDLOCREF instructions – the first one loads a local variable from a numbered local, and the second one additionally increments the reference count of the value pointed to by the local
  • the REFMASK is also about maintaining reference counts after the call
The actual translator from Touch Develop to this bytecode turned out to be quite simple – just over 1000 lines of TypeScript. It uses a JSON meta-data file generated from the runtime source, which lists information about address of runtime functions to call and their stack behavior. This JSON file also contains the base .hex file with the pre-compiled bytecode interpreter, which is then patched up with the actual bytecode to execute. This all happens in the browser, just like for MicroPython, and is as fast.

The Counting

Now, as for ref-counting, I have implemented String, Collection, user-defined Record, and user-defined inline Action (aka higher order function), as ref-counted classes.
Collection comes in two flavors depending if it’s a collection of ref-counted or primitive (Number/Boolean) objects. When a ref-counted element is loaded from a collection and pushed on the stack, its ref-count has to be incremented. When an element is overwritten, the ref-count of the old element has to be decremented. Finally, when the collection is deleted, ref-counts of all elements have to decremented.
Records have a number of ref-counted fields first, followed by number of primitive fields (which means fields are in different order than they were in the source code, but it doesn’t really matter). They also have fixed size.
Actions are similar to records – the fields are local variables captured at the point of creation - but they also pack a pointer to the code to be executed.
If I wasn’t on such a memory-constrained device, it would be possible to have a single type for collections and records, but as things stand, every byte counts, so the classes are optimized for particular usage.

The Space

After getting this all to work, I went on to write the infamous recursive fib() function to benchmark the thing. BTW, it was a bit of pointless exercise, because people are rather unlikely to run machine learning kind of programs on the micro:bit. Anyway, I noticed two things:
  • the bytecode was instantaneous to generate, but much (say 30x) slower to execute than the C++ optimized code (still it was a couple times faster than MicroPython, which is expected given the dynamic (in sense of typing) nature of Python and completely static nature of my bytecode)
  • I could fit maybe 20-50% of the stack frames that C++ was able to fit
In fact, I wasn’t so concerned about the speed (even though it was a bit embarrassing), as much as about the memory. The micro:bit runs at 16MHz and has 16KB of RAM. It means it can fill the entire memory several thousand times in a second. Your computer needs several seconds to fill its entire memory. This is at least 4 orders of magnitude of difference. It also means that if you have to trade-off optimizing space vs time you always go for optimizing space.
This is a real problem – for example if you have too many strings in your program you run out of memory. If you start too many threads (we have co-operative multi-threading, based on light-weight fibers, thanks to Joe), you run out of memory. If you go too deep in recursion, you run out of memory. It’s sort of unpredictable and difficult to explain to someone who’s just starting to program.
 Sample program and its Thumb assembly translation.

The Machine

The main reason for higher memory usage, is that I was generating a C++ stack frame for every bytecode function, and the C++ stack frame had a number of C++ local variables (program counter, an array for TD locals, some auxiliary stuff etc.). These stack frames were much bigger than strictly needed, but I could only optimize their size if I had access to very low-level code.
Now, given that I only had 25 instructions (give or take), how hard could it be to generate the machine code directly, and have all the control I could possibly want (and, unfortunately, more)? I know this sounds a bit like “famous last words”…
The micro:bit runs ARM Corterx-M0 core, which supports ARM Thumb instruction set. A reference card for these is only 3 pages long. There are maybe 50 instructions, and about 100 different instruction encodings (many instructions come in variants depending if they operate on register or on an immediate, etc.). This is remarkably small for a modern architecture, but still quite a few bits to get right.
Luckily, it turned out that I only needed 12 different instructions to compile Touch Develop. 11 of these are 16 bits long and one is 32 bit long. Still, it was a very challenging 208 bits.
On the plus side:
  • I already had a working and tested compiler from Touch Develop -> bytecode; the native compiler was almost 1:1 translation
  • I already had an elaborate test script with a hundred or so asserts making sure things were right
  • the Thumb instructions encoding is quite regular (almost all instructions are 16 bit; most use the same pattern to encode registers/offsets etc.), and well documented in “ARMv6-M Architecture Reference Manual”
  • with one little driver you can get serial console for printf() debugging
Now, as good as printf() debugging is, the lack of a real debugger means that finding these took some time:
  • the only 32-bit instruction of interest, BL (branch with link, or jump to subroutine for the rest of us), has somewhat crazy encoding
  • the PC offsets are with respect to the address of current instruction + 4 (I would have really expected either + 0, or + 2, as this is the address of the next instruction); this is all good provided you know about it – I didn’t
  • M0 only does aligned memory access; I know that, but you still sometimes run into that
  • the function pointers need to have the lowest bit set; this is to mark they are in Thumb mode (as opposed to full ARM mode, which isn’t present on M0); you need to take that into account when constructing them
  • the string literals from C++ programs are not aligned
  • gcc generates legacy assembly syntax (supposedly gcc 5.0 has an option to disable that, but yotta uses 4.9); as expects the legacy syntax; objdump uses the new unified syntax
  • gcc passes structs by pointer, never by value, despite the ARM calling convention mandating passing structs in register; ARM compiler by default does the same, but has an annotation for that
Anyway, in the end, the native code generator was only about 50% bigger than the bytecode one.

The Assembler

My code generator was producing machine code directly. I was also producing comments that looked like assembler, but the primary output was a bunch of numbers. Other compilers typically generate textual assembler code, which is then later converted into machine code by a separate tool. This allows for some separation of concerns – making both the compiler and assembler easier to modify - as well as inline assembly embedded in the source code (the compiler just copies a string from the source language to the assembly).
Just like with the machine code, I first thought about it, and decided it was too much work. And then the “how difficult can it be?” part came.
It turned out that:
  • there are only 6 ways to encode register numbers, 8 ways to encode immediates and 3 ways of encoding register lists; it is actually quite reasonable
  • there are about 100 instruction encodings, but you can pretty much copy these verbatim from the architecture manual with a little help of a hex table
  • you don’t have to do much to parse assembly – it’s line oriented, you split the line into words and there is only a few forms that are allowed
  • it is smart to check, double-check, and triple check the encoding of the 12 instructions that the compiler actually uses
  • .align N means to align to 1 << N bytes
The way the assembler works is by splitting the input into lines, and for every line:
  • record a label if any
  • see if the line is a directive, say .balign 16 - if so do the special processing
  • otherwise try to match the line against the table of instruction encoders; if there is a match generate the 16 bit encoding, and otherwise display the near hits to the user, so they can learn what instructions can be used
It does two passes over input – in the second pass the labels are all supposed to be resolved. It is not the most efficient way of doing this, but with the ~64K limit on the size of the output it doesn’t actually matter.
In the end, the full assembler is about 1000 lines of TypeScript. The code generator didn’t really shrink much, but it’s less tricky to change.

The End

All of this will soon ship at the BBC micro:bit site. The source code is available under MIT license in Touch Develop GitHub repo. I guess the assembler in particular might of use to others, as it has little dependency on the rest of Touch Develop.