Saturday, January 27, 2007

The Quintessential MIPS Assembly Language Script

This small program demonstrates what you need to get started developing assembly language scripts for the MIPS architecture. You only need two tools: a SPIM simulator (e.g. PCSpim v2.03) and a text editor (notepad).

This script demonstrates:
  1. How to declare integer and string variables, and arrays.
  2. How to manipulate arrays.
  3. Using the la (load address), lw (load word), sw (save word), add (addition), addi (add immediate), sub (subtract) instructions.
  4. How to print string and integers to the console.
  5. How to organize a basic assembly language script.
  6. How to exit the program.
When you load the script into PCSpim, the .data section is executed. The data segment is an area in memory where data for the program is stored. In the script, I declare a variable called 'num' of type .word and assign it the value 7. An array is declared by providing a sequence of values, and a string by using the .asciiz type and providing a message in double quotes. PCSpim shows the data segment for our program as such:
[0x10010000]
0x00000007 0x00000002 0x00000004 0x00000006
[0x10010010]
0x00000008 0x0000000a 0x75736552 0x203a746c
These numbers are in hex. The first one ([0x10010000]) is the address in memory where the data is. On a 32-bit computer (like most we have these days), it means data is stored in 32-bit words (thirty two 1's and 0's). But we know that 8 bits = 1 byte, so 32 bits (1 word) = 4 bytes. This is the basis for showing the data as "0x00000007 0x00000002 0x00000004 0x00000006".
The first byte (0x00000007) is the number 7, the value of the variable 'num' in the script. The second byte (0x00000002) is the number 2, which is the first value of the variable named 'array', and so on. The second word is "[0x10010010] 0x00000008 0x0000000a 0x75736552 0x203a746c": it's at address 0x10010010 and contains 8 (0x00000008), 10 (0x0000000a), 'Resu' (0x75736552), and 'lt: ' (0x203a746c). If you notice, all data is loaded sequentially into data blocks the way it was written in the script.

Other observations:
  • la of a variable loads the address of the first element. If you lw a variable such as 'array', it'll load the value of the first element into the register.
  • Array values are dereferenced using 4 byte increments (4 bytes = 32 bits = 1 word). So if array address is loaded in register $s0, the first element's value is 0($s0), second is 4($s0), and third is 8($s0), etc.
  • In the above example, the address of array element #1 (2) is 0x10010000 + 4 bytes = 0x10010004. That's basic address arithmetic for you.
  • Each character is stored as 8 bits. So a word (32 bits) can only contain 4 characters (e.g 'Results' would have to be split into 'Resu' and 'lts', and stored in two bytes (even if 'lts' doesn't quite fill a byte).
Why the heck would anyone learn assembly language? If you are a serious programmer, you want to know that assembly is the final stage your Java or C code is translated to before being ported to machine language (binary). Learning efficiencies that can be acheived at this stage may help with better compiler design and debugging. I like it for my endeavors in reverse engineering - this is where you learn what a program really does when you don't have it's source code or know the language it was written in.