126x Filetype PDF File size 0.06 MB Source: www1.cs.columbia.edu
AssemblyLanguages AssemblyLanguageModel Onestepupfrommachine . . Assembly Languages language . COMSW4995-02 Originally a more add r1,r2 user-friendly way to program sub r2,r3 Prof. Stephen A. Edwards Nowmostlyacompiler target cmpr3,r4 Fall 2002 Model of computation: PC → bneI1 ALU ↔ Registers ↔ Memory Columbia University stored program computer sub r4,1 Department of Computer Science I1: jmp I3 . . . AssemblyLanguageInstructions TypesofOpcodes Operands Built from two pieces: Arithmetic, logical Eachoperand taken from a particular addressing mode: • add, sub, mult Examples: add R1, R3, 3 • and, or • Cmp Register add r1, r2, r3 Opcode Operands Memoryload/store Immediate add r1, r2, 10 Whattodowiththedata Wheretogetthedata • ld, st Indirect movr1, (r2) Control transfer Offset movr1, 10(r3) • jmp PCRelative beq 100 • bne Reflect processor data pathways Complex • movs TypesofAssemblyLanguages CISCAssemblyLanguage RISCAssemblyLanguage Assembly language closely tied to processor architecture Developed when people wrote assembly language Response to growing use of compilers At least four main types: Complicated, often specialized instructions with many Easier-to-target, uniform instruction sets CISC: Complex Instruction-Set Computer effects “Make the most common operations as fast as possible” RISC: Reduced Instruction-Set Computer Examples from x86 architecture Load-store architecture: • String move DSP:Digital Signal Processor • Arithmetic only performed on registers VLIW: Very Long Instruction Word • Procedure enter, leave Many, complicated addressing modes • Memoryload/store instructions for memory-register transfers Socomplicated, often executed by a little program Designed to be pipelined (microcode) Examples: Intel x86, 68000, PDP-11 Examples: SPARC, MIPS, HP-PA, PowerPC DSPAssemblyLanguage VLIWAssemblyLanguage Example: Euclid’s Algorithm Digital signal processors designed specifically for signal Response to growing desire for instruction-level int gcd(int m, int n) processing algorithms parallelism { Lots of regular arithmetic on vectors Using more transistors cheaper than running them faster int r; Often written by hand Manyparallel ALUs while ((r = m % n) != 0) { Objective: keep them all busy all the time m = n; Irregular architectures to save power, area n = r; Substantial instruction-level parallelism Heavily pipelined } Moreregular instruction set return n; Examples: TI 320, Motorola 56000, Analog Devices Very difficult to program by hand } Looks like parallel RISC instructions Examples: Itanium, TI 320C6000 i386 Programmer’s Model Euclid on the i386 Euclid on the i386 .file "euclid.c" # Boilerplate .file "euclid.c" 31 0 15 0 .version "01.01" .version "01.01" eax Mostly cs Codesegment gcc2 compiled.: gcc2 compiled.: Stack Before Call ebx General- ds Data segment .text # Executable .text n 8(%esp) ecx Purpose- ss Stack segment .align 4 # Start on 16-byte boundary .align 4 m 4(%esp) .globl gcd # Make “gcd” linker-visible .globl gcd %esp→ R.A. 0(%esp) edx Registers es Extra segment .type gcd,@function .type gcd,@function esi Source index fs Data segment gcd: gcd: Stack After Entry gs Data segment pushl %ebp pushl %ebp n 12(%ebp) edi Destination index movl %esp,%ebp movl %esp,%ebp m 8(%ebp) ebp Basepointer pushl %ebx pushl %ebx R. A. 4(%ebp) esp Stack pointer movl 8(%ebp),%eax movl 8(%ebp),%eax %ebp→ oldebp 0(%ebp) movl 12(%ebp),%ecx movl 12(%ebp),%ecx %esp→ oldebx −4(%ebp) eflags Status word jmp .L6 jmp .L6 .p2align 4,,7 .p2align 4,,7 eip Instruction Pointer Euclid in the i386 Euclid on the i386 SPARCProgrammer’sModel jmp .L6 # Jump to local label .L6 jmp .L6 31 0 31 0 .p2align 4,,7 # Skip ≤ 7 bytes to a multiple of 16 .p2align 4,,7 .L4: .L4: r0 Always 0 r24/i0 Input Registers movl %ecx,%eax movl %ecx,%eax #m=n . . movl %ebx,%ecx movl %ebx,%ecx #n=r r1 Global Registers . . . .L6: .L6: . r30/i6 FramePointer cltd # Sign-extend eax to edx:eax cltd r7 r31/i7 Return Address idivl %ecx # Compute edx:eax / ecx idivl %ecx r8/o0 Output Registers movl %edx,%ebx movl %edx,%ebx . . testl %edx,%edx testl %edx,%edx #ANDofedxandedx . PSW Status Word jne .L4 jne .L4 # branch if edx was 6= 0 r14/o6 Stack Pointer PC Program Counter movl %ecx,%eax movl %ecx,%eax #Returnn r15/o7 nPC Next PC movl -4(%ebp),%ebx movl -4(%ebp),%ebx r16/l0 Local Registers leave leave # Move ebp to esp, pop ebp . . ret ret # Pop return address and branch . r23/l7 SPARCRegisterWindows Euclid on the SPARC Euclid on the SPARC r8/o0 .file "euclid.c" # Boilerplate mov %i0, %o1 . gcc2 compiled.: b .LL3 . . mov %i1, %i0 Theoutput registers of r15/o7 .global .rem # make .rem linker-visible r16/l0 .LL5: . .section ".text" # Executable code . the calling procedure . .align 4 mov %o0, %i0 # n = r becometheinputs to r23/l7 r8/o0 r24/i0 .global gcd # make gcd linker-visible .LL3: . . . . the called procedure . . .type gcd, #function mov %o1, %o0 #Computetheremainderof r15/o7 r31/i7 call .rem, 0 # m / n, result in o0 r16/l0 .proc 04 . Theglobal registers . mov %i0, %o1 . gcd: remain unchanged r23/l7 save %sp, -112, %sp # Next window, move SP r8/o0 r24/i0 . . . . cmp %o0, 0 Thelocal registers are . . r15/o7 r31/i7 mov %i0, %o1 # Move m into o1 bne .LL5 r16/l0 not visible across . mov %i0, %o1 #m=n(alwaysexecuted) . . b .LL3 # Unconditional branch ret # Return (actually jmp i7 + 8) procedures r23/l7 mov %i1, %i0 # Move n into i0 r24/i0 restore # Restore previous window . . . r31/i7 Digital Signal Processor Apps. EmbeddedProcessor Conventional DSP Architecture Requirements Low-cost embedded systems Harvard architecture • Modems, cellular telephones, disk drives, printers Inexpensive with small area and volume • Separate data memory/bus and program memory/bus High-throughput applications Deterministic interrupt service routine latency • Three reads and one or two writes per instruction cycle Lowpower:≈50mW(TMS320C54xuses0.36µA/MIPS) Deterministic interrupt service routine latency • Halftoning, base stations, 3-D sonar, tomography Multiply-accumulate in single instruction cycle PCbasedmultimedia Special addressing modes supported in hardware • Modulo addressing for circular buffers for FIR filters • Compression/decompression of audio, graphics, video • Bit-reversed addressing for fast Fourier transforms Instructions to keep the pipeline (3-4 stages) full • Zero-overhead looping (one pipeline flush to set up) • Delayed branches Conventional DSPs Conventional DSPs Example Fixed-Point Floating-Point Market share: 95% fixed-point, 5% floating-point Finite Impulse Response filter (FIR) Cost/Unit $5–$79 $5–$381 Eachprocessor comes in dozens of configurations Canbeusedforlowpass, highpass, bandpass, etc. Architecture Accumulator load-store • Data and program memory size Basic DSP operation Registers 2–4 data, 8 address 8–16 data, 8–16 address For each sample, computes Data Words 16 or 24 bit 32 bit • Peripherals: A/D, D/A, serial, parallel ports, timers Chip Memory 2–64Kdata+program 8–64Kdata+program Drawbacks k Address Space 16–128K data 16M–4Gdata • No byte addressing (needed for image and video) yn = Xaixn+i 16–64K program 16M–4Gprogram i=0 Compilers BadC Better C, C++ • Limited on-chip memory Examples TI TMS320C5x TI TMS320C3x • Limited addressable memory on most fixed-point where Motorola 56000 Analog Devices SHARC DSPs a0,...,ak are filter coffecients, • Non-standard C extensions to support fixed-point data xn is the nth input sample, yn is the nth output sample. 56000 Programmer’s Model 56001 MemorySpaces 56001 Address Generation 55 4847 x1 2423 x00 Source 15 0 Program Counter Three memory regions, each 64K: Addresses come from pointer register r0 ...r7 y1 y0 Registers Status Register • 24-bit Program memory Offset registers n0 ...n7 can be added to pointer a2 a1 a0 Accumulator Loop Address b2 b1 b0 Accumulator Loop Count • 24-bit X data memory Modifier registers cause the address to wrap around 15 PCStack 15 0 15 0 15 0 . . • r7 n7 m7 . 24-bit Y data memory Zero modifier causes reverse-carry arithmetic . . . 0 . . . . . . 15 SRStack Idea: enable simultaneous access of program, sample, Address Notation Next value of r0 r4 n4 m4 Address . . r0 (r0) r0 r3 n3 m3 Registers . and coefficient memory . . . . . . 0 . . . r0 + n0 (r0+n0) r0 r0 n0 m0 Stack pointer Three on-chip memory spaces can be used this way r0 (r0)+ (r0 + 1) mod m0 Oneoff-chip memory pathway connected to all three r0 - 1 -(r0) r0 - 1 mod m0 memoryspaces r0 (r0)- (r0 - 1) mod m0 r0 (r0)+n0 (r0 + n0) mod m0 Only one off-chip access per cycle maximum r0 (r0)-n0 (r0 - n0) mod m0 FIR Filter in 56001 FIR Filter in 56001 TI TMS320C6000 VLIWDSP n equ 20 # Define symbolic constants movep y:input, x:(r0) #Loadsampleintomemory Eight instruction units dispatched by one very long start equ $40 # Clear accumulator A instruction word samples equ $0 # Load a sample into x0 coeffs equ $0 # Load a coefficient Designed for DSP applications input equ $ffe0 #Memory-mappedI/O clr a x:(r0)+, x0 y:(r4)+, y0 output equ $ffe1 Orthogonal instruction set rep #n-1 # Repeat next instruction n-1 times Big, uniform register file (16 32-bit registers) org p:start #Locateinprog. memory # a = x0 × y0 move #samples, r0 #Pointers to samples # Next sample Better compiler target than 56001 move #coeffs, r4 # and coefficients # Next coefficient move #n-1, m0 # Prepare circular buffer mac x0,y0,a x:(r0)+, x0 y:(r4)+, y0 Deeply pipelined (up to 15 levels) move m0, m4 Complicated, but more regular, datapath macr x0,y0,a (r0)- movep a, y:output #Writeoutput sample Pipelining on the C6 FIRinOne’C6AssemblyInstruction Peripherals Oneinstruction issued per clock cycle Load a halfword (16 bits) Often the whole point of the system Very deep pipeline Dothis on unit D1 Memory-mapped I/O FIRLOOP: • 4 fetch cycles LDH .D1 *A1++, A2 ;Fetchnextsample • Magical memory locations that make something || LDH .D2 *B1++, B2 ; Fetch next coeff. happen or change on their own • 2 decode cycles || [B0] SUB .L2 B0, 1, B0 ; Decrement count • 1-10 execute cycles || [B0] B .S2 FIRLOOP ; Branch if non-zero Typical meanings: || MPY .M1X A2, B2, A3 ;Sample×Coeff. Branch in pipeline disables interrupts || ADD .L1 A4, A3, A4 ;Accumulate result • Configuration (write) Conditional instructions avoid branch-induced stalls Usethecross path • Status (read) Nohardwaretoprotect against hazards Predicated instruction (only if B0 non-zero) • Address/Data (access more peripheral state) • Assembler or compiler’s responsibility Runtheseinstruction in parallel
no reviews yet
Please Login to review.