The Resurgence of Parallelism

There’s an article in the CACM trying to restore historical context to research in parallel computation: The Resurgence of Parallelism. (One of) the answer(s), it seems, is functional programming:

Parallelism is not new; the realization that it is essential for continued progress in high-performance computing is. Parallelism is not yet a paradigm, but may become so if enough people adopt it as the standard practice and standard way of thinking about computation.

The new era of research in parallel processing can benefit from the results of the extensive research in the 1960s and 1970s, avoiding rediscovery of ideas already documented in the literature: shared memory multiprocessing, determinacy, functional programming, and virtual memory.

Via LtU.

Nested functions in GCC

GCC supports “nested functions” using the -fnested-functions flag. When I first saw this, I was excited: closures in C! In the famous words of Admiral Ackbar, “it’s a trap!”

#include 

typedef int (*fptr)(int);

fptr f(int arg) {
  int nested_function(int nested_arg) {
    return arg + nested_arg;
  }

  return &nested_function;
}

void smash(int arg) {
  return;
}

int main(void) {
  fptr g = f(10);
  printf("%d\n", (*g)(5));
  smash(12);
  // printf("%d\n", (*g)(5));
  fptr h = f(12);
  printf("%d\n", (*g)(5));
  printf("%d\n", (*h)(5));

  return 0;
}

Try compiling (gcc -fnested-functions). What does the second call to g produce—15 or 17? Try uncommenting line 21. What happens? Does commenting out line 20 affect this? What if line 19 is commented out, but lines 20 and 21 are uncommented?

I’m not sure this feature is worth it.

Contracts Made Manifest: final version

We’ve sent off the final version of Contracts Made Manifest. There have been quite a few improvements since submission, the most important of which is captured by Figure 1 from our paper:

The axis of blame

Our submission only addressed lax λC, where we had an inexact translation φ into λH and an exact translation ψ out of λH. We show a dual situation for picky λC, where φ is exact and ψ is inexact. Intuitively, languages farther to the right on the “axis of blame” are pickier. Translating from right to left preserves behavior exactly, but left-to-right translations generate terms that can blame more than their pre-images. (There are examples in the paper.) I should note that lax and picky λC seem to be the extremes of the axis of blame: I don’t see a natural way to be laxer than lax λC or pickier than picky λC.

We also show that restricting these calculi to first-order dependency leaves them exactly equivalent; before, we could only show an exact equivalence by eliminating all dependency.

Locally installing LLVM with Ocaml bindings

We can’t install software into the /usr tree at my office, so I end up having local installs of lots of software. Some things, like GODI, play well with this. I had some trouble finding the right way to get LLVM‘s Ocaml bindings to work, so I figured I’d share the wealth. The following instructions will put an install into the directory $PREFIX/llvm-install.

Here are the steps; they’re followed by a plain English explanation.

cd $PREFIX
svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
wget http://llvm.org/releases/2.5/llvm-gcc4.2-2.5-x86-linux-RHEL4.tar.gz
tar xzf llvm-gcc4.2-2.5-x86-linux-RHEL4.tar.gz
mkdir llvm-objects llvm-install
cd llvm-objects
../llvm/configure --with-llvmgccdir=$PREFIX/llvm-gcc4.2-2.5-x86-linux-RHEL4 --enable-optimized --enable-jit --prefix=$PREFIX/llvm-install --with-ocaml-libdir=$GODI_PATH/lib/ocaml/std-lib
make
make install

My PREFIX is my home directory, and GODI_PATH = ~/godi. First, we checkout the latest LLVM from SVN (step 2). Then we download and extract the latest release (2.5, as of writing) of LLVM-gcc (steps 3 and 4). (I couldn’t get the SVN version of LLVM-gcc to work with the SVN version of LLVM.) Notably, LLVM does not support in-place builds, so we create the llvm-objects directory to actually build LLVM; we’ll install it into llvm-install (step 5). We configure the software from the llvm-objects directory (steps 6 and 7). The long configure is necessary; the only optional item is --enable-jit. You may have to adjust your --with-ocaml-libdir to point to wherever your Ocaml libraries live. Then make and make install (steps 8 and 9). Voila!

To test it out, we can use the “Hello, World!” program written by Gordon Henrikson. I had to change it a little to bring it up to date with the latest APIs (in particular, the global context had to be added). You can download it as llvm_test.ml.

open Printf
open Llvm

let main filename =
   let c = create_context () in

   let i8_t  = i8_type c in
   let i32_t = i32_type c in

   let m = create_module c filename in

   (* @greeting = global [14 x i8] c"Hello, world!\00" *)
   let greeting =
     define_global "greeting" (const_string c "Hello, world!\000") m in

   (* declare i32 @puts(i8* ) *)
   let puts =
     declare_function "puts"
       (function_type i32_t [|pointer_type i8_t|]) m in

   (* define i32 @main() { entry: *)
   let main = define_function "main" (function_type i32_t [| |]) m in
   let at_entry = builder_at_end c (entry_block main) in

   (* %tmp = getelementptr [14 x i8]* @greeting, i32 0, i32 0 *)
   let zero = const_int i32_t 0 in
   let str = build_gep greeting [| zero; zero |] "tmp" at_entry in

   (* call i32 @puts( i8* %tmp ) *)
   ignore (build_call puts [| str |] "" at_entry);

   (* ret void *)
   ignore (build_ret (const_null i32_t) at_entry);

   (* write the module to a file *)
   if not (Llvm_bitwriter.write_bitcode_file m filename) then exit 1;
   dispose_module m

let () = match Sys.argv with
  | [|_; filename|] -> main filename
  | _ -> main "a.out"

Now we can compile:

ocamlopt -cc g++ llvm.cmxa llvm_bitwriter.cmxa llvm_test.ml -o llvm_test
./llvm_test hello.bc # generates bitcode
$PREFIX/llvm-install/bin/llvm-dis hello.bc # disassembles bitcode into hello.ll
$PREFIX/llvm-install/bin/lli hello.bc # outputs "Hello, world!"

If interpretation via lli isn’t your bag, you can also compile to native code:

$PREFIX/llvm-install/bin/llc hello.bc # generates assembly, hello.s
gcc -o hello hello.s
./hello # outputs "Hello, world!"

Flapjax: A Programming Language for Ajax Applications

I am immensely pleased to report that our paper on Flapjax was accepted to OOPSLA 2009.

This paper presents Flapjax, a language designed for contemporary Web applications. These applications communicate with servers and have rich, interactive interfaces. Flapjax provides two key features that simplify writing these applications. First, it provides event streams, a uniform abstraction for communication within a program as well as with external Web services. Second, the language itself is reactive: it automatically tracks data dependencies and propagates updates along those data?ows. This allows developers to write reactive interfaces in a declarative and compositional style.

Flapjax is built on top of JavaScript. It runs on unmodi?ed browsers and readily interoperates with existing JavaScript code. It is usable as either a programming language (that is compiled to JavaScript) or as a JavaScript library, and is designed for both uses. This paper presents the language, its design decisions, and illustrative examples drawn from several working Flapjax applications.

The real heroes of this story are my co-authors. Leo, Arjun, and Greg were there for the initial, heroic-effort-based implementation. Jacob and Aleks wrote incredible applications with our dog food. Shriram, of course, saw the whole thing through. Very few of my contributions remain: the original compiler is gone (thank goodness); my thesis work is discussed briefly in How many DOMs? on page 15. Here’s to a great team and a great experience (and a great language)!

Contracts Made Manifest

Benjamin Pierce, Stephanie Weirich, and I submitted a paper to POPL 2010; it’s about contracts. Here’s the abstract:

Since Findler and Felleisen introduced higher-order contracts, many variants of their system have been proposed. Broadly, these fall into two groups: some follow Findler and Felleisen in using latent contracts, purely dynamic checks that are transparent to the type system; others use manifest contracts, where refinement types record the most recent check that has been applied. These two approaches are generally assumed to be equivalent—different ways of implementing the same idea, one retaining a simple type system, and the other providing more static information. Our goal is to formalize and clarify this folklore understanding.

Our work extends that of Gronski and Flanagan, who defined a latent calculus \lambda_C and a manifest calculus \lambda_H, gave a translation \phi from \lambda_C to \lambda_H, and proved that if a \lambda_C term reduces to a constant, then so does its \phi-image. We enrich their account with a translation \psi in the opposite direction and prove an analogous theorem for \psi.

More importantly, we generalize the whole framework to dependent contracts, where the predicates in contracts can mention variables from the local context. This extension is both pragmatically crucial, supporting a much more interesting range of contracts, and theoretically challenging. We define dependent versions of \lambda_C (following Findler and Felleisen’s semantics) and \lambda_H, establish type soundness—a challenging result in itself, for \lambda_H—and extend \phi and \psi accordingly. Interestingly, the intuition that the two systems are equivalent appears to break down here: we show that \psi preserves behavior exactly, but that a natural extension of \phi to the dependent case will sometimes yield terms that blame more because of a subtle difference in the treatment of dependent function contracts when the codomain contract itself abuses the argument.

Edit on 2009-11-03: there’s a newer version, as will appear in POPL 2010.

Edit on 2010-01-22: I have removed the link to the submission, since it is properly subsumed by our published paper.

Debounce and other callback combinators

It is serendipitous that I noticed a blog post about a callback combinator while adding a few drops to the Flapjax bucket.

Flapjax is nothing more than a coherent set of callback combinators. The key insight to this set of callback combinators is the “Event” abstraction — a Node in FJ’s implementation. Once callbacks are Nodes, you get two things:

  1. a handle that allows you to multiply operate on a single (time-varying) data source, and
  2. a whole host of useful abstractions for manipulating handles: mergeE, calmE, switchE, etc.

The last I saw the implementations of Resume and Continue, they were built using this idea. The more I think about it, the more the FJ-language seems like the wrong approach: the FJ-library is an awesome abstraction, in theory and practice.

Practical OCaml

Suppose you were trying to run some experiments about L1 D-caches. (You may also suppose that this is a homework problem, but that’s life.) You’re given a trace of loads and stores at certain addresses. These addresses are 32-bits wide, and the trace is in a textual format:
1A2B3C4D L
DEADBEEF S
1B2B3C4D L
represents a load to 0x1a2b3c4d, followed by a store to 0xdeadbeef, followed by a load to 0x1b2b3c4d. (You might notice the two loads may be in conflict, depending on the block and cache size and the degree associativity. In that case, you might be in my computer architecture class…)

This is problematic. Naturally, you’d like to process the trace in OCaml. But did I mention that the trace is rather large — some 600MB uncompressed? And that some of the addresses require all 32 bits? And some of the statistics you need to collect require 32 bits (or more)? OCaml could process the entire trace in under a minute, but the boxing and unboxing of int32s and int64s adds more than twenty minutes (even with -unsafe). I felt bad about this until a classmate using Haskell had a runtime of about two and a half hours. Yeesh. C can do this in a minute or less. And apparently the traces that real architecture researchers use are gigabytes in size. Writing the simulator in OCaml was a joy; testing and running it was not.

There were some optimizations I didn’t do. I was reading memop-by-memop rather than in blocks of memops. I ran all of my simulations in parallel: read in a memop, simulate the memop in each type of cache, repeat. I could have improved cache locality by reading in a block of memops and then simulating in sequence; I’m not sure how the compiler laid out my statistics structures. I could’ve also written the statistics functions in C on unboxed unsigned longs, but didn’t have the patience. I’d still have to pay for boxing and unboxing the C structure every time, though. Still: one lazy summer week, I may give the code generation for boxed integers a glance.