In the last posting, Heterogeneous Computing using GPGPUs: NVidia GT200 I promised the next post would be a follow-on look at the AMD/ATI RV770. However, over the weekend, Niraj Tolia of HP Labs sent this my way as a follow-up on the set of articles on GPGPU Programming. Prior to reading this note, I hadn’t really been interested in virtualizing GPUs but the paper caught my interest and, I’m posting my notes on it just ahead of the RV770 architectural review that I’ll get up later in the week.
The paper GViM: GPU-accelerated Virtual Machines tackles the problem of implementing GPGPU programming in a virtual machine environment. The basic problem is this. If you are running N virtual machines each of which is running 1 or more GPGPU jobs and you have less than N GPGPUs physically attached to the server, then you need to virtualize the GPGPU. As covered in the last two postings, GPUs are large, very high state devices and, consequently, hard to efficiently virtualize.
The approaches discussed in this paper extend the trick from that I first saw used in Virtual Interface Adapter communications and is also supported Infiniband. I’m sure this model appeared elsewhere earlier but these are two good examples. In this networking interface model, the cost of each send and receive passing through the operating system communication path are avoided without giving up security by first making operating system calls to set up a communication path and to register buffers and door bells. The door bell is a memory location that, when written to, will cause the adapter to send the contents of the send buffer. At this point, the communications channel is set up, and all sends and receives can now be done directly in user space without further operating system interactions. It’s a nice, secure implementation of Remote Direct Memory Access (RDMA).
This technique of virtualizing part of a communications adapter and mapping it into the address space of the application program, can be played out in the GPGPU world as well to allow efficient sharing of GPUs between host operating systems in a virtual machine environment.
The approach to this problem proposed in the paper is based upon three observations: 1) GPU calls are course grained with considerable work done between each call so overhead on the calls themselves doesn’t dominate, 2) data transfer in and out of the device is very important and can dominate if not done efficiently, and 3) high level API access to GPUs is common. Building on the third observation, they chose to virtualize at the CUDA API level and implement CUDA over a what is called in the virtual machine world, a split driver model. In the split driver model a front end, or client, device driver is loaded into the guest O/S and it does calls to the management domain (called dom0 in Xen). In dom0, the other half of the driver is implemented. This other half of the driver makes standard CUDA calls against the physical GPUs device(s).
The approach taken by this paper is to implement all calls to CUDA via an interposer library that makes calls to the guest O/S driver which makes calls to the dom0 component that makes calls to the GPU. This effectively virtualizes the GPU device but the required call path is very inefficient. The authors note that calls to CUDA are course-grained and do considerable work so the per-call inefficiency actually does get amortized out nicely as long as the data is brought to and from the device efficiently. This later point is the tough one this is where the memory mapping tricks I introduced above are used.
The authors proposed three solutions to getting data to and from the GPU:
1. 2-copy: user program allocates memory in the guest O/S using malloc. Memory transferred to GPGPU must be first copied to host O/S kernel, then dom0 writes to the GPU.
2. 1-copy: user program and the device driver in the guest O/S kernel address space share a mapped memory space to avoid one copy of the two above.
3. Bypass: Exploit the fact that GPU is 100% managed by the dom0 component of the device driver and have it call cudaMalocHost() to map all GPU memory at start-up time. This map all GPU memory into its address space. Then employ the mapping trick of point 2 above to selectively map this space into the guest application space. This has the upside of avoiding copies but the downside of statically partitioning the GPU memory space. Each app gets access to only a portion of it. Less copying and less cost on context switch but much less memory is available for each application program.
Summary: By choosing to virtualize at the API layer rather than at the hardware layer, the task of virtualization was made easier with the downside that only one API is supported on this model. The authors use the split driver model to implement this level of virtualization easily on Xen exploiting the fact that there is considerable work done per CUDA call. Finally, they efficiently manage memory using the three techniques described above.
If you are interested in virtualization and GPGPU programming, it’s a good read with a simple and practical approach to virtualizing GPUs: http://www.cc.gatech.edu/~vishakha/files/GViM.pdf.
–jrh
James Hamilton, Amazon Web Services
1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | james@amazon.com
H:mvdirona.com | W:mvdirona.com/jrh/work | blog:http://perspectives.mvdirona.com