Sunday, July 27, 2014

Amazing


from Wikipedia:

Andromeda Software Development (or simply ASD) is a Greek demogroup that was formed in 1992. They produced a number of small intros and demos in the mid-1990s for the PC, most notably CounterFactual (winner of the first Greek demo party ever, The Gardening 1995) and Beyond (Placed 4th in The Gardening 1996). ASD was quiet for the following years until 2001, when they presented Cadence & Cascade - their first accelerated demo - and won the Digital Nexus demoparty, held in Athens, Greece.

Wednesday, July 16, 2014

Testing unified memory

I just tested CUDA6's unified memory. This code worked as I expected.

#include <iostream>
#include <string.h>

#include <cuda_runtime.h>

__global__ void set(char* buf, unsigned int num)
{
 int i = blockDim.x * blockIdx.x + threadIdx.x;
 if (i < num)
 {
  buf[i] = 1;
 }
}

int main(void)
{
 unsigned int num = 16384;
 char *buf;
 cudaMallocManaged(&buf, num);
 memset(buf, 0, num);                         //##1## D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //##2## H->D(16384bytes).
 cudaDeviceSynchronize();
 std::cout << (int)buf[0] << std::endl;       //##3## D->H(16384bytes). page fault.
 std::cout << (int)buf[10000] << std::endl;   //No data transfer.
 set<<<4, 128>>>(buf, num);
 cudaDeviceSynchronize();
 buf[0] = 5;                                  //##4## D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //##5## H->D(4096bytes).
 cudaFree(buf);
 cudaDeviceReset();
   return 0;
}

I used a profiler and added comments where data transfers have occurred. Whenever an un-transferred host memory is accessed for either read or write, Data transfer D->H occurs, not only before read but before write. In my understanding it is because unless it copies the memory D->H on write, once you wrote a value to a memory and then read a memory near the address where you have modified the value, data transfer D->H on read would overwrite your modification. It produces a bit of unnecessary data transfer (##1##).
You can also see when ##5##, only 4096 bytes of data transfer occurs, not full 16384 bytes. It is because it keep tracks of the page you have written (I have asked it to a NVIDIA guy).

I modified the main function a little bit and created another buffer buf2.
int main(void)
{
 unsigned int num = 16384;

 //Below code adds D->H(16384bytes), H->D(16384bytes). 1 page fault.
 char *buf2;
 cudaMallocManaged(&buf2, num);
 memset(buf2, 0, num);

 //Same as above.
 char *buf;
 cudaMallocManaged(&buf, num);
 memset(buf, 0, num);                         //D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //H->D(16384bytes).
 cudaDeviceSynchronize();
 std::cout << (int)buf[0] << std::endl;       //D->H(16384bytes). page fault.
 std::cout << (int)buf[10000] << std::endl;   //No data migration.
 set<<<4, 128>>>(buf, num);
 cudaDeviceSynchronize();
 buf[0] = 5;                                  //D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //H->D(4096bytes).
 cudaFree(buf);
 cudaFree(buf2);
 cudaDeviceReset();
   return 0;
}

In this code I allocated a unified memory to buf2 and called memset() but buf2 is not used elsewhere. It adds D->H(16384bytes) when memset() is called and it is natural, but it also adds H->D(16384bytes). We know the kernel doesn't use buf2 but it is too complicated for CUDA to know that. So before the first kernel call it also transfers buf2, which is another unnecessary data transfer. I have concluded unified memory makes the program simpler but not as efficient as I have expected. I heard Pascal has cleverer memory management. I will wait and see if is good enough.