Koichi Tamura's blog: 2014

Sunday, September 28, 2014

Kd-tree implementation

I have implemented my own kd-tree library based on Accelerating kd-tree searches for all k -nearest neighbours (pdf), compared the result with ANN to confirm that it outputs the correct result. It was surprising my implementation was several hundred times faster (below graph, for the same queries, ANN gets 14728 samples while my code gets 15). I still doubt if there's something wrong with my code or profiling because it's too fast even after considering the fact that my kd-tree is specialized in finding nearest *one* point and uses Eigen::Vector4f for SSE optimisation.

I tested several sampling profiler: perf, gperftools, oprofile, codeXL. It took some time for me to get the correct call graph. Many profilers output a call graph where malloc/new has no parent, probably because libstdc++ is not compiled with no-omit-frame-pointer and cannot find the caller function. Newer profilers such as recent perf and gperftools can handle it correctly. I couldn't though find a way to make perf report's "-G" option work when I took a profile with

perf record -call-graph dwarf ...

so I used gperftools for it (above graph). It still took a while until I found pprof's 'callgrind' option (kcachegrind) could not output the result correctly. 'web' worked fine.

Sunday, July 27, 2014

Amazing

from Wikipedia:

Andromeda Software Development (or simply ASD) is a Greek demogroup that was formed in 1992. They produced a number of small intros and demos in the mid-1990s for the PC, most notably CounterFactual (winner of the first Greek demo party ever, The Gardening 1995) and Beyond (Placed 4th in The Gardening 1996). ASD was quiet for the following years until 2001, when they presented Cadence & Cascade - their first accelerated demo - and won the Digital Nexus demoparty, held in Athens, Greece.

Wednesday, July 16, 2014

Testing unified memory

I just tested CUDA6's unified memory. This code worked as I expected.

#include <iostream>
#include <string.h>

#include <cuda_runtime.h>

__global__ void set(char* buf, unsigned int num)
{
 int i = blockDim.x * blockIdx.x + threadIdx.x;
 if (i < num)
 {
  buf[i] = 1;
 }
}

int main(void)
{
 unsigned int num = 16384;
 char *buf;
 cudaMallocManaged(&buf, num);
 memset(buf, 0, num);                         //##1## D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //##2## H->D(16384bytes).
 cudaDeviceSynchronize();
 std::cout << (int)buf[0] << std::endl;       //##3## D->H(16384bytes). page fault.
 std::cout << (int)buf[10000] << std::endl;   //No data transfer.
 set<<<4, 128>>>(buf, num);
 cudaDeviceSynchronize();
 buf[0] = 5;                                  //##4## D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //##5## H->D(4096bytes).
 cudaFree(buf);
 cudaDeviceReset();
   return 0;
}

I used a profiler and added comments where data transfers have occurred. Whenever an un-transferred host memory is accessed for either read or write, Data transfer D->H occurs, not only before read but before write. In my understanding it is because unless it copies the memory D->H on write, once you wrote a value to a memory and then read a memory near the address where you have modified the value, data transfer D->H on read would overwrite your modification. It produces a bit of unnecessary data transfer (##1##).
You can also see when ##5##, only 4096 bytes of data transfer occurs, not full 16384 bytes. It is because it keep tracks of the page you have written (I have asked it to a NVIDIA guy).

I modified the main function a little bit and created another buffer buf2.

int main(void)
{
 unsigned int num = 16384;

 //Below code adds D->H(16384bytes), H->D(16384bytes). 1 page fault.
 char *buf2;
 cudaMallocManaged(&buf2, num);
 memset(buf2, 0, num);

 //Same as above.
 char *buf;
 cudaMallocManaged(&buf, num);
 memset(buf, 0, num);                         //D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //H->D(16384bytes).
 cudaDeviceSynchronize();
 std::cout << (int)buf[0] << std::endl;       //D->H(16384bytes). page fault.
 std::cout << (int)buf[10000] << std::endl;   //No data migration.
 set<<<4, 128>>>(buf, num);
 cudaDeviceSynchronize();
 buf[0] = 5;                                  //D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //H->D(4096bytes).
 cudaFree(buf);
 cudaFree(buf2);
 cudaDeviceReset();
   return 0;
}

In this code I allocated a unified memory to buf2 and called memset() but buf2 is not used elsewhere. It adds D->H(16384bytes) when memset() is called and it is natural, but it also adds H->D(16384bytes). We know the kernel doesn't use buf2 but it is too complicated for CUDA to know that. So before the first kernel call it also transfers buf2, which is another unnecessary data transfer. I have concluded unified memory makes the program simpler but not as efficient as I have expected. I heard Pascal has cleverer memory management. I will wait and see if is good enough.

Sunday, May 18, 2014

CPU/GPU memory abstraction

When you write a program that uses GPU (either CUDA of OpenCL), you may want to implement both CPU code and GPU code, and use only one of them depending on the user's choice. For this reason I wanted to know how to treat different kinds of memories generically, i.e. the way to abstract host and device memory.

The easiest one is to follow the approach of (or just use) Thrust. It has host_vector and device_vector which interfaces are quite similar to std::vector. When you have one of them, you can copy it to another with assignment. Host/device copy will be done automatically under the hood.

I wanted another way of abstraction to debug the program easily. What interested me was to have a mechanism where input data was either in host or device memory, and the the code does not have to know where it is. After several try and error I implemented one like this;

template < typename T >
class Buffer
{
public:
    T* get(MemoryType mType);
    const T* get(MemoryType mType) const;
    void setClean(MemoryType mType, bool isClean=true);
    void sync(MemoryType mType) const;
    void allocate(MemoryType mType) const;
    void free(MemoryType mType);

private:
    mutable MemoryType m_cleanState; //Clean state. Bitwise OR of HOST and DEVICE.
    mutable void* m_addrs[2]; //Host and device.
};

It can have both host and device memory, and knows if the data stored in the host/device memory is up to date or not. If the one in the host memory is up to date and one in the device memory is not, it copies data from the host memory to the device by calling sync(DEVICE). if the device memory is already up to date, sync() does nothing.
You can use this class like this,

someCalculationCpu(const Buffer < float > * input, Buffer < float > * output)
{
    input.sync(HOST);
    float* ip = input.get(HOST);
    float* op = output.get(HOST);
    op[0] = ip[0]; /*Do some calculation with CPU*/
    output.setClean(HOST); //Tell the buffer that the data stored in the device memory is up to date.
}

someCalculationGpu(const Buffer < float > * input, Buffer < float > * output)
{
    input.sync(DEVICE);
    float* ip = input.get(DEVICE);
    float* op = output.get(DEVICE);
    op[0] = ip[0]; /*Do some calculation with GPU*/
    output.setClean(DEVICE); //Tell the buffer that the data stored in the device memory is up to date.
}

anotherCalculationCpu(const Buffer < float > * input, Buffer < float > * output)
{
    /*Same style as someCalculationCpu() with another calculation.*/
}

anotherCalculationGpu(const Buffer < float > * input, Buffer < float > * output)
{
    /*Same style as someCalculationGpu() with another calculation.*/
}

Now the these are all valid,

   someCalculationCpu(input, output);
   anotherCalculationCpu(input, output);

   someCalculationCpu(input, output);
   anotherCalculationGpu(input, output);

   someCalculationGpu(input, output);
   anotherCalculationCpu(input, output);

   someCalculationGpu(input, output);
   anotherCalculationGpu(input, output);

I've already implemented so I'll keep using it but just wonder if there is already a tool or a way with Thrust. Please leave a comment if you know.

Koichi Tamura's blog

Sunday, September 28, 2014

Kd-tree implementation

Sunday, July 27, 2014

Amazing

Wednesday, July 16, 2014

Testing unified memory

Sunday, May 18, 2014

CPU/GPU memory abstraction

Free

Koichi Tamura

Followers

Several posts

Labels

Blog Archive