Saturday, March 17, 2018

New tab homepage and firefox 59

It looks firefox 59 has a nice feature to kindly disable newtab homepage add-on in favor of their cool bookmark page.
If it's not working check the add-on manager.

Monday, August 28, 2017

VILKU VILKA Crafts

A friend of mine has opened an online shop.

Vancouver based Latvian Lifestyle Shop
Latvian Zakka* + Original Handcrafts


Check out their crafts here!
http://vilkuvilka.ca/onlineshop/

Saturday, July 1, 2017

Canada 150

Happy Canada day!
Joyeuse FĂȘte du Canada!

Wednesday, March 25, 2015

Fracture

Real example here.


Sunday, September 28, 2014

Kd-tree implementation

 I have implemented my own kd-tree library based on Accelerating kd-tree searches for all k -nearest neighbours (pdf), compared the result with ANN to confirm that it outputs the correct result. It was surprising my implementation was several hundred times faster (below graph, for the same queries, ANN gets 14728 samples while my code gets 15). I still doubt if there's something wrong with my code or profiling because it's too fast even after considering the fact that my kd-tree is specialized in finding nearest *one* point and uses Eigen::Vector4f for SSE optimisation.




 I tested several sampling profiler: perf, gperftools, oprofile, codeXL.  It took some time for me to get the correct call graph. Many profilers output a call graph where malloc/new has no parent, probably because libstdc++ is not compiled with no-omit-frame-pointer and cannot find the caller function. Newer profilers such as recent perf and gperftools can handle it correctly. I couldn't though find a way to make perf report's "-G" option work when I took a profile with

perf record -call-graph dwarf ...

so I used gperftools for it (above graph). It still took a while until I found pprof's 'callgrind' option (kcachegrind) could not output the result correctly. 'web' worked fine.

Sunday, July 27, 2014

Amazing


from Wikipedia:

Andromeda Software Development (or simply ASD) is a Greek demogroup that was formed in 1992. They produced a number of small intros and demos in the mid-1990s for the PC, most notably CounterFactual (winner of the first Greek demo party ever, The Gardening 1995) and Beyond (Placed 4th in The Gardening 1996). ASD was quiet for the following years until 2001, when they presented Cadence & Cascade - their first accelerated demo - and won the Digital Nexus demoparty, held in Athens, Greece.

Wednesday, July 16, 2014

Testing unified memory

I just tested CUDA6's unified memory. This code worked as I expected.

#include <iostream>
#include <string.h>

#include <cuda_runtime.h>

__global__ void set(char* buf, unsigned int num)
{
 int i = blockDim.x * blockIdx.x + threadIdx.x;
 if (i < num)
 {
  buf[i] = 1;
 }
}

int main(void)
{
 unsigned int num = 16384;
 char *buf;
 cudaMallocManaged(&buf, num);
 memset(buf, 0, num);                         //##1## D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //##2## H->D(16384bytes).
 cudaDeviceSynchronize();
 std::cout << (int)buf[0] << std::endl;       //##3## D->H(16384bytes). page fault.
 std::cout << (int)buf[10000] << std::endl;   //No data transfer.
 set<<<4, 128>>>(buf, num);
 cudaDeviceSynchronize();
 buf[0] = 5;                                  //##4## D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //##5## H->D(4096bytes).
 cudaFree(buf);
 cudaDeviceReset();
   return 0;
}

I used a profiler and added comments where data transfers have occurred. Whenever an un-transferred host memory is accessed for either read or write, Data transfer D->H occurs, not only before read but before write. In my understanding it is because unless it copies the memory D->H on write, once you wrote a value to a memory and then read a memory near the address where you have modified the value, data transfer D->H on read would overwrite your modification. It produces a bit of unnecessary data transfer (##1##).
You can also see when ##5##, only 4096 bytes of data transfer occurs, not full 16384 bytes. It is because it keep tracks of the page you have written (I have asked it to a NVIDIA guy).

I modified the main function a little bit and created another buffer buf2.
int main(void)
{
 unsigned int num = 16384;

 //Below code adds D->H(16384bytes), H->D(16384bytes). 1 page fault.
 char *buf2;
 cudaMallocManaged(&buf2, num);
 memset(buf2, 0, num);

 //Same as above.
 char *buf;
 cudaMallocManaged(&buf, num);
 memset(buf, 0, num);                         //D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //H->D(16384bytes).
 cudaDeviceSynchronize();
 std::cout << (int)buf[0] << std::endl;       //D->H(16384bytes). page fault.
 std::cout << (int)buf[10000] << std::endl;   //No data migration.
 set<<<4, 128>>>(buf, num);
 cudaDeviceSynchronize();
 buf[0] = 5;                                  //D->H(16384bytes). page fault.
 set<<<4, 128>>>(buf, num);                   //H->D(4096bytes).
 cudaFree(buf);
 cudaFree(buf2);
 cudaDeviceReset();
   return 0;
}

In this code I allocated a unified memory to buf2 and called memset() but buf2 is not used elsewhere. It adds D->H(16384bytes) when memset() is called and it is natural, but it also adds H->D(16384bytes). We know the kernel doesn't use buf2 but it is too complicated for CUDA to know that. So before the first kernel call it also transfers buf2, which is another unnecessary data transfer. I have concluded unified memory makes the program simpler but not as efficient as I have expected. I heard Pascal has cleverer memory management. I will wait and see if is good enough.