updated readme
[RBC.git] / readme.txt
1 ***Random Ball Cover (RBC) v0.2***
2 Lawrence Cayton
3 lcayton@tuebingen.mpg.de
4
5 (C) Copyright 2010, Lawrence Cayton [lcayton@tuebingen.mpg.de]
6
7 This program is free software: you can redistribute it and/or modify
8 it under the terms of the GNU General Public License as published by
9 the Free Software Foundation, either version 3 of the License, or
10 (at your option) any later version.
11
12 This program is distributed in the hope that it will be useful,
13 but WITHOUT ANY WARRANTY; without even the implied warranty of
14 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 GNU General Public License for more details.
16
17 You should have received a copy of the GNU General Public License
18 along with this program. If not, see <http://www.gnu.org/licenses/>.
19
20 ---------------------------------------------------------------------
21 SUMMARY
22
23 This is a C and CUDA implementation of the Random Ball Cover data
24 structure for nearest neighbor search described in
25
26 L. Cayton, A nearest neighbor data structure for graphics hardware.
27 ADMS, 2010.
28
29
30 ---------------------------------------------------------------------
31 FILES
32
33 * brute.{h,cu} -- implementation of brute force search (CPU and GPU
34 versions)
35 * defs.h -- definitions of constants and macros, including the
36 distance metric.
37 * driver.cu -- example code for using the RBC data structure.
38 * kernels.{h,cu} -- implementation of all the (device) kernel functions,
39 except those related to the scan (see sKernels below)
40 * kernelWrap.{h,cu} -- CPU wrapper code around the kernels.
41 * rbc.{h,cu} -- the core of the RBC data structure. Includes the
42 implementation of build and search algorithms.
43 * sKernel.{h,cu} -- implementation of the kernel functions related to
44 the parallel scan algorithm (used within the build method).
45 * sKernelWrap.{h,cu} -- wrappers for the kernels in sKernel.
46 * utils.{h,cu} -- misc utilities used in the code.
47 * utilsGPU.{h,cu} -- misc utilities related to the GPU.
48
49
50 ---------------------------------------------------------------------
51 COMPILATION
52
53 Type make in a shell. Requires GCC and NVCC (CUDA). The code has
54 been tested under GCC 4.4 and CUDA 3.1.
55
56
57 ---------------------------------------------------------------------
58 USE
59
60 To use the RBC data structure, you will likely need to integrate this
61 code into your own. The driver.cu file provides an example of how to
62 use the RBC implementation. To try it out, type
63 >testRBC
64 at the prompt and a list of options will be displayed. Currently, the
65 test program assumes that the input is a single binary file, which it
66 then splits into queries and a the database randomly. Clearly, such a
67 setup is only useful for testing the performance of the data
68 structure. To use the data structure in a more useful fashion, you
69 may wish to call the readData function on separate files. There is
70 also a readDataText function in the driver.cu for your convenience.
71
72 The core of the implementation is in rbc.cu and in the kernel files.
73 There is a buildRBC function, a queryRBC function, and a kqueryRBC
74 function, which together should suffice for basic use of the data
75 structure.
76
77 Currently, the kernel functions are reasonably optimized, but can be
78 improved. Indeed, the results appearing in the ADMS paper came from a
79 slightly more optimized version than this one.
80
81
82 ---------------------------------------------------------------------
83 MISC NOTES ON THE CODE
84
85 * The code currently computes distance using the L_1 (manhattan)
86 metric. If you wish to use a different notion of distance, you must
87 modify defs.h. It is quite simple to switch to any metric that
88 operates alongs the coordinates independently (eg, any L_p metric),
89 but more complex metrics will require some aditional work. The L_2
90 metric (standard Euclidean distance) is already defined in defs.h.
91
92 * The k-NN code is currently hard-coded for K=32. The k-nn code
93 contains a manual implementation of a sorting network, which is why
94 this is hard-coded. This design allows all sorting to take place
95 in on-chip (shared) memory, and is highly efficient. Note that
96 the NNs are returned in sorted order, so that if one wants only,
97 say, 5 NNs, one can simply ignore the last 27 returned indices. For
98 k>32, contact the author.
99
100 * The code requires that the entire DB and query set fit into the
101 device memory.
102
103 * For the most part, device variables (ie arrays residing on the GPU)
104 begin with a lowercase d. For example, the device version of the
105 DB variable x is dx.
106
107 * The computePlan code is a bit more complex than is needed for the
108 version of the RBC search algorithm described in the paper. The
109 search algorithm described in the paper has two steps: (1) Find the
110 closest representative to the query. (2) Explore the points owned
111 by that representative (ie the s-closest points to the representative
112 in the DB). The computePlan code is more complex to make it easy
113 to try out other options. For example, one could search the points
114 owned by the *two* closest representatives to the query instead. This
115 would require only minor changes to the code, though is currently
116 untested.
117
118 * Currently the software works in single precision. If you wish to
119 switch to double precision, you must edit the defs.h file. Simply
120 uncomment the lines
121
122 typedef double real;
123 #define MAX_REAL DBL_MAX
124
125 and comment out the lines
126
127 typedef float real;
128 #define MAX_REAL FLT_MAX
129
130 Then, you must do a
131
132 make clean
133
134 followed by another make.
135
136 * This software has been tested on the following graphics cards:
137 NVIDIA GTX 285
138 NVIDIA Tesla c2050.
139
140 * This sotware has been tested under the following software setup:
141 Ubuntu 10.04 (linux)
142 gcc 4.4
143 cuda 3.1
144
145 Please share your experience getting it to work under Windows and
146 Mac OSX!
147
148 * If you are running this code on a GPU which is also driving your
149 display: A well-known issue with CUDA code in this situation is that
150 a process within the operating system will automatically kill
151 kernels that have been running for more than 5-10 seconds or so.
152 You can get around this in Linux by switching out of X-Windows (often
153 CTRL-ALT-F1 does the trick) and running the code directly from the
154 terminal.
155
156 * If you have a graphics card with more than 4GB of memory (eg a Tesla
157 2070), and use a recent version of CUDA (at least 3.2), then an
158 optimization is possible. In particular, removing the
159 -DCUDA_FORCE_API_VERSION=3010
160 flag and making some very minor code adjustments will enhance the
161 build performance.