-
Notifications
You must be signed in to change notification settings - Fork 1
/
indexing_options.html
347 lines (344 loc) · 12.3 KB
/
indexing_options.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta content="en-us" http-equiv="Content-Language"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="no-cache, no-store, must-revalidate" http-equiv="Cache-Control"/>
<meta content="no-cache" http-equiv="Pragma"/>
<meta content="0" http-equiv="Expires"/>
<title>
allpairs_global command
</title>
<link href="stylesx.css" rel="stylesheet" type="text/css"/>
<style type="text/css">
body.c4 {background-color:#c0c0c0;}
div.c3 {position:absolute; top:45px; left:20px; width:830px; background-color:#ffffff; border-width:10px; border-style:solid;border-color:white;}
span.c2 {font-weight: bold}
div.c1 {position:absolute; top:10px; left:20px; width:850px; height:60px;}
.TopButtonPara { color:white; background-color:rgb(50,100,150); border-color:rgb(50,100,150); font-family:Arial, Helvetica, sans-serif; font-weight:normal; font-size:9pt; text-align:center; border-width:4px; border-style:solid; }
.TopButton { color:white; }
a.TopButton:link { text-decoration:none; }
a.TopButton:visited { text-decoration:none; }
a.TopButton:hover { color:orange; }
.NewButtonPara { color:white; background-color:rgb(50,100,150); border-color:rgb(50,100,150); font-family:Arial, Helvetica, sans-serif; font-weight:normal; font-size:9pt; text-align:center; border-width:4px; border-style:solid; }
.NewButton { color:white; }
a.NewButton:link { text-decoration:none; }
a.NewButton:visited { text-decoration:none; }
a.NewButton:hover { color:orange; }
.SideButtonPara { color:white; font-family:Arial, Helvetica, sans-serif; font-size:9pt; font-weight:normal; text-align:center; line-height:18px; }
.SideButton { color:white; }
a.SideButton:link { text-decoration:none; }
a.SideButton:visited { text-decoration:none; }
a.SideButton:hover { color:orange; }
</style>
</head>
<body style="background-color:#c0c0c0;">
<div>
<a href="https://drive5.com/usearch">
<img alt="USEARCH v12" src="usearch12_banner.jpg" style="position:absolute; top:40px; left:10px; padding:0px; border:0px;"/>
</a>
</div>
<div style="position:absolute; top:115px; left:10px; width:850px; background-color:#ffffff; min-height:500px">
<div style="position:relative; float:left; background-color:#696969; width:125px; left: 0px; min-height:500px; padding:5px; height: 125px;">
<div class="SideButtonPara" style="text-align:center; padding-top:5px;">
<a class="SideButton" href="index.html">
Docs home
</a>
<br/>
<hr style="border:0; border-bottom: 1px solid white;"/>
<a class="SideButton" href="cmds.html">
Commands
</a>
<br/>
<a class="SideButton" href="topics.html">
Topics
</a>
<br/>
<a class="SideButton" href="citation.html">
Publications
</a>
<br/>
</div>
</div>
<div class="ManText" style="left:20px; position: absolute; left:135px; width:695px; background-color:white; padding:10px">
<h1>
Index parameters
</h1>
<span class="ManText">
<b>
See also
</b>
<br/>
<a href="default_index_params.html">
Default index parameter values
</a>
<br/>
<a href="memory.html">
Memory requirements
</a>
<br/>
<br/>
Most USEARCH commands use a database index to enable fast searching. There are two types of index: one for finding matching seeds for the
<a href="ublast_algo.html">
UBLAST algorithm
</a>
, and another for fast calculation of common word counts for the
<a href="usearch_algo.html">
USEARCH algorithm
</a>
. Clustering uses a USEARCH-style index. Indexing parameters apply to both types of index.
</span>
<p>
<span class="ManText">
During search and clustering, indexes are always accessed directly in memory rather than being retrieved from a disk file, in order to maximize speed. The amount of RAM required to store the index is approximately the same as the size of a UDB file created with the same sequences and options. The physical RAM in the computer should be bigger than the index, otherwise virtual memory paging will cause much slower execution.
</span>
</p>
<p>
<span class="ManText">
Indexes are constructed in three different ways:
</span>
</p>
<p>
<span class="ManText">
(1) Loaded from in a UDB file.
</span>
</p>
<p>
<span class="ManText">
(2) Built from a FASTA file.
</span>
</p>
<p>
<span class="ManText">
(3) Built dynamically during clustering. The index is initially empty, then grows as centroid sequences are added to the database.
</span>
</p>
<p>
<span class="ManText">
<b>
Indexing options
<br/>
</b>
In the following table, "word" refers generically to the fixed-length segment of the database sequence that is indexed. It may be a k-mer or a
<a href="patterns.html">
pattern
</a>
. The effective word length is the length of the k-mer or the number of 1s in the pattern.
</span>
</p>
<div align="left">
<table border="0" cellpadding="5" cellspacing="0" summary="Table" width="78%">
<tr>
<td align="right" class="c3" width="112">
<span class="ManText c2">
Option
</span>
</td>
<td align="center" class="c3" width="29">
<span class="ManText c2">
Value
</span>
</td>
<td align="left" class="c4">
<span class="ManText">
Description
</span>
</td>
</tr>
<tr>
<td align="right" class="c5" valign="top" width="112">
<span class="ManText">
-wordlength
</span>
</td>
<td align="center" class="c5" valign="top" width="29">
<span class="ManText">
N
</span>
</td>
<td align="left" class="c5" valign="top">
<span class="ManText">
<a href="word_length.html">
Word length
</a>
. If this is given, an all-ones
<a href="patterns.html">
pattern
</a>
is assumed and the -pattern option may not be given. For long word lengths, the -slots option can be used to reduce memory use.
</span>
</td>
</tr>
<tr>
<td align="right" class="c3" valign="top" width="112">
<span class="ManText">
-pattern
</span>
</td>
<td align="center" class="c3" valign="top" width="29">
<span class="ManText">
string
</span>
</td>
<td align="left" class="c3" valign="top">
<span class="ManText">
A
<a href="patterns.html">
pattern
</a>
specified as a string of 1s and 0s. A pattern of all ones is equivalent to a kmer of that length and can also be specified by the -wordlength option. It is not valid to specify both -wordlength and -pattern. The default for protein
<a href="DELETE_URL">
ublast
</a>
is 10111011. For long patterns, the -slots option can be used to reduce memory use.
</span>
</td>
</tr>
<tr>
<td align="right" class="c5" valign="top" width="112">
<span class="ManText">
-alpha
</span>
</td>
<td align="center" class="c5" valign="top" width="29">
<span class="ManText">
string
</span>
</td>
<td align="left" class="c5" valign="top">
<span class="ManText">
Alphabet. Either nt (nucleotide), aa (20-letter amino acid alphabet), or a
<a href="DELETE_URL">
compressed amino acid alphabet
</a>
expressed as a string containing the 20 standard letters with groups separated by commas. Default for protein
<a href="DELETE_URL">
ublast
</a>
is the 10-group alphabet A,KR,DENQ,C,G,H,ILVM,FYW,P,ST. Other indexed search and clustering commands default to the full 20-letter alphabet but support compressed alphabets as an option.
</span>
</td>
</tr>
<tr>
<td align="right" class="c3" valign="top" width="112">
<span class="ManText">
-
<a href="DELETE_URL">
dbstep
</a>
</span>
</td>
<td align="center" class="c3" valign="top" width="29">
<span class="ManText">
N
</span>
</td>
<td align="left" class="c3" valign="top">
<span class="ManText">
Specifies that every Nth database word should be indexed. Default is N=1, meaning that all words are indexed. Similar to the stride parameter of
<a href="http://bioinformatics.oxfordjournals.org/content/24/16/1757.full">
MEGABLAST
</a>
. Setting N>1 saves memory by reducing the size of the index, roughly by a factor of N for large databases.
</span>
</td>
</tr>
<tr>
<td align="right" class="c5" valign="top" width="112">
<span class="ManText">
-
<a href="DELETE_URL">
dbaccelpct
</a>
</span>
</td>
<td align="center" class="c5" valign="top" width="29">
<span class="ManText">
N
</span>
</td>
<td align="left" class="c5" valign="top">
<span class="ManText">
Specifies an acceleration parameter in the range 0 to 100, similar to the -
<a href="DELETE_URL">
accel
</a>
parameter of
<a href="DELETE_URL">
ublast
</a>
. Expressed as an integer percentage. Usually it is more effective to use -
<a href="DELETE_URL">
accel
</a>
than -dbaccelpct, though this may depend on the database. The main advantage of -dbaccel is reduced memory and UDB file size. This parameter can only be used for
<a href="db_files.html">
database file
</a>
indexes, it is not valid for clustering.
</span>
</td>
</tr>
<tr>
<td align="right" class="c3" valign="top" width="112">
<span class="ManText">
-
<a href="opt_dbmask.html">
dbmask
</a>
</span>
</td>
<td align="center" class="c3" valign="top" width="29">
<span class="ManText">
method
</span>
</td>
<td align="left" class="c3" valign="top">
<span class="ManText">
See
<a href="masking_options.html">
masking options
</a>
for supported methods. A word with one or more masked letters is not indexed. Default is fastnucleo or fastamino.
</span>
</td>
</tr>
<tr>
<td align="right" class="c5" valign="top" width="112">
<span class="ManText">
-slots
</span>
</td>
<td align="center" class="c5" valign="top" width="29">
<span class="ManText">
N
</span>
</td>
<td align="left" class="c5" valign="top">
<span class="ManText">
Use a hashed index with the given number of slots (table entries at the top level of the index). It is recommended to use a
<a href="http://en.wikipedia.org/wiki/Prime_number">
prime number
</a>
as this reduces the frequency of hash collisions. Each slot requires a minimum of several bytes, even of the word corresponding to that slot is not found in the database. By default, if the alphabet size is A and the effective word length is w, the index has A
<sup>
w
</sup>
slots. This is the fastest way to do a word lookup, but can use too much memory for long word lengths. For example, a word length of 16 for proteins would require 10
<sup>
21
</sup>
slots. A
<a href="http://en.wikipedia.org/wiki/Hash_table">
hash table
</a>
index can save memory by using fewer slots, enabling longer word lengths to be used. Index operations become somewhat slower, though the difference in overall search speed is often negligible.
</span>
</td>
</tr>
</table>
</div>
</div>
</div>
</body>
</html>