vg
tools for working with variation graphs
|
#include <stream_index.hpp>
Public Types | |
using | bin_t = make_unsigned< id_t >::type |
using | window_t = make_unsigned< id_t >::type |
Public Member Functions | |
StreamIndexBase ()=default | |
void | load (istream &from) |
void | save (ostream &to) const |
Save an index to a file. More... | |
vector< pair< int64_t, int64_t > > | find (id_t node_id) const |
void | find (id_t node_id, const function< bool(int64_t, int64_t)> scan_callback) const |
void | find (id_t min_node, id_t max_node, const function< bool(int64_t, int64_t)> scan_callback) const |
void | scan_backward (const function< bool(int64_t, int64_t)> scan_callback) const |
void | add_group (id_t min_id, id_t max_id, int64_t virtual_start, int64_t virtual_past_end) |
bool | used_bins_of_range (id_t min_id, id_t max_id, const function< bool(bin_t)> &iteratee) const |
Static Public Member Functions | |
static BitString | bin_to_prefix (bin_t bin) |
Get the ID prefix bits corresponding to a bin. More... | |
static BitString | id_to_prefix (id_t id) |
Get the given ID as a bit string. More... | |
static bin_t | common_bin (id_t a, id_t b) |
Get the most specific bin that contains both of the given node IDs. More... | |
static window_t | window_of_id (id_t id) |
Static Public Attributes | |
const static uint32_t | MAX_INPUT_VERSION = 1 |
What's the maximum index version number we can read with this code? More... | |
const static uint32_t | OUTPUT_VERSION = 1 |
What's the version we serialize? More... | |
const static string | MAGIC_BYTES = "GAI!" |
Static Protected Member Functions | |
static bool | is_in_range (const vector< pair< id_t, id_t >> &ranges, id_t id) |
Protected Attributes | |
unordered_map< bin_t, vector< pair< int64_t, int64_t > > > | bin_to_ranges |
BitStringTree< bin_t > | bins_by_id_prefix |
map< window_t, int64_t > | window_to_start |
id_t | last_group_min_id = numeric_limits<id_t>::min() |
Static Protected Attributes | |
const static size_t | WINDOW_SHIFT = 8 |
Private Member Functions | |
StreamIndexBase (const StreamIndexBase &other)=delete | |
StreamIndexBase & | operator= (const StreamIndexBase &other)=delete |
An index for a node-ID-sorted VPKG-formatted Protobuf file, such as GAM or VG.
Works on a BAI-like concept of bins partitioning node ID space.
Files are serialized as count-prefixed groups of Protobuf messages. These groups are the smallest unit that can be deserialized.
Every group of messages gets assigned to a bin which is the longest bin that completely contains the ID range used in the group.
We define runs of adjacent groups which have the same bin, which are the basic subject of the index.
We then store an index from bin to the virtual offset ranges (start and past-the-end), in order, of runs that are assigned to the bin.
You will get non-contiguous virtual offset ranges for a node ID range when some messages run into the range from the left, then messages that start later don't, and then messages that start even later do again.
We also have a BAI-style linear index, mapping from tiling windows in node ID space to the lowest virtual offset of a group that overlaps the window.
The bin structure is that we partition all of node ID space into bins of power-of-2 size, starting with size 2 nodes. We number the bins such that 0 is the whole-ID-space bin, divided into 1 and 2, then into 3, 4, 5, and 6, then into 7, 8, 9, 10, 11, 12, 13, and 14, and so on.
The tiling windows are just the node IDs down-shifted by a few bits.
Messages that use no nodes (i.e. unmapped reads) are considered to visit node ID 0. The maximum and minimum id_t values are used as sentinels, so they can't be real nodes.
All find operations are thread-safe with respect to each other. Simultaneous adds or finds and adds are prohibited.
Most of the basic index API doesn't depend on the message type. So we put it in this base class and inherit form it in templates that provide the high-level interface in terms of message instances.
using vg::StreamIndexBase::bin_t = make_unsigned<id_t>::type |
using vg::StreamIndexBase::window_t = make_unsigned<id_t>::type |
|
default |
|
privatedelete |
auto vg::StreamIndexBase::add_group | ( | id_t | min_id, |
id_t | max_id, | ||
int64_t | virtual_start, | ||
int64_t | virtual_past_end | ||
) |
Add a group into the index, based on its minimum and maximum (inclusive) used node IDs. Must be called for all groups in virtual offset order.
|
static |
Get the ID prefix bits corresponding to a bin.
Subtract out the offset to get the bin index, and use it as a prefix with the appropriate number of bits based on what the offset was.
Get the most specific bin that contains both of the given node IDs.
auto vg::StreamIndexBase::find | ( | id_t | min_node, |
id_t | max_node, | ||
const function< bool(int64_t, int64_t)> | scan_callback | ||
) | const |
Find all the ranges of run virtual offsets to check for reads visiting the given inclusive node ID range. Relies on a scanning callback, which will be called repeatedly with the start and past-the-end virtual offsets of runs which may contain groups touching the given node ID. When called, the callback should scan the run and return either true if it wants the next run, or false if it encountered a group with an out-of-range start and wants to stop iteration. Runs will be emitted in order, and truncated on the left to either the appropriate lower bound from the linear index, or the past-the-end of the previous run scanned.
auto vg::StreamIndexBase::find | ( | id_t | node_id | ) | const |
Find all the ranges of run virtual offsets from the first position that might be relevant for the given node ID to the ends of all the bins it is in. Trims ranges by the linear index on the low end, and returns a series of potentially abutting but non-overlapping virtual offset ranges. Does not stop early (because it has no access to the actual reads to tell when it should stop looking at runs in a bin). So you will get ranges covering all runs in a bin that follow the runs you are interested in as well.
auto vg::StreamIndexBase::find | ( | id_t | node_id, |
const function< bool(int64_t, int64_t)> | scan_callback | ||
) | const |
Find all the ranges of run virtual offsets to check for reads visiting the given node ID. Relies on a scanning callback, which will be called repeatedly with the start and past-the-end virtual offsets of runs which may contain groups touching the given node ID. When called, the callback should scan the run and return either true if it wants the next run, or false if it encountered a group with an out-of-range start and wants to stop iteration. Runs will be emitted in order, and truncated on the left to either the appropriate lower bound from the linear index, or the past-the-end of the previous run scanned (which should be moot, because runs should not overlap in the index).
|
static |
Get the given ID as a bit string.
|
staticprotected |
Return true if the given ID is in any of the sorted, coalesced, inclusive ranges in the vector, and false otherwise. TODO: Is repeated binary search on the ranges going to be better than an unordered_set of all the individual IDs?
auto vg::StreamIndexBase::load | ( | istream & | from | ) |
Load an index from a file. File holds the index, not the actual data being indexed. Index file format doesn't care what type of message is being indexed.
|
privatedelete |
auto vg::StreamIndexBase::save | ( | ostream & | to | ) | const |
Save an index to a file.
auto vg::StreamIndexBase::scan_backward | ( | const function< bool(int64_t, int64_t)> | scan_callback | ) | const |
Iterate over ranges of virtual offsets from the end of the file to the start. The ranges to not necessarily correspond to runs. The ending VO of the first range iterated may be numeric_limits<int64_t>::max(). The start VO of each range is guaranteed to be a valid VO of a group. The past-end VO may be the start of the previously iterated range. We have no scan_forward because you can just do that with a cursor. Stops when the callback returns false.
auto vg::StreamIndexBase::used_bins_of_range | ( | id_t | min_id, |
id_t | max_id, | ||
const function< bool(bin_t)> & | iteratee | ||
) | const |
Iterate over the populated bins in the index, in in-order bin tree traversal order, that any of the node IDs in the given inclusive range occur in. Returns false if asked to stop.
|
static |
Get the linear index window that the given node ID falls in. The window range for a group is its min nodes' window through its max node's window.
|
protected |
Maps from bin number to all the ranges of virtual offsets, in order, for runs that land in the given bin. A run lands in a bin if that bin is the most specific bin that includes both its lowest and highest nodes it uses.
|
protected |
Maps from the bit string representing the prefix of node IDs that a bin matches to the bin's bin number. Only contains entries for nonempty bins.
What was the minimum node ID of the last group added? If this isn't strictly increasing, we're trying to index data that is not sorted.
|
static |
What magic value do we embed in the compressed index data? TODO: Make this depend on type of message being indexed so we can't mix up index files.
|
static |
What's the maximum index version number we can read with this code?
|
static |
What's the version we serialize?
|
staticprotected |
|
protected |
Maps from linear index window to the virtual offset of the first group that overlaps that window (taking the group as a min-to-max node range). If you are looking for reads that visit a node, they can't possibly occur in a group before the first offset stored for the node's window (or any greater window). TODO: Should we make this a vector instead and hope nobody uses high/sparse node IDs?