vg
tools for working with variation graphs
Public Types | Public Member Functions | Static Public Member Functions | Static Public Attributes | Static Protected Member Functions | Protected Attributes | Static Protected Attributes | Private Member Functions | List of all members
vg::StreamIndexBase Class Reference

#include <stream_index.hpp>

Inheritance diagram for vg::StreamIndexBase:
vg::StreamIndex< vg::algorithms::Graph > vg::StreamIndex< Message >

Public Types

using bin_t = make_unsigned< id_t >::type
 
using window_t = make_unsigned< id_t >::type
 

Public Member Functions

 StreamIndexBase ()=default
 
void load (istream &from)
 
void save (ostream &to) const
 Save an index to a file. More...
 
vector< pair< int64_t, int64_t > > find (id_t node_id) const
 
void find (id_t node_id, const function< bool(int64_t, int64_t)> scan_callback) const
 
void find (id_t min_node, id_t max_node, const function< bool(int64_t, int64_t)> scan_callback) const
 
void scan_backward (const function< bool(int64_t, int64_t)> scan_callback) const
 
void add_group (id_t min_id, id_t max_id, int64_t virtual_start, int64_t virtual_past_end)
 
bool used_bins_of_range (id_t min_id, id_t max_id, const function< bool(bin_t)> &iteratee) const
 

Static Public Member Functions

static BitString bin_to_prefix (bin_t bin)
 Get the ID prefix bits corresponding to a bin. More...
 
static BitString id_to_prefix (id_t id)
 Get the given ID as a bit string. More...
 
static bin_t common_bin (id_t a, id_t b)
 Get the most specific bin that contains both of the given node IDs. More...
 
static window_t window_of_id (id_t id)
 

Static Public Attributes

const static uint32_t MAX_INPUT_VERSION = 1
 What's the maximum index version number we can read with this code? More...
 
const static uint32_t OUTPUT_VERSION = 1
 What's the version we serialize? More...
 
const static string MAGIC_BYTES = "GAI!"
 

Static Protected Member Functions

static bool is_in_range (const vector< pair< id_t, id_t >> &ranges, id_t id)
 

Protected Attributes

unordered_map< bin_t, vector< pair< int64_t, int64_t > > > bin_to_ranges
 
BitStringTree< bin_tbins_by_id_prefix
 
map< window_t, int64_t > window_to_start
 
id_t last_group_min_id = numeric_limits<id_t>::min()
 

Static Protected Attributes

const static size_t WINDOW_SHIFT = 8
 

Private Member Functions

 StreamIndexBase (const StreamIndexBase &other)=delete
 
StreamIndexBaseoperator= (const StreamIndexBase &other)=delete
 

Detailed Description

An index for a node-ID-sorted VPKG-formatted Protobuf file, such as GAM or VG.

Works on a BAI-like concept of bins partitioning node ID space.

Files are serialized as count-prefixed groups of Protobuf messages. These groups are the smallest unit that can be deserialized.

Every group of messages gets assigned to a bin which is the longest bin that completely contains the ID range used in the group.

We define runs of adjacent groups which have the same bin, which are the basic subject of the index.

We then store an index from bin to the virtual offset ranges (start and past-the-end), in order, of runs that are assigned to the bin.

You will get non-contiguous virtual offset ranges for a node ID range when some messages run into the range from the left, then messages that start later don't, and then messages that start even later do again.

We also have a BAI-style linear index, mapping from tiling windows in node ID space to the lowest virtual offset of a group that overlaps the window.

The bin structure is that we partition all of node ID space into bins of power-of-2 size, starting with size 2 nodes. We number the bins such that 0 is the whole-ID-space bin, divided into 1 and 2, then into 3, 4, 5, and 6, then into 7, 8, 9, 10, 11, 12, 13, and 14, and so on.

The tiling windows are just the node IDs down-shifted by a few bits.

Messages that use no nodes (i.e. unmapped reads) are considered to visit node ID 0. The maximum and minimum id_t values are used as sentinels, so they can't be real nodes.

All find operations are thread-safe with respect to each other. Simultaneous adds or finds and adds are prohibited.

Most of the basic index API doesn't depend on the message type. So we put it in this base class and inherit form it in templates that provide the high-level interface in terms of message instances.

Member Typedef Documentation

◆ bin_t

using vg::StreamIndexBase::bin_t = make_unsigned<id_t>::type

◆ window_t

using vg::StreamIndexBase::window_t = make_unsigned<id_t>::type

Constructor & Destructor Documentation

◆ StreamIndexBase() [1/2]

vg::StreamIndexBase::StreamIndexBase ( )
default

◆ StreamIndexBase() [2/2]

vg::StreamIndexBase::StreamIndexBase ( const StreamIndexBase other)
privatedelete

Member Function Documentation

◆ add_group()

auto vg::StreamIndexBase::add_group ( id_t  min_id,
id_t  max_id,
int64_t  virtual_start,
int64_t  virtual_past_end 
)

Add a group into the index, based on its minimum and maximum (inclusive) used node IDs. Must be called for all groups in virtual offset order.

◆ bin_to_prefix()

auto vg::StreamIndexBase::bin_to_prefix ( bin_t  bin)
static

Get the ID prefix bits corresponding to a bin.

Subtract out the offset to get the bin index, and use it as a prefix with the appropriate number of bits based on what the offset was.

◆ common_bin()

auto vg::StreamIndexBase::common_bin ( id_t  a,
id_t  b 
)
static

Get the most specific bin that contains both of the given node IDs.

◆ find() [1/3]

auto vg::StreamIndexBase::find ( id_t  min_node,
id_t  max_node,
const function< bool(int64_t, int64_t)>  scan_callback 
) const

Find all the ranges of run virtual offsets to check for reads visiting the given inclusive node ID range. Relies on a scanning callback, which will be called repeatedly with the start and past-the-end virtual offsets of runs which may contain groups touching the given node ID. When called, the callback should scan the run and return either true if it wants the next run, or false if it encountered a group with an out-of-range start and wants to stop iteration. Runs will be emitted in order, and truncated on the left to either the appropriate lower bound from the linear index, or the past-the-end of the previous run scanned.

◆ find() [2/3]

auto vg::StreamIndexBase::find ( id_t  node_id) const

Find all the ranges of run virtual offsets from the first position that might be relevant for the given node ID to the ends of all the bins it is in. Trims ranges by the linear index on the low end, and returns a series of potentially abutting but non-overlapping virtual offset ranges. Does not stop early (because it has no access to the actual reads to tell when it should stop looking at runs in a bin). So you will get ranges covering all runs in a bin that follow the runs you are interested in as well.

◆ find() [3/3]

auto vg::StreamIndexBase::find ( id_t  node_id,
const function< bool(int64_t, int64_t)>  scan_callback 
) const

Find all the ranges of run virtual offsets to check for reads visiting the given node ID. Relies on a scanning callback, which will be called repeatedly with the start and past-the-end virtual offsets of runs which may contain groups touching the given node ID. When called, the callback should scan the run and return either true if it wants the next run, or false if it encountered a group with an out-of-range start and wants to stop iteration. Runs will be emitted in order, and truncated on the left to either the appropriate lower bound from the linear index, or the past-the-end of the previous run scanned (which should be moot, because runs should not overlap in the index).

◆ id_to_prefix()

auto vg::StreamIndexBase::id_to_prefix ( id_t  id)
static

Get the given ID as a bit string.

◆ is_in_range()

auto vg::StreamIndexBase::is_in_range ( const vector< pair< id_t, id_t >> &  ranges,
id_t  id 
)
staticprotected

Return true if the given ID is in any of the sorted, coalesced, inclusive ranges in the vector, and false otherwise. TODO: Is repeated binary search on the ranges going to be better than an unordered_set of all the individual IDs?

◆ load()

auto vg::StreamIndexBase::load ( istream &  from)

Load an index from a file. File holds the index, not the actual data being indexed. Index file format doesn't care what type of message is being indexed.

◆ operator=()

StreamIndexBase& vg::StreamIndexBase::operator= ( const StreamIndexBase other)
privatedelete

◆ save()

auto vg::StreamIndexBase::save ( ostream &  to) const

Save an index to a file.

◆ scan_backward()

auto vg::StreamIndexBase::scan_backward ( const function< bool(int64_t, int64_t)>  scan_callback) const

Iterate over ranges of virtual offsets from the end of the file to the start. The ranges to not necessarily correspond to runs. The ending VO of the first range iterated may be numeric_limits<int64_t>::max(). The start VO of each range is guaranteed to be a valid VO of a group. The past-end VO may be the start of the previously iterated range. We have no scan_forward because you can just do that with a cursor. Stops when the callback returns false.

◆ used_bins_of_range()

auto vg::StreamIndexBase::used_bins_of_range ( id_t  min_id,
id_t  max_id,
const function< bool(bin_t)> &  iteratee 
) const

Iterate over the populated bins in the index, in in-order bin tree traversal order, that any of the node IDs in the given inclusive range occur in. Returns false if asked to stop.

◆ window_of_id()

auto vg::StreamIndexBase::window_of_id ( id_t  id)
static

Get the linear index window that the given node ID falls in. The window range for a group is its min nodes' window through its max node's window.

Member Data Documentation

◆ bin_to_ranges

unordered_map<bin_t, vector<pair<int64_t, int64_t> > > vg::StreamIndexBase::bin_to_ranges
protected

Maps from bin number to all the ranges of virtual offsets, in order, for runs that land in the given bin. A run lands in a bin if that bin is the most specific bin that includes both its lowest and highest nodes it uses.

◆ bins_by_id_prefix

BitStringTree<bin_t> vg::StreamIndexBase::bins_by_id_prefix
protected

Maps from the bit string representing the prefix of node IDs that a bin matches to the bin's bin number. Only contains entries for nonempty bins.

◆ last_group_min_id

id_t vg::StreamIndexBase::last_group_min_id = numeric_limits<id_t>::min()
protected

What was the minimum node ID of the last group added? If this isn't strictly increasing, we're trying to index data that is not sorted.

◆ MAGIC_BYTES

const string vg::StreamIndexBase::MAGIC_BYTES = "GAI!"
static

What magic value do we embed in the compressed index data? TODO: Make this depend on type of message being indexed so we can't mix up index files.

◆ MAX_INPUT_VERSION

const static uint32_t vg::StreamIndexBase::MAX_INPUT_VERSION = 1
static

What's the maximum index version number we can read with this code?

◆ OUTPUT_VERSION

const static uint32_t vg::StreamIndexBase::OUTPUT_VERSION = 1
static

What's the version we serialize?

◆ WINDOW_SHIFT

const static size_t vg::StreamIndexBase::WINDOW_SHIFT = 8
staticprotected

◆ window_to_start

map<window_t, int64_t> vg::StreamIndexBase::window_to_start
protected

Maps from linear index window to the virtual offset of the first group that overlaps that window (taking the group as a min-to-max node range). If you are looking for reads that visit a node, they can't possibly occur in a group before the first offset stored for the node's window (or any greater window). TODO: Should we make this a vector instead and hope nobody uses high/sparse node IDs?


The documentation for this class was generated from the following files: