In this chapter we will be going through how database indexes work and how they are implemented.

## What is an Index?

- An index is a data structure to speed up retrieval of data records based on some search key.
- A search key is a sequence of
`k`

data attributes.- It can be the primary key of the database
- It can also be the composite key of the database

- Indexes are also stored at files on the computer.
- A unique index is an index where the search key is unique.
- Otherwise it is a non-unique index.

- Indexes are stored as file
- Records in an index file are referred as data entries

## Types of Indexes

In databases, there are 2 main types of database indexes:

- Tree-based index
- These are based on the sorting of search trees
- IE: Binary Search Trees, B+ Trees, etc

- Hash-based index
- Data entries are accessed using hash functions
- IE: Static hashing, extendible hashing, linear hashing, etc

### How to choose an index

Considerations | Description |
---|---|

Search Performance | How fast can we find a record in the index for a given query (Can be ranged or equality search |

Storage overhead | How much space does the index take up in the disk |

Update performance | How much does it cost to update a record in the index |

## Format of data entries

In a database index, there are various ways to store data entries.

Format | Description | Example |
---|---|---|

Format 1 | The actual record with the search key is stored at the leaf pages | `(k, Record data)` |

Format 2 | The records are stored in tuples of `(k, RID)` at the leaf pages | `(k, RID)` |

Format 3 | The records are stored as tuples of (k, List of RIDs). All entries with the key `k` will be stored together | `(k, List of RIDs)` |

**Note**

- For format 1, finding the relevant leaf pages is equivalent to finding the actual records.
- In formats 2 & 3, there is 1 additional lookup to find the actual page containing the record if it is required.

## B+ Tree Index

Some terminologies for a B+ Tree

- Leaf nodes: Store sorted data entries
`k*`

denote a data entry in the form of (k, RID)`k`

= search key value corresponding data record`RID`

=`RID`

of corresponding data record

- Leaf nodes are doubly linked
- Internal nodes: Store sorted pointers to child nodes
- In the form of (
`p1`

,`k1`

,`p2`

,`k2`

,……,`pn`

,`kn`

)- Where
`k1`

<`k2`

<`k3`

, etc `pi`

= disk page address (Root node of subtree)

- Where

- In the form of (
- For each data entry in the B+ tree, the value of
`k`

must be between the 2 keys in the parent node adjacent to it. - Each
`(ki, pi)`

is an index entry,`ki`

serves as a separator between 2 pointers between the contents pointed by`pi-1`

and`pi`

It has the following properties

- Dynamic structure, it adapts to data updates gracefully.
- Height-balanced index structure
- Order of index tree,
`d`

is a subset of positive integers`d`

controls the space utilization of index nodes- Each non-root node contains
`m`

entries where`d <= m <= 2d`

- The root node contains
`m`

entries where`1 <= m <= 2d`

If you find the above definition confusing, here is a diagram of a B+ tree:

In the diagram you can see:

`d`

value is 1- So each internal node can have a maximum of 2 entries
- The root can have less but currently does not.

- Each internal node (root + 1 layer below root) consists of 2 numbers (
`k`

values) and 3 pointers (`p`

values). - Each data entry in the leaf node lies between the 2 keys in the parent node adjacent to it.
- IE:
`0007`

is between the node`0007`

and`0008`

.

- IE:
- For nodes at the edge, the missing k value is assumed to be -infinity or +infinity or inferred from further parents.
- IE:
`0006`

is between -infinity and`0007`

,`0017`

is between`0017`

and +infinity. - IE:
`0008`

is between`0008`

and`0009`

and`0011`

is between`0011`

and`0012`

- IE:

To visualize it yourself, visit B+ Tree Visualization

## B+ Tree Index Search

There are 2 types of searches which are supported in a B+ Tree index.

- Equality search
- Ranged search

### Equality queries

To search for a record in a B+ tree, we need to do the following:

- Start at the root node
- Find the largest key value that is smaller than the search key,
`ki`

- If the search key exists
- search the subtree at pointer
`pi`

- otherwise search the subtree at
`p0`

- search the subtree at pointer

- If the search key exists
- We can repeat this recursively to find the value we want.

### Ranged queries

To search for a range of records in a B+ tree, we need to do the following:

- Start at the root node
- Find the largest key value that is smaller than the search key,
`ki`

- If the search key exists
- search the subtree at pointer
`pi`

- otherwise search the subtree at
`p0`

- search the subtree at pointer

- If the search key exists
- When we arrive at the leaf node
- We collect all values and follow the pages until the first record that is larger than the search key is found

## Insertion into B+ Tree

If we want to insert a value `(k, RID)`

into the B+ Tree, we need to do the following:

- Find the leaf node (
`l`

) where the value should be inserted. - If the leaf node is not full (IE:
`m < 2d`

)- Insert the value into the leaf node
- Done

- Otherwise
- Allocate a new leaf node (
`n*`

), - Put the last
`d+1`

entries of`l`

into`n*`

- Update the sibling pointers in the parent node (Parent of root is null)
- Update the pointers of
`l`

and`n*`

to point to each other

- Allocate a new leaf node (

To see a detailed visualization, visit B+ Tree Visualization

There are optimizations that can be done to reduce the number of disk accesses.

We can check the page to the left and right of the current page.

If there is space, we can move 1 entry over and update the pointer between the 2 pages in the parent node.

## B+ Tree deletion

To delete a value `(k, RID)`

from the B+ Tree, we need to do the following:

- Find the leaf node (
`l`

) where the value should be deleted. - Remove the entry
- If the current leaf does not fall below the minimum threshold (IE:
`m >= d`

)- Done

- Otherwise
- If a sibling has additional nodes over the minimum threshold
- Move 1 node from the sibling to the current node and update the internal node
- done

- Otherwise
- Merge the current node with the sibling
- Update the parent node to remove the pointer to the current node

- If a sibling has additional nodes over the minimum threshold

Similar to insertion, a visualization can be found at B+ Tree Visualization which can help us better understand the process.

## Bulk loading a B+ Tree

If we have a large collection of records that we want to make a B+ tree for, we can do it in 2 ways.

### Insert records into B+ tree 1 at a time

This is the same as the insertion process we have seen before. The process is time consuming, especially for large datasets.

### Bulk loading

We can also bulk load the B+ tree.

This are the steps to bulk load a tree:

- Sort the data entries by search keys
- Load the leaf pages of B+ Tree with sorted entries
- Initialize the B+ tree with an empty root page
- For each leaf page
- Insert its index entry into the rightmost parent of leaf level page of B+ Tree

This is a video explanation for bulk loading in a B+ Tree which can help us better understand the process.