Advanced Distributed System - Big data

Big Data

Posted by UNKNOWN SPACE on Monday, November 6, 2023

In a shared nothing infrastructure, data storage, update, access and processing takes place across a collection of networked machines. To enhance the performance, processing needs to be take place in parallel, which we have to distribute data, and that’s partition. This article will gives an rough idea of partition techniques and details needs to be consider.

Partition

By definition, partition involves deciding how to assign data from a sources to distributed nodes.Take the above image as an example, we could do partition in these ways:

  1. Random or Round Robin:
    Data is assigned as it is inserted across different nodes.

    Pros Cons
    Uniform initial distribution Querying involves accessing all notes
    Easy to rebalance on update
  2. Range based Partition:
    Partitioning often assumes some form of key-value data model. Then partitioning relates to the key Key range: each partition is associated with a range of values for the key (e.g. alphabetical or numerical).

    Pros Cons
    Focused range queries on key Risk of hot spots for popular keys
    Focused direct access on key Uniform distribution not intrinsic
    Rebalancing can be expensive
    Only helps for key-based requests
  3. Hash based Partition:
    Partition based on the hash of the keys.
    Assume we have Hash(String) -> Integer, if there is n parition, partition(key) = return(Hash(key) mod n)

    Pros Cons
    Uniform distribution with certain keys Risk of hot spots for popular keys
    Focused direct access on key No support for range queries
    Rebalancing can be expensive
    Only helps for the requests that have the key
  4. Meaning based Partition:
    Partitioning by key means choosing a suitable key

    Desirable Examples
    Useful for range access requests Size
    Useful for direct access requests Name, Town
    Diverse values Id, Name, Town, Size
    Not subject to skew Id, Name, Town, Size

Repartitioning

Skew and Hot Spots

Skew is when partitioning is uneven. For example, a partitioning on the first letter of Country in the University table.
A hot spot arises when the load is unevenly balanced. For example, many more people may look up some universities than others.

Method

Repartitioning involves moving the data from one partition to another.
Example: Partition by Id, use a hash function Hash(x) = Id(x) mod #nodes How many tuples need to change node?
In this case of uniform distribution and mode-based partitioning, the number of data need to be repartitioning is:
To-Distribute = 1 - 1 / #final-nodes

In this example, the number of data need to be distribute is:
1 - 1/3 = 0.666, which is 0.666 * 6 = 4

In order to reduce the cost of repartitioning, it is possible to have more partitions than nodes. Then repartitioning for a new node involves moving selected partitions to the new node.

Two steps need to be taken to find a partition for value v:

  1. Hash v to a partition number
  2. Use a lookup table to identify the node of this partition

In this case the fraction of data need to be moved is: 1 / #nodes + 1
In the previous example the fraction is 1/4

Secondary Indexes

The Ranging and Hashing methods are better performed when key is widely used. NoSQL are always designed to support applications use Id or names as primary key. However, it’s better for us to have multiple access routes to avoid scanning. In this case we use secondary index - the index on an attribute that has not been used for partitioning.

Example: In this table, the data is partitioned by name. Does the hash index on the partition helps?

Query Does partition index help? Useful Additional Index
How many students attend UCL? Yes N/A
Which Uni have > 40,000 students No Range index on size
Which Uni are in Manchester No Hash or range index on town

Local Secondary Index

A key question is that where should we store these secondary index? Here are few options to do it:

  1. Store the secondary index on the same node as the data it is indexing.

    Pros Cons
    Updating a document leads to local index changes (so no distributed transactions) A lookup needs to go to every transaction, leading to many index lookups
  2. Store the secondary index partitioned by term: a single distributed index.

    Pros Cons
    Index lookups no longer need to go to all nodes (note some index lookups will have no hits) Updates to a document on a node now also lead to index updates (and thus potentially to distributed transactions)

Partitioning for Evaluation

After we talked about how to do partitioning, we should discuss about when data need partitioning. As an example, Uni-Sizes is hash partitioned on ‘Name’, Uni-Places is hash partitioned on ‘Town’, and we want these two partition to join on ‘Name’.

Alt text

We need to repartition Uni-Places as it is not partition on ‘Name’, and the steps are:

  1. Partition Uni-Places on Name to give P1’, P2’.
  2. Join(P1:Uni-Sizes, P1’: Uni-Places) on Node 1
  3. Join(P2:Uni-Sizes, P2’: Uni-Places) on Node 2