Explain HDFS Concepts in detail

Blocks
A disk has a block size, which is the minimum amount of data that it can read or write. Filesystem blocks are typically a few kilobytes in size, while disk blocks are normally 512 bytes. HDFS has the concept of a block, but it is a much larger unit—64 MB by default. File0sin HDFS are broken into block-sized chunks, which are stored as independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage. Simplicity is something to strive for all in all systems, but is especially important for a distributed system in which the failure modes are so varied. The storage subsystem deals with blocks, simplifying storage management and eliminating metadata concerns.
Namenodes and Datanodes
An HDFS cluster has two types of node operating in a master-worker pattern: a name node (the master) and a number of data nodes (workers). The name node manages the file system namespace. It maintains the file system tree and the metadata for all the files and directories in the tree.
The name node also knows the data nodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from data nodes when the system starts.
A client accesses the file system on behalf of the user by communicating with the name node and data nodes. Data nodes are the workhorses of the file system. Hadoop can be configured so that the name node writes its persistent state to multiple file systems. These writes are synchronous and atomic. The usual configuration choice is to write to local disk as well as a remote NFS mount.
It is also possible to run a secondary name node, which despite its name does not act as a name node. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large. The secondary name node usually runs on a separate physical machine, since it requires plenty of CPU and as much memory as the name node to perform the merge. It keeps a copy of the merged namespace image, which can be used in the event of the name node failing.
HDFS Federation
The name node keeps a reference to every file and block in the file system in memory, which means that on very large clusters with many files, memory becomes the limiting factor for scaling.
HDFS Federation, introduced in the 0.23 release series, allows a cluster to scale by adding name nodes, each of which manages a portion of the file system namespace. For example, one name node might manage all the files rooted under /user, say, and a second Name node might handle files under /share.Under federation, each name node manages a namespace volume, which is made up of the metadata for the namespace, and a block pool containing all the blocks for the files in the namespace. Namespace volumes are independent of each other, which means name nodes do not communicate with one another, and furthermore the failure of one name node does not affect the availability of the namespaces managed by other name nodes.
Block pool storage is not partitioned, however, so data nodes register with each name node in the cluster and store blocks from multiple block pools.
HDFS High-Availability
The combination of replicating name node metadata on multiple file systems, and using the secondary name node to create checkpoints protects against data loss, but does not provide high-availability of the file system. The name node is still a single point of failure (SPOF), since if it did fail, all clients—including MapReduce jobs—would be unable to read, write, or list files, because the name node is the sole repository of the metadata and the file-to-block mapping. In such an event the whole Hadoop system would effectively be out of service until a new name node could be brought online. In the event of the failure of the active name node, the standby takes over its duties to continue servicing client requests without a significant interruption.
A few architectural changes are needed to allow this to happen:
- The name nodes must use highly-available shared storage to share the edit log. When a standby name node comes up it reads up to the end of the shared edit log to synchronize its state with the active name node, and then continues to read new entries as they are written by the active name node.
- Data nodes must send block reports to both name nodes since the block mappings are stored in a name node’s memory, and not on disk.
- Clients must be configured to handle name node fail over, which uses a mechanism that is transparent to users.
If the active name node fails, then the standby can take over very quickly since it has the latest
state available in memory: both the latest edit log entries, and an up-to-date block mapping. The
actual observed fail over time will be longer in practice (around a minute or so), since the system
needs to be conservative in deciding that the active name node has failed.
Failover and fencing
The transition from the active name node to the standby is managed by a new entity in the system called the failover controller. Fail over controllers are plug gable, but the first implementation uses ZooKeeper to ensure that only one name node is active. Each name node runs a lightweight failover controller process whose job it is to monitor its name node for failures and trigger a failover should a name node fail.
Failover may also be initiated manually by an administrator, in the case of routine maintenance, for example.
In the case of an ungraceful failover, however, it is impossible to be sure that the failed name node has stopped running. The HA implementation goes to great lengths to ensure that the previously active name node is prevented from doing any damage and causing corruption—a method known as fencing. The system employs a range of fencing mechanisms, including killing the name node’s process, revoking its access to the shared storage directory, and disabling its network port via a remote management command. As a last resort, the previously active name node can be fenced with a technique rather graphically known as STONITH, or “shoot the other node in the head”, which uses a specialized power distribution unit to forcibly power down the host machine. Client failover is handled transparently by the client library. The simplest implementation uses client-side configuration to control failover. The HDFS URI uses a logical hostname which is mapped to a pair of name node addresses, and the client library tries each name node address until the operation succeeds.