The nodes in a Hadoop cluster are interconnected through the network. Typically, one or more of the following phases of MapReduce jobs transfers data over the network:
1. Writing data: This phase occurs when the initial data is either streamed or bulk-delivered to HDFS. Data blocks of the loaded files are replicated, transferring additional data over the network.
2. Workload execution: The MapReduce algorithm is run.
a. Map phase: In the map phase of the algorithm, almost no traffic is sent over the network. The network is used at the beginning of the map phase only if a HDFS locality miss occurs (the data block is not locally available and has to be requested from another data node).
b. Shuffle phase: This is the phase of workload execution in which traffic is sent over the network, the degree to which depends on the workload. Data is transferred over the network when the output of the mappers is shuffled to the reducers.
c. Reduce phase: In this phase, almost no traffic is sent over the network because the reducers have all the data they need from the shuffle phase.
d. Output replication: MapReduce output is stored as a file in HDFS. The network is used when the blocks of the result file have to be replicated by HDFS for redundancy.
3. Reading data: This phase occurs when the final data is read from HDFS for consumption by the end application, such as the website, indexing, or SQL database.
In addition, the network is crucial for the Hadoop control plane: the signaling and operations of HDFS and the MapReduce infrastructure.
Be sure to consider the benefits and costs of the choices available when designing a network: network architectures, network devices, resiliency, oversubscription ratios, etc. The following section discusses some of these parameters in more detail.
Impact of Network Characteristics on Job Completion Times
- A functional and resilient network is a crucial part of a good Hadoop cluster.
- However, an analysis of the relative importance of the factors shows that other factors in a cluster have a greater influence on the performance of the cluster than the network.
- Nevertheless, you should consider some of the relevant network characteristics and their potential effects.
- Variations in switch and router latency have been shown to have only limited impact on cluster performance.
- From a network point of view, any latency-related optimization should start with a network wide analysis.
- “Architecture first, and device next” is an effective strategy.
- Architectures that deliver consistently lower latency at scale are better than architectures with higher overall latency but lower individual device latency.
- The latency contribution to the workload is much higher at the application level, contributed by the application logic (Java Virtual Machine software stack, socket-buffer etc)than network latency.
- In any case, slightly more or less network latency will not noticeably affect job completion times.
Data Node Network Speed
- Data nodes must be provisioned with enough bandwidth for efficient job completion. Price-to-performance ratio entailed in adding more bandwidth to nodes.
- Recommendations for a cluster depend on workload characteristics.
- Typical clusters are provisioned with one or two 1-GB uplinks per data node. Cluster management is made easier by choosing network architectures that are proven to be resilient and easy to manage and that can scale as your data grows.
- The use of 10Gbps server access is largely dependent on the cost/performance trade-off.
- The workload characteristics and business requirement to complete the job in required time will drive the 10Gbps server connectivity.
- For example: As 10 Gbps Ethernet LAN-on-motherboard (LOM) connectors become more commonly available on servers in the future, more clusters likely will be built with 10 Gigabit Ethernet data node uplinks.