Thursday, September 19, 2013

Hadoop – Play with replication factor


HDFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are per-file property. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later.

In CM, you can set the “dfs.replication”, which controls the default block replication. The number of replications to make when the file is created. The default value is used if a replication number is not specified. The default value is 3. If you change this value, the replication factor of exisitng blocks/files won’t be affected. Also, after you change “dfs.replication”, make sure you “dedeploy client configuration” since this is a client property.

To update existing files’ replication factor, follow steps:
$ hadoop fs -ls /user/username/input/file0
Found 1 items
-rw-r--r--   3 username supergroup         22 2013-05-30 12:11 /user/username/input/file0

$ hadoop fs -setrep -w 2 /user/username/input/file0
Replication 2 set: /user/username/input/file0
Waiting for /user/username/input/file0 .... done

$ hadoop fs -ls /user/username/input/file0
Found 1 items
-rw-r--r--   2 username supergroup         22 2013-05-30 12:11 /user/username/input/file0

You can also do this recursively. To change replication of entire HDFS to 1:
$ hadoop fs -setrep -R -w 1 /


Decrease Replication Factor
When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of free space in the cluster.

No comments: