Title: Hooked on Hadoop to Read And Write Sequence File Using Map Reduce
Publisher: Guru Nanak Publications
Series: Volume 7 Issue 1
Authors: Ravi Aavula, D. Saidulu, B. Harichandana
To enjoy all the benefits of a hadoop based big data system, the first step would be to get the data into HDFS also known as the ingestion process. Hadoop does not work very well with a lot of small files, i.e., files that are smaller than a typical HDFS Block size as it causes a memory overhead for the NameNode to hold huge amounts of small files. Also, every map task processes a block of data at a time and when a map task has too little data to process, it becomes inefficient. Starting up several such map tasks is an overhead. To solve this problem, Sequence files are used as a container to store the small files. Sequence files are flat files containing key, value pairs. A very common use case when designing ingestion systems is to use Sequence files as containers and store any file related metadata (filename, path, creation time etc) as the key and the file contents as the value. Sequence files are splittable with each map task processing a split, with one or many key/value pairs. Each call to map() method in the Mapper would retrieve the next key and value in the corresponding split. Even when a sequence file split cuts off in the middle of a record, the sequence file reader will read until a sync marker is reached ensuring that a record is read in whole.
Hadoop, Mapreduce, Sequence files.