Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Facebook and Linkedin currently use Hadoop framework. Hadoop was first introduced by Google! Hadoop is primarily based on HDFS (Hadoop Distributed File System) and Map Reduce.
HDFS works like a normal distrubuted file system where a cluster is distributed across different nodes. There is a name node which has all the information about the data in other nodes called data nodes. The name node and data nodes are interconnected. After accomplishment of given job, the data nodes report at the name node. The end-user collect the information/result from the name node. The data nodes contains redundant data which can be used later if there is any node failure in one data node or in the other. What if the name node itself fails? How do we access the data? So this problem can be solved using Map reduce.
MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster. The key aspect of the MapReduce algorithm is that if every Map and Reduce is independent of all other ongoing Maps and Reduces, then the operation can be run in parallel on different keys and lists of data. A distributed filesystem spreads multiple copies of the data across different machines. This not only offers reliability without the need for RAID-controlled disks, it offers multiple locations to run the mapping. If a machine with one copy of the data is busy or offline, another machine can be used. A job scheduler keeps track of which MR jobs are executing, schedules individual Maps, Reduces or intermediate merging operations to specific machines, monitors the success and failures of these individual Tasks, and works to complete the entire batch job. In this way the data can be accessed by people using programs to read and write data.