What is Hadoop?

Highlights

What is Hadoop? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware.

Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing.

As the World Wide Web grew at a dizzying pace in the late 1900s and early 2000s, search engines and indexes were created to help people find relevant information amid all of that text-based content. During the early years, search results were returned by humans. It’s true! But as the number of web pages grew from dozens to millions, automation was required. Web crawlers were created, many as university-led research projects, and search engine startups took off (Yahoo, AltaVista, etc.).

One such project was Nutch – an open-source web search engine – and the brainchild of Doug Cutting and Mike Cafarella. Their goal was to invent a way to return web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously. Also during this time, another search engine project called Google was in progress. It was based on the same concept – storing and processing data in a distributed, automated way so that more relevant web search results could be returned faster.

In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing. The Nutch project was divided. The web crawler portion remained as Nutch. The distributed computing and processing portion became Hadoop (named after Cutting’s son’s toy elephant). Some advantages of Hadoop are low cost, more computing power, scalability, storage flexibility and inherent data protection and self-healing capabilities.

Print Article

Just In

What is Hadoop?

What is Hadoop? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware.

News

Company

Entertainment

All News