Duration 20:9

CSCI-E63 Big Data Analytics (2015) - Final Project Impala a Realtime Query Engine”

295 watched
0
0
Published 14 May 2015

The summary is at this link: /watch/sYNvNnrJAkMJv The demo is at this link : /watch/wXc0E9b1S-Y10 This project introduces Cloudera Impala as a real time query execution engine and its position within the Hadoop Ecosystem. First, it briefly describes the background that inspired the development of Impala. . Next, the project describes the pitfalls of the conventional Relational Database Management System (RDBMS) and the reasons that Hadoop uses “schema on read” data analysis strategy. Then, a general description of the Hadoop new concepts that shapes data management from the big data point. After that I discuss Hadoop ecosystem and how each component of this ecosystem contributes to the processing of large data sets in a distributed computing environment. Next it describes the different available data analytics engines used to provide data summarization, query, and analysis within Hadoop different distributions with an emphasis on Cloudera Impala. The project describes next the different daemons and component of Cloudera Impala. Then, the interaction between these daemons and components is illustrated. The process life cycle of Hive query execution is compared to the Impala query execution. The advantages and disadvantages of using Impala in memory data access in contrast to Hive’s Map Reduce desk memory are listed next. The data access in the two models is briefly highlighted. To illustrate the result of the different memory model and query execution life cycle, the project demonstrates the performance advantages of using Cloudera Impala over using Hive for real-time queries.

Category

Show more

Tags

Comments - 0