Microsoft Azure HDInsight is Microsoft’s 100 percent compliant distribution of Apache Hadoop on Microsoft Azure. This means that standard Hadoop concepts and technologies apply, so learning the Hadoop stack helps you learn the HDInsight service. At the time of this writing, HDInsight (version 3.0) uses Hadoop version 2.2 and Hortonworks Data Platform 2.0.

In Introducing Microsoft Azure HDInsight, we cover what big data really means, how you can use it to your advantage in your company or organization, and one of the services you can use to do that quickly—specifically, Microsoft’s HDInsight service. We start with an overview of big data and Hadoop, but we don’t emphasize only concepts in this book—we want you to jump in and get your hands dirty working with HDInsight in a practical way. To help you learn and even implement HDInsight right away, we focus on a specific use case that applies to almost any organization and demonstrate a process that you can follow along with.

We also help you learn more. In the last chapter, we look ahead at the future of HDInsight and give you recommendations for self-learning so that you can dive deeper into important concepts and round out your education on working with big data.

Who should read this book

This book is intended to help database and business intelligence (BI) professionals, programmers, Hadoop administrators, researchers, technical architects, operations engineers, data analysts, and data scientists understand the core concepts of HDInsight and related technologies. It is especially useful for those looking to deploy their first data cluster and run MapReduce jobs to discover insights and for those trying to figure out how HDInsight fits into their technology infrastructure.

Assumptions

Many readers will have no prior experience with HDInsight, but even some familiarity with earlier versions of HDInsight and/or with Apache Hadoop and the MapReduce framework will provide a solid base for using this book. Introducing Microsoft Azure HDInsight assumes you have experience with web technology, programming on Windows machines, and basic data analysis principles and practices and an understanding of Microsoft Azure cloud technology.

Who should not read this book

Not every book is aimed at every possible audience. This book is not intended for data mining engineers.

Avkash Chauhan

Avkash Chauhan is the founder and principal at Big Data Perspective, working to build a product that makes Hadoop accessible to mainstream enterprises by simplifying its adoption, customization, management, and support for a Hadoop cluster. While recently at Platfora, he participated in building big data analytics software that runs natively on Hadoop. Previously he worked eight years at Microsoft building cloud and big data products and providing assistance to enterprise partners worldwide. Avkash has more than 15 years of software development experience in cloud and big data disciplines. He is an accomplished author, blogger, and technical speaker and loves the outdoors.

Valentine Fontama

Valentine Fontama is a principal data scientist in the Data and Decision Sciences Group at Microsoft. Val has more than eight years of data science experience. After obtaining his PhD in neural networks, he was a new technology consultant at Equifax in London, where he pioneered the application of data mining in the consumer credit industry. Over the last seven years, Val was a senior product marketing manager for big data and predictive analytics in SQL Server marketing, responsible for machine learning, HDInsight, Parallel Data Warehouse, and Fast Track Data Warehouse. Val also holds an MBA in strategic management and marketing from the Wharton School, an MS in computing, and a BS in mathematics and electronics. He has published 11 academic papers and is an accomplished speaker about big data.

Michele Hart

Michele Hart is a senior technical writer with more than 20 years writing experience, the last 6 at Microsoft. She has written countless knowledgeable words for various industries, including finance, entertainment, Internet, telecom, and education. She spent several years as a manager and director of writing, training, and support teams, several more years as a stay-at-home mom, and the last eight or so as an individual contributor focusing on SQL Server and Power BI articles and videos.

Wee-Hyong Tok

Wee-Hyong Tok is a senior program manager on the SQL Server team at Microsoft. WeeHyong has a range of experiences working with data, with more than six years of data platform experience in industry and six years of academic experience. After obtaining his PhD in data streaming systems from the National University of Singapore, he joined Microsoft and worked on SQL Server Integration Services (SSIS). He was responsible for shaping the SSIS Server, bringing it from concept to its inclusion in SQL Server 2012. WeeHyong has published 20 academic papers and speaks regularly at technology conferences.

Buck Woody

Buck Woody is a senior technical specialist for Microsoft, working with enterprise-level clients to develop computing platform architecture solutions within their organizations. With more than 25 years of professional and practical experience in computer technology, he is also a popular speaker at TechEd, PASS, and many other conferences. Buck is the author of more than 500 articles and five books on databases and teaches a database design course at the University of Washington.

This book consists of one conceptual chapter and four hands-on chapters. Chapter 1, “Big data, quick overview,” introduces the topic of big data, with definitions of terms and descriptions of tools and technologies. Chapter 2, “Getting started with HDInsight,” takes you through the steps to deploy a cluster and shows you how to use the HDInsight Emulator. After your cluster is deployed, it’s time for Chapter 3, “Programming HDInsight.” Chapter 3 continues where Chapter 2 left off, showing you how to run MapReduce jobs and turn your data into insights. Chapter 4, “Working with HDInsight data,” teaches you how to work more effectively with your data with the help of Apache Hive, Apache Pig, Excel and Power BI, and Sqoop. Finally, Chapter 5, “What next?,” covers practical topics such as integrating HDInsight into the rest of your stack and the different options for Hadoop deployment on Windows. Chapter 5 finishes up with a discussion of future plans for HDInsight and provides links to additional learning resources.