图书介绍

Hadoop权威指南 英文 第4版【2025|PDF|Epub|mobi|kindle电子书版本百度云盘下载】

Hadoop权威指南 英文 第4版
  • (美)怀特著 著
  • 出版社: 南京:东南大学出版社
  • ISBN:9787564159177
  • 出版时间:2015
  • 标注页数:730页
  • 文件大小:77MB
  • 文件页数:756页
  • 主题词:数据处理软件-指南-英文

PDF下载


点此进入-本书在线PDF格式电子书下载【推荐-云解压-方便快捷】直接下载PDF格式图书。移动端-PC端通用
种子下载[BT下载速度快]温馨提示:(请使用BT下载软件FDM进行下载)软件下载地址页直链下载[便捷但速度慢]  [在线试读本书]   [在线获取解压码]

下载说明

Hadoop权威指南 英文 第4版PDF格式电子书版下载

下载的文件为RAR压缩包。需要使用解压软件进行解压得到PDF格式图书。

建议使用BT下载工具Free Download Manager进行下载,简称FDM(免费,没有广告,支持多平台)。本站资源全部打包为BT种子。所以需要使用专业的BT下载软件进行下载。如BitComet qBittorrent uTorrent等BT下载工具。迅雷目前由于本站不是热门资源。不推荐使用!后期资源热门了。安装了迅雷也可以迅雷进行下载!

(文件页数 要大于 标注页数,上中下等多册电子书除外)

注意:本站所有压缩包均有解压码: 点击下载压缩包解压工具

图书目录

Part Ⅰ.Hadoop Fundamentals3

1.Meet Hadoop3

Data!3

Data Storage and Analysis5

Querying All Your Data6

Beyond Batch7

Comparison with Other Systems8

Relational Database Management Systems8

Grid Computing10

Volunteer Computing11

A Brief History of Apache Hadoop12

What's in This Book?15

2.MapReduce19

A Weather Dataset19

Data Format19

Analyzing the Data with Unix Tools21

Analyzing the Data with Hadoop22

Map and Reduce22

Java Map Reduce24

Scaling Out30

Data Flow30

Combiner Functions34

Running a Distributed Map Reduce Job37

Hadoop Streaming37

Ruby37

Python40

3.The Hadoop Distributed Filesystem43

The Design of HDFS43

HDFS Concepts45

Blocks45

Namenodes and Datanodes46

Block Caching47

HDFS Federation48

HDFS High Availability48

The Command-Line Interface50

Basic Filesystem Operations51

Hadoop Filesystems53

Interfaces54

The Java Interface56

Reading Data from a Hadoop URL57

Reading Data Using the FileSystem API58

Writing Data61

Directories63

Querying the Filesystem63

Deleting Data68

Data Flow69

Anatomy of a File Read69

Anatomy of a File Write72

Coherency Model74

Parallel Copying with distcp76

Keeping an HDFS Cluster Balanced77

4.YARN79

Anatomy of a YARN Application Run80

Resource Requests81

Application Lifespan82

Building YARN Applications82

YARN Compared to MapReduce 183

Scheduling in YARN85

Scheduler Options86

Capacity Scheduler Configuration88

Fair Scheduler Configuration90

Delay Scheduling94

Dominant Resource Fairness95

Further Reading96

5.Hadoop I/O97

Data Integrity97

Data Integrity in HDFS98

LocalFileSystem99

ChecksumFileSystem99

Compression100

Codecs101

Compression and Input Splits105

Using Compression in MapReduce107

Serialization109

The Writable Interface110

Writable Classes113

Implementing a Custom Writable121

Serialization Frameworks126

File-Based Data Structures127

SequenceFile127

MapFile135

Other File Formats and Column-Oriented Formats136

Part Ⅱ.MapReduce141

6.Developing a MapReduce Application141

The Configuration API141

Combining Resources143

Variable Expansion143

Setting Up the Development Environment144

Managing Configuration146

GeneficOptionsParser,Tool,and ToolRunner148

Writing a Unit Test with MRUnit152

Mapper153

Reducer156

Running Locally on Test Data156

Running a Job in a Local Job Runner157

Testing the Driver158

Running on a Cluster160

Packaging a Job160

Launching a Job162

The MapReduce Web UI165

Retrieving the Results167

Debugging a Job168

Hadoop Logs172

Remote Debugging174

Tuning a Job175

Profiling Tasks175

MapReduce Workflows177

Decomposing a Problem into MapReduce Jobs177

JobControl178

Apache Oozie179

7.How Map Reduce Works185

Anatomy of a MapReduce Job Run185

Job Submission186

Job Initialization187

Task Assignment188

Task Execution189

Progress and Status Updates190

Job Completion192

Failures193

Task Failure193

Application Master Failure194

Node Manager Failure195

Resource Manager Failure196

Shuffle and Sort197

The Map Side197

The Reduce Side198

Configuration Tuning201

Task Execution203

The Task Execution Environment203

Speculative Execution204

Output Committers206

8.MapReduce Types and Formats209

MapReduce Types209

The Default MapReduce Job214

Input Formats220

Input Splits and Records220

Text Input232

Binary Input236

Multiple Inputs237

Database Input(and Output)238

Output Formats238

Text Output239

Binary Output239

Multiple Outputs240

Lazy Output245

Database Output245

9.MapReduce Features247

Counters247

Built-in Counters247

User-Defined Java Counters251

User-Defined Streaming Counters255

Sorting255

Preparation256

Partial Sort257

Total Sort259

Secondary Sort262

Joins268

Map-Side Joins269

Reduce-Side Joins270

Side Data Distribution273

Using the Job Configuration273

Distributed Cache274

MapReduce Library Classes279

Part Ⅲ.Hadoop Operations283

1O.Setting Up a Hadoop Cluster283

Cluster Specification284

Cluster Sizing285

Network Topology286

Cluster Setup and Installation288

Installing Java288

Creating Unix User Accounts288

Installing Hadoop289

Configuring SSH289

Configuring Hadoop290

Formatting the HDFS Filesystem290

Starting and Stopping the Daemons290

Creating User Directories292

Hadoop Configuration292

Configuration Management293

Environment Settings294

Important Hadoop Daemon Properties296

Hadoop Daemon Addresses and Ports304

Other Hadoop Properties307

Security309

Kerberos and Hadoop309

Delegation Tokens312

Other Security Enhancements313

Benchmarking a Hadoop Cluster314

Hadoop Benchmarks314

User Jobs316

11.Administering Hadoop317

HDFS317

Persistent Data Structures317

Safe Mode322

Audit Logging324

Tools325

Monitoring330

Logging330

Metrics and JMX331

Maintenance332

Routine Administration Procedures332

Commissioning and Decommissioning Nodes334

Upgrades337

Part Ⅳ.Related Projects345

12.Avro345

Avro Data Types and Schemas346

In-Memory Serialization and Deserialization349

The Specific API351

Avro Datafiles352

Interoperability354

Python API354

Avro Tools355

Schema Resolution355

Sort Order358

Avro MapReduce359

Sorting Using Avro MapReduce363

Avro in Other Languages365

13.Parquet367

Data Model368

Nested Encoding370

Parquet File Format370

Parquet Configuration372

Writing and Reading Parquet Files373

Avro,Protocol Buffers,and Thrift375

Parquet MapReduce377

14.Flume381

Installing Flume381

An Example382

Transactions and Reliability384

Batching385

The HDFS Sink385

Partitioning and Interceptors387

File Formats387

Fan Out388

Delivery Guarantees389

Replicating and Multiplexing Selectors390

Distribution:Agent Tiers390

Delivery Guarantees393

Sink Groups395

Integrating Flume with Applications398

Component Catalog399

Further Reading400

15.Sqoop401

Getting Sqoop401

Sqoop Connectors403

A Sample Import404

Text and Binary File Formats406

Generated Code407

Additional Serialization Systems408

Imports:A Deeper Look408

Controlling the Import410

Imports and Consistency411

Incremental Imports411

Direct-Mode Imports411

Working with Imported Data412

Imported Data and Hive413

Importing Large Objects415

Performing an Export417

Exports:A Deeper Look419

Exports and Transactionality420

Exports and SequenceFiles421

Further Reading422

16.Pig423

Installing and Running Pig424

Execution Types424

Running Pig Programs426

Grunt426

Pig Latin Editors427

An Example427

Generating Examples429

Comparison with Databases430

Pig Latin432

Structure432

Statements433

Expressions438

Types439

Schemas441

Functions445

Macros447

User-Defined Functions448

A Filter UDF448

An Eval UDF452

A Load UDF453

Data Processing Operators457

Loading and Storing Data457

Filtering Data457

Grouping and Joining Data459

Sorting Data465

Combining and Splitting Data466

Pig in Pracfice467

Parallelism467

Anonymous Relations467

Parameter Substitution468

Further Reading469

17.Hive471

Installing Hive472

The Hive Shell473

An Example474

Running Hive475

Configuring Hive475

Hive Services478

The Metastore480

Comparison with Traditional Databases482

Schema on Read Versus Schema on Write482

Updates,Transactions,and Indexes483

SQL-on-Hadoop Alternatives484

HiveQL485

Data Types486

Operators and Functions488

Tables489

Managed Tables and External Tables490

Partitions and Buckets491

Storage Formats496

Importing Data500

Altering Tables502

Dropping Tables502

Querying Data503

Sorting and Aggregating503

MapReduce Scripts503

Joins505

Subqueries508

Views509

User-Defined Functions510

Writing a UDF511

Writing a UDAF513

Further Reading518

18.Crunch519

An Example520

The Core Crunch API523

Primitive Operations523

Types528

Sources and Targets531

Functions533

Materialization535

Pipeline Execution538

Running a Pipeline538

Stopping a Pipeline539

Inspecting a Crunch Plan540

Iterative Algorithms543

Checkpointing a Pipeline545

Crunch Libraries545

Further Reading548

19.Spark549

Installing Spark550

An Example550

Spark Applications,Jobs,Stages,and Tasks552

A Scala Standalone Application552

A Java Example554

A Python Example555

Resilient Distributed Datasets556

Creation556

Transformations and Actions557

Persistence560

Serialization562

Shared Variables564

Broadcast Variables564

Accumulators564

Anatomy of a Spark Job Run565

Job Submission565

DAG Construction566

Task Scheduling569

Task Execution570

Executors and Cluster Managers570

Spark on YARN571

Further Reading574

20.HBase575

HBasics575

Backdrop576

Concepts576

Whirlwind Tour of the Data Model576

Implementation578

Installation581

Test Drive582

Clients584

Java584

MapReduce587

REST and Thrift589

Building an Online Query Application589

Schema Design590

Loading Data591

Online Queries594

HBase Versus RDBMS597

Successful Service598

HBase599

Praxis600

HDFS600

UI601

Metrics601

Counters601

Further Reading601

21.ZooKeeper603

Installing and Running ZooKeeper604

An Example606

Group Membership in ZooKeeper606

Creating the Group607

Joining a Group609

Listing Members in a Group610

Deleting a Group612

The ZooKeeper Service613

Data Model614

Operations616

Implementation620

Consistency622

Sessions624

States625

Building Applications with ZooKeeper627

A Configuration Service627

The Resilient ZooKeeper Application630

A Lock Service634

More Distributed Data Structures and Protocols636

ZooKeeper in Production637

Resilience and Performance637

Configuration639

Further Reading640

Part Ⅴ.Case Studies643

22.Composable Data at Cerner643

From CPUs to Semantic Integration643

Enter Apache Crunch644

Building a Complete Picture644

Integrating Healthcare Data647

Composability over Frameworks650

Moving Forward651

23.Biological Data Science:Saving Lives with Software653

The Structure of DNA655

The Genetic Code:Turning DNA Letters into Proteins656

Thinking of DNA as Source Code657

The Human Genome Project and Reference Genomes659

Sequencing and Aligning DNA660

ADAM,A Scalable Genome Analysis Platform661

Literate programming with the Avro interface description language(IDL)662

Column-oriented access with Parquet663

A simple example:k-mer counting using Spark and ADAM665

From Personalized Ads to Personalized Medicine667

Join In668

24.Cascading669

Fields,Tuples,and Pipes670

Operations673

Taps,Schemes,and Flows675

Cascading in Practice676

Flexibility679

Hadoop and Cascading at ShareThis680

Summary684

A.Installing Apache Hadoop685

B.Cloudera's Distribution Including Apache Hadoop691

C.Preparing the NCDC Weather Data693

D.The Old and New Java MapReduce APIs697

Index701

热门推荐