Realize the value of your data using Hadoop

Social media Transactional data Texts Images ... benefits and prioritizing use cases may require thorough ... counsel in the use of Hadoop and other a...

15 downloads 410 Views 790KB Size
White Paper

Realize the value of your data using Hadoop Evaluation, adoption and value of data and analytics

1

White Paper

Table of contents Hadoop realities

1

A logical adoption cycle

3

Methods and tools

4

Deriving business value

6

Partner considerations

7

Seek real value

7

About the authors

8

As organizations strive to identify and realize the value locked inside their enterprise data, including new data sources, many now seek more agile and capable analytics system options. The Apache Hadoop ecosystem is a rapidly maturing technology framework that promises measurable value and savings and is enjoying significant uptake in the enterprise environment. This ecosystem brings a modern data processing platform with storage redundancy and a rich set of capabilities for data integration and analytics, from query engines to advanced machine learning and artificial intelligence (AI). However, very real challenges remain with this evolving data and analytics model for organizations to adopt it effectively and leverage it, along with other advanced analytics services from cloud providers, to deliver business value. Companies and public institutions may struggle to acquire the skills, tools and capabilities needed to successfully implement Hadoop analytics projects in both cloud and on-premises environments. Others are working to clarify a logical path to value for dataand analytics-oriented deployments. In this paper, DXC Technology describes a proven strategy to extract value from Hadoop-oriented investments. The paper examines the opportunities and obstacles many organizations face and explores the processes, tools and best practice methods needed to achieve business objectives.

Hadoop realities 1

2

IDC White Paper, sponsored by Seagate, Data Age 2025, April 2017. http://www.seagate.com/ www-content/our-story/trends/ files/Seagate-WP-DataAge2025March-2017.pdf

T he Big (Unstructured) Data Problem, Forbes CommunityVoice, June 5, 2017. https://www.forbes.com/ sites/ forbestechcouncil/2017/06/05/ thebig-unstructured-data-problem

As organizations of all kinds embrace the use of data and analytics, most are now moving beyond traditional business intelligence (BI) to a more advanced and comprehensive analytics environment with much richer data sources. Forward-looking executives now view data and analytics involving machine learning and AI capabilities as vital tools needed to improve customer relationships, accelerate speed to market, survive in an increasingly dynamic marketplace and drive sustainable value. The value of data The very nature of the data revolution — driven by the four V’s: volume, variety, velocity and veracity — now poses unique challenges to many organizations. Informed observers expect the global volume of data to swell to 163 zettabytes by 2025, 10 times the amount today.1 Unstructured data now comprises 80 percent of enterprise data,2 with growing volumes of data flowing from increasingly ubiquitous sensors, mobile devices, video streams and social networks. This is in addition to the varieties of data 1

White Paper already being stored within enterprises but not easily available for analytics. Given the speed and reach of this data revolution, it is perhaps not surprising that many organizations are less than prepared to meet these challenges. Hadoop ecosystem emerges and matures As businesses, public agencies and other organizations struggle with these challenges, many now view the incrementally maturing Hadoop ecosystem and related complementary technology offerings as a logical way to process and analyze big data as well as smaller datasets (see Figure 1). Driven by the needs associated with broadening adoption, the ecosystem is evolving with more enterprise integration and operational features.

Business outcomes

Data-driven products and services

Value themes

Customer experience

Operations & IT optimization

Modernize/leverage

Corporate and financial performance

Managing company risks

Operationalize analytics

Discovery

Development integration

Implementation

Tools

Best practices

Processes

Process & tools

Data governance

Platform Hadoop

Data

CRM, SCM, ERP

Documents

Social media

Texts

Video

Open data

GPS

Email

Audio

Transactional data

Images

Mobile

Sensor data

Weather data

Machine data

Figure 1. A framework for advanced data and analytics

The Apache Hadoop ecosystem is an open source software framework for distributed storage and processing of extensive datasets, using simple program modules across clusters of commodity-level computing hardware. Reliable and scalable, Hadoop is designed to run on anything from a single server to thousands of machines. Most ecosystem tools are inherently scalable as well, although early releases of tools may not start as mature enough for demanding production uses. The primary Hadoop project incorporates common utilities, a distributed file system, frameworks for resource management and job scheduling, and technology for parallel processing of large data volumes. It also provides complementary ecosystem tools, such as the Spark family of components and database tools (e.g., Hive and HBase) for different types of analytics processing. These tools extend to advanced analytics with machine learning, including AI-oriented functional components. Most enterprises have deployed, or are considering the deployment of, Hadoop environments. Business and IT leaders expect Hadoop to help them extract value from their data and to reduce their total cost of ownership (TCO) for BI and analytics. The Hadoop stack continues to evolve rapidly and now incorporates solution features that allow organizations to build and deploy IT and business solutions that can operate at production grade and across multiple corporate locations. It is not uncommon

2

White Paper

that some business uses involve more on-demand or periodic scale-out clusters for specific processing needs, deployed in the public cloud, in addition to more permanent multitenant core clusters that would be on premises, or in the private or public cloud. Also, as the analytics uses become increasingly business critical, more deployments are starting to incorporate disaster recovery (DR) cluster solutions with commensurate recovery time objectives. Challenges remain While Hadoop, with its rich ecosystem, has quickly gained traction as a viable open source technology in the data and analytics marketplace, as seen with the broader digital revolution, a number of significant challenges have emerged. A Hadoop implementation presents very complex planning, deployment and long-term management challenges. There is still a general shortage of Hadoop skills in the marketplace. Although the Hadoop technology stack continues to evolve, it is still maturing and thus poses a higher degree of difficulty and uncertainty. This is especially true, and risky, around the evolving domains of data ingestion and data governance, where the ground is still shifting and independent vendors are attempting to provide a measure of distribution/release independence and stability. Also, distribution vendors are exploring different approaches to dynamic cluster management, and that can pose some challenges for companies that have, or intend to have, different types of clusters to manage as the company’s business grows. The business-oriented drivers for many data and analytics projects are often unclear, or less than precise. Discovery projects often lack focus or a clear enough view of the business benefits expected against the use cases being evaluated. Estimating benefits and prioritizing use cases may require thorough preparations with many key stakeholders in the organization for these evaluations to be viewed as representative and robust enough for investments and for mobilizing the resources needed to execute. Not surprisingly, many organizations are struggling to identify a clear path to value for their current Hadoop and other data, analytics and BI investments.

For many, identifying end-state value is the core challenge for data- and analytics-oriented initiatives. DXC Technology believes the correct response is a robust and proven approach to the evaluation and adoption of Hadoop and similar efforts.

A logical adoption cycle A carefully planned, phased and cost/benefit-balanced adoption environment is crucial for the successful implementation and adoption of Hadoop or any other complex technology system. DXC Technology recognizes a proven three-step approach to implementing sophisticated data or analytics systems: 1. Discovery. In this initial exploratory phase, organizations consider potential project candidates, build and evaluate the business case for potential Hadoop and analytics initiatives and reject, when appropriate, poor project candidates. Many Hadoop projects fall into the discovery category. Some include new analytics use cases; others may be about optimizing workloads and off-loading legacy systems. 2. Development and integration. Once a project has proven its value, it must be built and integrated with existing applications and with the larger BI and analytics landscape. The enterprise integration needs to include connectivity, security and operational support, as well as application interfaces. 3

White Paper

3. Implementation. In this final, critical phase, organizations may roll out multiple, industrial-strength Hadoop applications. Depending on the nature of the organization, those deployments may occur across a complex IT landscape, in multiple clusters and on a global scale. The rollout needs to enable adoption by targeted user communities; they may be addressed in incremental waves and may require tailored training.

Methods and tools The good news is that a growing range of techniques and technologies is now available to support the Hadoop framework environment. The following outlines some of the major tools and resources organizations can use to ensure more successful data and analytics project outcomes. Expert guidance Managing information as a strategic asset requires unique organizational and governance structures. Detailed planning and the use of consistent methods are needed to reduce the cost and risk of data and analytics projects. Not surprisingly, forward-looking organizations increasingly seek expert advice and counsel in the use of Hadoop and other advanced data and analytics technologies. An experienced partner can offer end-to-end guidance for BI modernization — from initial assessments to the scope, planning and strategy, proof of value and actionable roadmaps. Competent advisory services may include preliminary designs, realistic cost estimates and prioritized phased planning for building a sustainable, Hadoopenabled BI architecture. Organizations can use this input to make better decisions about technical architectures, required skills and competencies, governance models and delivery platforms. Discovering value-driven Hadoop uses Exploring the business value and possible avenues for data and analytics advances can be difficult. Few organizations possess a structured methodology, specialized data visualization and sharing tools as part of an integrated platform, skilled resources, collaboration methods and best practices to support experimentation that is both ambitious and cost-effective. To solve this problem, DXC recommends a formal and structured approach to data and analytics discovery. This collaborative model provides a number of experimental options and allows organizations to explore, test and learn about Hadoop, as well as other data and analytics opportunities, in a safe, cost-efficient environment. This phased, expert-based discovery model allows organizations to deploy analytics solutions more quickly and with lower, more predictable investments. It reduces disruptions to existing processes and data while increasing productivity and long-term revenue. A robust discovery environment opens a clear path to business value — identifying weak projects and quickly proving the value of strong data and analytics opportunities.

4

White Paper

Leveraging the right platform Given the complex, rapidly changing nature of data and analytics technologies, organizations are naturally reluctant to invest heavily in capital-intensive systems that may too quickly become obsolete. Fortunately, a new generation of consumption-based services now allows enterprises of all kinds to quickly pursue data- and analytics-driven opportunities without the high cost and obsolescence risks of traditional infrastructure. As-a-service solutions are now available for Hadoop enterprise data analytics, real-time BI, and cloud-based data and analytics capabilities. Those cost-effective models allow organizations to more swiftly derive valuable insights — again, without investing in hardware, software licenses, procurement and installation, as well as ongoing management of what is often a one-off test and deployment environment. This model lets companies consume and pay only for what they use, and eliminates much of the concern over technology refreshes. It invites teams to more quickly discover insights and develop and deploy winning analytics applications in a production-ready environment. Even with these options and reduced financial and technology management risks, many companies are still challenged with daunting choices stemming from the rapidly evolving technology and integration complexities. Again, expert guidance and experienced system integrator support — with proven in-practice deployment, integration blue prints and application design patterns — can help accelerate solutioning and reduce the downstream risk and costs of the chosen platform and initially targeted applications. Data management considerations One of the most daunting issues facing any BI strategy is how to fully harness all available data, whether well-known structured systems, the burgeoning universe of unstructured information, or the data from virtually any internal and external source. The answer is what some might call a truly modern analytics platform — one that incorporates hybrid data management and workload optimization capabilities, seamlessly bridging traditional and more advanced data and BI technologies. DXC envisions a hybrid approach that optimizes existing BI technologies, and that integrates analytics into business systems and processes across data centers, secure value chains, and public and private cloud environments. Deployment and integration Implementation of the data lake is a vital step in an analytics initiative. This is where the different types of data are stored to enable the analytics. Strong integration to and from the data lake requires the development and execution of the technical, process and organizational architectures needed to capture, manage, retain and deliver information from the data lake across the enterprise. Robust content management allows organizations to give employees, customers and value chain partners secure yet convenient access to information and processes. 5

White Paper

Data governance DXC also believes that to fully leverage data and analytics as strategic assets, organizations must have a formal and effective structure for information governance of the data lake. As in other areas of a business, strong governance can reduce costs and risk while ensuring greater efficiency from the governed activities. Any good governance approach should serve to classify, archive and manage physical and electronic data in a reliable and cost-effective way. That requires careful planning and should encompass data ownership, storage and flows, visibility, confidentiality and security. A robust governance structure addresses policies, processes and systems — and should meet data-oriented regulatory requirements and business objectives. Deployment alternatives In their efforts to fully exploit data and analytics, businesses and public agencies are often limited by talent deficiencies, the constraints of available resources and financial pressures to limit capital spending. For these and related reasons, growing numbers of organizations are now accessing consumption-based, managed analytics and BI services. While next-generation analytics are gaining ground, traditional BI will be viable for the foreseeable future. By working with a competent provider for managed services, companies can reduce data warehouse and BI costs while improving service levels. Managed analytics services can eliminate the need to build internal statistical bureaus, as well as the need to hire and pay on-staff data scientists. This approach enables companies to acquire expertise and capabilities without adding soon-to-be-obsolete infrastructure. It supports service levels that adjust quickly and easily, allowing organizations to better meet changing business and technology demands. It can also be an add-on option to more traditional on-premises approaches, to handle peaks in demand or the needs of new innovation projects, thus leading to hybrid models.

Deriving business value Based on decades of experience in enterprise-class data and analytics environments, DXC sees the Hadoop stack driving business value along two broad themes: • Modernizing existing BI environments. Hadoop can be implemented to allow existing BI systems to better handle the volume, variety, velocity and veracity of growing enterprise data. Certainly, traditional BI tools are reaching the upper limits of their capabilities. By introducing Hadoop ecosystem-based solutions, organizations can lower their analytics TCO and run costs while deploying modernized versions of Oracle, SAP and other BI stacks.

6

White Paper

• O perationalizing analytics. Hadoop supports the execution of use cases for a vast array of data emanating from numerous internal and external sources. Its ecosystem toolset is very rich and supports different data ingestion patterns, including batch and real time, and different types of advanced analytics, including machine learning and AI in addition to traditional BI. Consumer-oriented firms could, in just one example, better target buyers based on their historical preferences and social media activities. By operationalizing those applications into existing business processes, organizations can drive business value and competitive advantages.

Partner considerations Given the complexity of the still-emerging Hadoop environment, it comes as no surprise that many organizations now recognize the value of working with partners who are experienced in enterprise-grade data and analytics systems. CIOs and others should consider several qualities when assessing potential Hadoop allies. A helpful partner should, at the very least, offer demonstrated experience and expertise in the current Apache Hadoop framework and its ecosystem of tools, and have experience with multiple deployment models and best practice methodologies. Look for teams with extensive integration skills, and teams who understand how to not only deploy around existing BI infrastructure but also extend its useful life. For organizations across the industrial and public spectrum, DXC can be that partner. The company has emerged as a leading source of guidance, expert skills, analytics platforms, systems and services to support Hadoop implementations. DXC Analytics services bring the right people, processes and technologies to enterprise-class challenges. DXC specializes in evaluating and aligning analytics investments with business objectives, allowing organizations to explore big data opportunities, glean insights and derive value from information. DXC has invested globally to enable the delivery and management of the information ecosystem — and stands ready to discuss Hadoop and other analytics needs with enterprises.

Seek real value The importance of data is undeniable in today’s enterprise environment, and forward-looking organizations are correctly seeking more practical and agile analytics systems. A Hadoop framework presents both challenges and opportunities. It is important to chart the right path for discovery through development and implementation, and through long-term governance and management of the data lakes. DXC has a distinct point of view on how organizations can map a clear journey to more productive, cost-effective outcomes. By adopting these proven methods, leaders can better identify, leverage and monetize the inherent value in enterprise information.

7

White Paper

About the authors Ashim Bose leads the Analytics Portfolio team at DXC Technology. With over 20 years of industry experience in automotive, industrial, airlines and space exploration, Ashim holds a Ph.D. in computer science and a master’s degree in mechanical engineering, both with a specialization in artificial intelligence.

Learn more at www.dxc.technology/ analytics

Jan Jonak leads Analytics Platform Engineering at DXC Technology. He has over 10 years of experience with BI and data warehousing delivery for major clients, and with offerings development for big data, data discovery and production platforms (Hadoop, Spark, Vertica and Haven — on-premises and cloud).

About DXC Technology DXC Technology (DXC: NYSE) is the world’s leading independent, end-to-end IT services company, helping clients harness the power of innovation to thrive on change. Created by the merger of CSC and the Enterprise Services business of Hewlett Packard Enterprise, DXC Technology serves nearly 6,000 private and public sector clients across 70 countries. The company’s technology independence, global talent and extensive partner network combine to deliver powerful next-generation IT services and solutions. DXC Technology is recognized among the best corporate citizens globally. For more information, visit www.dxc.technology. www.dxc.technology

© 2017 DXC Technology Company. All rights reserved.

MD_7102a-18. November 2017