3 Challenges to Unleashing Your Data Preparation Potential

by Zachary Zeus
June 14, 2017
red and white letter m

There is a growing trend wherein business analysts have more control than ever to prepare and analyze a variety of valuable data, often circumventing IT processes and policies.  What are IT leaders to do?  Can they liberate data to be used by analysts, yet provide checks and balances when needed?

At BizCubed, we’ve seen many organizations face these issues.  What follows is a discussion of some of the relevant challenges in data preparation and management — with examples of what’s working, and what’s not in these heady times.  In order to help you overcome the most common data analysis challenges, we’ll cover the following areas:

  1. Celebrating Data Diversity: Embracing The Biggest V with a Dynamic Approach
  2. Being Braggadocious is Not an Asset: Avoiding the Macho Coder & Keeping Data Preparation Simple
  3. Power to the People: Reinventing Data Governance, End-to-End

1. Celebrating Data Diversity: Embracing The Biggest V with a Dynamic Approach

In recent years, there’s been much focus on “big data.” Most readers are well aware of the 3 big data V’s that Doug Laney coined in his 2001 META Group article, “3-D Data Management: Controlling Data Volume, Velocity and Variety.”

While there’s been concern around the “Volume” aspect of big data, there’s a different “V” that is more broadly challenging to organizations: Variety.  Laney recognized this in his article, noting: “No greater barrier to effective data management will exist than the variety of incompatible data formats, non-aligned data structures, and inconsistent data semantics.”

Today, we see an array of platforms collecting data relevant for analysis. At a recent conference, one of the analytics leaders at Ford Motor Company mentioned that their team navigates over 4,600 different systems to gather data for analytics. That’s an astounding amount of data variety within just a single company.

This data variety issue is endemic. One productive approach to confronting it is implementing dynamic, reusable processes that accommodate different data.  For example, an insurance brokerage we’ve worked with needed to implement standardized data processes in support of the Health Insurance Portability and Accountability Act (HIPAA). Specifically, they needed to consume and integrate diverse data from across systems to provide complex EDI enrolment data to their insurance partners.

We created a dynamic environment that consumed info about the required data sources and targets, such as directory paths, field names, and data types. Then, we automatically injected this metadata into the data transformation process, drastically reducing the number of workflows needed to handle a variety of formats.

Today’s IT leaders must welcome the diversity of the data available. But in order to do this, they must also embrace platforms that provide a high degree of process reusability and automation.  This can simplify data preparation, helping teams keep up with the unrelenting demand for new data sets.

2. Being Braggadocious is Not an Asset: Avoiding the Macho Coder & Keeping Data Preparation Simple

IT leaders are enthused by the growing crop of talent coming from university programs and technical schools these days. Many of these millennial developers have learned a variety of programming languages.

For this generation, coding is “cool.” There’s a certain machismo in this cohort with respect to developing in raw code versus leveraging user interfaces. They believe command line is always better than GUI and that hand coding is always better than visual development.

This bravado is not new. We’ve seen it before. In the early 90’s, the only way to manage databases was with procedural SQL languages. In response, companies developed visual ETL technologies to provide a reusable framework while eliminating redundant development. Yet, many developers shunned these innovations, claiming they needed more control of their algorithms. In the end, the productivity of tools won out.

Today, there is a similar pattern with new coders, who are often more focused on creating something of their own than the value of their work to the business.  Experienced IT leaders recognize that a balance must be struck between creativity and productivity. Indeed, creating massive custom code will likely result in productivity loss and an IT maintenance nightmare in the long run.

We encountered this situation recently at one of our clients. They needed a process to export data to their partners. When presented with this problem, the developer in charge reviewed visual ETL tools, pronounced that they were too rigid, and boasted that he could code a superior process.

The developer proceeded to write a huge amount of Python code to pull data from databases, format and cleanse it, and then generate and transmit files to specific partners. Once the code was deployed and in need of support, the developer decided to leave the company, taking all his know-how with him.

The client was left poring through megabytes of code any time there was an error or extension request. The flexibility benefits proclaimed by the developer wound up stifling the client’s innovation process and slowed their service level to a crawl.

While it may be appealing to eschew existing processes for something brand new, IT leaders must remain skeptical of vagabond coders more interested in padding their development portfolio than pursuing business goals.

Our experience shows that teams should seek out data preparation tools featuring an intuitive visual design paradigm to accelerate time to insight and expand user access beyond the rogue coder, while still providing the ability for programmers to create extensions and collaborate with wider development communities.

3. Power to the People: Reinventing Data Governance, End-to-End

One lie that IT organizations tell themselves is that there are two kinds of data sources: 1) governed (that IT controls) and 2) ungoverned (that IT does not control). The story goes that data consumers can trust governed sources, but can’t trust ungoverned sources.

This may be true for cases related to regulatory compliance, privacy, and certain executive KPIs.  However, the more fine-grained the questions are, the more difficult it is to conceive of a “single truth” for almost any metric. Further, it’s a fallacy that only IT can provide this absolute truth.

We have a customer in the logistics industry that had many well-designed legacy reports showing transit times and on-time rates for carriers. These reports were developed by skilled IT developers and were governed to ensure that they contained “true” data.

Of course, analysts would regularly find that the reports didn’t have the required rollups or calculations, or were based on invalid assumptions. So did they wait a few months for IT to extend the reports so they could answer their questions?

Of course not. Instead, they took the “perfect” reports, dumped them into a spreadsheet, and manipulated at will. The rules for this manipulation were either stored in the analyst’s head or lost in the ether of copy and paste operations. Welcome to data preparation, cowboy-style.

There’s much fretting amongst IT professionals that allowing analysts to do data preparation in some “ungoverned” way is the end of days. However, we see this new age of analyst-controlled data preparation as one where IT experts are even more relevant than before.

IT data managers, though, need to transform from gatekeepers to flight controllers. IT has the technical chops and data structure knowledge to help analysts navigate data, and create shared, reusable data environments for the good of all.  It’s the duty of IT leaders to act as supporters of a platform for collaborating with analysts on their data discovery journey.

As such, look to establish business rules and technology practices that bridge the divide between traditional IT-led data engineering, business-oriented analytic exploration, and the middle ground of data preparation.  Gaps between these activities continue to be a source of frustration in analytics projects.

Follow the path from raw data to business insight carefully, but ensure the right levels of input and collaboration from all relevant stakeholders.  United we stand, divided we fall!

Conclusion – Check Your Data Preparation

To wrap up, we’ve covered several factors that can make – or break – the data preparation processes so crucial to driving value in many organizations.  If you aren’t sure where you stand, ask yourself the following questions:

  1. Are we set up to exploit the growing data variety that’s becoming crucial to driving insights and competitive advantage? Is our approach static or is it dynamic and automated, ready for the next challenge?
  2. Are we (perhaps unknowingly) dependent on custom coded data preparation routines, and how will we respond if the author(s) leave?  Can we equip a wider array of team members to easily drive scalable data processes?
  3. Are we so rigid in our governance approach that IT and business analysts are working more at odds than together?  Have we identified how to maintain a trusted, holistic process to fuel analytics, and where might there be technology or collaboration gaps?

Next, I’d encourage you to take a moment to learn more about how Pentaho helps businesses meet the challenges of modern data preparation through a flexible, intuitive, and data-agnostic platform.

Finally, if you want to dive deeper into taking your data preparation processes to the next level, take a look at the recent research piece from TDWI, “Improving Data Preparation for Business Analytics.”

Portrait of Maxx Silver
Zachary Zeus

Zachary Zeus is the Co-CEO & Founder of BizCubed. He provides the business with more than 20 years' engineering experience and a solid background in providing large financial services with data capability. He maintains a passion for providing engineering solutions to real world problems, lending his considerable experience to enabling people to make better data driven decisions.

More blog posts