Big data software, hardware, application suites, business analytics solutions ... suddenly, it seems, IT enterprises are deluged with vendor offerings that solve problems it didn't know it had. Competitors are gathering data to determine customer needs, define new product categories, and increase profits. Common applications include storage and analysis of customer sales data, web interactions, machine sensor readings, and much more. As you dive into what will most likely be your largest IT project of the year, ensure that you have planned and budgeted for the following items that are unique to big data implementations.
Budgeting for Scaling Up
Scaling up is a term used to describe the ability of a system to react to larger volumes of data, faster data transmissions, and increasing numbers of users or data-consuming applications. Scaling up problems typically manifest themselves as elongated transaction times, long job run times, and user-perceived poor response times. The most common response is to perceive this as a resource or capacity issue. IT teams add more hardware memory, add more and faster CPUs, larger and faster disk drive, and high-speed network cabling.
In a big data environment, the scale of data movement and processing is already large. IT technical teams have already provided a high-capacity, high-speed data store, usually in the form of a hybrid hardware/software appliance offered by a vendor. (One common solution is the IBM DB2 Analytics Accelerator (IDAA), sold by IBM.) With such a device, why would one worry about scaling up?
The answer lies in the way that your big data is used over time. The most common use is to store large amounts of time-dependent data, such as multiple months of customer transactions. This data is then analyzed by sophisticated software called business intelligence analytics. The combination of advanced analytical software and high-speed processing results in fast queries, and actionable results.
As the data and analytics become more valuable, more users issue larger and more complex queries. Quick response time leads to submission of many times more queries than normal. Management turns the results analyses into better customer service, better product availability, and higher profits. Ad hoc queries become regular monthly reports, then weekly, then daily. The amount of work increases exponentially. Suddenly, you have reached the limits of your current big data application.
Realizing this, IT must budget for an eventual scaling up. The success of the project will naturally lead to the requirements for more capacity and resources. Anticipate how soon this will happen, and include expansion estimates in your follow-up budgets.
Budgeting for Scaling Out
Large IT projects rarely exist in a vacuum, and a big data implementation is no different. Big data isn't simply a large, compressed version of simply business data. Most operational systems of today contain new and complex data types such as large objects (LOBs) that contain audio, image, video, and scanned documents. Many systems store data that contains its own description in extensible markup language (XML). Other systems contain web click-streams or machine sensor readings. Big data means re-visiting current business data models and integration data with dissimilar data types. These data will also exist across multiple hardware platforms on your network.
Budgeting for this scaling out means budgeting for staff time. You will need to understand data across architectures, and this means standardized documentation, developing best practices, and a potential enterprise-wide data integration effort.
Budgeting for Non-Production Environments
Most IT shops have multiple non-production environments. Some are used for application development, some for testing, and others for user acceptance testing. How many of these will contain a non-production version of the big data application?
Big data hardware and software tend to be expensive, and testing analytics usually requires a full-sized data store. For example, you won't get a useful analysis of customer purchasing habits in a test system having less than ten customers.
Budget for non-production by choosing a single environment to contain your big data application. The most common choice is the pre-production or user acceptance testing environment. These environments usually contain nearly full sized databases. Another alternative is to re-purpose one non-production environment as a disaster recovery location, allowing your big data application to do double-duty.
Budgeting for Staff Training
Big data applications come with new technologies and, hence, requirements for new areas of staff expertise. Some of these requirements include:
- Support for installation, configuration, tuning, and upgrading any special-purpose hardware or software such as a big data appliance;
- Support for business intelligence analytics, including query performance tuning, debugging, and perhaps experience in new analytics software.
- Knowledge of several related operational systems, especially any from which source data will be pulled into your big data application.
Here, staff budgeting will include not only training for current staff, but perhaps also acquisition of new employees or consulting services. Early requirements include the specifics of implementing the new big data hardware, software and application. Soon after that you will need a support organization to assist analytics users, as well as staff to perform performance monitoring, tuning and capacity planning.
Budgeting for Operational Systems
Why worry about current operational applications: don't they already work? Perhaps they do, but the important point is that these systems are the primary source of your big data application. Data extracted from operational systems can be incomplete and inaccurate. Enterprise data warehouse architects know this: they have already put in place so-called extract, transform, and load processes that retrieve operational data and remove or "clean" data element values.
The same or similar processes must be implemented to populate the big data application; therefore, to be successful you must have a complete understanding of your data. Unfortunately, most operational systems staff will be unfamiliar with big data or analytics, and big data staff experts will not know current systems.
Budget for this by planning regular collaborative meetings. Review current systems and store architecture and data flow documentation in a shared place for continual review. Pay particular attention to your data warehouse staff, if any. These professionals already must be aware of most operational systems data, since this is their source of data for the warehouse. They will also be aware of operational data issues, and usually have implemented fixes as part of the extract, transform load process.
Budgeting for Purge/Archive
Eventually the data in your big data application will become old, stale or unusable simply because of its age. Even proprietary high-speed data stores can get bogged down with too much data. Further, the big data application is not the system of record; compliance or regulatory requirements for data retention apply to the operational systems and possibly the data warehouse, but not the big data application.
Budget for a robust purge or archive process. This will take time to develop, and purge execution time may be significant as many vendor solutions are tuned for fast data loading and fast queries, rather than fast data removal.
Budgeting for Disaster Recovery
In general, a disaster recovery environment provides for running operational systems when one or more primary sites are not available. Many IT shops do not include a big data application in disaster recovery planning. Why? Generally speaking, priority is always given to operational systems such as order processing, product delivery, and other customer-facing applications. While big data queries are nice to run, they are not considered mission-critical.
Think again! When IT implements a big data application, there is an initial period of training and low usage. As more analysts run more queries, they see how fast things happen. This gives them the confidence to run even more queries. Valuable one-time queries rapidly expand into daily reports. Multiple lines of business create their own versions of these reports. At some point, enough internal users are so dependent upon their analytics that they label them mission-critical. Suddenly, IT management must provide a disaster recovery solution.
Budget for this ahead of time! The disaster recovery environment should already be configured to run current operational systems. Re-size the configuration by adding in your big data implementation, including an appliance if you are using one. Since as noted previously appliances are expensive, consider re-purposing your disaster recovery environment as a non-production environment during normal times. This will allow you to get some value from the appliance, rather than let it sit idle.
Budgeting is a tricky process, balancing current and available staffing, money, and return on investment. The unique budgeting needs for a big data implementation have a common pattern: new processes, new data storage, new analytics. Newness means change, and some IT staff may be resistant to change.
Plan for this by looking ahead. Review the items mentioned above, figure out how they may manifest themselves in your organization, and budget accordingly. Many of the issues mentioned revolve around changes to staff expertise, perhaps even retaining consulting services in the short or long term.