Understanding Data Warehousing: ASP Alliance


Understanding Data Warehousing

page

by Arindam Ghosh
Feedback

Average Rating:

Views (Total / Last 10 Days): 35249/ 59

Introduction

Since the beginning of time reports and information to the masses have been provided. Whether that information was in the form of smoke signals, stone tablets, or hand-written records, information sharing has been around for ages. Today we call this collecting and sharing of information data warehousing. Data warehousing technology comprises a set of new concepts and tools which support the knowledge worker (executive, manager, and analyst) with informational material for decision-making. The fundamental reason for building a data warehouse is to improve the quality of information in an organization. The key issue is the provision of access to a company-wide view of data whenever it resides. Data coming from internal and external sources, existing in a variety of forms from traditional structural data to unstructured data like text files or multimedia is cleaned and integrated into a single repository. A data warehouse (DWH) is the consistent store of this data which is made available to end users in a way they can understand and use in a business context.

A data warehouse is a data repository designed to support the decision-making process for an organization. Unlike with its operational system counterpart, the information can be stored many times in many different locations. Its primary purpose is to provide management with the information it needs in order to make intelligent business decisions. In data warehousing, data is integrated from various, heterogeneous operational systems (like database systems, flat files, etc.) and further external data sources (like demographic and statistical databases etc.). Before the integration; structural and semantic differences have to be reconciled, i.e., data have to be “homogenized” according to a uniform data model. Furthermore, data values from operational systems have to be cleaned in order to get correct data into the data warehouse.

Data warehouses have four distinct characteristics that differentiate them from operational systems. Data warehouses are

Subject oriented: The warehouse is organized around a specific business process such as purchasing.

Integrated: The warehouse is integrated so that we can relate one subject against another, so that, for instance, we could perform purchasing versus sales analysis.

Nonvolatile: The warehouse is static it is not changing like an operational system. We load data on a regular basis into the warehouse, but we do not change data that already exists in the database.

Time based: The basic power of the warehouse is that the information contained in it is based on specific point-in-time loads. The loads may be daily, weekly, monthly, or based on some other time period, but whatever that time is, we include this with the data in the warehouse. The warehouse shows your business information at many points in time, whereas your operational system shows you information at the time you look at the information.

Accessible: The primary purpose of a data warehouse is to provide readily accessible information to end-users.

Process-Oriented: It is important to view data warehousing as a process for the delivery of information. The maintenance of a data warehouse is ongoing and iterative in nature.

A data warehouse system (DWS) comprises the data warehouse and all components used for building, accessing and maintaining the DWH. The center of a data warehouse system is the data warehouse itself. The data import and preparation component is responsible for data acquisition. It includes all programs, applications and legacy systems interfaces that are responsible for extracting data from operational sources, preparing and loading it into the warehouse. The access component includes all different applications (OLAP or data mining applications) that make use of the information stored in the warehouse.

In data warehousing, there are various types of metadata (metadata is defined as data about data or data describing the meaning of data ), e.g., information about the operational sources, the structure and semantics of the DWH data, the tasks performed during the construction, the maintenance and access of a DWH, etc. The need for metadata is well known. Statements like, “A data warehouse without adequate metadata is like a filing cabinet stuffed with papers, but without any folders or labels”, characterize this situation. Thus, the quality of metadata and the resulting quality of information gained using a data warehouse solution are tightly linked.

Comparison of data warehouse and operational data

The data warehouse is distinctly different from the operational data used and maintained by day-to-day operational systems. Data warehousing is not simply an “access wrapper” for operational data, where data is simply “dumped” into tables for direct access. Among the differences:

OPERATIONAL DATA	DW DATA
Application Oriented	Subject Oriented
Detailed	Summarized (or otherwise refined)
Accurate, as of the moment of access	Represents values over time, snapshots
Updated	Not Updated
Run repetitively and non-reflectively	Run heuristically
Requirements for processing not completely understood before development	Requirements for processing understood before initial development
Compatible with the Software Development Life Cycle	Completely different life cycle
Performance sensitive (immediate response required when entering a transaction)	Performance relaxed (immediacy not required)
Accessed a unit at a time (limited number of data elements for a single record)	Accessed a set at a time (many records of many data elements)
Transaction driven	Analysis driven
High availability and not redundant	Relaxed availability and redundancy is a fact of life
Managed in its entirety	Managed by subsets
Static structure; variable contents	Flexible structure
Small amount of data used in a process	Large amount of data used in a process

The Data Warehousing Process

Determine Informational Requirements

Identify and analyze existing informational capabilities.

Identify from key users the significant business questions and key metrics that the target user group regards as their most important requirements for information.

Decompose these metrics into their component parts with specific definitions.

Map the component parts to the informational model and systems of record.

Evolutionary and Iterative Development Process

Start with one subject area (or subset or superset) and one target user group.

Continue and add subject areas, user groups and informational capabilities to the architecture based on the organization’s requirements for information, not technology.

Improvements are made from what was learned from previous increments.

Improvements are made from what was learned about warehouse operation and support.

The technical environment may have changed.

Results are seen very quickly; after each iteration.

The end user requirements are refined after each iteration.

The warehouse populating process

A data warehouse is populated through a series of steps that

Remove data from the source environment (extract).

Change the data to have desired warehouse characteristics like subject-orientation and time-variance (transform).

Place the data into a target environment (load).

Complexity of Transformation and Integration

The extraction of data from the operational environment to the data warehouse environment requires a change in technology.

The selection of data from the operational environment may be very complex.

Data is reformatted.

Data is cleansed.

Multiple input sources of data exist.

Default values need to be supplied.

Summarization of data often needs to be done.

The input records that must be read have “exotic” or nonstandard formats.

Data format conversion must be done.

Massive volumes of input must be accounted for.

Perhaps the worst of all: Data relationships that have been built into old legacy program logic must be understood and unraveled before those files can be used as input.

Implementing a concrete Data Warehousing Software (DWS) is a complex task comprising two major phases. In the DWS configuration phase, a conceptual view of the warehouse is first specified according to user requirements (data warehouse design). Then, the involved data sources and the methods in which data will be extracted and loaded into the warehouse (data acquisition) are determined.

Finally, decisions about persistent storage of the warehouse using database technology and the various ways data will be accessed during analysis are made. After the initial load during the DWS operation phase, warehouse data must be regularly refreshed, i.e., modifications of operational data since the last DWH refreshment must be propagated into the warehouse such that data stored in the DWH reflect the state of the underlying operational systems. Besides DWH refreshment, DWS operation includes further tasks like archiving and purging of DWH data or DWH monitoring.

Data warehouse design

Data warehouse design methods consider the read-oriented characteristic of warehouse data and enable the efficient query processing over large amounts of data. A special type of relational database schemas, called star schema, is often used to model the multiple dimensions of warehouse data (in contrast to the two-dimensional representation of normal relational schemas). In this case, the database consists of a central fact table and several dimension tables. The fact table contains tuples that represent business facts (measures) to be analyzed. Each fact table tuple references multiple dimensional table tuples each one representing a dimension of interest. Since dimension tables are not normalized, joining the fact table with the dimension tables provides different dimensional views of the warehouse data in an efficient way. A variant of the star schema, called the snowflake schema, is commonly used to explicitly represent the dimensional hierarchies by normalizing the dimension tables.

A more natural way to consider multidimensionality of warehouse data is provided by the multidimensional data model. Thereby, the data cube is the basic underlying modeling construct. Special operations like pivoting (rotate the cube), slicing- dicing (select a subset of the cube), roll-up and drill-down (increasing and decreasing the level of aggregation) have been proposed in this context. For the implementation of multidimensional databases, there are two main approaches:

In the first approach, extended relational DBMSs, called relational OLAP (ROLAP) servers, use a relational database to implement the multidimensional model and operations. ROLAP servers provide SQL extensions and translate data cube operations to relational queries.

In the second approach, multidimensional OLAP (MOLAP) servers store multidimensional data in non-relational specialized storage structures. These systems usually pre-compute the results of complex operations in order to increase performance.

Back End Tools and Utilities

Data warehousing systems use a variety of data extraction and cleaning tools, and load and refresh utilities for populating warehouses. Data extraction from “foreign” sources is usually implemented via gateways and standard interfaces (such as Information Builders EDA/SQL, ODBC, Oracle Open Connect, Sybase Enterprise Connect, Informix Enterprise Gateway).

Data Cleaning

Since a data warehouse is used for decision making, it is important that the data in the warehouse be correct. However, since large volumes of data from multiple sources are involved, there is a high probability of errors and anomalies in the data. Therefore, tools that help to detect data anomalies and correct them can have a high payoff. Some examples where data cleaning becomes necessary are: inconsistent field lengths, inconsistent descriptions, inconsistent value assignments, missing entries and violation of integrity constraints. Optional fields in data entry forms are significant sources of inconsistent data. There are three related, classes of data cleaning tools:

Data migration tools allow simple transformation rules to be specified.

Data scrubbing tools use domain-specific knowledge to do the scrubbing of data. Tools such as Integrity and Trillum fall in this category.

Data auditing tools make it possible to discover rules and relationships (or to signal violation of stated rules) by scanning data. Thus, such tools may be considered variants of data mining tools.

Load

After extracting, cleaning and transforming, data must be loaded into the warehouse. Additional preprocessing may still be required: checking integrity constraints; sorting; summarization, aggregation and other computation to build the derived tables stored in the warehouse; building indices and other access paths; and partitioning to multiple target storage areas. Typically, batch load utilities are used for this purpose. In addition to populating the warehouse, a load utility must allow the system administrator to monitor status, to cancel, suspend and resume a load, and to restart after failure with no loss of data integrity. The load utilities for data warehouses have to deal with much larger data volumes than for operational databases.

Refresh

Refreshing a warehouse consists in propagating updates on source data to correspondingly update the base data and derived data stored in the warehouse. There are two sets of issues to consider: when to refresh, and how to refresh. Usually, the warehouse is refreshed periodically (e.g., daily or weekly). Only if some OLAP queries need current data (e.g., up to the minute stock quotes), is it necessary to propagate every update. The refresh policy is set by the warehouse administrator, depending on user needs and may be different for different sources.

Refresh techniques may also depend on the characteristics of the source and the capabilities of the database servers. Extracting an entire source file or database is usually too expensive, but may be the only choice for legacy data sources. Most contemporary database systems provide replication servers that support incremental techniques for propagating updates from a primary database to one or more replicas. Such replication servers can be used to incrementally refresh a warehouse when the sources change. There are two basic replication techniques: data shipping and transaction shipping.

Conclusion

So we can summarize by saying that data warehouse is the distinction between data and information. Data is composed of observable and recordable facts that are often found in operational or transactional systems. In a data warehouse environment, data only comes to have value to end-users when it is organized and presented as information. Information is an integrated collection of facts and is used as the basis for decision making. The data warehouse is that portion of an overall Architected Data Environment that serves as the single integrated source of data for processing information.

Product Spotlight