Data warehousing systems use a variety of data extraction
and cleaning tools, and load and refresh utilities for populating warehouses.
Data extraction from “foreign” sources is usually implemented via gateways and
standard interfaces (such as Information Builders EDA/SQL, ODBC, Oracle Open
Connect, Sybase Enterprise Connect, Informix Enterprise Gateway).
Data Cleaning
Since a data warehouse is used for decision making, it is
important that the data in the warehouse be correct. However, since large
volumes of data from multiple sources are involved, there is a high probability
of errors and anomalies in the data. Therefore, tools that help to detect data
anomalies and correct them can have a high payoff. Some examples where data cleaning
becomes necessary are: inconsistent field lengths, inconsistent descriptions,
inconsistent value assignments, missing entries and violation of integrity
constraints. Optional fields in data entry forms are significant sources of
inconsistent data. There are three related, classes of data cleaning tools:
Data migration tools allow simple
transformation rules to be specified.
Data scrubbing tools use
domain-specific knowledge to do the scrubbing of data. Tools such as Integrity
and Trillum fall in this category.
Data auditing tools make it possible
to discover rules and relationships (or to signal violation of stated rules) by
scanning data. Thus, such tools may be considered variants of data mining
tools.
Load
After extracting, cleaning and transforming, data must be
loaded into the warehouse. Additional preprocessing may still be required:
checking integrity constraints; sorting; summarization, aggregation and other
computation to build the derived tables stored in the warehouse; building
indices and other access paths; and partitioning to multiple target storage
areas. Typically, batch load utilities are used for this purpose. In addition
to populating the warehouse, a load utility must allow the system administrator
to monitor status, to cancel, suspend and resume a load, and to restart after
failure with no loss of data integrity. The load utilities for data warehouses
have to deal with much larger data volumes than for operational databases.
Refresh
Refreshing a warehouse consists in propagating updates on
source data to correspondingly update the base data and derived data stored in
the warehouse. There are two sets of issues to consider: when to refresh, and
how to refresh. Usually, the warehouse is refreshed periodically (e.g., daily
or weekly). Only if some OLAP queries need current data (e.g., up to the minute
stock quotes), is it necessary to propagate every update. The refresh policy is
set by the warehouse administrator, depending on user needs and may be
different for different sources.
Refresh techniques may also depend on the characteristics of
the source and the capabilities of the database servers. Extracting an entire
source file or database is usually too expensive, but may be the only choice
for legacy data sources. Most contemporary database systems provide replication
servers that support incremental techniques for propagating updates from a
primary database to one or more replicas. Such replication servers can be used
to incrementally refresh a warehouse when the sources change. There are two
basic replication techniques: data shipping and transaction shipping.