Data Warehouse

Published on December 2016 | Categories: Documents | Downloads: 69 | Comments: 0 | Views: 574
of 25
Download PDF   Embed   Report

Comments

Content

Data Warehouse Definition
Different people have different definitions for a data warehouse. The most popular definition came from Bill Inmon, who provided the following: A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process. Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject. Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product. Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer. Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered. Ralph Kimball provided a more concise definition of a data warehouse: A data warehouse is a copy of transaction data specifically structured for query and analysis. This is a functional view of a data warehouse. Kimball did not address how the data warehouse is built like Inmon did, rather he focused on the functionality of a data warehouse.
There are three types of data warehouses: 1. 2. Enterprise Data Warehouse - An enterprise data warehouse provides a central database for decision support throughout the enterprise. ODS(Operational Data Store) - This has a broad enterprise wide scope, but unlike the real enterprise data warehouse, data is refreshed in near real time and used for routine business activity. One of the typical applications of the ODS (Operational Data Store) is to hold the recent data before migration to the Data Warehouse.Typically, the ODS are not conceptually equivalent to the Data Warehouse albeit do store the data that have a deeper level of the history than that of the OLTP data. Data Mart - Data mart is a subset of data warehouse and it supports a particular region, business unit or business function. Data warehouses and data marts are built on dimensional data modeling where fact tables are connected with dimension tables. This is most useful for users to access data since a database can be visualized as a cube of several dimensions. A data warehouse provides an opportunity for slicing and dicing that cube along each of its dimensions. Data Mart: A data mart is a subset of data warehouse that is designed for a particular line of business, such as sales, marketing, or finance. In a dependent data mart, data can be derived from an enterprise-wide data warehouse. In an independent data mart, data can be collected directly from sources.

3.

Data warehouse

From Wikipedia, the free encyclopedia Jump to: navigation, search

A data warehouse is a repository (collection of resources that can be accessed to retrieve information) of an organization's electronically stored data, designed to facilitate reporting and analysis [1]. This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed and cataloged and is made available to be used by managers and other business professional for data mining, online analytical processing, market research and decision support (Marakas & OBrien 2009). However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata. Data warehousing arises in an organization's need for reliable, consolidated, unique and integrated analysis and reporting of its data, at different levels of aggregation. The practical reality of most organizations is that their data infrastructure is made up by a collection of heterogeneous systems. For example, an organization might have one system that handles customer-relationship, a system that handles employees, systems that handle sales data or production data, yet another system for finance and budgeting data, etc. In practice, these systems are often poorly or not at all integrated and simple questions like: "How much time did sales person A spend on customer C, how much did we sell to Customer C, was customer C happy with the provided service, Did Customer C pay his bills" can be very hard to answer, even though the information is available "somewhere" in the different data systems. Another problem is that enterprise resource planning (ERP) systems are designed to support relevant operations. For example, a finance system might keep track of every single stamp bought; When it was ordered, when it was delivered, when it was paid and the system might offer accounting principles (like double bookkeeping) that further complicates the data model. Such information is great for the person in charge of buying "stamps" or the accountant trying to sort out an irregularity, but the CEO is definitely not interested in such detailed information, the CEO wants to know stuff like "What's the cost?", "What's the revenue?", "did our latest initiative reduce costs?" and wants to have this information at an aggregated level. Yet another problem might be that the organization is, internally, in disagreement about which data are correct. For example, the sales department might have one view of its costs, while the finance department has another view of that cost. In such cases the organization can spend unlimited time discussing who's got the correct view of the data. It is partly the purpose of data warehousing to bridge such problems. It is important to note that in data warehousing the source data systems are considered as given: Even though the data

source system might have been made in such a manner that it's difficult to extract integrated information, the "data warehousing answer" is not to redesign the data source systems but rather to make the data appear consistent, integrated and consolidated despite the problems in the underlying source systems. Data warehousing achieves this by employing different data warehousing techniques, creating one or more new data repositories (i.e. the data warehouse) whose data model(s) support the needed reporting and analysis.

History
The concept of data warehousing dates back to the late 1980s [2] when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow, mainly the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations it was typical for multiple decision support environments to operate independently. Each environment served different users but often required much of the same stored data. The process of gathering, cleaning and integrating data from various sources, usually from long-term existing operational systems (usually referred to as legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from "data marts" that were tailored for ready access by users.

History
The concept of data warehousing dates back to the late 1980s [2] when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow, mainly the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations it was typical for multiple decision support environments to operate independently. Each environment served different users but often required much of the same stored data. The process of gathering, cleaning and integrating data from various sources, usually from long-term existing operational systems (usually referred to as legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from "data marts" that were tailored for ready access by users. Key developments in early years of data warehousing were:

y

y

y

y

y

y

1960s General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts.[3] (Alfred Hitchcock s movie Psycho is released) (Echo I, the first communications satellite, launched) (Muhammad Ali wins the gold medal at the Olympics) (Assassination of John Kennedy) (The Internet s symbolic birth date: publication of RFC 1) (Neil Armstrong lands on the moon) (The creation of the first computer game which was dubbed Spacewar) (The creation of the computer mouse) 1970s ACNielsen and IRI provide dimensional data marts for retail sales.[3] (6th of Octobre War & Camp David Treaty) (Watergate Scandal & Richard Nixon s resignation) (Billie Jean King beats Bobby Riggs in battle of sexes tennis match) (OPEC oil embargo) (Intel entered the scene with the first dynamic RAM chip) 1988 Barry Devlin and Paul Murphy publish the article An architecture for a business and information systems in IBM Systems Journal where they introduce the term "business data warehouse". (Microsoft releases Windows 2.1) (Iran-Iraq war ends) (The Morris worm, the first computer worm distributed via the Internet, written by Robert Tappan Morris, launched from MIT) (Naguib Mahfouz won the Nobel Prize in literature) 1991 Bill Inmon publishes the book Building the Data Warehouse. (The first Gulf war) (Germany formally gains independence) (Tim Berners-Lee announces the WWW project & software on the alt. Hypertext newsgroup) (Collapse of the Soviet Union) (Salem Express ferry sinks in the Red Sea) 1996 Ralph Kimball publishes the book The Data Warehouse Toolkit. (The first version of the Java programming language is released) (Chess computer Deep Blue defeats world chess champion Garry Kasparov for the first time) (General Motors first electric car EV1 is launched) 2000 Daniel Linstedt releases the Data Vault, enabling real time auditable Data Warehouses. (Y2K problem) (Microsoft is ruled to have violated US antitrust laws by keeping an oppressive thumb on its competitors) (405 The Movie, the first short film widely distributed on the Internet, is released)

[edit] Architecture
Architecture, in the context of an organization's data warehousing efforts, is a conceptualization of how the data warehouse is built. There is no right or wrong architecture, but rather there are multiple architectures that exist to support various environments and situations. The worthiness of the architecture can be judged from how the conceptualization aids in the building, maintenance, and usage of the data warehouse. One possible simple conceptualization of a data warehouse architecture consists of the following interconnected layers:
Operational database layer The source data for the data warehouse systems fall into this layer. Data access layer The interface between the operational and informational access layer transform, load data into the warehouse fall into this layer. Tools to extract, An organization's Enterprise Resource Planning

Metadata layer The data directory This is usually more detailed than an operational system data directory. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool. Informational access layer The data accessed for reporting and analyzing and the tools for reporting and analyzing data Business intelligence tools fall into this layer. The Inmon-Kimball differences about design methodology, discussed later in this article, have to do with this layer

[edit] Conforming information
Another important fact in designing a data warehouse is which data to conform and how to conform the data. For example, one operational system feeding data into the data warehouse may use "M" and "F" to denote sex of an employee while another operational system may use "Male" and "Female". Though this is a simple example, much of the work in implementing a data warehouse is devoted to making similar meaning data consistent when they are stored in the data warehouse. Typically, extract, transform, load tools are used in this work. Master Data Management has the aim of conforming data that could be considered "dimensions".

[edit] Normalized versus dimensional approach for storage of data
There are two leading approaches to storing data in a data warehouse ² the dimensional approach and the normalized approach. In a dimensional approach, transaction data are partitioned into either "facts", which are generally numeric transaction data, or "dimensions", which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order. A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. The main disadvantages of the dimensional approach are:
1. In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated, and 2. It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business.

In the normalized approach, the data in the data warehouse are stored following, to a degree, database normalization rules. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.). The main advantage of this

approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to:
1. join data from different sources into meaningful information and then 2. access the information without a precise understanding of the sources of data and of the data structure of the data warehouse.

These approaches are not mutually exclusive, and there are other approaches. Dimensional approaches can involve normalizing data to a degree.

[edit] Top-down versus bottom-up design methodologies
[edit] Bottom-up design

Ralph Kimball, a well-known author on data warehousing,[4] is a proponent of an approach to data warehouse design which he describes as bottom-up.[5]. In the bottom-up approach data marts are first created to provide reporting and analytical capabilities for specific business processes. Though it is important to note that in Kimball methodology, the bottom-up process is the result of an initial business oriented Top-down analysis of the relevant business processes to be modelled. Data marts contain, primarily, dimensions and facts. Facts can contain either atomic data and, if necessary, summarized data. The single data mart often models a specific business area such as "Sales" or "Production." These data marts can eventually be integrated to create a comprehensive data warehouse. The integration of data marts is managed through the implementation of what Kimball calls "a data warehouse bus architecture".[6]. The data warehouse bus architecture is primarily an implementation of "the bus" a collection of conformed dimensions, which are dimensions that are shared (in a specific way) between facts in two or more data marts. The integration of the data marts in the data warehouse is centered on the conformed dimensions (residing in "the bus") that define the possible integration "points" between data marts. The actual integration of two or more data marts is then done by a process known as "Drill across". A drill-across works by grouping (summarizing) the data along the keys of the (shared) conformed dimensions of each fact participating in the "drill across" followed by a join on the keys of these grouped (summarized) facts. Maintaining tight management over the data warehouse bus architecture is fundamental to maintaining the integrity of the data warehouse. The most important management task is making sure dimensions among data marts are consistent. In Kimball's words, this means that the dimensions "conform". Some consider it an advantage of the Kimball method, that the data warehouse ends up being "segmented" into a number of logically self contained (up to and including The Bus) and consistent data marts, rather than a big and often complex centralized model. Business value can be returned as quickly as the first data marts can be created, and the method gives itself well to

an exploratory and iterative approach to building data warehouses. For example, the data warehousing effort might start in the "Sales" department, by building a Sales-data mart. Upon completion of the Sales-data mart, The business might then decide to expand the warehousing activities into the, say, "Production department" resulting in a Production data mart. The requirement for the Sales data mart and the Production data mart to be integrable, is that they share the same "Bus", that will be, that the data warehousing team has made the effort to identify and implement the conformed dimensions in the bus, and that the individual data marts links that information from the bus. Note that this does not require 100% awareness from the onset of the data warehousing effort, no master plan is required upfront. The Sales-data mart is good as it is (assuming that the bus is complete) and the production data mart can be constructed virtually independent of the sales data mart (but not independent of the Bus). If integration via the bus is achieved, the data warehouse, through its two data marts, will not only be able to deliver the specific information that the individual data marts are designed to do, in this example either "Sales" or "Production" information, but can deliver integrated SalesProduction information, which, often, is of critical business value. An integration (possibly) achieved in a flexible and iterative fashion.
[edit] Top-down design

Bill Inmon, one of the first authors on the subject of data warehousing, has defined a data warehouse as a centralized repository for the entire enterprise.[6] Inmon is one of the leading proponents of the top-down approach to data warehouse design, in which the data warehouse is designed using a normalized enterprise data model. "Atomic" data, that is, data at the lowest level of detail, are stored in the data warehouse. Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse. In the Inmon vision the data warehouse is at the center of the "Corporate Information Factory" (CIF), which provides a logical framework for delivering business intelligence (BI) and business management capabilities. Inmon states that the data warehouse is:
Subject-oriented The data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together. Non-volatile Data in the data warehouse are never over-written or deleted static, read-only, and retained for future reporting. Integrated The data warehouse contains data from most or all of an organization's operational systems and these data are made consistent. once committed, the data are

Time-variant

The top-down design methodology generates highly consistent dimensional views of data across data marts since all data marts are loaded from the centralized repository. Top-down design has also proven to be robust against business changes. Generating new dimensional data marts against the data stored in the data warehouse is a relatively simple task. The main disadvantage to the top-down methodology is that it represents a very large project with a very broad scope. The up-front cost for implementing a data warehouse using the top-down methodology is significant, and the duration of time from the start of project to the point that end users experience initial benefits can be substantial. In addition, the top-down methodology can be inflexible and unresponsive to changing departmental needs during the implementation phases.[6]
[edit] Hybrid design

Over time it has become apparent to proponents of bottom-up and top-down data warehouse design that both methodologies have benefits and risks. Hybrid methodologies have evolved to take advantage of the fast turn-around time of bottom-up design and the enterprise-wide data consistency of top-down design.

[edit] Data warehouses versus operational systems
Operational systems are optimized for preservation of data integrity and speed of recording of business transactions through use of database normalization and an entity-relationship model. Operational system designers generally follow the Codd rules of database normalization in order to ensure data integrity. Codd defined five increasingly stringent rules of normalization. Fully normalized database designs (that is, those satisfying all five Codd rules) often result in information from a business transaction being stored in dozens to hundreds of tables. Relational databases are efficient at managing the relationships between these tables. The databases have very fast insert/update performance because only a small amount of data in those tables is affected each time a transaction is processed. Finally, in order to improve performance, older data are usually periodically purged from operational systems. Data warehouses are optimized for speed of data analysis. Frequently data in data warehouses are denormalised via a dimension-based model. Also, to speed data retrieval, data warehouse data are often stored multiple times²in their most granular form and in summarized forms called aggregates. Data warehouse data are gathered from the operational systems and held in the data warehouse even after the data has been purged from the operational systems.

[edit] Evolution in organization use
Organizations generally start off with relatively simple use of data warehousing. Over time, more sophisticated use of data warehousing evolves. The following general stages of use of the data warehouse can be distinguished:
Off line Operational Database

Data warehouses in this initial stage are developed by simply copying the data off an operational system to another server where the processing load of reporting against the copied data does not impact the operational system's performance. Off line Data Warehouse Data warehouses at this stage are updated from data in the operational systems on a regular basis and the data warehouse data are stored in a data structure designed to facilitate reporting. Real Time Data Warehouse Data warehouses at this stage are updated every time an operational system performs a transaction (e.g. an order or a delivery or a booking.) Integrated Data Warehouse Data warehouses at this stage are updated every time an operational system performs a transaction.

[edit] Benefits
Some of the benefits that a data warehouse provides are as follows:[7][8]
y

y y

y y y

A data warehouse provides a common data model for all data of interest regardless of the data's source. This makes it easier to report and analyze information than it would be if multiple data models were used to retrieve information such as sales invoices, order receipts, general ledger charges, etc. Prior to loading data into the data warehouse, inconsistencies are identified and resolved. This greatly simplifies reporting and analysis. Information in the data warehouse is under the control of data warehouse users so that, even if the source system data are purged over time, the information in the warehouse can be stored safely for extended periods of time. Because they are separate from operational systems, data warehouses provide retrieval of data without slowing down operational systems. Data warehouses can work in conjunction with and, hence, enhance the value of operational business applications, notably customer relationship management (CRM) systems. Data warehouses facilitate decision support system applications such as trend reports (e.g., the items with the most sales in a particular area within the last two years), exception reports, and reports that show actual performance versus goals.

[edit] Disadvantages
There are also disadvantages to using a data warehouse. Some of them are:
y

Data warehouses are not the optimal environment for unstructured data.

y y y y

Because data must be extracted, transformed and loaded into the warehouse, there is an element of latency in data warehouse data. Over their life, data warehouses can have high costs. Data warehouses can get outdated relatively quickly. There is a cost of delivering suboptimal information to the organization. There is often a fine line between data warehouses and operational systems. Duplicate, expensive functionality may be developed. Or, functionality may be developed in the data warehouse that, in retrospect, should have been developed in the operational systems.

[edit] Sample applications
Some of the applications data warehousing can be used for are:
y y y y y

Credit card churn analysis Insurance fraud analysis Call record analysis Logistics management. Agriculture [9]

[edit] The future
Data warehousing, like any technology , has a history of innovations that did not receive market acceptance.[10] A 2009 Gartner Group paper predicted these developments in business intelligence/data warehousing market.[11]
y

y y y y

Because of lack of information, processes, and tools, through 2012, more than 35 percent of the top 5,000 global companies will regularly fail to make insightful decisions about significant changes in their business and markets. By 2012, business units will control at least 40 percent of the total budget for business intelligence. By 2010, 20 percent of organizations will have an industry-specific analytic application delivered via software as a service as a standard component of their business intelligence portfolio. In 2009, collaborative decision making will emerge as a new product category that combines social software with business intelligence platform capabilities. By 2012, one-third of analytic applications applied to business processes will be delivered through coarse-grained application mashups.

A data mart is a subset of an organizational data store, usually oriented to a specific purpose or major data subject, that may be distributed to support business needs.[1] Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organization. Data marts are often derived from subsets of data in a data warehouse, though in

the bottom-up data warehouse design methodology the data warehouse is created from the union of organizational data marts.

Terminology
In practice, the data mart and data warehouse each tend to imply the presence of the other in some form. However, most writers using the term seem to agree that the design of a data mart tends to start from an analysis of user needs and that a data warehouse tends to start from an analysis of what data already exists and how it can be collected in such a way that the data can later be used. A data warehouse is a central aggregation of data (which can be distributed physically); a data mart is a data repository that may or may not derive from a data warehouse and that emphasizes ease of access and usability for a particular designed purpose. In general, a data warehouse tends to be a strategic but somewhat unfinished concept; a data mart tends to be tactical and aimed at meeting an immediate need. One writer, Marc Demarest, suggests combining the ideas into a Universal Data Architecture (UDA). In practice, many products and companies offering data warehouse services also tend to offer data mart capabilities or services. There can be multiple data marts inside a single corporation; each one relevant to one or more business units for which it was designed. DMs may or may not be dependent or related to other data marts in a single corporation. If the data marts are designed using conformed facts and dimensions, then they will be related. In some deployments, each department or business unit is considered the owner of its data mart including all the hardware, software and data.[2] This enables each department to use, manipulate and develop their data any way they see fit; without altering information inside other data marts or the data warehouse. In other deployments where conformed dimensions are used, this business unit ownership will not hold true for shared dimensions like customer, product, etc. The related term spreadmart describes the situation that occurs when one or more business analysts develop a system of linked spreadsheets to perform a business analysis, then grow it to a size and degree of complexity that makes it nearly impossible to maintain. Copyright MCA2M

[edit] Design schemas
y y

star schema or dimensional model is a fairly popular design choice, as it enables a relational database to emulate the analytical functionality of a multidimensional database. snowflake schema

[edit] Reasons for creating a data mart
y y y y y

Easy access to frequently needed data Creates collective view by a group of users Improves end-user response time Ease of creation Lower cost than implementing a full Data warehouse

y

Potential users are more clearly defined than in a full Data warehouse

[edit] Dependent data mart
According to the Inmon school of data warehousing, a dependent data mart is a logical subset (view) or a physical subset (extract) of a larger data warehouse, isolated for one of the following reasons:
y y y y y y y

A need for a special data model or schema: e.g., to restructure for OLAP Performance: to offload the data mart to a separate computer for greater efficiency or to obviate the need to manage that workload on the centralized data warehouse. Security: to separate an authorized data subset selectively Expediency: to bypass the data governance and authorizations required to incorporate a new application on the Enterprise Data Warehouse Proving Ground: to demonstrate the viability and ROI (return on investment) potential of an application prior to migrating it to the Enterprise Data Warehouse Politics: a coping strategy for IT (Information Technology) in situations where a user group has more influence than funding or is not a good citizen on the centralized data warehouse. Politics: a coping strategy for consumers of data in situations where a data warehouse team is unable to create a usable data warehouse.

According to the Inmon school of data warehousing, tradeoffs inherent with data marts include limited scalability, duplication of data, data inconsistency with other silos of information, and inability to leverage enterprise sources of data. Data mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. It is currently used in a wide range of profiling practices, such as marketing, surveillance, fraud detection, and scientific discovery. The related terms data dredging, data fishing and data snooping refer to the use of data mining techniques to sample portions of the larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered (see also data-snooping bias). These techniques can however, be used in the creation of new hypothesises to test against the larger data populations.

Background
The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology has increased data collection and storage. As data sets have grown in size and complexity, direct hands-on data analysis has increasingly been augmented with indirect, automatic data processing. This has been aided by other discoveries in computer science, such as neural networks, clustering, genetic algorithms (1950s), decision trees (1960s) and support vector machines (1980s). Data mining is the process of applying these methods to data with the intention of uncovering hidden patterns.[1]

It has been used for many years by businesses, scientists and governments to sift through volumes of data such as airline passenger trip records, census data and supermarket scanner data to produce market research reports. (Note, however, that reporting is not always considered to be data mining.) A primary reason for using data mining is to assist in the analysis of collections of observations of behaviour. Such data are vulnerable to collinearity because of unknown interrelations. An unavoidable fact of data mining is that the (sub-)set(s) of data being analysed may not be representative of the whole domain, and therefore may not contain examples of certain critical relationships and behaviours that exist across other parts of the domain. To address this sort of issue, the analysis may be augmented using experiment-based and other approaches, such as Choice Modelling for human-generated data. In these situations, inherent correlations can be either controlled for, or removed altogether, during the construction of the experimental design. There have been some efforts to define standards for data mining, for example the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). These are evolving standards; later versions of these standards are under development. Independent of these standardization efforts, freely available opensource software systems like the R Project, Weka, KNIME, RapidMiner and others have become an informal standard for defining data-mining processes. Notably, all these systems are able to import and export models in PMML (Predictive Model Markup Language) which provides a standard way to represent data mining models so that these can be shared between different statistical applications[2]. PMML is an XML-based language developed by the Data Mining Group (DMG)[3], an independent group composed of many data mining companies. PMML version 4.0 was released in June 2009.[3][4][5]
[edit] Research and evolution

In addition to industry driven demand for standards and interoperability, professional and academic activity have also made considerable contributions to the evolution and rigour of the methods and models; an article published in a 2008 issue of the International Journal of Information Technology and Decision Making summarises the results of a literature survey which traces and analyzes this evolution.[6] The premier professional body in the field is the Association for Computing Machinery's Special Interest Group on Knowledge discovery and Data Mining (SIGKDD).[citation needed] Since 1989 they have hosted an annual international conference and published its proceedings,[7] and since 1999 have published a biannual academic journal titled "SIGKDD Explorations".[8] Other Computer Science conferences on data mining include:
y y y y y

DMIN - International Conference on Data Mining;[9] DMKD - Research Issues on Data Mining and Knowledge Discovery; ECML-PKDD - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases; ICDM - IEEE International Conference on Data Mining;[10] MLDM - Machine Learning and Data Mining in Pattern Recognition;

y y y y

SDM - SIAM International Conference on Data Mining EDM - International Conference on Educational Data Mining ECDM - European Conference on Data Mining PAKDD - The annual Pacific-Asia Conference on Knowledge Discovery and Data Mining

[edit] Process
[edit] Pre-processing

Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns already present in the data, the target dataset must be large enough to contain these patterns while remaining concise enough to be mined in an acceptable timeframe. A common source for data is a datamart or data warehouse. Pre-process is essential to analyse the multivariate datasets before clustering or data mining. The target set is then cleaned. Cleaning removes the observations with noise and missing data. The clean data are reduced into feature vectors, one vector per observation. A feature vector is a summarised version of the raw data observation. For example, a black and white image of a face which is 100px by 100px would contain 10,000 bits of raw data. This might be turned into a feature vector by locating the eyes and mouth in the image. Doing so would reduce the data for each vector from 10,000 bits to three codes for the locations, dramatically reducing the size of the dataset to be mined, and hence reducing the processing effort. The feature(s) selected will depend on what the objective(s) is/are; obviously, selecting the "right" feature(s) is fundamental to successful data mining. The feature vectors are divided into two sets, the "training set" and the "test set". The training set is used to "train" the data mining algorithm(s), while the test set is used to verify the accuracy of any patterns found.
[edit] Data mining

Data mining commonly involves four classes of tasks:[11]
y y

y y

Clustering - is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Classification - is the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam. Common algorithms include decision tree learning, nearest neighbor, naive Bayesian classification, neural networks and support vector machines. Regression - Attempts to find a function which models the data with the least error. Association rule learning - Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.

See also structured data analysis.

[edit] Results validation

The final step of knowledge discovery from data is to verify the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the data mining algorithms are necessarily valid. It is common for the data mining algorithms to find patterns in the training set which are not present in the general data set, this is called overfitting. To overcome this, the evaluation uses a test set of data which the data mining algorithm was not trained on. The learnt patterns are applied to this test set and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish spam from legitimate emails would be trained on a training set of sample emails. Once trained, the learnt patterns would be applied to the test set of emails which it had not been trained on, the accuracy of these patterns can then be measured from how many emails they correctly classify. A number of statistical methods may be used to evaluate the algorithm such as ROC curves. If the learnt patterns do not meet the desired standards, then it is necessary to reevaluate and change the preprocessing and data mining. If the learnt patterns do meet the desired standards then the final step is to interpret the learnt patterns and turn them into knowledge.

[edit] Notable uses
[edit] Games

Since the early 1960s, with the availability of oracles for certain combinatorial games, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. Current pattern recognition approaches do not seem to fully have the required high level of abstraction in order to be applied successfully. Instead, extensive experimentation with the tablebases, combined with an intensive study of tablebase-answers to well designed problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is used to yield insightful patterns. Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation.
[edit] Business

Data mining in customer relationship management applications can contribute significantly to the bottom line.[citation needed] Rather than randomly contacting a prospect or customer through a call center or sending mail, a company can concentrate its efforts on prospects that are predicted to have a high likelihood of responding to an offer. More sophisticated methods may be used to optimise resources across campaigns so that one may predict which channel and which offer an individual is most likely to respond to ² across all potential offers. Additionally, sophisticated applications could be used to automate the mailing. Once the results from data mining (potential prospect/customer and channel/offer) are determined, this "sophisticated application" can either automatically send an e-mail or regular mail. Finally, in cases where many people will take an action without an offer, uplift modeling can be used to determine which people will have the

greatest increase in responding if given an offer. Data clustering can also be used to automatically discover the segments or groups within a customer data set. Businesses employing data mining may see a return on investment, but also they recognise that the number of predictive models can quickly become very large. Rather than one model to predict how many customers will churn, a business could build a separate model for each region and customer type. Then instead of sending an offer to all people that are likely to churn, it may only want to send offers to customers. And finally, it may also want to determine which customers are going to be profitable over a window of time and only send the offers to those that are likely to be profitable. In order to maintain this quantity of models, they need to manage model versions and move to automated data mining. Data mining can also be helpful to human-resources departments in identifying the characteristics of their most successful employees. Information obtained, such as universities attended by highly successful employees, can help HR focus recruiting efforts accordingly. Additionally, Strategic Enterprise Management applications help a company translate corporatelevel goals, such as profit and margin share targets, into operational decisions, such as production plans and workforce levels.[12] Another example of data mining, often called the market basket analysis, relates to its use in retail sales. If a clothing store records the purchases of customers, a data-mining system could identify those customers who favour silk shirts over cotton ones. Although some explanations of relationships may be difficult, taking advantage of it is easier. The example deals with association rules within transaction-based data. Not all data are transaction based and logical or inexact rules may also be present within a database. In a manufacturing application, an inexact rule may state that 73% of products which have a specific defect or problem will develop a secondary problem within the next six months. Market basket analysis has also been used to identify the purchase patterns of the Alpha consumer. Alpha Consumers are people that play a key roles in connecting with the concept behind a product, then adopting that product, and finally validating it for the rest of society. Analyzing the data collected on these type of users has allowed companies to predict future buying trends and forecast supply demands[citation needed]. Data Mining is a highly effective tool in the catalog marketing industry[citation needed]. Catalogers have a rich history of customer transactions on millions of customers dating back several years. Data mining tools can identify patterns among customers and help identify the most likely customers to respond to upcoming mailing campaigns. Related to an integrated-circuit production line, an example of data mining is described in the paper "Mining IC Test Data to Optimize VLSI Testing."[13] In this paper the application of data mining and decision analysis to the problem of die-level functional test is described. Experiments mentioned in this paper demonstrate the ability of applying a system of mining historical die-test data to create a probabilistic model of patterns of die failure which are then utilised to decide in real time which die to test next and when to stop testing. This system has been shown, based on

experiments with historical test data, to have the potential to improve profits on mature IC products.
[edit] Science and engineering

In recent years, data mining has been widely used in area of science and engineering, such as bioinformatics, genetics, medicine, education and electrical power engineering. In the area of study on human genetics, an important goal is to understand the mapping relationship between the inter-individual variation in human DNA sequences and variability in disease susceptibility. In lay terms, it is to find out how the changes in an individual's DNA sequence affect the risk of developing common diseases such as cancer. This is very important to help improve the diagnosis, prevention and treatment of the diseases. The data mining technique that is used to perform this task is known as multifactor dimensionality reduction.[14] In the area of electrical power engineering, data mining techniques have been widely used for condition monitoring of high voltage electrical equipment. The purpose of condition monitoring is to obtain valuable information on the insulation's health status of the equipment. Data clustering such as self-organizing map (SOM) has been applied on the vibration monitoring and analysis of transformer on-load tap-changers(OLTCS). Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms. Obviously, different tap positions will generate different signals. However, there was considerable variability amongst normal condition signals for the exact same tap position. SOM has been applied to detect abnormal conditions and to estimate the nature of the abnormalities.[15] Data mining techniques have also been applied for dissolved gas analysis (DGA) on power transformers. DGA, as a diagnostics for power transformer, has been available for many years. Data mining techniques such as SOM has been applied to analyse data and to determine trends which are not obvious to the standard DGA ratio techniques such as Duval Triangle.[15] A fourth area of application for data mining in science/engineering is within educational research, where data mining has been used to study the factors leading students to choose to engage in behaviors which reduce their learning[16] and to understand the factors influencing university student retention.[17] A similar example of the social application of data mining is its use in expertise finding systems, whereby descriptors of human expertise are extracted, normalised and classified so as to facilitate the finding of experts, particularly in scientific and technical fields. In this way, data mining can facilitate Institutional memory. Other examples of applying data mining technique applications are biomedical data facilitated by domain ontologies,[18] mining clinical trial data,[19] traffic analysis using SOM,[20] et cetera. In adverse drug reaction surveillance, the Uppsala Monitoring Centre has, since 1998, used data mining methods to routinely screen for reporting patterns indicative of emerging drug safety issues in the WHO global database of 4.6 million suspected adverse drug reaction incidents.[21]

Recently, similar methodology has been developed to mine large collections of electronic health records for temporal patterns associating drug prescriptions to medical diagnoses.[22]
[edit] Spatial data mining

Spatial data mining is the application of data mining techniques to spatial data. Spatial data mining follows along the same functions in data mining, with the end objective to find patterns in geography. So far, data mining and Geographic Information Systems (GIS) have existed as two separate technologies, each with its own methods, traditions and approaches to visualization and data analysis. Particularly, most contemporary GIS have only very basic spatial analysis functionality. The immense explosion in geographically referenced data occasioned by developments in IT, digital mapping, remote sensing, and the global diffusion of GIS emphasises the importance of developing data driven inductive approaches to geographical analysis and modeling. Data mining, which is the partially automated search for hidden patterns in large databases, offers great potential benefits for applied GIS-based decision-making. Recently, the task of integrating these two technologies has become critical, especially as various public and private sector organisations possessing huge databases with thematic and geographically referenced data begin to realise the huge potential of the information hidden there. Among those organisations are:
y y y y

offices requiring analysis or dissemination of geo-referenced statistical data public health services searching for explanations of disease clusters environmental agencies assessing the impact of changing land-use patterns on climate change geo-marketing companies doing customer segmentation based on spatial location.

[edit] Challenges

Geospatial data repositories tend to be very large. Moreover, existing GIS datasets are often splintered into feature and attribute components, that are conventionally archived in hybrid data management systems. Algorithmic requirements differ substantially for relational (attribute) data management and for topological (feature) data management [23]. Related to this is the range and diversity of geographic data formats, that also presents unique challenges. The digital geographic data revolution is creating new types of data formats beyond the traditional "vector" and "raster" formats. Geographic data repositories increasingly include ill-structured data such as imagery and geo-referenced multi-media [24]. There are several critical research challenges in geographic knowledge discovery and data mining. Miller and Han [25] offer the following list of emerging research topics in the field:
y

Developing and supporting geographic data warehouses - Spatial properties are often reduced to simple aspatial attributes in mainstream data warehouses. Creating an integrated GDW requires solving issues in spatial and temporal data interoperability, including differences in semantics, referencing systems, geometry, accuracy and position.

y

y

Better spatio-temporal representations in geographic knowledge discovery - Current geographic knowledge discovery (GKD) techniques generally use very simple representations of geographic objects and spatial relationships. Geographic data mining techniques should recognise more complex geographic objects (lines and polygons) and relationships (nonEuclidean distances, direction, connectivity and interaction through attributed geographic space such as terrain). Time needs to be more fully integrated into these geographic representations and relationships. Geographic knowledge discovery using diverse data types - GKD techniques should be developed that can handle diverse data types beyond the traditional raster and vector models, including imagery and geo-referenced multimedia, as well as dynamic data types (video streams, animation).

[edit] Surveillance

Previous data mining to stop terrorist programs under the U.S. government include the Total Information Awareness (TIA) program, Secure Flight (formerly known as Computer-Assisted Passenger Prescreening System (CAPPS II)), Analysis, Dissemination, Visualization, Insight, Semantic Enhancement (ADVISE[26]), and the Multi-state Anti-Terrorism Information Exchange (MATRIX).[27] These programs have been discontinued due to controversy over whether they violate the US Constitution's 4th amendment, although many programs that were formed under them continue to be funded by different organisations, or under different names.[28] Two plausible data mining techniques in the context of combating terrorism include "pattern mining" and "subject-based data mining".
[edit] Pattern mining

"Pattern mining" is a data mining technique that involves finding existing patterns in data. In this context patterns often means association rules. The original motivation for searching association rules came from the desire to analyze supermarket transaction data, that is, to examine customer behaviour in terms of the purchased products. For example, an association rule "beer crisps (80%)" states that four out of five customers that bought beer also bought crisps. In the context of pattern mining as a tool to identify terrorist activity, the National Research Council provides the following definition: "Pattern-based data mining looks for patterns (including anomalous data patterns) that might be associated with terrorist activity ² these patterns might be regarded as small signals in a large ocean of noise."[29][30][31] Pattern Mining includes new areas such a Music Information Retrieval (MIR) where patterns seen both in the temporal and non temporal domains are imported to classical knowledge discovery search techniques.
[edit] Subject-based data mining

"Subject-based data mining" is a data mining technique involving the search for associations between individuals in data. In the context of combatting terrorism, the National Research Council provides the following definition: "Subject-based data mining uses an initiating

individual or other datum that is considered, based on other information, to be of high interest, and the goal is to determine what other persons or financial transactions or movements, etc., are related to that initiating datum."[30]

[edit] Privacy concerns and ethics
Some people believe that data mining itself is ethically neutral.[32] However, the ways in which data mining can be used can raise questions regarding privacy, legality, and ethics.[33] In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns.[34][35] Data mining requires data preparation which can uncover information or patterns which may compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation. Data aggregation is when the data are accrued, possibly from various sources, and put together so that they can be analyzed.[36] This is not data mining per se, but a result of the preparation of data before and for the purposes of the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when originally the data were anonymous. It is recommended that an individual is made aware of the following before data are collected:
y y y y y

the purpose of the data collection and any data mining projects, how the data will be used, who will be able to mine the data and use them, the security surrounding access to the data, and in addition, how collected data can be updated.[36]

In the United States, privacy concerns have been somewhat addressed by their congress via the passage of regulatory controls such as the Health Insurance Portability and Accountability Act (HIPAA). The HIPAA requires individuals to be given "informed consent" regarding any information that they provide and its intended future uses by the facility receiving that information. According to an article in Biotech Business Week, ³In practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena, says the AAHC. More importantly, the rule's goal of protection through informed consent is undermined by the complexity of consent forms that are required of patients and participants, which approach a level of incomprehensibility to average individuals.´ [37] This underscores the necessity for data anonymity in data aggregation practices. One may additionally modify the data so that they are anonymous, so that individuals may not be readily identified.[36] However, even de-identified data sets can contain enough information to identify individuals, as occurred when journalists were able to find several individuals based on a set of search histories that were inadvertently released by AOL.[38] http://www.computerworld.com/s/article/70102/The_Story_So_Far?taxonomyId=009

Data Mart Does Not Equal Data Warehouse Published 07/18/00 on DMReview.com By William Inmon "...The data warehouse is nothing more than the union of all the data marts...," Ralph Kimball, December 29, 1997. "You can catch all the minnows in the ocean and stack them together and they still do not make a whale," Bill Inmon, January 8, 1998. The single most important issue facing the information technology manager this year is whether to build the data warehouse first or the data mart first. The data mart vendors have said that data warehouses are difficult and expensive to build, take a long time to design and develop, require thought and investment, and mandate that the corporation face difficult issues such as integration of legacy data, managing massive volumes of data and cost justifying the entire DSS/data warehouse effort to the management committee. The picture painted by the data mart advocates for building the data warehouse is gloomy. It is also self-serving and incorrect. The data mart vendors look upon the data warehouse as an obstacle between themselves and the revenue that comes from making sales. Of course, they want to shun the data warehouse. The data warehouse lengthens their sales cycle, regardless of the long-term effect of building a bunch of data marts with no data warehouse. The data mart vendors are selling a very short-term perspective at the expense of long-term architectural success. The data mart advocates suggest that there may be alternate, much easier paths to DSS success than building a data warehouse. One of those paths is to build several data marts and when they grow big enough, call them a data warehouse, rather than build an actual data warehouse. The data mart advocates argue that the data mart can be built much more quickly and cheaply than a warehouse. When you build the data mart there is no need for a great amount of organizational hassle or discipline and no concern for the long-term architecture that is created by the data marts. Unfortunately, by avoiding the visceral organizational and design issues of warehousing, the data mart advocates miss much of the point of warehousing. By building an architecture consisting entirely of data marts, the data mart advocates lead the organization into an even larger mess. Instead of messy legacy operational systems, now we have messy legacy operational systems AND messy data marts. Stovepipe data marts and stovepipe DSS applications are what result from building nothing but data marts. There is no integration when all that you build is data marts. And a DSS environment without integration is like a man without a skeletal system-hardly a useful, viable entity.
A Change of Approaches

In the early days of the data warehouse marketplace, the data mart vendors tried to jump on the warehouse gravy train by proclaiming that a data warehouse was the same thing as a data mart. In trade show after trade show, the data mart vendors confused people with what a data

warehouse is and what a data mart is. The data mart vendors spread half truths and misinformation about data warehousing. The result was confusion. The obfuscation sowed by the data mart vendors caused a few confused customers to build data marts with no actual warehouse. After about the third data mart, the customer discovered something was rotten in Denmark. The architectural deficiency of building nothing but data marts was unmasked. The customer discovered that when you don't build a data warehouse, there is:
y y y

massive redundancy of detailed and historical data from one data mart to another, inconsistent and irreconcilable results from one data mart to the next, an unmanageable interface between the data marts and the legacy application environment, etc.

In short order, the world discovered that a DSS environment without a data warehouse was an extremely unsatisfactory thing. Now that the world has found that building data marts is not the proper way to proceed in DSS, the data mart vendors and their spokesmen are back again and are sowing a different brand of confusion. This time they have altered their original words a little and have promised a new and improved path to easy success. In a slight twist of concept from the first time around, the notion now being spread is that a data warehouse is merely a collection of integrated data marts (whatever that is). The notion that multiple data marts can be integrated is oxymoronic. The whole essence of data marts is that mart users do their own thing so that they don't have to integrate with other marts. Simply stated, for a variety of very powerful reasons, you cannot build data marts, watch them grow and magically turn them a data warehouse when they reach a certain size. And by the same token, integrating data across data marts is equally unthinkable because each department that owns its own data mart has its own unique specifications. In order to understand why one or more data marts cannot be transformed into a data warehouse, you must first understand what a data mart is and a data warehouse is.
Different Architectural Structures

A data mart and a data warehouse are essentially different architectural structures, even though when viewed from afar and superficially, they look to be very similar.
What is a Data Mart?

A data mart is a collection of subject areas organized for decision support based on the needs of a given department. Finance has their data mart, marketing has theirs, sales has theirs and so on. And the data mart for marketing only faintly resembles anyone else's data mart.

Perhaps most importantly, the individual departments OWN the hardware, software, data and programs that constitute the data mart. The rights of ownership allow the departments to bypass any means of control or discipline that might coordinate the data found in the different departments. Each department has its own interpretation of what a data mart should look like and each department's data mart is peculiar to and specific to its own needs. Typically, the database design for a data mart is built around a star-join structure that is optimal for the needs of the users found in the department. In order to shape the star join, the requirements of the users for the department must be gathered. The data mart contains only a modicum of historical information and is granular only to the point that it suits the needs of the department. The data mart is typically housed in multidimensional technology which is great for flexibility of analysis but is not optimal for large amounts of data. Data found in data marts is highly indexed. There are two kinds of data marts--dependent and independent. A dependent data mart is one whose source is a data warehouse. An independent data mart is one whose source is the legacy applications environment. All dependent data marts are fed by the same source--the data warehouse. Each independent data mart is fed uniquely and separately by the legacy applications environment. Dependent data marts are architecturally and structurally sound. Independent data marts are unstable and architecturally unsound, at least for the long haul. The problem with independent data marts is that their deficiencies do not make themselves manifest until the organization has built multiple independent data marts.
What is a Data Warehouse?

Data warehouses are significantly different from data marts. Data warehouses are arranged around the corporate subject areas found in the corporate data model. Usually the data warehouse is built and owned by centrally coordinated organizations, such as the classic IT organization. The data warehouse represents a truly corporate effort. There may or may not be a relationship between any department's subject areas and the corporation's subject areas. The data warehouse contains the most granular data the corporation has. Data mart data is usually much less granular than data warehouse data (i.e., data warehouses contain more detail information while most data marts contain more summarized or aggregated data). The data warehouse data structure is an essentially normalized structure. The structure and the content of the data in the data warehouse do not reflect the bias of any particular department, but represent the corporation's needs for data. The volume of data found in the data warehouse is significantly different from the data found in the data mart. Because of the volume of data found in the data warehouse, the data warehouse is indexed very lightly. The data warehouse contains a robust amount of historical data. The technology housing the data warehouse is optimized on handling an industrial strength amount of data. The data warehouse data is integrated from the many legacy sources. In short, there are very significant differences between the structure and content of data that resides in a data warehouse and the structure and content of data that resides in a data mart.

Figure 1 shows some of the differences between a data mart and a data warehouse. Because data that is granular, integrated and historical resides in a data warehouse, the data warehouse attracts a significant volume of data. Because the warehouse attracts a significant amount of data, it is advisable to build the warehouse iteratively. If you don't build the warehouse iteratively, you will spend years building the warehouse. From the very first literature that was ever written on data warehousing, it has been recognized that there was a need to get concrete, tangible results in front of the end user as quickly as possible. The best advice of the writers and consultants of the data warehousing industry has consistently been to build the warehouse quickly and to avoid large, lengthy efforts. Interestingly, the data mart advocates and their spokesmen claim that data warehouses take a long time to build. It is only in the hype issued by the data mart advocates that it has ever been suggested that the warehouse be built in "galactic" proportions. Figure 2 shows the recommended construction path for data warehouses. The most recent theory of the data mart advocates is that you can build one or more data marts, integrate them (although no one is very clear as to what that means) and then when they grow to a certain size, they can be (magically!) turned into a warehouse. This suggestion is sadly mistaken for a variety of reasons: * The data mart is designed to suit the needs of a department. Many departments with very different objectives must be satisfied. That is why there are many different data marts in the corporation, each with its own distinctive look and feel. The data warehouse is designed to suit the collective needs of the entire corporation. A given design can be optimal for a single department or the corporation, but not both. The design objectives for the corporation are very different from the design objectives for a given department. * The granularity of data in the data mart is very different from the granularity of data in the data warehouse. The data mart contains aggregated or summarized data. The data warehouse contains the most detailed data that is found in the corporation. Since the data mart granularity is much higher than that found in the data warehouse, you cannot easily decompose the data mart granularity into data warehouse granularity. But you can always go the other direction and summarize detailed units of data into summarizations.
y y

y y y

The structure of the data in the data mart (commonly a star join structure) is only faintly compatible with the structure of the data in the warehouse (a normalized structure). The amount of historical data found in the data mart is very different from the history of the data found in the warehouse. Data warehouses contain robust amounts of history. Data marts contain only modest amounts of history. The subject areas found in the data mart are only faintly related to the subject areas found in the data warehouse. The relationships found in the data mart are not those relationships that are found in the data warehouse. The types of queries satisfied in the data mart are quite different from those queries found in the data warehouse.

y y

The kind of users (farmers) that are found in the marts are quite different from the type of users (explorers) that are found in the data warehouse. The key structures found in the data mart are significantly different from those key structures found in the data warehouse, and so forth.

Reality

There are simply MAJOR, MAJOR significant differences between the data mart and the data warehouse environment. The assertion that a data mart can be turned into a data warehouse when it reaches a certain size or that data marts can be integrated together is no more valid than saying that when a tumbleweed grows large enough that it can be turned into an oak tree. Reality and genetics being what they are, it is true that a tumbleweed and an oak tree are, at one point in their life, green living organisms planted in the soil and are approximately the same size. But just because those two plants share a few basic characteristics at some moment in time does not mean that a tumbleweed can be turned into an oak tree. Only a misinformed person would mistake a tumbleweed for an oak tree at any stage in the life of the plants. This article was originally published in the May, 1998 issue of DM Review magazine. Click here to visit the current issue of DM Review. William Inmon Bill Inmon is universally recognized as the father of the data warehouse. He can be reached at (303) 221-4000.

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close