analytics

Published on March 2017 | Categories: Documents | Downloads: 35 | Comments: 0 | Views: 835
of 37
Download PDF   Embed   Report

Comments

Content

Clickstream analytics: On a Web site, clickstream analysis (sometimes called clickstream analytics) is the process of collecting, analyzing, and reporting aggregate data about which pages visitors visit in what order - which are the result of the succession of mouse clicks each visitor makes (that is, the clickstream). There are two levels of clickstream analysis, traffic analysis and e-commerce analysis. Traffic analysis operates at the server level by collecting clickstream data related to the path the user takes when navigating through the site. Traffic analysis tracks how many pages are served to the user, how long it takes pages to load, how often the user hits the browser's back or stop button, and how much data is transmitted before a user moves on. E-commerce-based analysis uses clickstream data to determine the effectiveness of the site as a channel-to-market by quantifying the user's behavior while on the Web site. It is used to keep track of what pages the user lingers on, what the user puts in or takes out of their shopping cart, and what items the user purchases. Because a large volume of data can be gathered through clickstream analysis, many e-businesses rely on pre-programmed applications to help interpret the data and generate reports on specific areas of interest. Clickstream analysis is considered to be most effective when used in conjunction with other, more traditional, market evaluation resources. A Clickstream is the recording of what a computer user clicks on while Web browsing or using another software application. As the user clicks anywhere in the webpage or application, the action is logged on a client or inside the Web server, as well as possibly the Web browser, routers, proxy servers, and ad servers. Clickstream analysis is useful for Web activity analysis[1], software testing, market research, and for analyzing employee productivity. A small observation on the evolution of clickstream tracking: Initial clickstream or click path data had to be gleaned from server log files. Because human and machine traffic were not differentiated, the study of human clicks took a substantial effort. Subsequently javascript technologies were developed which use a tracking cookie to generate a series of signals from

browsers. In other words, information was only collected from "real humans" clicking on sites through browsers. A clickstream is a series of page requests, every page requested generates a signal. These signals can be graphically represented for clickstream reporting. The main point of clickstream tracking is to give webmasters insight into what visitors on their site are doing. This data itself is "neutral" in the sense that any dataset is neutral. The data can be used in various scenarios, one of which is marketing. Additionally, any webmaster, researcher, blogger or person with a website can learn about how to improve their site. Use of clickstream data can raise privacy concerns, especially since some Internet service providers have resorted to selling users' clickstream data as a way to enhance revenue. There are 10-12 companies that purchase this data, typically for about $0.40/month per user.[1] While this practice may not directly identify individual users, it is often possible to indirectly identify specific users, an example being the AOL search data scandal. Most consumers are unaware of this practice, and its potential for compromising their privacy. In addition, few ISPs publicly admit to this practice.[2] Since the business world is quickly evolving into a state of e-commerce, analyzing the data of clients that visit a company website is becoming a necessity in order to remain competitive. This analysis can be used to generate two findings for the company, the first being an analysis of a user’s clickstream while using a website to reveal usage patterns, which in turn gives a heightened understanding of customer behaviour. This use of the analysis creates a user profile that aids in understanding the types of people that visit a company’s website. As discussed in Van den Poel & Buckinx (2005), clickstream analysis can be used to predict whether a customer is likely to purchase from an e-commerce website. Clickstream analysis can also be used to improve customer satisfaction with the website and with the company itself. Both of these uses generate a huge business advantage. It can also be used to assess the effectiveness of advertising on a web page or site.[2] With the growing corporate knowledge of the importance of clickstreams, the way that they are being monitored and used to build Business Intelligence is evolving. Data mining [3], column-

oriented DBMS, and integrated OLAP systems are being used in conjunction with clickstreams to better record and analyze this data. Clickstreams can also be used to allow the user to see where they have been and allow them to easily return to a page they have already visited, a function that is already incorporated in most browsers. Unauthorized clickstream data collection is considered to be spyware. However, authorized clickstream data collection comes from organizations that use opt-in panels to generate market research using panelists who agree to share their clickstream data with other companies by downloading and installing specialized clickstream collection agents. Semantic Web is a term coined by World Wide Web Consortium (W3C) director Tim BernersLee.[1] It describes methods and technologies to allow machines to understand the meaning - or "semantics" - of information on the World Wide Web.[2] According to the original vision, the availability of machine-readable metadata would enable automated agents and other software to access the Web more intelligently. The agents would be able to perform tasks automatically and locate related information on behalf of the user. While the term "Semantic Web" is not formally defined it is mainly used to describe the model and technologies[3] proposed by the W3C. These technologies include the Resource Description Framework (RDF), a variety of data interchange formats (e.g. RDF/XML, N3, Turtle, NTriples), and notations such as RDF Schema (RDFS) and the Web Ontology Language (OWL), all of which are intended to provide a formal description of concepts, terms, and relationships within a given knowledge domain. Many of the technologies proposed by the W3C already exist and are used in various projects. The Semantic Web as a global vision, however, has remained largely unrealized and its critics have questioned the feasibility of the approach. In addition other technologies with similar goals, such as microformats, have evolved, which are not always described as "Semantic Web".

Purpose Humans are capable of using the Web to carry out tasks such as finding the Irish word for "directory", reserving a library book, and searching for a low price for a DVD. However, one computer cannot accomplish all of these tasks without human direction, because web pages are designed to be read by people, not machines. The semantic web is a vision of information that is understandable by computers, so computers can perform more of the tedious work involved in finding, combining, and acting upon information on the web. Tim Berners-Lee originally expressed the vision of the semantic web as follows:[4] I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents’ people have touted for ages will finally materialize. – Tim Berners-Lee, 1999 Semantic Web application areas are experiencing intensified interest due to the rapid growth in the use of the Web, together with the innovation and renovation of information content technologies. The Semantic Web is regarded as an integrator across different content and information applications and systems, and provide mechanisms for the realisation of Enterprise Information Systems. The rapidity of the growth experienced provides the impetus for researchers to focus on the creation and dissemination of innovative Semantic Web technologies, where the envisaged ’Semantic Web’ is long overdue. Often the terms ’Semantics’, ’metadata’, ’ontologies’ and ’Semantic Web’ are used inconsistently. In particular, these terms are used as everyday terminology by researchers and practitioners, spanning a vast landscape of different fields, technologies, concepts and application areas. Furthermore, there is confusion with regards to the current status of the enabling technologies envisioned to realise the Semantic Web. In a paper presented by Gerber, Barnard and Van der Merwe[5] the Semantic Web landscape is charted and a brief summary of related terms and enabling technologies is presented. The

architectural model proposed by Tim Berners-Lee is used as basis to present a status model that reflects current and emerging technologies.[6] Semantic Publishing Semantic publishing will benefit greatly from the semantic web. In particular, the semantic web is expected to revolutionize scientific publishing, such as real-time publishing and sharing of experimental data on the Internet. This simple but radical idea is now being explored by W3C HCLS group's Scientific Publishing Task Force.

Web 3.0 Tim Berners-Lee has described the semantic web as a component of 'Web 3.0'.[7] People keep asking what Web 3.0 is. I think maybe when you've got an overlay of scalable vector graphics - everything rippling and folding and looking misty — on Web 2.0 and access to a semantic Web integrated across a huge space of data, you'll have access to an unbelievable data resource..." Purpose Humans are capable of using the Web to carry out tasks such as finding the Irish word for "directory", reserving a library book, and searching for a low price for a DVD. However, one computer cannot accomplish all of these tasks without human direction, because web pages are designed to be read by people, not machines. The semantic web is a vision of information that is understandable by computers, so computers can perform more of the tedious work involved in finding, combining, and acting upon information on the web. Tim Berners-Lee originally expressed the vision of the semantic web as follows:[4] I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’,

which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents’ people have touted for ages will finally materialize. – Tim Berners-Lee, 1999 Semantic Web application areas are experiencing intensified interest due to the rapid growth in the use of the Web, together with the innovation and renovation of information content technologies. The Semantic Web is regarded as an integrator across different content and information applications and systems, and provide mechanisms for the realisation of Enterprise Information Systems. The rapidity of the growth experienced provides the impetus for researchers to focus on the creation and dissemination of innovative Semantic Web technologies, where the envisaged ’Semantic Web’ is long overdue. Often the terms ’Semantics’, ’metadata’, ’ontologies’ and ’Semantic Web’ are used inconsistently. In particular, these terms are used as everyday terminology by researchers and practitioners, spanning a vast landscape of different fields, technologies, concepts and application areas. Furthermore, there is confusion with regards to the current status of the enabling technologies envisioned to realise the Semantic Web. In a paper presented by Gerber, Barnard and Van der Merwe[5] the Semantic Web landscape is charted and a brief summary of related terms and enabling technologies is presented. The architectural model proposed by Tim Berners-Lee is used as basis to present a status model that reflects current and emerging technologies.[6] Semantic Publishing Semantic publishing will benefit greatly from the semantic web. In particular, the semantic web is expected to revolutionize scientific publishing, such as real-time publishing and sharing of experimental data on the Internet. This simple but radical idea is now being explored by W3C HCLS group's Scientific Publishing Task Force.

Web 3.0 Tim Berners-Lee has described the semantic web as a component of 'Web 3.0'.[7] People keep asking what Web 3.0 is. I think maybe when you've got an overlay of scalable vector graphics - everything rippling and folding and looking misty — on Web 2.0 and access to a semantic Web integrated across a huge space of data, you'll have access to an unbelievable data resource..."

Basic Notions of Semantics A perennial problem in semantics is the delineation of its subject matter. The term meaning can be used in a variety of ways, and only some of these correspond to the usual understanding of the scope of linguistic or computational semantics. We shall take the scope of semantics to be restricted to the literal interpretations of sentences in a context, ignoring phenomena like irony, metaphor, or conversational implicature [Gri75,Lev83]. A standard assumption in computationally oriented semantics is that knowledge of the meaning of a sentence can be equated with knowledge of its truth conditions: that is, knowledge of what the world would be like if the sentence were true. This is not the same as knowing whether a sentence is true, which is (usually) an empirical matter, but knowledge of truth conditions is a prerequisite for such verification to be possible. Meaning as truth conditions needs to be generalized somewhat for the case of imperatives or questions, but is a common ground among all contemporary theories, in one form or another, and has an extensive philosophical justification, e.g., [Dav69,Dav73]. A semantic description of a language is some finitely stated mechanism that allows us to say, for each sentence of the language, what its truth conditions are. Just as for grammatical description, a semantic theory will characterize complex and novel sentences on the basis of their constituents: their meanings, and the manner in which they are put together. The basic constituents will ultimately be the meanings of words and morphemes. The modes of

combination of constituents are largely determined by the syntactic structure of the language. In general, to each syntactic rule combining some sequence of child constituents into a parent constituent, there will correspond some semantic operation combining the meanings of the children to produce the meaning of the parent. A corollary of knowledge of the truth conditions of a sentence is knowledge of what inferences can be legitimately drawn from it. Valid inference is traditionally within the province of logic (as is truth) and mathematical logic has provided the basic tools for the development of semantic theories. One particular logical system, first order predicate calculus (FOPC), has played a special role in semantics (as it has in many areas of computer science and artificial intelligence). FOPC can be seen as a small model of how to develop a rigorous semantic treatment for a language, in this case an artificial one developed for the unambiguous expression of some aspects of mathematics. The set of sentences or well formed formulae of FOPC are specified by a grammar, and a rule of semantic interpretation is associated with each syntactic construct permitted by this grammar. The interpretations of constituents are given by associating them with set-theoretic constructions (their denotation) from a set of basic elements in some universe of discourse. Thus for any of the infinitely large set of FOPC sentences we can give a precise description of its truth conditions, with respect to that universe of discourse. Furthermore, we can give a precise account of the set of valid inferences to be drawn from some sentence or set of sentences, given these truth conditions, or (equivalently, in the case of FOPC) given a set of rules of inference for the logic. 3.5.2 Practical Applications of Semantics Some natural language processing tasks (e.g., message routing, textual information retrieval, translation) can be carried out quite well using statistical or pattern matching techniques that do not involve semantics in the sense assumed above. However, performance on some of these tasks improves if semantic processing is involved. (Not enough progress has been made to see whether this is true for all of the tasks). Some tasks, however, cannot be carried out at all without semantic processing of some form. One important example application is that of database query, of the type chosen for the Air

Travel Information Service (ATIS) task [DAR89]. For example, if a user asks, ``Does every flight from London to San Francisco stop over in Reykyavik?'' then the system needs to be able to deal with some simple semantic facts. Relational databases do not store propositions of the form every X has property P and so a logical inference from the meaning of the sentence is required. In this case, every X has property P is equivalent to there is no X that does not have property P and a system that knows this will also therefore know that the answer to the question is no if a non-stopping flight is found and yes otherwise. Any kind of generation of natural language output (e.g., summaries of financial data, traces of KBS system operations) usually requires semantic processing. Generation requires the construction of an appropriate meaning representation, and then the production of a sentence or sequence of sentences which express the same content in a way that is natural for a reader to comprehend, e.g., [MKS94]. To illustrate, if a database lists a 10 a.m.\ flight from London to Warsaw on the 1st--14th, and 16th--30th of November, then it is more helpful to answer the question What days does that flight go? by Every day except the 15th instead of a list of 30 days of the month. But to do this the system needs to know that the semantic representations of the two propositions are equivalent. Future Directions Currently, the most pressing needs for semantic theory are to find ways of achieving wider and more robust coverage of real data. This will involve progress in several directions: (i) Further exploration of the use of underspecified representations so that some level of semantic processing can be achieved even where complete meaning representations cannot be constructed (either because of lack of coverage or inability to carry out contextual resolution). (ii) Closer cooperation with work in lexicon construction. The tradition in semantics has been to assume that word meanings can by and large simply be plugged in to semantic structures. This is a convenient and largely correct assumption when dealing with structures like every X is P, but becomes less tenable as more complex phenomena are examined. However, the relevant semantic properties of individual words or groups of words are seldom to be found in conventional dictionaries and closer cooperation between semanticists and computationally aware lexicographers is required. (iii) More integration between sentence or utterance level

semantics and theories of text or dialogue structure. Recent work in semantics has shifted emphasis away from the purely sentence-based approach, but the extent to which the interpretations of individual sentences can depend on dialogue or text settings, or on the goals of speakers, is much greater than had been suspected. Introduction: (Intelligent agent) A key enabling technology for personalization, Intelligent agent technology provides a mechanism for information systems to act on behalf of their users. Specifically, intelligent agents can be programmed to search, acquire, and store information on behalf of the wants and needs of users. Attributes Intelligent agents are task-oriented. Examples of intelligent agent tasks include data mining, profile management, privacy management, rules management, and application management. Data mining agents seek data and information based on the profile of the user and instructions carried out by the rules manager agent. The profile management agent's role is to maintain the user profile, constantly adding, deleting and modifying profile information based on new information. The privacy management agent's role is to safeguard privacy of the user, including identity, location, or other personal information. An application agent manages application(s) and interactions between the network, other networks, content, and other applications. Agent Locations and Technologies Intelligent agent software may consist of embedded technology within the mobile device, servers on the Internet, and/or programs within applications on/off the mobile network. Agency vs. Intelligence Agency is the degree of authority given to a program - how much it is allowed to perform on its own accord, based on the instructions given by user.

Intelligence is the degree of reasoning and independent "learning" enabled within the program. A truly intelligent agent has the ability to adapt itself to its environment, sensing events and objects, evaluating circumstances and objects, and act over time in pursuit of a specific agenda. Intelligent Agent Applications The number of applications benefiting from intelligent agent technology are virtually limited only by the imagination. One of the key goals is to enable the autonomous delivery of information for the benefit of various applications. For example, a location detection intelligent agent in a mobile device would inform application middle-ware once a user is in a certain location, perhaps engaging additional positioning and or applications. The exploitation of intelligent agent software is expected to rise over time as other enabling personalization technologies also grow in prominence.

Market Basket Analysis What is it? Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. For example, if you are in an English pub and you buy a pint of beer and don't buy a bar meal, you are more likely to buy crisps (US. chips) at the same time than somebody who didn't buy beer. The set of items a customer buys is referred to as an itemset, and market basket analysis seeks to find relationships between purchases. Typically the relationship will be in the form of a rule: IF {beer, no bar meal} THEN {crisps}.

The probability that a customer will buy beer without a bar meal (i.e. that the antecedent is true) is referred to as the support for the rule. The conditional probability that a customer will purchase crisps is referred to as the confidence. The algorithms for performing market basket analysis are fairly straightforward (Berry and Linhoff is a reasonable introductory resource for this). The complexities mainly arise in exploiting taxonomies, avoiding combinatorial explosions (a supermarket may stock 10,000 or more line items), and dealing with the large amounts of transaction data that may be available. A major difficulty is that a large number of the rules found may be trivial for anyone familiar with the business. Although the volume of data has been reduced, we are still asking the user to find a needle in a haystack. Requiring rules to have a high minimum support level and a high confidence level risks missing any exploitable result we might have found. One partial solution to this problem is differential market basket analysis, as described below. How is it used? In retailing, most purchases are bought on impulse. Market basket analysis gives clues as to what a customer might have bought if the idea had occurred to them . (For some real insights into consumer behavior, see Why We Buy: The Science of Shopping by Paco Underhill.) As a first step, therefore, market basket analysis can be used in deciding the location and promotion of goods inside a store. If, as has been observed, purchasers of Barbie dolls have are more likely to buy candy, then high-margin candy can be placed near to the Barbie doll display. Customers who would have bought candy with their Barbie dolls had they thought of it will now be suitably tempted. But this is only the first level of analysis. Differential market basket analysis can find interesting results and can also eliminate the problem of a potentially high volume of trivial results.

In differential analysis, we compare results between different stores, between customers in different demographic groups, between different days of the week, different seasons of the year, etc. If we observe that a rule holds in one store, but not in any other (or does not hold in one store, but holds in all others), then we know that there is something interesting about that store. Perhaps its clientele are different, or perhaps it has organized its displays in a novel and more lucrative way. Investigating such differences may yield useful insights which will improve company sales. Other Application Areas Although Market Basket Analysis conjures up pictures of shopping carts and supermarket shoppers, it is important to realize that there are many other areas in which it can be applied. These include:
• • •

Analysis of credit card purchases. Analysis of telephone calling patterns. Identification of fraudulent medical insurance claims. (Consider cases where common rules are broken). Analysis of telecom service purchases.



Market Basket Analysis What is it? Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. For example, if you are in an English pub and you buy a pint of beer and don't buy a bar meal, you are more likely to buy crisps (US. chips) at the same time than somebody who didn't buy beer. The set of items a customer buys is referred to as an itemset, and market basket analysis seeks to find relationships between purchases. Typically the relationship will be in the form of a rule:

IF {beer, no bar meal} THEN {crisps}. The probability that a customer will buy beer without a bar meal (i.e. that the antecedent is true) is referred to as the support for the rule. The conditional probability that a customer will purchase crisps is referred to as the confidence. The algorithms for performing market basket analysis are fairly straightforward (Berry and Linhoff is a reasonable introductory resource for this). The complexities mainly arise in exploiting taxonomies, avoiding combinatorial explosions (a supermarket may stock 10,000 or more line items), and dealing with the large amounts of transaction data that may be available. A major difficulty is that a large number of the rules found may be trivial for anyone familiar with the business. Although the volume of data has been reduced, we are still asking the user to find a needle in a haystack. Requiring rules to have a high minimum support level and a high confidence level risks missing any exploitable result we might have found. One partial solution to this problem is differential market basket analysis, as described below. How is it used? In retailing, most purchases are bought on impulse. Market basket analysis gives clues as to what a customer might have bought if the idea had occurred to them . (For some real insights into consumer behavior, see Why We Buy: The Science of Shopping by Paco Underhill.) As a first step, therefore, market basket analysis can be used in deciding the location and promotion of goods inside a store. If, as has been observed, purchasers of Barbie dolls have are more likely to buy candy, then high-margin candy can be placed near to the Barbie doll display. Customers who would have bought candy with their Barbie dolls had they thought of it will now be suitably tempted. But this is only the first level of analysis. Differential market basket analysis can find interesting results and can also eliminate the problem of a potentially high volume of trivial results.

In differential analysis, we compare results between different stores, between customers in different demographic groups, between different days of the week, different seasons of the year, etc. If we observe that a rule holds in one store, but not in any other (or does not hold in one store, but holds in all others), then we know that there is something interesting about that store. Perhaps its clientele are different, or perhaps it has organized its displays in a novel and more lucrative way. Investigating such differences may yield useful insights which will improve company sales. Other Application Areas Although Market Basket Analysis conjures up pictures of shopping carts and supermarket shoppers, it is important to realize that there are many other areas in which it can be applied. These include:
• • •

Analysis of credit card purchases. Analysis of telephone calling patterns. Identification of fraudulent medical insurance claims. (Consider cases where common rules are broken). Analysis of telecom service purchases.



How the Data Dictionary Is Used The data dictionary has three primary uses:


Oracle accesses the data dictionary to find information about users, schema objects, and storage structures. Oracle modifies the data dictionary every time that a data definition language (DDL) statement is issued. Any Oracle user can use the data dictionary as a read-only reference for information about the database.





How Oracle Uses the Data Dictionary Data in the base tables of the data dictionary is necessary for Oracle to function. Therefore, only Oracle should write or change data dictionary information. Oracle provides scripts to modify the data dictionary tables when a database is upgraded or downgraded. Caution: No data in any data dictionary table should be altered or deleted by any user. During database operation, Oracle reads the data dictionary to ascertain that schema objects exist and that users have proper access to them. Oracle also updates the data dictionary continuously to reflect changes in database structures, auditing, grants, and data. For example, if user Kathy creates a table named parts, then new rows are added to the data dictionary that reflect the new table, columns, segment, extents, and the privileges that Kathy has on the table. This new information is then visible the next time the dictionary views are queried. Public Synonyms for Data Dictionary Views Oracle creates public synonyms for many data dictionary views so users can access them conveniently. The security administrator can also create additional public synonyms for schema objects that are used systemwide. Users should avoid naming their own schema objects with the same names as those used for public synonyms. Cache the Data Dictionary for Fast Access Much of the data dictionary information is kept in the SGA in the dictionary cache, because Oracle constantly accesses the data dictionary during database operation to validate user access and to verify the state of schema objects. All information is stored in memory using the least recently used (LRU) algorithm. Parsing information is typically kept in the caches. The COMMENTS columns describing the tables and their columns are not cached unless they are accessed frequently.

Other Programs and the Data Dictionary Other Oracle products can reference existing views and create additional data dictionary tables or views of their own. Application developers who write programs that refer to the data dictionary should refer to the public synonyms rather than the underlying tables: the synonyms are less likely to change between software releases. How to Use the Data Dictionary The views of the data dictionary serve as a reference for all database users. Access the data dictionary views with SQL statements. Some views are accessible to all Oracle users, and others are intended for database administrators only. The data dictionary is always available when the database is open. It resides in the SYSTEM tablespace, which is always online. The data dictionary consists of sets of views. In many cases, a set consists of three views containing similar information and distinguished from each other by their prefixes: Table 7-1 Data Dictionary View Prefixes Prefix USER ALL DBA Scope User's view (what is in the user's schema) Expanded user's view (what the user can access) Database administrator's view (what is in all users' schemas)

The set of columns is identical across views, with these exceptions:


Views with the prefix USER usually exclude the column OWNER. This column is implied in the USER views to be the user issuing the query. Some DBA views have additional columns containing information useful to the administrator.



Dynamic Performance Tables Throughout its operation, Oracle maintains a set of virtual tables that record current database activity. These tables are called dynamic performance tables. Dynamic performance tables are not true tables, and they should not be accessed by most users. However, database administrators can query and create views on the tables and grant access to those views to other users. These views are sometimes called fixed views because they cannot be altered or removed by the database administrator. SYS owns the dynamic performance tables; their names all begin with V_$. Views are created on these tables, and then public synonyms are created for the views. The synonym names begin with V$. For example, the V$DATAFILE view contains information about the database's datafiles, and the V$FIXED_TABLE view contains information about all of the dynamic performance tables and views in the database. Database Object Metadata The DBMS_METADATA package provides interfaces for extracting complete definitions of database objects. The definitions can be expressed either as XML or as SQL DDL. Two styles of interface are provided:
• •

A flexible, sophisticated interface for programmatic control A simplified interface for ad hoc querying

Database Object Metadata The DBMS_METADATA package provides interfaces for extracting complete definitions of database objects. The definitions can be expressed either as XML or as SQL DDL. Two styles of interface are provided:
• •

A flexible, sophisticated interface for programmatic control A simplified interface for ad hoc querying

Summary: The recent emphasis on regulatory compliance, SOA, and mergers and acquisitions has made the creating and maintaining of accurate and complete master data a business imperative. This paper covers the reasons for adopting master-data management, the process of developing a solution, and several options for the technological implementation of the solution. (12 printed pages) Contents Introduction Deciding What to Manage Why Should I Manage Master Data? What Is Master Data Management? Conclusion Introduction The pain that organizations are experiencing around consistent reporting, regulatory compliance, strong interest in Service-Oriented Architecture (SOA), and Software as a Service (SaaS) has prompted a great deal of interest in Master Data Management (MDM). This paper explains what MDM is, why it is important, and how to manage it, while identifying some of the key MDM management patterns and best practices that are emerging. This paper is a high-level treatment of the problem space. In subsequent papers, we will drill down into the technical and procedural issues involved in Master Data Management. What Is Master Data? Most software systems have lists of data that are shared and used by several of the applications that make up the system. For example, a typical ERP system as a minimum will have a Customer Master, an Item Master, and an Account Master. This master data is often one of the key assets of a company. It's not unusual for a company to be acquired primarily for access to its Customer Master data.

Rudimentary Definitions There are some very well-understood and easily identified master-data items, such as "customer" and "product." In fact, many define master data by simply reciting a commonly agreed upon master-data item list, such as: customer, product, location, employee, and asset. But how you identify elements of data that should be managed by a master-data management system is much more complex and defies such rudimentary definitions. In fact, there is a lot of confusion around what master data is and how it is qualified, necessitating a more comprehensive treatment. There are essentially five types of data in corporations:


Unstructured—This is data found in e-mail, white papers like this, magazine articles, corporate intranet portals, product specifications, marketing collateral, and PDF files. Transactional—This is data related to sales, deliveries, invoices, trouble tickets, claims, and other monetary and non-monetary interactions. Metadata—This is data about other data and may reside in a formal repository or in various other forms such as XML documents, report definitions, column descriptions in a database, log files, connections, and configuration files.







Hierarchical—Hierarchical data stores the relationships between other data. It may be stored as part of an accounting system or separately as descriptions of real-world relationships, such as company organizational structures or product lines. Hierarchical data is sometimes considered a super MDM domain, because it is critical to understanding and sometimes discovering the relationships between master data.



Master—Master data are the critical nouns of a business and fall generally into four groupings: people, things, places, and concepts. Further categorizations within those groupings are called subject areas, domain areas, or entity types. For example, within people, there are customer, employee, and salesperson. Within things, there are product, part, store, and asset. Within concepts, there are things like contract, warrantee, and licenses. Finally, within places, there are office locations and geographic divisions. Some of these domain areas may be further divided. Customer may be further segmented, based on incentives and history. A company may have normal customers, as well as premiere and executive customers. Product may be further segmented by sector and industry. The

requirements, life cycle, and CRUD cycle for a product in the Consumer Packaged Goods (CPG) sector is likely very different from those of the clothing industry. The granularity of domains is essentially determined by the magnitude of differences between the attributes of the entities within them. Deciding What to Manage While identifying master data entities is pretty straightforward, not all data that fits the definition for master data should necessarily be managed as such. This paper narrows the definition of master data to the following criteria, all of which should be considered together when deciding if a given entity should be treated as master data. Behavior Master data can be described by the way that it interacts with other data. For example, in transaction systems, master data is almost always involved with transactional data. A customer buys a product. A vendor sells a part, and a partner delivers a crate of materials to a location. An employee is hierarchically related to their manager, who reports up through a manager (another employee). A product may be a part of multiple hierarchies describing their placement within a store. This relationship between master data and transactional data may be fundamentally viewed as a noun/verb relationship. Transactional data capture the verbs, such as sale, delivery, purchase, email, and revocation; master data are the nouns. This is the same relationship datawarehouse facts and dimensions share. Life Cycle Master data can be described by the way that it is created, read, updated, deleted, and searched. This life cycle is called the CRUD cycle and is different for different master-data element types and companies. For example, how a customer is created depends largely upon a company's business rules, industry segment, and data systems. One company may have multiple customercreation vectors, such as through the Internet, directly through account representatives, or through outlet stores. Another company may only allow customers to be created through direct contact over the phone with its call center. Further, how a customer element is created is

certainly different from how a vendor element is created. The following table illustrates the differing CRUD cycles for four common master-data subject areas. Sample CRUD cycle Customer Customer visit, such Create as to Web site or facility; account created Product Asset Unit acquired by opening a PO; approval process necessary Periodic reporting Periodic inventory purposes, figuring catalogues depreciation, verification Packaging changes, Transfers, raw materials changes maintenance, accident reports Obsolete, sold, destroyed, stolen, scrapped Termination, death Employee HR hires, numerous forms, orientation, benefits selection, asset allocations, office assignments Office access, reviews, insurance-claims, immigration

Product purchased or manufactured; SCM involvement

Contextualized views Read based on credentials of viewer Address, discounts, Update phone number, preferences, credit accounts Death, bankruptcy, Destroy liquidation, do-notcall. CRM system, callSearch center system, system Cardinality

Immigration status, marriage status, level increase, raises, transfers

Canceled, replaced, no longer available

ERP system, orders- GL tracking, asset DB management

contact-management processing system

HR LOB system

As cardinality (the number of elements in a set) decreases, the likelihood of an element being treated as a master-data element—even a commonly accepted subject area, such as customer—

decreases. For example, if a company has only three customers, most likely they would not consider those customers master data—at least, not in the context of supporting them with a master-data management solution, simply because there is no benefit to managing those customers with a master-data infrastructure. Yet, a company with thousands of customers would consider Customer an important subject area, because of the concomitant issues and benefits around managing such a large set of entities. The customer value to each of these companies is the same. Both rely upon their customers for business. One needs a customer master-data solution; the other does not. Cardinality does not change the classification of a given entity type; however, the importance of having a solution for managing an entity type increases as the cardinality of the entity type increases. Lifetime Master data tends to be less volatile than transactional data. As it becomes more volatile, it typically is considered more transactional. For example, some might consider "contract" a master-data element. Others might consider it a transaction. Depending on the lifespan of a contract, it can go either way. An agency promoting professional athletes might consider their contracts as master data. Each is different from the other and typically has a lifetime of greater than a year. It may be tempting to simply have one master-data item called "athlete." However, athletes tend to have more than one contract at any given time: one with their teams and others with companies for endorsing products. The agency would need to manage all those contracts over time, as elements of the contract are renegotiated or athletes traded. Other contracts—for example, contracts for detailing cars or painting a house—are more like a transaction. They are one-time, short-lived agreements to provide services for payment and are typically fulfilled and destroyed within hours. Complexity Simple entities, even valuable entities, are rarely a challenge to manage and are rarely considered master-data elements. The less complex an element, the less likely the need to manage change for that element. Typically, such assets are simply collected and tallied. For example, Fort Knox likely would not track information on each individual gold bar stored there, but rather only keep

a count of them. The value of each gold bar is substantial, the cardinality high, and the lifespan long; yet, the complexity is low. Value The more valuable the data element is to the company, the more likely it will be considered a master data element. Value and complexity work together. Volatility While master data is typically less volatile than transactional data, entities with attributes that do not change at all typically do not require a master-data solution. For example, rare coins would seem to meet many of the criteria for a master-data treatment. A rare-coin collector would likely have many rare coins. So, cardinality is high. They are valuable. They are also complex. For example, rare coins have a history and description. There are attributes, such as condition of obverse, reverse, legend, inscription, rim, and field. There are other attributes, such as designer initials, edge design, layers, and portrait. Yet, rare coins do not need to be managed as a master-data item, because they don't change over time—or, at least, they don't change enough. There may need to be more information added, as the history of a particular coin is revealed or if certain attributes must be corrected. But, generally speaking, rare coins would not be managed through a master-data management system, because they are not volatile enough to warrant a solution. Reuse One of the primary drivers of master-data management is reuse. For example, in a simple world, the CRM system would manage everything about a customer and never need to share any information about the customer with other systems. However, in today's complex environments, customer information needs to be shared across multiple applications. That's where the trouble begins. Because—for a number of reasons—access to a master datum is not always available, people start storing master data in various locations, such as spreadsheets and application private stores. There are still reasons, such as data-quality degradation and decay, to manage master data

that is not reused across the enterprise. However, if a master-data entity is reused in multiple systems, it's a sure bet that it should be managed with a master-data management system. To summarize, while it is simple to enumerate the various master-data entity types, it is sometimes more challenging to decide which data items in a company should be treated as master data. Often, data that does not normally comply with the definition for master data may need to be managed as such, and data that does comply with the definition may not. Ultimately, when deciding on what entity types should be treated as master data, it is better to categorize them in terms of their behavior and attributes within the context of the business needs than to rely on simple lists of entity types. Why Should I Manage Master Data? Because it is used by multiple applications, an error in master data can cause errors in all the applications that use it. For example, an incorrect address in the customer master might mean orders, bills, and marketing literature are all sent to the wrong address. Similarly, an incorrect price on an item master can be a marketing disaster, and an incorrect account number in an Account Master can lead to huge fines or even jail time for the CEO—a career-limiting move for the person who made the mistake! Here is a typical master-data horror story: A credit-card customer moves from 2847 North 9th St. to 1001 11th St. North. The customer changed his billing address immediately, but did not receive a bill for several months. One day, the customer received a threatening phone call from the credit-card billing department, asking why the bill has not been paid. The customer verifies that they have the new address, and the billing department verifies that the address on file is 1001 11th St. N. The customer asks for a copy of the bill, to settle the account. After two more weeks without a bill, the customer calls back and finds the account has been turned over to a collection agency. This time, they find out that even though the address in the file was 1001 11th St. N, the billing address is 101 11th St. N. After a bunch of phone calls and letters between lawyers, the bill finally gets resolved and the credit-card company has lost a customer for life. In this case, the master copy of the data was accurate, but another copy of it was flawed. Master data must be both correct and consistent.

Even if the master data has no errors, few organizations have just one set of master data. Many companies grow through mergers and acquisitions. Each company you acquire comes with its own customer master, item master, and so forth. This would not be bad if you could just Union the new master data with your current master data, but unless the company you acquire is in a completely different business in a faraway country, there's a very good chance that some customers and products will appear in both sets of master data—usually, with different formats and different database keys. If both companies use the Dun & Bradstreet number or Social Security number as the customer identifier, discovering which customer records are for the same customer is a straightforward issue; but that seldom happens. In most cases, customer numbers and part numbers are assigned by the software that creates the master records, so the chances of the same customer or the same product having the same identifier in both databases is pretty remote. Item masters can be even harder to reconcile, if equivalent parts are purchased from different vendors with different vendor numbers. Merging master lists together can be very difficult. The same customer may have different names, customer numbers, addresses, and phone numbers in different databases. For example, William Smith might appear as Bill Smith, Wm. Smith, and William Smithe. Normal database joins and searches will not be able to resolve these differences. A very sophisticated tool that understands nicknames, alternate spellings, and typing errors will be required. The tool will probably also have to recognize that different name variations can be resolved, if they all live at the same address or have the same phone number. While creating a clean master list can be a daunting challenge, there are many positive benefits to your bottom line from a common master list:
• •

A single, consolidated bill saves money and improves customer satisfaction. Sending the same marketing literature to a customer from multiple customer lists wastes money and irritates the customer. Before you turn a customer account over to a collection agency, it would be good to know if they owe other parts of your company money or, more importantly, that they are another division's biggest customer.





Stocking the same item under different part numbers is not only a waste of money and shelf space, but can potentially lead to artificial shortages.

The recent movements toward SOA and SaaS make Master Data Management a critical issue. For example, if you create a single customer service that communicates through well-defined XML messages, you may think you have defined a single view of your customers. But if the same customer is stored in five databases with three different addresses and four different phone numbers, what will your customer service return? Similarly, if you decide to subscribe to a CRM service provided through SaaS, the service provider will need a list of customers for their database. Which one will you send them? For all these reasons, maintaining a high-quality, consistent set of master data for your organization is rapidly becoming a necessity. The systems and processes required to maintain this data are known as Master Data Management. What Is Master Data Management? For purposes of this article, we define Master Data Management (MDM) as the technology, tools, and processes required to create and maintain consistent and accurate lists of master data. There are a couple things worth noting in this definition. One is that MDM is not just a technological problem. In many cases, fundamental changes to business process will be required to maintain clean master data, and some of the most difficult MDM issues are more political than technical. The second thing to note is that MDM includes both creating and maintaining master data. Investing a lot of time, money, and effort in creating a clean, consistent set of master data is a wasted effort unless the solution includes tools and processes to keep the master data clean and consistent as it is updated and expanded. While MDM is most effective when applied to all the master data in an organization, in many cases the risk and expense of an enterprise-wide effort are difficult to justify. It may be easier to start with a few key sources of Master Data and expand the effort, once success has been demonstrated and lessons have been learned. If you do start small, you should include an analysis of all the master data that you might eventually want to include, so you do not make design decisions or tool choices that will force you to start over when you try to incorporate a new data source. For example, if your initial Customer master implementation only includes the

10,000 customers your direct-sales force deals with, you don't want to make design decisions that will preclude adding your 10,000,000 Web customers later. An MDM project plan will be influenced by requirements, priorities, resource availability, time frame, and the size of the problem. Most MDM projects include at least these phases:
1. Identify sources of master data. This step is usually a very revealing exercise. Some

companies find they have dozens of databases containing customer data that the IT department did not know existed.
2. Identify the producers and consumers of the master data. Which applications produce

the master data identified in the first step, and—generally more difficult to determine— which applications use the master data. Depending on the approach you use for maintaining the master data, this step might not be necessary. For example, if all changes are detected and handled at the database level, it probably does not matter where the changes come from.
3. Collect and analyze metadata about for your master data. For all the sources

identified in step one, what are the entities and attributes of the data, and what do they mean? This should include attribute name, datatype, allowed values, constraints, default values, dependencies, and who owns the definition and maintenance of the data. The owner is the most important and often the hardest to determine. If you have a repository loaded with all your metadata, this step is an easy one. If you have to start from database tables and source code, this could be a significant effort.
4. Appoint data stewards. These should be the people with the knowledge of the current

source data and the ability to determine how to transform the source into the master-data format. In general, stewards should be appointed from the owners of each master-data source, the architects responsible for the MDM systems, and representatives from the business users of the master data.
5. Implement a data-governance program and data-governance council. This group

must have the knowledge and authority to make decisions on how the master data is maintained, what it contains, how long it is kept, and how changes are authorized and audited. Hundreds of decisions must be made in the course of a master-data project, and

if there is not a well-defined decision-making body and process, the project can fail, because the politics prevent effective decision making.
6. Develop the master-data model. Decide what the master records look like: what

attributes are included, what size and datatype they are, what values are allowed, and so forth. This step should also include the mapping between the master-data model and the current data sources. This is normally both the most important and most difficult step in the process. If you try to make everybody happy by including all the source attributes in the master entity, you often end up with master data that is too complex and cumbersome to be useful. For example, if you cannot decide whether weight should be in pounds or kilograms, one approach would be to include both (WeightLb and WeightKg). While this might make people happy, you are wasting megabytes of storage for numbers that can be calculated in microseconds, as well as running the risk of creating inconsistent data (WeightLb = 5 and WeightKg = 5). While this is a pretty trivial example, a bigger issue would be maintaining multiple part numbers for the same part. As in any committee effort, there will be fights and deals resulting in sub-optimal decisions. It's important to work out the decision process, priorities, and final decision maker in advance, to make sure things run smoothly.
7. Choose a toolset. You will need to buy or build tools to create the master lists by

cleaning, transforming, and merging the source data. You will also need an infrastructure to use and maintain the master list. These functions are covered in detail later in the paper. You can use a single toolset from a single vendor for all of these functions, or you might want to take a best-of-breed approach. In general, the techniques to clean and merge data are different for different types of data, so there are not a lot of tools that span the whole range of master data. The two main categories of tools are Customer Data Integration (CDI) tools for creating the customer master and Product Information Management (PIM) tools for creating the product master. Some tools will do both, but generally they are better at one or the other.

The toolset should also have support for finding and fixing data-quality issues and maintaining versions and hierarchies. Versioning is a critical feature, because understanding the history of a master-data record is vital to maintaining its quality and accuracy over time. For example, if a merge tool combines two records for John Smith in Boston, and you decide there really are two different John Smiths in Boston, you need to know what the records looked like before they were merged, in order to "unmerge" them.
8. Design the infrastructure. Once you have clean, consistent master data, you will need to

expose it to your applications and provide processes to manage and maintain it. This step is a big-enough issue, I devote a section to it later in the document. When this infrastructure is implemented, you will have a number of applications that will depend on it being available, so reliability and scalability are important considerations to include in your design. In most cases, you will have to implement significant parts of the infrastructure yourself, because it will be designed to fit into your current infrastructure, platforms, and applications.
9. Generate and test the master data. This step is where you use the tools you have

developed or purchased to merge your source data into your master-data list. This is often an iterative process requiring tinkering with rules and settings to get the matching right. This process also requires a lot of manual inspection to ensure that the results are correct and meet the requirements established for the project. No tool will get the matching done correctly 100 percent of the time, so you will have to weigh the consequences of false matches versus missed matches to determine how to configure the matching tools. False matches can lead to customer dissatisfaction, if bills are inaccurate or the wrong person is arrested. Too many missed matches make the master data less useful, because you are not getting the benefits you invested in MDM to get.
10. Modify the producing and consuming systems. Depending on how your MDM

implementation is designed, you might have to change the systems that produce, maintain, or consume master data to work with the new source of master data. If the master data is used in a system separate from the source systems—a data warehouse, for example—the source systems might not have to change. If the source systems are going to use the master data, however, there will likely be changes required. Either the source

systems will have to access the new master data or the master data will have to be synchronized with the source systems, so that the source systems have a copy of the cleaned-up master data to use. If it's not possible to change one or more of the source systems, either that source system might not be able to use the master data or the master data will have to be integrated with the source system's database through external processes, such as triggers and SQL commands. The source systems generating new records should be changed to look up existing master record sets before creating new records or updating existing master records. This ensures that the quality of data being generated upstream is good, so that the MDM can function more efficiently and the application itself manages data quality. MDM should be leveraged not only as a system of record, but also as an application that promotes cleaner and more efficient handling of data across all applications in the enterprise. As part of MDM strategy, all three pillars of data management need to be looked into: data origination, data management, and data consumption. It is not possible to have a robust enterprise-level MDM strategy if any one of these aspects is ignored.
11. Implement the maintenance processes. As we stated earlier, any MDM implementation

must incorporate tools, processes, and people to maintain the quality of the data. All data must have a data steward who is responsible for ensuring the quality of the master data. The data steward is normally a business person who has knowledge of the data, can recognize incorrect data, and has the knowledge and authority to correct the issues. The MDM infrastructure should include tools that help the data steward recognize issues and simplify corrections. A good data-stewardship tool should point out questionable matches that were made—customers with different names and customer numbers that live at the same address, for example. The steward might also want to review items that were added as new, because the match criteria were close but below the threshold. It is important for the data steward to see the history of changes made to the data by the MDM systems, to isolate the source of errors and undo incorrect changes. Maintenance also includes the processes to pull changes and additions into the MDM system, and to distribute the cleansed data to the required places.

As you can see, MDM is a complex process that can go on for a long time. Like most things in software, the key to success is to implement MDM incrementally, so that the business realizes a series of short-term benefits while the complete project is a long-term process. No MDM project can be successful without the support and participation of the business users. IT professionals do not have the domain knowledge to create and maintain high-quality master data. Any MDM project that does not include changes to the processes that create, maintain, and validate master data is likely to fail. The rest of this paper will cover the details of the technology and processes for creating and maintaining master data. How Do I Create a Master List? Whether you buy a tool or decide to roll your own, there are two basic steps to creating master data: clean and standardize the data, and match data from all the sources to consolidate duplicates. Before you can start cleaning and normalizing your data, you must understand the data model for the master data. As part of the modeling process, the contents of each attribute were defined, and a mapping was defined from each source system to the master-data model. This information is used to define the transformations necessary to clean your source data. Cleaning the data and transforming it into the master data model is very similar to the Extract, Transform, and Load (ETL) processes used to populate a data warehouse. If you already have ETL tools and transformation defined, it might be easier just to modify these as required for the master data, instead of learning a new tool. Here are some typical data-cleansing functions:


Normalize data formats. Make all the phone numbers look the same, transform addresses (and so on) to a common format. Replace missing values. Insert defaults, look up ZIP codes from the address, look up the Dun & Bradstreet number. Standardize values. Convert all measurements to metric, convert prices to a common currency, change part numbers to an industry standard. Map attributes. Parse the first name and last name out of a contact-name field, move Part# and partno to the PartNumber field.







Most tools will cleanse the data that they can, and put the rest into an error table for hand processing. Depending on how the matching tool works, the cleansed data will be put into a master table or a series of staging tables. As each source is cleansed, the output should be examined to ensure the cleansing process is working correctly. Matching master-data records to eliminate duplicates is both the hardest and most important step in creating master data. False matches can actually lose data (two Acme Corporations become one, for example) and missed matches reduce the value of maintaining a common list. The matching accuracy of MDM tools is one of the most important purchase criteria. Some matches are pretty trivial to do. If you have Social Security numbers for all your customers, or if all your products use a common numbering scheme, a database JOIN will find most of the matches. This hardly ever happens in the real world, however, so matching algorithms are normally very complex and sophisticated. Customers can be matched on name, maiden name, nickname, address, phone number, credit-card number, and so on, while products are matched on name, description, part number, specifications, and price. The more attribute matches and the closer the match, the higher degree of confidence the MDM system has in the match. This confidence factor is computed for each match, and if it surpasses a threshold, the records match. The threshold is normally adjusted depending on the consequences of a false match. For example, you might specify that if the confidence level is over 95 percent, the records are merged automatically, and if the confidence is between 80 percent and 95 percent, a data steward should approve the match before they are merged. Most merge tools merge one set of input into the master list, so the best procedure is to start the list with the data in which you have the most confidence, and then merge the other sources in one at a time. If you have a lot of data and a lot of problems with it, this process can take a long time. You might want to start with the data from which you expect to get the most benefit having consolidated; run a pilot project with that data, to ensure your processes work and you are seeing the business benefits you expect; and then start adding other sources, as time and resources permit. This approach means your project will take longer and possibly cost more, but the risk is lower. This approach also lets you start with a few organizations and add more as the project demonstrates success, instead of trying to get everybody on board from the start.

Another factor to consider when merging your source data into the master list is privacy. When customers become part of the customer master, their information might be visible to any of the applications that have access to the customer master. If the customer data was obtained under a privacy policy that limited its use to a particular application, you might not be able to merge it into the customer master. You might want to add a lawyer to your MDM planning team. At this point, if your goal was to produce a list of master data, you are done. Print it out or burn it to a CD, and move on. If you want your master data to stay current as data is added and changed, you will have to develop infrastructure and processes to manage the master data over time. The next section provides some options on how to do just that. How Do I Maintain a Master List? There are many different tools and techniques for managing and using master data. We will cover three of the more common scenarios here:


Single-copy approach—In this approach, there is only one master copy of the master data. All additions and changes are made directly to the master data. All applications that use master data are rewritten to use the new data instead of their current data. This approach guarantees consistency of the master data, but in most cases it's not practical. Modifying all your applications to use a new data source with a different schema and different data is, at least, very expensive; if some of your applications are purchased, it might even be impossible.



Multiple copies, single maintenance—In this approach, master data is added or changed in the single master copy of the data, but changes are sent out to the source systems in which copies are stored locally. Each application can update the parts of the data that are not part of the master data, but they cannot change or add master data. For example, the inventory system might be able to change quantities and locations of parts, but new parts cannot be added, and the attributes that are included in the product master cannot be changed. This reduces the number of application changes that will be required, but the applications will minimally have to disable functions that add or update master data.

Users will have to learn new applications to add or modify master data, and some of the things they normally do will not work anymore.


Continuous merge—In this approach, applications are allowed to change their copy of the master data. Changes made to the source data are sent to the master, where they are merged into the master list. The changes to the master are then sent to the source systems and applied to the local copies. This approach requires few changes to the source systems; if necessary, the change propagation can be handled in the database, so no application code is changed. On the surface, this seems like the ideal solution. Application changes are minimized, and no retraining is required. Everybody keeps doing what they are doing, but with higher-quality, more complete data. This approach does have several issues:
o

Update conflicts are possible and difficult to reconcile. What happens if two of the source systems change a customer's address to different values? There's no way for the MDM system to decide which one to keep, so intervention by the data steward is required; in the meantime, the customer has two different addresses. This must be addressed by creating data-governance rules and standard operating procedures, to ensure that update conflicts are reduced or eliminated. Additions must be remerged. When a customer is added, there is a chance that another system has already added the customer. To deal with this situation, all data additions must go through the matching process again to prevent new duplicates in the master. Maintaining consistent values is more difficult. If the weight of a product is converted from pounds to kilograms and then back to pounds, rounding can change the original weight. This can be disconcerting to a user who enters a value and then sees it change a few seconds later.

o

o

In general, all these things can be planned for and dealt with, making the user's life a little easier, at the expense of a more complicated infrastructure to maintain and more work for the data stewards. This might be an acceptable trade-off, but it's one that should be made consciously.

Versioning and Auditing No matter how you manage your master data, it's important to be able to understand how the data got to the current state. For example, if a customer record was consolidated from two different merged records, you might need to know what the original records looked like, in case a data steward determines that the records were merged by mistake and really should be two different customers. The version management should include a simple interface for displaying versions and reverting all or part of a change to a previous version. The normal branching of versions and grouping of changes that source-control systems use can also be very useful for maintaining different derivation changes and reverting groups of changes to a previous branch. Data stewardship and compliance requirements will often include a way to determine who made each change and when it was made. To support these requirements, an MDM system should include a facility for auditing changes to the master data. In addition to keeping an audit log, the MDM system should include a simple way to find the particular change you are looking for. An MDM system can audit thousands of changes a day, so search and reporting facilities for the audit log are important. Hierarchy Management In addition to the master data itself, the MDM system must maintain data hierarchies—for example, bill of materials for products, sales territory structure, organization structure for customers, and so forth. It's important for the MDM system to capture these hierarchies, but it's also useful for an MDM system to be able to modify the hierarchies independently of the underlying systems. For example, when an employee moves to a different cost center, there might be impacts to the Travel and Expense system, payroll, time reporting, reporting structures, and performance management. If the MDM system manages hierarchies, a change to the hierarchy in a single place can propagate the change to all the underlying systems. There might also be reasons to maintain hierarchies in the MDM system that do not exist in the source systems. For example, revenue and expenses might need to be rolled up into territory or organizational structures that do not exist in any single source system. Planning and forecasting might also require temporary hierarchies to calculate "what if" numbers for proposed

organizational changes. Historical hierarchies are also required in many cases to roll up financial information into structures that existed in the past, but not in the current structure. For these reasons, a powerful, flexible hierarchy-management feature is an important part of an MDM system. Conclusion The recent emphasis on regulatory compliance, SOA, and mergers and acquisitions has made the creating and maintaining of accurate and complete master data a business imperative. Both large and small businesses must develop data-maintenance and governance processes and procedures, to obtain and maintain accurate master data. While it's easy to think of master-data management as a technological issue, a purely technological solution without corresponding changes to business processes and controls will likely fail to produce satisfactory results. This paper has covered the reasons for adopting master-data management, the process of developing a solution, and several options for the technological implementation of the solution. Future papers in this series will explain the technological and procedural issues that must be resolved to implement an MDM system

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close