KU 5TH SEM ASSIGNMENT - BSIT (TB) - 53 (DATA WAREHOUSING & DATA MINING)
PART - A I. Note: Answer all the questions. a) What is Normalization? What are the different forms of Normalization ? The usual approach in normalization in database applications is to ensure that the data is divided into two or more tables, such that when the data in one of them is updated, it does not lead to anamolies of data (The student is advised to refer any book on data base management systems for details, if interested). The idea is to ensure that when combined, the data available is consistent. However, H owever, in data warehousing, one may even tend to break the large table into several s everal “denormalized” smaller tables. This may lead to lots of extra space being used. But it helps in an indirect way – It avoids the overheads of joining the data during queries. b) Define Data warehouse. What are roles of education in a data warehousing delivery process? Data Warehouse: In it’s simplest form, a data ware house is a collection of key pieces of information used to manage and direct the business for the most profitable outcome. It would decide the amount of inventory to be held, the no. of employees to be hired, the amount to be procured on loan etc.,.
The above definition may not be - but that isbut how are. There are different definitions given by precise different authors, wedata haveware this house idea insystems mind and proceed. It is a large collection of data and a set of process managers that use this data to make information available. The data can be meta data, facts, dimensions and aggregations. The process managers can be load managers, ware house managers or query managers. The information made available is such that they allow the end users to make informed decisions. Roles of education in a data warehousing delivery process:- This has two roles to play - one to make people, specially top level policy makers, comfortable with the concept. The second role is to aid the prototyping activity. To take care of the education concept, an initial (usually scaled down) prototype is created and people are encouraged to interact with it. This would help achieve both the activities listed above. The users became comfortable with the use of the system and the ware house developer becomes aware of the limitations of his prototype which can be improvised upon. c) What is process managers? What are the different types of process managers? Process Managers: These are responsible for the smooth flow, maintainance and upkeep of data into and out of the database. The main types of process managers are:- i). Load manager: to take case of source interaction, data transformation and data load. ii). Ware house manger: to take care of data movement, meta data management and performance monitoring. iii). Query manager: to control query scheduling and monitoring.
We shall look into each of them briefly. Before that, we look at a schematic diagram that defines the boundaries of the three types of managers.
d) Give the architectures of data mining systems.
e) What are the guidelines for KDD environment ? It is customary in the computer industry to formulate rules of thumb that help information technology (IT) specialists to apply new developments. In setting up a reliable data mining environment we may follow the guidelines so that KDD system may work in a manner we desire. i). Support extremely large data sets ii). Support hybrid learning iii). Establish a data warehouse iv). Introduce data cleaning facilities v). Facilitate working with dynamic coding vi). Integrate with decision support system
PART - B II. Answer any FIVE full questions. 1. a) With the help of a diagram explain architecture of data warehouse. The architecture for a data ware is indicated below. Before we proceed further, we should be clear about the concept of architecture. It only gives the major items that make up a data ware house. The size and complexity of each of these items depend on the actual size of the ware house itself, the specific requirements of the ware house and the actual details of implementation.
Before looking into the details of each of the managers we could get a broad idea about their functionality by mapping the processes that we studied in the previous chapter to the managers. The extracting and loading processes are taken care of by the load manager. The processes of cleanup and transformation transformation of data as also of back up and archiving archiving are the duties of the ware house manage, while the query manager, as the name implies is to take case of query management. b) Indicate function the important function of a Load Manager, Warehouse Manager. Important of Load Manager: i) To extract data from the source (s) ii) To load the data into a temporary storage device iii) To perform simple transformations to map it to the structures of the data ware house. Important function of Warehouse Manager: i) Analyze the data to confirm data consistency and data integrity . ii) Transform and merge the source data from the temporary data storage into the ware house. iii) Create indexes, cross references, partition views etc.,. iv) Check for normalization’s. v) Generate new aggregations, if needed. vi) Update all existing aggregations vii) Create backups of data.
viii) Archive the data that needs to be archived.
2. a) Differentiate between vertical partitioning and horizontal partitioning. In horizontal partitioning, we simply the first few thousand entries in one partition, the second few thousand in the next and so on. This can be done by partitioning by time, where in all data pertaining to the first month / first year is put in the first partition, the second one in the second partition and so on. The other alternatives can be based on different sized dimensions, partitioning an other dimensions, petitioning on the size of the table and round robin partitions. Each of them have certain advantages as well as disadvantages. In vertical partitioning, some columns are stored in one partition and certain other columns of the same row in a different partition. This can again be achieved either by normalization or row splitting. We will look into their relative trade offs. b) What is schema? Distinguish between facts and dimensions. A schema, by definition, is a logical arrangements of facts that facilitate ease of storage and retrieval, as described by the end users. The end user is not bothered about the overall arrangements of the data or the fields in it. For example, a sales executive, trying to project the sales of a particular item is only interested in the sales s ales details of that item where as a tax practitioner looking at the same data will be interested interested only in the amounts received by by the company and the profits made. The star schema looks a good solution to the problem of ware housing. It simply states that one should identify the facts and store it in the read-only area and the dimensions surround the area. Whereas the dimensions are liable to change, the facts are not. But given a set of
raw data from the sources, how does one identify the facts and the dimensions? It is not always easy, but the following steps can help in that direction. i) Look for the fundamental transactions in the entire business process. These basic entities are the facts. ii) Find out the important dimensions that apply to each of these facts. They are the candidates for dimension tables. iii) Ensure that facts do not include those candidates that are actually dimensions, with a set of facts attached to it. iv) Ensure that dimensions do not include these candidates that are actually facts. 3. a) What is an event in data warehousing? List any five events. An event is defined as a measurable, observable occurrence of a defined action. If this definition is quite vague, it is because it encompasses a very large set of operations. The event manager is a software that continuously monitors the system for the occurrence of the event and then take any action that is suitable (Note that the event is a “measurable and observable” occurrence). The action to be taken is also normally specific to the event. A partial list of the common events that need to be monitored are as follows: i). Running out of memory space. ii). A process dying iii). A process using excessing resource iv). I/O errors v). Hardware failure b) What is summary table? Describe the aspects to be looked into while designing a summary table.
The main purpose of using summary tables is to cut down the time taken to execute a specific query.
The main methodology involves minimizing the volume of data being scanned each time the query is to be answered. In other words, partial answers to the query are already made available. For example, in the above cited example of mobile market, if one expects i) the citizens above 18 years of age ii) with salaries greater than 15,000 and iii) with professions that involve traveling are the potential customers, then, every time the query is to be processed (may be every month or every quarter), one will have to look at the entire data base to compute these values and then combine them suitably to get the relevant answers. The other method is to prepare summary tables, which have the values pertaining toe ach of these sub-queries, before hand, and then combine them as and when the query is raised. Summary table are designed by following the steps given below: i) Decide the dimensions along which aggregation is to be done. ii) Determine the aggregation of multiple facts. iii) Aggregate multiple facts into the summary table. iv) Determine the level of aggregation and the extent of embedding. v) Design time into the table. vi) Index the summary table. 4. a) List the significant issues in automatic cluster detection. Most of the issues related to automatic cluster detection are connected to the kinds of questions we want to be answered in the data mining project, or data preparation for their successful application. i). Distance measure Most clustering techniques use for the distance measure the Euclidean distance formula (square root of the sum of the squares of distances along each attribute axes). Non-numeric variables must be transformed transformed and scaled before the clustering can take place. Depending on this transformations, the categorical variables may dominate clustering results or they may be even completely ignored. ii). Choice of the right number of clusters If the number of clusters k in the K-means method is not chosen so to match the natural structure of the data, the results will not be good. The proper way t alleviate this is to experiment with different values for k. In principle, the best k value will exhibit the smallest intra-cluster distances and largest inter-cluster distances. iii). Cluster interpretation Once the clusters are discovered they have to be interpreted in order to have some value for the data mining project. b) Define data marting. List the reasons for data marting. The data mart stores a subset of the data available in the ware house, so that one need not always have to scan through the entire content of the ware house. It is similar si milar to a retail outlet. A data mart speeds up the queries, since the volume of data to be scanned is much less. It also helps to have tail or made processes for different access tools, imposing control strategies etc.,. Following are the reasons for which data marts are created: i) Since the volume of data scanned is small, they speed up the query processing.
ii) Data can be structured in a form suitable for a user access too iii) Data can be segmented or partitioned so that they can be used on different platforms and also different control strategies become applicable. 5. a) Explain how to categorize data mining system. There are many data mining systems available or being developed. Some are specialized systems dedicated to a given data source or are confined to limited data mining functionalities, other are more versatile and comprehensive. Data mining systems can be categorized according to various criteria among other classification classif ication are the following: a) Classification according to the type of data source mined: this classification categorizes data mining systems according to the type of data handled such as spatial data, multimedia data, time-series data, text data, World Wide Web, etc. b) Classification according to the data model drawn on: this classification categorizes data mining systems based on the data model involved such as relational database, object-oriented database, data warehouse, transactional, etc. c) Classification according to the king of knowledge discovered: this classification categorizes data mining systems based on the kind of knowledge discovered or data mining functionalities, such as characterization, discrimination, association, classification, clustering, etc. Some systems tend to be comprehensive systems offering several data mining functionalities together. d) Classification according to mining techniques used: Data mining systems employ and
provide different techniques. techniques. This classification categorizes data mining systems according to the data analysis approach used such as machine learning, neural networks, genetic algorithms, statistics, visualization, database oriented or data warehouse-oriented, etc. b) List and explain different kind of data that can be mined. Different kind of data that can be mined are listed below:- i). Flat files: Flat files are actually the most common data source for data mining algorithms, especially at the research level. ii). Relational Databases: A relational database consists of a set of tables containing either values of entity attributes, or values of attributes from entity relationships. iii). Data Warehouses: A data warehouse as a storehouse, is a repository of data collected from multiple data sources (often heterogeneous) and is intended to be used as a whole under the same unified schema. iv). Multimedia Databases: Multimedia databases include video, images, audio and text media. They can be stored on extended object-relational or object-oriented databases, or simply on a file system. v). Spatial Databases: Spatial databases are databases that in addition to usual data, store geographical information like maps, and global or regional positioning. vi). Time-Series Databases: Time-series databases contain time related data such stock market data or logged activities. These databases usually have a continuous flow of new data coming in, which sometimes causes the need for a challenging real time analysis. vii). World Wide Web: The World Wide Web is the most heterogeneous and dynamic repository available. A very large number of authors and publishers are continuously contributing to its growth and metamorphosis and a massive number of users are accessing its resources daily. 6. a) Give the syntax for task relevant data specification. Syntax for tax-relevant data specification:- The first step in defining a data mining task is the specification of the task-relevant t ask-relevant data, that
is, the data on which mining is to be performed. This involves specifying the database and tables or data warehouse containing the relevant data, conditions for selecting the relevant data, the relevant attributes or dimensions for exploration, and instructions regarding the ordering or grouping of the data retrieved. DMQL provides clauses for the clauses for the specification of such information, as follows:i). use database (database_name) or use data warehouse (data_warehouse_name): The use clause directs the mining task to the database or data warehouse specified. ii). from (relation(s)/cube(s)) [where(condition)]: The from and where clauses respectively specify the database tables or data cubes involved, and the conditions defining the data to be retrieved. iii). in relevance to (attribute_or_dimension_list): This clause lists the attributes or dimensions for exploration. iv). order by (order_list): The order by clause specifies the sorting order of the task relevant data. v). group by (grouping_list): the group by clause specifies criteria for grouping the data. vi). having (conditions): The having cluase specifies the condition by which groups of data are considered relevant. b) Explain the designing of GUI based on data mining query language. A data mining query language provides necessary primitives that allow users to communicate with data mining systems. But novice users may find data mining query language difficult to
use and the syntax difficult to remember. Instead , user may prefer to communicate with data mining systems through a graphical user interface (GUI). In relational database technology , SQL serves as a standard core language for relational systems , on top of which GUIs can easily be designed. Similarly, a data mining query language may serve as a core language for data mining system implementations, providing a basis for the development of GUI for effective data mining. A data mining GUI may consist of the following functional components:- a) Data collection and data mining query composition - This component allows the user to specify task-relevant data sets and to compose data mining queries. It is similar to GUIs used for the specification of relational queries. b) Presentation of discovered patterns – This component allows the display of the discovered patterns in various forms, including tables, graphs, charts, curves and other visualization techniques. c) Hierarchy specification and manipulation - This component allows for concept hierarchy specification , either manually by the user or automatically. In addition , this component should allow concept hierarchies to be modified by the user or adjusted automatically based on a given data set distribution. d) Manipulation of data mining primitives – This component may allow the dynamic adjustment of data mining thresholds, as well as the selection, display and modification of concept hierarchies. It may also allow the modification of previous data mining queries or conditions. e) Interactive multilevel mining – This component should allow roll-up or drill-down operations on discovered patterns. f) Other miscellaneous information – This component may include on-line help manuals, indexed search , debugging and other interactive graphical facilities. 7. a) Explain how decision trees are useful in data mining.
Decision trees are powerful and popular tools for classification and prediction. The attractiveness of tree-based methods is due in large part to the fact that, it is simple and
decision trees represent rules. Rules can readily be expressed so that we humans can understand them or in a database access language like SQL so that records falling into a particular category may be retrieved. retrieved. b) Identify an application and also explain the techniques that can be incorporated in solving the problem using data mining techniques. Write yourself... 8. Write a short notes on : i) Data Mining Querying Language ii) Schedule Manager iii) Data Formatting. i) Data Mining Querying Language A data mining language helps in effective knowledge discovery from the data mining systems. Designing a comprehensive data mining language is challenging because data mining covers a wide w ide spectrum of tasks from data characterization to mining association rules, data classification and evolution analysis. Each task has different requirements. The design of an effective data mining query language requires a
deep understanding of the power, limitation and underlying mechanism of the various kinds of data mining tasks. ii) Schedule manager The scheduling is the key for successful warehouse management. Almost all operations in the ware house need some type of scheduling. Every operating system will have it’s own scheduler and batch control mechanism. But these schedulers may not be capable of fully meeting the requirements of a data warehouse. Hence it is more desirable to have specially designed schedulers to manage the operations. iii) Data formatting Final data preparation step which represents syntactic modifications to the data that do not change its meaning, but are required by the particular modelling tool chosen for the DM task. These include: a). reordering of the attributes or records: some modelling tools require reordering of the attributes (or records) in the dataset: putting target attribute at the beginning or at the end, randomizing order of records (required by neural networks for example) b). changes related to the constraints of modelling modelling tools: removing commas or or tabs, special characters, trimming strings to maximum allowed number of characters, replacing special characters with allowed set of special characters.