AN OVERVIEW OF KNOWLEDGE DISCOVERY IN DATABASE (KDD) PROCESS TOWARDS DATA MINING

1. INTRODUCTION Historically, the notion of finding useful patterns in data has received a variety of names, including data mining, knowledge extraction, information discovery, information gathering, data archeology, data processing and pattern. The rapid emergence of electronic data management methods has led to some recent times as the “Information Age call. “Powerful database for the collection and management systems in use in virtually all large and mid-range businesses – there is hardly a transaction that is not not a computer record somewhere. Every year more automated transactions, collect any information about the activities, activities and achievements. All these data have valuable information, eg, Trends and patterns, which can be used to improve business decisions and optimize success. However, today’s databases contain so much data that it is almost impossible to analyze them manually valuable information for decision making. In many cases, hundreds of independent attributes should be considered simultaneously to accurately model system behavior. The term data mining is primarily used by statisticians, data analysts, and management information systems (MIS) communities. It also gained popularity in the database field. The phrase knowledge discovery in databases KDD was coined at the first workshop in 1989 [1] (Piatetsky-Shapiro 1991) to emphasize that knowledge is the end product of a data-driven discovery. It is popular in AI and machine learning fields. We believe KDD refers to the overall process of discovering useful knowledge from data, and data mining refers to a particular step in this process. Data mining is the application of specific algorithms for extracting patterns from data. The distinction between the KDD process and data-mining step (in process) is a central point of this article. The additional steps in the KDD process, including data preparation, data selection, data cleaning , the inclusion of the necessary knowledge, information and interpretation of the results of mining, are essential to ensure that useful knowledge is derived from the data. Blind application of data mining methods (rightly criticized as data dredging in the statistical literature) can be a dangerous activity, easily lead to the discovery of meaningless and invalid patterns. 2. The interdisciplinary nature of KDD KDD has evolved, and continues to evolve, from the intersection of research fields, including machine learning, pattern recognition, databases, statistics, AI, acquiring knowledge for expert systems, data visualization, and high-performance computing. The unifying goal is extracting high-level knowledge from low-level data in the context of large data sets. The data-mining component of current KDD relies heavily on known techniques from machine learning, pattern recognition, and statistics to find patterns of data in the data-mining step in the KDD process. A natural question is how KDD is different from pattern recognition and machine learning (and related areas) The answer is that these areas provide some of the data mining methods used in the data-mining step in the KDD process. KDD focuses on the entire process of knowledge discovery from data, including how the data are stored and accessed, how algorithms can be scaled to massive data yet efficiently, how results can be interpreted and visualized, and the total man-machine interaction can be usefully modeled and supported. The KDD process can be seen as a multidisciplinary activity that includes technologies beyond the reach of a particular discipline such as machine learning. In this context, there are clear opportunities for other areas of the AI (next machine learning) contribute to KDD. KDD puts a special emphasis on finding intelligible patterns that can be interpreted as useful or interesting knowledge. Thus, neural networks, although a powerful modeling tool, relatively difficult to understand compared to the decision trees. KDD also emphasizes scaling and robustness properties of algorithms for modeling of large noisy data sets . AI-related research areas include machine discovery, the discovery of the laws of empirical observation and experimentation [10 targets] (Shrager and Langley 1990) and causal modeling for the inference of causal models from data [11] (Spirtes, Glymour and Australian Ines 1993). Statistics in particular has much in common with KDD. Knowledge discovery from data is essentially a statistical exercise. Statistics offers a language and a framework for quantifying the uncertainty arising when one tries to infer general patterns from one sample of a total population. As mentioned earlier, the term data mining has negative connotations in the statistics since the 1960s when computer-based data-analysis techniques were first introduced. The concern arose because if you drop enough searches in a dataset (even randomly generated data), one can find patterns that appear statistically significant, but in fact are not. It is clear that this issue is of fundamental importance to KDD. Significant progress in recent years in understanding of such issues in statistics. Much of this work is directly relevant to KDD. Thus, data mining is a legitimate activity as long as you understand how to do it properly, data mining poorly executed (without regard to the statistical aspects of the problem ) should be avoided. KDD can be considered include a broader vision on modeling than statistics. KDD aims to provide tools to automate (to the extent possible) the whole process of data analysis and “the statistician’s art “hypothesis of selection. A driving force behind the KDD database field (the second D in KDD). Indeed, the problem of the actual data manipulation when the data can not fit in the memory is of fundamental importance to KDD. Database techniques obtaining an efficient access to data, grouping and ordering operations by accessing the data and optimizing questions form the basis for scaling algorithms to large data sets. Most data mining algorithms from statistics, pattern recognition and machine learning take data in main memory and pay no attention to how the algorithm breaks down if only limited views of the data are possible. A related field evolution of databases is data warehousing, which refers to the popular business trend of collecting and cleaning transactional data them available for online analysis and decision support making. Data warehousing helps set the stage for KDD in two important ways: (1) Data Cleaning (2) data access. Data cleaning As organizations are forced to consider a unified logical given the wide variety of data and databases that they own, they have to address the problem of mapping data to a single naming convention, uniformly and handling missing data, and handling noise and if possible errors. Uniform access to well-defined methods and data should be made for access to data and access paths to data that was previously difficult to achieve (eg stored offline). Once organizations and individuals have the problem of how to store and access to their data, the natural next step remaining is the question, what else do we do with all information? This is where opportunities for KDD natural origin. A popular approach for the analysis of data warehouses called Online Analytical Processing (OLAP), named after a set of principles proposed by [12] Codd (1993). OLAP tools focus on providing multi-dimensional data analysis, which is better for SQL to calculate summaries and breakdowns along many dimensions. OLAP tools are targeted towards simplifying and supporting interactive data analysis, but the goal of KDD tools is to get as much of the process to automate. So KDD is a step beyond what is currently supported by most standard database systems. 3. DATA MINING AND KNOWLEDGE DISCOVERY IN THE REAL WORLD A high degree of current interest in KDD is the result of the media attention surrounding successful KDD applications, for example, the focus articles in the past two years in Business Week, Newsweek, Byte, PC Week, and other large-circulation magazines. Unfortunately it is not always easy to separate fact from media hype. Yet some documented examples of successful systems can rightly be called KDD applications and are deployed in operational use on a large scale real problems in science and business . In science, one of the key applications is astronomy. It was a remarkable success SKICAT, a system used by astronomers to perform image analysis, classification and cataloging of sky objects from the sky-survey images [2] ( Fayyad, Djorgovski, and Weir 1996). In its first application was the system used to process the 3 terabytes (1012 bytes) of image data resulting from the Second Palomar Observatory Sky Survey, where it is estimated that in order 109 of sky objects are detected. SKICAT may exceed traditional human and computational techniques in the classification of weak sky objects. See [3] Fayyad, Haussler, and Stolorz (1996) for an overview of scientific applications. In business, KDD main applications include marketing, finance (especially investment), fraud detection, manufacturing, telecommunications, and Internet agents. Marketing marketing, the primary application is database marketing systems, customer databases to analyze various customer groups to identify and predict their behavior. Business Week [4] (Berry 1994) estimated that more than half of all retailers using or planning database marketing, and those not using it have good results using, for example, American Express reported a 10 – to 15 – percent increase in credit-card use. Another striking marketing application is market basket analysis [5] (Agrawal et al. 1996) systems, which find patterns, such as: “If a customer bought X, he / she is probably Y and Z. “Those buying patterns are valuable to retailers. Investments Many companies use data mining investment, but most do not describe their systems. An exception is LBS Capital Management. The system uses expert systems, neural networks and genetic algorithms for managing portfolios totaling 600 million U.S. dollars since its inception in 1993, the system is better than the broad stock market [6] (Hall, Mani, and Barr 1996). HNC Falcon Fraud detection and Nestor PRISM systems are used to monitor credit card fraud, look on millions of accounts. The FAIS system [7] (Senator et al. 1995), the U.S. Treasury Financial Crimes Enforcement Network, will be used for financial transactions that could indicate on money laundering activity. ASSIOPEE Manufacturing The solving of problems, developed as part of a joint venture between General Electric and Snecma, was used by three major European airlines to diagnose and predict problems for the Boeing 737. For families derive from errors, clustering methods. CASSIOPEE received the first European prize for innovative applications. Telecommunications Telecommunications alarm series analyzer (TASA) is built in collaboration with a manufacturer of telecommunications equipment and three telephone [8] (Mannila, Toivonen, and Verkamo 1995). The system uses a new framework for locating frequently occurring alarm episodes from the alarm stream and presenting them as rules. Large sets discovered rules can be explored with flexible information retrieval tools to support interactivity and iteration. In this way provides TASA pruning, grouping, and ordering tools to the results of a basic brute-force search for rules to refine. Data cleaning the MERGE-PURGE system was applied to the identification of dual welfare claims [ 9] (Hernandez and Stolfo 1995). It was successfully applied to data from the Social Services of Washington State. In other areas, a well-documented system is IBM’s Advanced Scout, a specialized data-mining system that helps National Basketball Association ( NBA) coaches organize and interpret data from NBA games (U.S. News 1995). ADVANCED SCOUT was used by a number of NBA teams in 1996, including the Seattle SuperSonics, the NBA finals. Finally, a new and increasingly important type of a discovery based on the use of intelligent agents to navigate through an information-rich environment. Although the idea of active triggers has long been analyzed in the database field, really successful applications of this idea was only with the advent of the Internet. These systems ask the user for a profile of interest and find related information to specify a wide range of public domain and private sources. For example, Firefly is a personal music recommendation agent: It requires a user the advice of various musical and then suggests other music that the user would want. 4. The knowledge discovery and Data Mining This section provides an introduction to the field of knowledge discovery and data mining tasks. The Knowledge Discovery Process A is still some confusion about the terms Knowledge Discovery in Databases (KDD) and data mining. Often these two terms are interchangeable. We use the term KDD to denote the overall process of converting low-level data on high level knowledge. A simple definition of KDD is as follows: knowledge discovery in databases is the trivial process of identifying valid, new, potentially useful and ultimately understandable patterns in data. We take the commonly used definition of data mining as the extraction of patterns or models from observed data. While the core of the knowledge discovery process, this step usually takes only a small proportion (estimated at 15% to 25%) of the total effort. Hence data mining is just one step in the overall KDD process. Other such steps include: Develop an understanding of the application domain and the goals of the data mining process to acquire or selecting a set of target data integration and verification of the data set Data cleaning, preprocessing and transformation and Modeling hypothesis building choose appropriate data mining algorithms and visualization Result Result interpretation and testing authentication and maintain the discovered knowledge. Data Mining tasks at the core of the KDD process, the data mining methods for extracting patterns from data. These methods, different goals , depending on the intended outcome of the overall KDD process. It should also be noted that different methods with different targets in succession can be applied to a desired result. For example, to determine which customers are likely a new product, a business analyst is perhaps the first use clustering to segment the customer database, then buy and apply regression to predict buying behavior for each cluster. Most data mining goals fall into the following categories: Data Processing Depending on the objectives and requirements of the KDD process, analysts can select, filter, aggregate, sample, clean and / or transform data. Automating some of the most typical processing functions and integrate them seamlessly into the overall process can eliminate or at least the necessity for programming specialized routines and for data export / import reduction, thus improving the productivity of the analyst. Prediction Given a data item and a predictive model, the predicted value for a specific attribute of the data item. For example, given a predictive model of credit card transactions, predict the likelihood that a particular transaction is fraudulent. Regression Given a set of data items, the regression analysis of the dependence of some attribute values to the values of other attributes in that entry, and automatic production of a model that can predict the attribute values for new records. For example, given a dataset of credit card transactions, building a model that can predict the likelihood of fraud for new transactions. Classification Given a set of predefined categorical classes, to determine which of these classes a particular data item belongs. For example, the classes of patients corresponding to medical treatment responses, identifying the type of treatment a new patient is most likely to respond. Clustering Given a set of data items, this partition Set in a series of classes such that objects with similar characteristics are grouped. Clustering is best used to find groups of similar items. For example, given a data set of customers, identify subgroups of customers who buy a similar behavior. Link Analysis (associations) Given a set of data items, identifying relationships between attributes and objects, such as the presence of a pattern implies the presence of a different pattern. These relationships may be associations between attributes within the same data item. The investigation the relationships between items over a period of time is often referred to as “sequential pattern analysis.” Model Visualization Visualization plays an important role in making the discovered knowledge to understand and interpret by humans. Besides, the human eye-brain system itself is still the best pattern-recognition device known. Visualization techniques can range from simple scatter plots and histogram plots on a parallel coordinates to 3D movies. 5. The data-mining step of the KDD PROCESS The data-mining component of The KDD process is often repeated iterative application of certain data-mining methods. This section gives an overview of the main goals of data mining, a description of the methods to address these objectives, and a brief description of the data-mining algorithms believe that these methods. The discovery of knowledge objectives are defined by the intended use of the system. We distinguish two types of objectives: (1) Research (2) Discovery. A check is limited to the control system hypothesis of the user. discovery, the system will find new patterns independently. We are further divided into the discovery target prediction, where the system finds patterns to predict the future behavior of some entities, and description, where the system finds patterns for presentation to a user in a human intelligible form. In this article, we focus on discovery-oriented data mining. Data mining is assembling models, or determining patterns from observed data. The fitted models play the role of knowledge derived: If the models reflect useful or interesting knowledge is part of all, interactive KDD process which usually subjective human evaluation is necessary. Two primary mathematical formalisms used in model fitting: (1) statistics (2) Logical. The statistical approach provides deterministic effects in the model, while a logical model is purely deterministic. We focus primarily on the statistical approach of data-mining, which is usually the most common basis for practical data mining applications are given the typical presence of uncertainty in real-world data -generating processes. Most data mining methods are based on proven techniques from machine learning, pattern recognition, and statistics: classification, clustering, regression, etc.. The array of different algorithms for each of these items is often confusing for both the beginner and experienced data analyst. It must be stressed that many of the data-mining methods advertised in the literature, there are only a few basic techniques. 6. Research and implementation challenges we outline some of the current primary research and application challenges KDD. This list is not exhaustive and is intended to provide the reader a feel for the kind of problem that doctors wrestle with KDD. Big Databases Databases with hundreds of millions of records and fields and tables and a multi-gigabyte size are commonplace, and terabytes (1012 bytes ) databases begin to appear. Methods for dealing with large amounts of data more efficient sampling algorithms, alignment, and massively parallel processing. High dimensionality Not only is there often a large number of records in the database, but there may also a large number of fields (attributes , variables), yes, the dimensionality of the problem is large. A high-dimensional dataset creates problems in terms of increasing the size of the search space for model induction in a combinatorial explosive way. Furthermore, it increases the probability that a data-mining algorithm will find spurious patterns that are not valid in general. Methods for this problem include methods for the effective dimensionality of the problem and reduce the use of knowledge to identify irrelevant variables. About When fitting algorithm searches for the best parameters for a given model using a limited set of data, the model not only the general patterns in the data, but also all the noise specific to the data, resulting in poor performance of the model test data. Possible solutions are cross-validation, regularization, and other advanced statistical strategies. Evaluation of the statistical significance problem (with respect to more than fitting) occurs when the system searches on several models. For example, if a system test models on the 0 . 001 significance level, than average, with purely random data, N/1000 of these models will be accepted as a significant edge is important in all steps of the KDD process. Bayesian approach [13] (eg Cheeseman [1990]) use prior probabilities on data and distributions as a form of coding knowledge. Others employ deductive database capabilities to discover that knowledge is then used to search the data-mining [14 guide] (Simoudis eg, Livezey and Kerber [1995]). Integration with other A standalone systems discovery system might not be very useful. Typical problems include integration with a database management system integration (eg through a query interface), integration with spreadsheets and visualization tools and the reception of real-time sensor measurements. Examples of integrated KDD systems are described by [14] Simoudis, Livezey and Kerber (1995). 7. CONCLUSION This article is a step towards a common framework that we hope eventually a unifying vision of the common general objectives and methods used in the KDD. We hope that this would eventually lead to a better understanding of the diversity of approaches to this multidisciplinary field and how they fit together. 9. REFERENCES [1] Piatetsky – Shapiro, G. 1991. Knowledge Discovery in Real Databases: A Report on the IJCAI-89 Workshop. AI Magazine 11 (5): 68-70. [2] Fayyad, UM, Djorgovski, SG, and Weir, N. 1996. From digitized images to On-Line Catalogs: Data Mining a Sky survey. AI Magazine 17 (2): 51-66. [3] Fayyad, UM, Haussler, D., and Stolorz, Z. 1996. KDD for Science Data Analysis: Issues and examples. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 50-56. Menlo Park, Calif.: American Association for Artificial Intelligence. [4] Berry, J. 1994. Database Marketing. Business Week, September 5, 56-62. [ 5] Agrawal, R., and Psaila, G. 1995. Active Data Mining. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95), 3-8. Menlo Park, Calif.: American Association for Artificial intelligence [6] Hall, J., Mani, G., and Barr, D. 1996. The application of Computational Intelligence to the investment process. In Proceedings of Cífer-96: Computational Intelligence in Financial Engineering. Washington, D. C: IEEE Computer Society. [7] Senator, T., Goldberg, HG, Wooton, J.; Cottini, MA, Umar Khan, AF, Klinger, CD, Llamas, WM, Marrone, MP, and Wong, RWH 1995. The Financial Crimes Enforcement Network AI System (FAIS): Identifying potential money laundering from reports of large cash transactions. AI Magazine 16 (4): 21-39. [8] Mannila, H., Toivonen, H., and Verkamo, AI 1995 . Discovering frequent episodes in sequences. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95), 210-215. Menlo Park, Calif.: American Association for Artificial Intelligence.

Mr. G. Pandiyan, LECTURER, Dept of MCA, RVS College of Arts & Science, S?lur, Coimbatore-641662

Leave a comment

Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

Your email is never shared. Required fields are marked *