Real life data rarely comply with the necessities of various data mining tools. It is usually inconsistent and noisy. It may contain redundant attributes, unsuitable formats etc. Hence data has to be prepared vigilantly before the data mining actually starts. It is well known fact that success of a data mining algorithm is very much dependent on the quality of data processing. Data processing is one of the most important tasks in data mining. In this context it is natural that data pre-processing is a complicated task involving large data sets. Sometimes data pre-processing take more than 50% of the total time spent in solving the data mining problem. It is crucial for data miners to choose efficient data preprocessing technique for specific data set which can not only save processing time but also retain the quality of the data for data mining process. A data pre-processing tool should help miners with many data mining activates. For example, data may be provided in different formats as discussed in previous chapter (flat files, database files etc). Data files may also have different formats of values, calculation of derived attributes, data filters, joined data sets etc. Data mining process generally starts with understanding of data. In this stage pre-processing tools may help with data exploration and data discovery tasks. Data processing includes lots of tedious works,
* Data Cleaning * Data Integration * Data Transformation And * Data Reduction. In this chapter we will study all these data pre-processing activities.
In Data understanding phase the first task is to collect initial data and then proceed with activities in order to get well known with data, to discover data quality problems, to discover first insight into the data or to identify interesting subset to form hypothesis for hidden information. The data understanding phase according to CRISP model can be shown in following .
The initial collection of data includes loading of data if required for data understanding. For instance, if specific tool is applied for data understanding, it makes great sense to load your data into this tool. This attempt possibly leads to initial data preparation steps. However if data is obtained from multiple data sources then integration is an additional issue.
Here the gross or surface properties of the gathered data are examined.
This task is required to handle the data mining questions, which may be addressed using querying, visualization and reporting. These include: * Sharing of key attributes, for instance the goal attribute of a prediction task * Relations between pairs or small numbers of attributes * Results of simple aggregations * Properties of important sub-populations * Simple statistical analyses.
In this step quality of data is examined. It answers questions such as: * Is the data complete (does it cover all the cases required)?
We will send an essay sample to you in 24 Hours. If you need help faster you can always use our custom writing service.Get help with my paper