As time has gone on, technology has advanced to a point where so much data can be stored and to a large scale. Because of this, it has become difficult to be able to pick out relationships and patterns in the data, make key business decisions from the data results and be successful. Therefore, tools have been developed to actually find these patterns in the data.
Unlike how the name suggests, data mining is actually analysing data in a database, picking out patterns and relationships and then using them to process new information. As an example, data mining can be useful for businesses that have a strong consumer focus.
By finding patterns in the way that people shop, businesses can drive prices up or down, sales and even where to place the product in a store. By using data mining techniques, products and promotions can be developed specifically to consumer needs and their likely spending habits.
Data mining is the key component of the Knowledge Discovery in Databases (KDD) which is discovering useful information from the data. KDD is made up of multiple functions: data storage and access, scaling algorithms to massive data sets and interpreting results. The data cleansing and data access process included in data warehousing facilitate the KDD process. Artificial intelligence also supports KDD. The patterns recognised in the data must be valid on new data, and possess some degree of certainty. These patterns are considered new knowledge.
Traditional methods of turning data into knowledge usually relied on manual analysis. This meant that testing a data set for patterns were slow and expensive but most of all, highly subjective. Databases have gotten big and so much more extra data is being stored. Data analysis needed to become automated to remove these problems, hence the KDD process was developed.
Data is chosen that is to be used in the data mining process.
A target data set must be created before the data mining algorithms can be used to identify patterns in the data. The target data set must be big enough to contain these patterns while still containing enough detail to be mined. Data warehouses are typical data sources as they allow improved access to the information and they are purposely built to perform quickly in retrieving the necessary data. Pre-processing is important in searching for relationships in the data; the data set must be correct and ready for use before the data mining stage. The target data set is then cleaned ie removing errors, missing values, noisy or inconsistent data.
After the data has been cleaned, it needs to be converted to a new format ready for mining. The system will detect the data’s original format, determine what it needs to translated to in order for it to be usable by the new system and then proceeds to transform the data.
Data mining algorithms are now applied to discover the patterns and relationships in the data. The following functions are applied to the data sets:
Detecting anomalies – identifies unusual and uncommon data that are unexpected and do not follow the common pattern of other results. Any results that have found to be anomalies could be incorrect and will require investigating into.
Association rule learning – uses strict rules to identify relationships between the parameters used to obtain the data. Similar to machine learning, the machine uses algorithms to find the solutions, however, where it is differs is machine learning determines the algorithms itself and does not require the strict rules to be set.
Clustering – the machine groups together pieces of data that have similar properties to each other while leaving out data without those properties at the same time.
Classification – the system learns a function that finds data that hasn’t yet been defined and categorising it into a pre-defined class. The user will define a structure and the machine will categorise the data based on the rules of that defined structure.
Regression – analyses which function estimates the least incorrect results. It does this by understanding how the dependent variable reacts when independent variables are changed.
Summarisation – presents the data in a way that is more understandable to a user by data visualisation techniques.
The patterns that have been found can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics.
The final step in the KDD process is to establish the patterns identified by the data mining algorithms. Data mining may produce results that are not valid or don’t actually predict future behaviour and cannot be repeated on a different data sample. The algorithms may find relationships that are not there in the general set of data. This is known as overfitting.
To prevent this from happening the evaluation uses a test data set on which the mining algorithm was not learnt. This is a machine learning property. The newly learned patterns are applied to this test data set and the results are compared to the expected results.
If the learned patterns do not meet expected standards, the data mining steps must be re-evaluated and the data set created in the pre-processing step must be changed.
If the learned patterns do meet expected standards, they are interpreted and turned into system knowledge.
The knowledge is then organised and presented so that it understandable to the user such as in a report or a graph.