can be viewed as a result of the natural evolution of information technology.
Data mining
The database system industry has witnessed an evolutionary path in the development of the following functionalities:
data collection and database creation
data management
advanced data analysis
With numerous database systems offering query and transaction processing as common practice, _________________ has naturally become the next target.
advanced data analysis
Efficient methods for __________, where a query is viewed as a read-only transaction, have contributed substantially to the evolution and wide acceptance of relational technology as a major tool for efficient storage, retrieval, and management of large amounts of data.
on-line transaction processing (OLTP)
One data repository architecture that has emerged is the __________, a repository of multiple heterogeneous data sources organized under a unified schema at a single site in order to facilitate management decision making.
data warehouse
Data warehouse technology includes:
data cleaning
data integration
on-line analytical processing (OLAP)
analysis techniques with functionalities such as summarization, consolidation, and aggregation as well as the ability to view information from different angles.
on-line analytical processing (OLAP)
Data flow in and out like streams, as in applications like video surveillance, telecommunication, and sensor networks.
data streams
The abundance of data, coupled with the need for powerful data analysis tools, has been described as a _______________ situation.
data rich but information poor
As a result, data collected in large data repositories become “__________”—data archives that are seldom visited.
data tombs
Consider expert system technologies, which typically rely on users or domain experts to _________ input knowledge into knowledge bases.
manually
The widening gap between data and information calls for a systematic development of data mining tools that will turn data tombs into “__________” of knowledge.
golden nuggets
refers to extracting or “mining” knowledge from large amounts of data.
data mining
data mining should have been more appropriately named “_____________”
knowledge mining from data
“____________,” a shorter term, may not reflect the emphasis on mining from large amounts of data.
Knowledge mining
terms carry a similar or slightly different meaning to data mining:
knowledge mining from data,
knowledge extraction,
data/pattern analysis,
data archaeology,
data dredging
Many people treat data mining as a synonym for another popularly used term, ____________.
Knowledge Discovery from Data, or KDD
Knowledge discovery process:
Data cleaning
Data integration
Data selection
Data transformation
Data mining
Pattern evaluation
Knowledge presentation
to remove noise and inconsistent data
Data cleaning
where multiple data sources may be combined
Data integration
where data relevant to the analysis task are retrieved from the database
Data selection
where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance
Data transformation
an essential process where intelligent methods are applied in order to extract data patterns
Data mining
to identify the truly interesting patterns representing knowledge based on some interestingness measures
Pattern evaluation
where visualization and knowledge representation techniques are used to present the mined knowledge to the user
Knowledge presentation
Steps 1 to 4 are different forms of ___________, where the data are prepared for mining
data preprocessing
(T/F) Data mining is only one step in the entire process, albeit an essential one because it uncovers hidden patterns for evaluation.
True
This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories.
Database, data warehouse, World Wide Web, or other information repository
It is responsible for fetching the relevant data, based on the user’s data mining request.
Database or data warehouse server
This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns.
Knowledge base
This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.
Data mining engine
This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns.
Pattern evaluation module
used to organize attributes or attribute values into different levels of abstraction.
concept hierarchies
This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results.
User interface
For an algorithm to be _________, its running time should grow approximately linearly in proportion to the size of the data, given the available system resources such as main memory and disk space.
scalable
A ________ , consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data.
database system or DBMS
A _________ is a collection of tables, each of which is assigned a unique name.
relational database
Each table consists of a set of _________ (columns or fields) and usually stores a large set of _______ (records or rows).
attributes, tuples
A semantic data model, such as an __________ data model, is often constructed for relational databases.
entity-relationship (ER)
Relational data can be accessed by __________ written in a relational query language, such as SQL, or with the assistance of graphical user interfaces.
database queries
A _________ is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site.
data warehouse
A data warehouse is usually modeled by a multidimensional database structure, where each ________ corresponds to an attribute or a set of attributes in the schema, and each ______ stores the value of some aggregate measure.
dimension, cell
A data cube provides a multidimensional view of data and allows the precomputation and fast accessing of summarized data.
multidimensional data cube
Allow the user to view the data at differing degrees of summarization
drill-down and roll-up
a _________ consists of a file where each record represents a transaction.
transactional database
_____________ are constructed based on an object-relational data model.
Object-relational databases
Extends the relational model by providing a rich data type for handling complex objects and object orientation.
Object-relational databases
The object-relational data model inherits the essential concepts of object-oriented databases, where, in general terms, each entity is considered as an _______.
object
Each object has a set of:
variables
messages
methods
A __________ typically stores relational data that include time-related attributes.
temporal database
A __________ stores sequences of ordered events, with or without a concrete notion of time.
sequence database
A __________ stores sequences of values or events obtained over repeated measurements of time
time-series database
____________ contain spatial-related information.
Spatial databases
Spatial data may be represented in __________, consisting of n-dimensional bit maps or pixel maps.
raster format
Maps can be represented in ___________, where roads, bridges, buildings, and lakes are represented as unions or overlays of basic geometric constructs.
vector format
A spatial database that stores spatial objects that change with time is called a _____________, from which interesting information can be mined.
spatiotemporal database
_________ are databases that contain word descriptions for objects.
Text databases
____________ store image, audio, and video data.
Multimedia databases
Because video and audio data require real-time retrieval at a steady and predetermined rate in order to avoid picture or sound gaps and system buffer overflows, such data are referred to as ___________ data.
continuous-media
A __________ consists of a set of interconnected, autonomous component databases.
heterogeneous database
A __________ is a group of heterogeneous databases that combines different kinds of data systems, such as relational or object-oriented databases, hierarchical databases, network databases, spreadsheets, multimedia databases, or file systems.
legacy database
Data flow in and out of an observation platform dynamically.
stream data
Capturing user access patterns in such distributed information environments is called ____________.
Web usage mining (or Weblog mining)
___________ based on linkages among Web pages can help rank Web pages based on their importance, influence, and topics.
authoritative Web page analysis
____________ help group and arrange Web pages in a multidimensional manner based on their contents.
Automated Web page clustering and classification
___________ helps identify hidden Web social networks and communities and observe their evolution.
Web community analysis
_______ characterize the general properties of the data in the database.
Descriptive mining tasks
____________ perform inference on the current data in order to make predictions.
Predictive mining tasks
Descriptions of a class or a concept are called __________.
class/concept descriptions
_________ is a summarization of the general characteristics or features of a target class of data.
Data characterization
The resulting descriptions can also be presented as _________ or in rule form (called characteristic rules)
generalized relations
_________ is a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes.
Data discrimination
Discrimination descriptions expressed in rule form are referred to as _________.
discriminant rules
_________, are patterns that occur frequently in data.
Frequent patterns
A __________ typically refers to a set of items that frequently appear together in a transactional data set.
frequent itemset
A frequently occurring subsequence is a ________.
(frequent) sequential pattern
If a substructure occurs frequently, it is called a _____________.
(frequent) structured pattern
Association rules that contain a single predicate are referred to as ______________.
single-dimensional association rules
A 1% _______ means that 1% of all of the transactions under analysis showed that computer and software were purchased together.
support
A ________, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that she will buy software as well.
confidence
Typically, association rules are discarded as uninteresting if they do not satisfy both a minimum _______ threshold and a minimum ________ threshold.
support, confidence
Additional analysis can be performed to uncover interesting statistical ________ between associated attribute-value pairs.
correlations
__________ is the simplest form of frequent pattern mining.
Frequent itemset mining
_________ is the process of finding a model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown.
Classification
The derived model is based on the analysis of a set of ___________.
training data
A ________ is a flow-chart-like tree structure,where each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions.
decision tree
A _______, when used for classification, is typically a collection of neuron-like processing units with weighted connections between the units.
neural network
_________ is a statistical methodology that is most often used for numeric prediction, although other methods exist as well.
Regression analysis
Attempts to identify attributes that do not contribute to the classification or prediction process.
relevance analysis
Analyzes data objects without consulting a known class label.
clustering
Clustering can also facilitate ___________,that is, the organization of observations into a hierarchy of classes that group similar events together.
taxonomy formation
A database may contain data objects called ________ , that do not comply with the general behavior or model of the data.
outliers
The analysis of outlier data is referred to as _________.
outlier mining
Data ____________ describes and models regularities or trends for objects whose behavior changes over time.
evolution analysis
A pattern is interesting if it validates a hypothesis that the user sought to confirm. An interesting pattern represents __________.
knowledge
Data mining systems classifications according to:
kinds of databases mined
kinds of knowledge mined
kinds of techniques utilized
applications adapted
Each user will have a _________ in mind, that is, some form of data analysis that he or she would like to have performed
data mining task
input to the data mining system.
data mining query
A data mining query is defined in terms of ____________.
data mining task primitives
designed as a teaching tool, based on the primitives.
DMQL (Data Mining Query Language)
an attribute may be specified as the ___________, whose values explicitly represent the classes.
class label attribute
Means that a DM system will not utilize any function of a DB or DW system.
No coupling
Means that a DM system will use some facilities of a DB or DW system, fetching data from a data repository managed by these systems, performing data mining, and then storing the mining results either in a file or in a designated place in a database or data warehouse.
Loose coupling
Means that besides linking a DM system to a DB/DW system, efficient implementations of a few essential data mining primitives can be provided in the DB/DW system.
Semitight coupling
Means that a DM system is smoothly integrated into the DB/DW system.
Tight coupling
These reflect the kinds of knowledge mined, the ability to mine knowledge at multiple granularities, the use of domain knowledge, ad hoc mining, and knowledge visualization.
Mining methodology and user interaction issues
These include efficiency, scalability, and parallelization of data mining algorithms.
Performance issues
The huge size of many databases, the wide distribution of data, and the computational complexity of some data mining methods are factors motivating the development of __________.