Ch01 - Data Mining Introduction

  1. can be viewed as a result of the natural evolution of information technology.
    Data mining
  2. The database system industry has witnessed an evolutionary path in the development of the following functionalities:
    • data collection and database creation
    • data management
    • advanced data analysis
  3. With numerous database systems offering query and transaction processing as common practice, _________________ has naturally become the next target.
    advanced data analysis
  4. Efficient methods for __________, where a query is viewed as a read-only transaction, have contributed substantially to the evolution and wide acceptance of relational technology as a major tool for efficient storage, retrieval, and management of large amounts of data.
    on-line transaction processing (OLTP)
  5. One data repository architecture that has emerged is the __________, a repository of multiple heterogeneous data sources organized under a unified schema at a single site in order to facilitate management decision making.
    data warehouse
  6. Data warehouse technology includes:
    • data cleaning
    • data integration
    • on-line analytical processing (OLAP)
  7. analysis techniques with functionalities such as summarization, consolidation, and aggregation as well as the ability to view information from different angles.
    on-line analytical processing (OLAP)
  8. Data flow in and out like streams, as in applications like video surveillance, telecommunication, and sensor networks.
    data streams
  9. The abundance of data, coupled with the need for powerful data analysis tools, has been described as a _______________ situation.
    data rich but information poor
  10. As a result, data collected in large data repositories become “__________”—data archives that are seldom visited.
    data tombs
  11. Consider expert system technologies, which typically rely on users or domain experts to _________ input knowledge into knowledge bases.
    manually
  12. The widening gap between data and information calls for a systematic development of data mining tools that will turn data tombs into “__________” of knowledge.
    golden nuggets
  13. refers to extracting or “mining” knowledge from large amounts of data.
    data mining
  14. data mining should have been more appropriately named “_____________”
    knowledge mining from data
  15. “____________,” a shorter term, may not reflect the emphasis on mining from large amounts of data.
    Knowledge mining
  16. terms carry a similar or slightly different meaning to data mining:
    • knowledge mining from data,
    • knowledge extraction,
    • data/pattern analysis,
    • data archaeology,
    • data dredging
  17. Many people treat data mining as a synonym for another popularly used term, ____________.
    Knowledge Discovery from Data, or KDD
  18. Knowledge discovery process:
    • Data cleaning 
    • Data integration 
    • Data selection
    • Data transformation
    • Data mining
    • Pattern evaluation
    • Knowledge presentation
  19. to remove noise and inconsistent data
    Data cleaning
  20. where multiple data sources may be combined
    Data integration
  21. where data relevant to the analysis task are retrieved from the database
    Data selection
  22. where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance
    Data transformation
  23. an essential process where intelligent methods are applied in order to extract data patterns
    Data mining
  24. to identify the truly interesting patterns representing knowledge based on some interestingness measures
    Pattern evaluation
  25. where visualization and knowledge representation techniques are used to present the mined knowledge to the user
    Knowledge presentation
  26. Steps 1 to 4 are different forms of ___________, where the data are prepared for mining
    data preprocessing
  27. (T/F) Data mining is only one step in the entire process, albeit an essential one because it uncovers hidden patterns for evaluation.
    True
  28. This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories.
    Database, data warehouse, World Wide Web, or other information repository
  29. It is responsible for fetching the relevant data, based on the user’s data mining request.
    Database or data warehouse server
  30. This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns.
    Knowledge base
  31. This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.
    Data mining engine
  32. This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns.
    Pattern evaluation module
  33. used to organize attributes or attribute values into different levels of abstraction.
    concept hierarchies
  34. This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results.
    User interface
  35. For an algorithm to be _________, its running time should grow approximately linearly in proportion to the size of the data, given the available system resources such as main memory and disk space.
    scalable
  36. A ________ , consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data.
    database system or DBMS
  37. A _________ is a collection of tables, each of which is assigned a unique name.
    relational database
  38. Each table consists of a set of _________ (columns or fields) and usually stores a large set of _______ (records or rows).
    attributes, tuples
  39. A semantic data model, such as an __________ data model, is often constructed for relational databases.
    entity-relationship (ER)
  40. Relational data can be accessed by __________ written in a relational query language, such as SQL, or with the assistance of graphical user interfaces.
    database queries
  41. A _________ is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site.
    data warehouse
  42. A data warehouse is usually modeled by a multidimensional database structure, where each ________ corresponds to an attribute or a set of attributes in the schema, and each ______ stores the value of some aggregate measure.
    dimension, cell
  43. A data cube provides a multidimensional view of data and allows the precomputation and fast accessing of summarized data.
    multidimensional data cube
  44. Allow the user to view the data at differing degrees of summarization
    drill-down and roll-up
  45. a _________ consists of a file where each record represents a transaction.
    transactional database
  46. _____________ are constructed based on an object-relational data model.
    Object-relational databases
  47. Extends the relational model by providing a rich data type for handling complex objects and object orientation.
    Object-relational databases
  48. The object-relational data model inherits the essential concepts of object-oriented databases, where, in general terms, each entity is considered as an _______.
    object
  49. Each object has a set of:
    • variables
    • messages
    • methods
  50. A __________ typically stores relational data that include time-related attributes.
    temporal database
  51. A __________ stores sequences of ordered events, with or without a concrete notion of time.
    sequence database
  52. A __________ stores sequences of values or events obtained over repeated measurements of time
    time-series database
  53. ____________ contain spatial-related information.
    Spatial databases
  54. Spatial data may be represented in __________, consisting of n-dimensional bit maps or pixel maps.
    raster format
  55. Maps can be represented in ___________, where roads, bridges, buildings, and lakes are represented as unions or overlays of basic geometric constructs.
    vector format
  56. A spatial database that stores spatial objects that change with time is called a _____________, from which interesting information can be mined.
    spatiotemporal database
  57. _________ are databases that contain word descriptions for objects.
    Text databases
  58. ____________ store image, audio, and video data.
    Multimedia databases
  59. Because video and audio data require real-time retrieval at a steady and predetermined rate in order to avoid picture or sound gaps and system buffer overflows, such data are referred to as ___________ data.
    continuous-media
  60. A __________ consists of a set of interconnected, autonomous component databases.
    heterogeneous database
  61. A __________ is a group of heterogeneous databases that combines different kinds of data systems, such as relational or object-oriented databases, hierarchical databases, network databases, spreadsheets, multimedia databases, or file systems.
    legacy database
  62. Data flow in and out of an observation platform dynamically.
    stream data
  63. Capturing user access patterns in such distributed information environments is called ____________.
    Web usage mining (or Weblog mining)
  64. ___________ based on linkages among Web pages can help rank Web pages based on their importance, influence, and topics.
    authoritative Web page analysis
  65. ____________ help group and arrange Web pages in a multidimensional manner based on their contents.
    Automated Web page clustering and classification
  66. ___________ helps identify hidden Web social networks and communities and observe their evolution.
    Web community analysis
  67. _______ characterize the general properties of the data in the database.
    Descriptive mining tasks
  68. ____________ perform inference on the current data in order to make predictions.
    Predictive mining tasks
  69. Descriptions of a class or a concept are called __________.
    class/concept descriptions
  70. _________ is a summarization of the general characteristics or features of a target class of data.
    Data characterization
  71. The resulting descriptions can also be presented as _________ or in rule form (called characteristic rules)
    generalized relations
  72. _________ is a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes.
    Data discrimination
  73. Discrimination descriptions expressed in rule form are referred to as _________.
    discriminant rules
  74. _________, are patterns that occur frequently in data.
    Frequent patterns
  75. A __________ typically refers to a set of items that frequently appear together in a transactional data set.
    frequent itemset
  76. A frequently occurring subsequence is a ________.
    (frequent) sequential pattern
  77. If a substructure occurs frequently, it is called a _____________.
    (frequent) structured pattern
  78. Association rules that contain a single predicate are referred to as ______________.
    single-dimensional association rules
  79. A 1% _______ means that 1% of all of the transactions under analysis showed that computer and software were purchased together.
    support
  80. A ________, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that she will buy software as well.
    confidence
  81. Typically, association rules are discarded as uninteresting if they do not satisfy both a minimum _______ threshold and a minimum ________ threshold.
    support, confidence
  82. Additional analysis can be performed to uncover interesting statistical ________ between associated attribute-value pairs.
    correlations
  83. __________ is the simplest form of frequent pattern mining.
    Frequent itemset mining
  84. _________ is the process of finding a model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown.
    Classification
  85. The derived model is based on the analysis of a set of ___________.
    training data
  86. A ________ is a flow-chart-like tree structure,where each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions.
    decision tree
  87. A _______, when used for classification, is typically a collection of neuron-like processing units with weighted connections between the units.
    neural network
  88. _________ is a statistical methodology that is most often used for numeric prediction, although other methods exist as well.
    Regression analysis
  89. Attempts to identify attributes that do not contribute to the classification or prediction process.
    relevance analysis
  90. Analyzes data objects without consulting a known class label.
    clustering
  91. Clustering can also facilitate ___________,that is, the organization of observations into a hierarchy of classes that group similar events together.
    taxonomy formation
  92. A database may contain data objects called ________ , that do not comply with the general behavior or model of the data.
    outliers
  93. The analysis of outlier data is referred to as _________.
    outlier mining
  94. Data ____________ describes and models regularities or trends for objects whose behavior changes over time.
    evolution analysis
  95. A pattern is interesting if it validates a hypothesis that the user sought to confirm. An interesting pattern represents __________.
    knowledge
  96. Data mining systems classifications according to:
    • kinds of databases mined
    • kinds of knowledge mined
    • kinds of techniques utilized
    • applications adapted
  97. Each user will have a _________ in mind, that is, some form of data analysis that he or she would like to have performed
    data mining task
  98. input to the data mining system.
    data mining query
  99. A data mining query is defined in terms of ____________.
    data mining task primitives
  100. designed as a teaching tool, based on the primitives.
    DMQL (Data Mining Query Language)
  101. an attribute may be specified as the ___________, whose values explicitly represent the classes.
    class label attribute
  102. Means that a DM system will not utilize any function of a DB or DW system.
    No coupling
  103. Means that a DM system will use some facilities of a DB or DW system, fetching data from a data repository managed by these systems, performing data mining, and then storing the mining results either in a file or in a designated place in a database or data warehouse.
    Loose coupling
  104. Means that besides linking a DM system to a DB/DW system, efficient implementations of a few essential data mining primitives can be provided in the DB/DW system.
    Semitight coupling
  105. Means that a DM system is smoothly integrated into the DB/DW system.
    Tight coupling
  106. These reflect the kinds of knowledge mined, the ability to mine knowledge at multiple granularities, the use of domain knowledge, ad hoc mining, and knowledge visualization.
    Mining methodology and user interaction issues
  107. These include efficiency, scalability, and parallelization of data mining algorithms.
    Performance issues
  108. The huge size of many databases, the wide distribution of data, and the computational complexity of some data mining methods are factors motivating the development of __________.
    parallel and distributed data mining algorithms
Author
FelipeJung
ID
327467
Card Set
Ch01 - Data Mining Introduction
Description
2nd Semester
Updated