Text & Web Mining A - Exam III

  1. text mining (text data mining or knowledge discovery in textual databases)
    the semi-automated process of extracting patterns from large amounts of unstructured data sources - applying data mining to unstructured data
  2. unstructured data
    data that does not have a predetermined format and is stored in the form of textual documents - for humans to process and understand
  3. structured data
    data that has a predetermined format and is usually organized into records with simple data values and stored in databases - for computers to process
  4. corpus
    a large and structured set (body) of texts prepared for the purpose of knowledge discovery (mining)
  5. terms
    a word or phrase extracted by NLP (natural language processing)
  6. concepts
    features generated from a collection of documents by means of manual, statistical, rule-based or hybrid categorization methodology - when compared to terms are the result of higher level abstraction - (ie: generalizations about terms)
  7. stemming - ex: model, models, modeling, etc.
    reducing words to their root or base form
  8. stop words or noise words - ex: a, am, of, the, this, etc.
    words that are filtered out
  9. synonyms - ex: movie, film, & motion picture
    syntactically different words (i.e., spelled differently) with identical or at least similar meanings (ie: words that mean the same thing)
  10. polysemes or homonyms - ex: bow (to bend forward), bow (front of a ship), bow (weapon that shoots arrows)
    syntactically identical words (i.e., spelled exactly the same) with different meanings
  11. bag of words
    simple method of text mining that groups words together irrespective of their order and arranges them by classification (ex: sales, complaint, etc.)
  12. include words - (required words)
    words that are preindexed
  13. tokenizing - ex: Auburn University vs. Auburn & University
    assigning meaning to a block of text
  14. term dictionary
    a list of words used to narrow a focus on a search
  15. word frequency
    the number of times a word is found in a specific document
  16. part-of-speech tagging
    the process of marking up the words in a text as corresponding to a particular part of speech (such as nouns, verbs, adjectives, etc.) based on a word's definition and the context in which it is used
  17. term-by-document matrix (occurrence matrix)
    matrix of documents and terms by word count
  18. morphology
    a branch of the field of linguistics and a part of natural language processing that studies the internal structure of words (patterns of word-formation within a language or across languages)
  19. singular-value decomposition (latent sematic indexing)
    method used to transform the term-by-document matrix to a manageable size - used to reduce the number of terms in a matrix
  20. information extraction
    topic tracking
    summarization
    categorization
    clustering
    concept linking
    question answering
    applications of text mining
  21. natural language processing (NLP)
    a subfield of artificial intelligence and computational linquistics that studies the problem of "Understanding" the natural human language
  22. 1.part-of-speech tagging - in context
    2.text segmentation - single word boundaries (ex: chinese)
    3.word sense disambiguation - words have more than one meaning
    4.syntactic ambiguity - grammar is ambiguous
    5.imperfect or irregular input - accents, etc.
    6.speech acts - a sentence can be considered an action
    challenges with NLP
  23. sentiment analysis
    a technique used to detect favorable and unfavorable opinions toward specific products and services using a large number of textual data sources (ex: customer feedback) - an important use of NLP
  24. information retrieval
    information extraction
    named-entry recognition
    question answering
    automatic summarization
    natural language generation & understanding
    machine translation
    foreign language reading & writing
    speech recognition
    text proofing
    optical character recognition
    natural language processing (NLP) task cagegories
  25. establish the corpus
    create the term-document matrix
    extract the knowledge
    three step text mining process
Author
mjweston
ID
241878
Card Set
Text & Web Mining A - Exam III
Description
Big Data
Updated