Text & Web Mining A - Exam III

Home

Get App

Take Quiz

Create

text mining (text data mining or knowledge discovery in textual databases)

the semi-automated process of extracting patterns from large amounts of unstructured data sources - applying data mining to unstructured data
unstructured data

data that does not have a predetermined format and is stored in the form of textual documents - for humans to process and understand
structured data

data that has a predetermined format and is usually organized into records with simple data values and stored in databases - for computers to process
corpus

a large and structured set (body) of texts prepared for the purpose of knowledge discovery (mining)
terms

a word or phrase extracted by NLP (natural language processing)
concepts

features generated from a collection of documents by means of manual, statistical, rule-based or hybrid categorization methodology - when compared to terms are the result of higher level abstraction - (ie: generalizations about terms)
stemming - ex: model, models, modeling, etc.

reducing words to their root or base form
stop words or noise words - ex: a, am, of, the, this, etc.

words that are filtered out
synonyms - ex: movie, film, & motion picture

syntactically different words (i.e., spelled differently) with identical or at least similar meanings (ie: words that mean the same thing)
polysemes or homonyms - ex: bow (to bend forward), bow (front of a ship), bow (weapon that shoots arrows)

syntactically identical words (i.e., spelled exactly the same) with different meanings
bag of words

simple method of text mining that groups words together irrespective of their order and arranges them by classification (ex: sales, complaint, etc.)
include words - (required words)

words that are preindexed
tokenizing - ex: Auburn University vs. Auburn & University

assigning meaning to a block of text
term dictionary

a list of words used to narrow a focus on a search
word frequency

the number of times a word is found in a specific document
part-of-speech tagging

the process of marking up the words in a text as corresponding to a particular part of speech (such as nouns, verbs, adjectives, etc.) based on a word's definition and the context in which it is used
term-by-document matrix (occurrence matrix)

matrix of documents and terms by word count
morphology

a branch of the field of linguistics and a part of natural language processing that studies the internal structure of words (patterns of word-formation within a language or across languages)
singular-value decomposition (latent sematic indexing)

method used to transform the term-by-document matrix to a manageable size - used to reduce the number of terms in a matrix
information extraction
topic tracking
summarization
categorization
clustering
concept linking
question answering

applications of text mining
natural language processing (NLP)

a subfield of artificial intelligence and computational linquistics that studies the problem of "Understanding" the natural human language
1.part-of-speech tagging - in context
2.text segmentation - single word boundaries (ex: chinese)
3.word sense disambiguation - words have more than one meaning
4.syntactic ambiguity - grammar is ambiguous
5.imperfect or irregular input - accents, etc.
6.speech acts - a sentence can be considered an action

challenges with NLP
sentiment analysis

a technique used to detect favorable and unfavorable opinions toward specific products and services using a large number of textual data sources (ex: customer feedback) - an important use of NLP
information retrieval
information extraction
named-entry recognition
question answering
automatic summarization
natural language generation & understanding
machine translation
foreign language reading & writing
speech recognition
text proofing
optical character recognition

natural language processing (NLP) task cagegories
establish the corpus
create the term-document matrix
extract the knowledge

three step text mining process

Author

mjweston

241878

Card Set

Text & Web Mining A - Exam III

Description

Big Data

Updated

2013-10-22T17:49:44Z