-
text mining (text data mining or knowledge discovery in textual databases)
the semi-automated process of extracting patterns from large amounts of unstructured data sources - applying data mining to unstructured data
-
unstructured data
data that does not have a predetermined format and is stored in the form of textual documents - for humans to process and understand
-
structured data
data that has a predetermined format and is usually organized into records with simple data values and stored in databases - for computers to process
-
corpus
a large and structured set (body) of texts prepared for the purpose of knowledge discovery (mining)
-
terms
a word or phrase extracted by NLP (natural language processing)
-
concepts
features generated from a collection of documents by means of manual, statistical, rule-based or hybrid categorization methodology - when compared to terms are the result of higher level abstraction - (ie: generalizations about terms)
-
stemming - ex: model, models, modeling, etc.
reducing words to their root or base form
-
stop words or noise words - ex: a, am, of, the, this, etc.
words that are filtered out
-
synonyms - ex: movie, film, & motion picture
syntactically different words (i.e., spelled differently) with identical or at least similar meanings (ie: words that mean the same thing)
-
polysemes or homonyms - ex: bow (to bend forward), bow (front of a ship), bow (weapon that shoots arrows)
syntactically identical words (i.e., spelled exactly the same) with different meanings
-
bag of words
simple method of text mining that groups words together irrespective of their order and arranges them by classification (ex: sales, complaint, etc.)
-
include words - (required words)
words that are preindexed
-
tokenizing - ex: Auburn University vs. Auburn & University
assigning meaning to a block of text
-
term dictionary
a list of words used to narrow a focus on a search
-
word frequency
the number of times a word is found in a specific document
-
part-of-speech tagging
the process of marking up the words in a text as corresponding to a particular part of speech (such as nouns, verbs, adjectives, etc.) based on a word's definition and the context in which it is used
-
term-by-document matrix (occurrence matrix)
matrix of documents and terms by word count
-
morphology
a branch of the field of linguistics and a part of natural language processing that studies the internal structure of words (patterns of word-formation within a language or across languages)
-
singular-value decomposition (latent sematic indexing)
method used to transform the term-by-document matrix to a manageable size - used to reduce the number of terms in a matrix
-
information extraction
topic tracking
summarization
categorization
clustering
concept linking
question answering
applications of text mining
-
natural language processing (NLP)
a subfield of artificial intelligence and computational linquistics that studies the problem of "Understanding" the natural human language
-
1.part-of-speech tagging - in context
2.text segmentation - single word boundaries (ex: chinese)
3.word sense disambiguation - words have more than one meaning
4.syntactic ambiguity - grammar is ambiguous
5.imperfect or irregular input - accents, etc.
6.speech acts - a sentence can be considered an action
challenges with NLP
-
sentiment analysis
a technique used to detect favorable and unfavorable opinions toward specific products and services using a large number of textual data sources (ex: customer feedback) - an important use of NLP
-
information retrieval
information extraction
named-entry recognition
question answering
automatic summarization
natural language generation & understanding
machine translation
foreign language reading & writing
speech recognition
text proofing
optical character recognition
natural language processing (NLP) task cagegories
-
establish the corpus
create the term-document matrix
extract the knowledge
three step text mining process
|
|