text mining (text data mining or knowledge discovery in textual databases)
the semi-automated process of extracting patterns from large amounts of unstructured data sources - applying data mining to unstructured data
unstructured data
data that does not have a predetermined format and is stored in the form of textual documents - for humans to process and understand
structured data
data that has a predetermined format and is usually organized into records with simple data values and stored in databases - for computers to process
a large and structured set (body) of texts prepared for the purpose of knowledge discovery (mining)
a word or phrase extracted by NLP (natural language processing)
features generated from a collection of documents by means of manual, statistical, rule-based or hybrid categorization methodology - when compared to terms are the result of higher level abstraction - (ie: generalizations about terms)
stemming - ex: model, models, modeling, etc.
reducing words to their root or base form
stop words or noise words - ex: a, am, of, the, this, etc.
words that are filtered out
synonyms - ex: movie, film, & motion picture
syntactically different words (i.e., spelled differently) with identical or at least similar meanings (ie: words that mean the same thing)
polysemes or homonyms - ex: bow (to bend forward), bow (front of a ship), bow (weapon that shoots arrows)
syntactically identical words (i.e., spelled exactly the same) with different meanings
bag of words
simple method of text mining that groups words together irrespective of their order and arranges them by classification (ex: sales, complaint, etc.)
include words - (required words)
words that are preindexed
tokenizing - ex: Auburn University vs. Auburn & University
assigning meaning to a block of text
term dictionary
a list of words used to narrow a focus on a search
word frequency
the number of times a word is found in a specific document
part-of-speech tagging
the process of marking up the words in a text as corresponding to a particular part of speech (such as nouns, verbs, adjectives, etc.) based on a word's definition and the context in which it is used
term-by-document matrix (occurrence matrix)
matrix of documents and terms by word count
a branch of the field of linguistics and a part of natural language processing that studies the internal structure of words (patterns of word-formation within a language or across languages)
singular-value decomposition (latent sematic indexing)
method used to transform the term-by-document matrix to a manageable size - used to reduce the number of terms in a matrix
information extraction
topic tracking
concept linking
question answering
applications of text mining
natural language processing (NLP)
a subfield of artificial intelligence and computational linquistics that studies the problem of "Understanding" the natural human language
1.part-of-speech tagging - in context
2.text segmentation - single word boundaries (ex: chinese)
3.word sense disambiguation - words have more than one meaning
4.syntactic ambiguity - grammar is ambiguous
5.imperfect or irregular input - accents, etc.
6.speech acts - a sentence can be considered an action
challenges with NLP
sentiment analysis
a technique used to detect favorable and unfavorable opinions toward specific products and services using a large number of textual data sources (ex: customer feedback) - an important use of NLP
information retrieval
information extraction
named-entry recognition
question answering
automatic summarization
natural language generation & understanding
machine translation
foreign language reading & writing
speech recognition
text proofing
optical character recognition
natural language processing (NLP) task cagegories
establish the corpus
create the term-document matrix
extract the knowledge
three step text mining process