Web mining is comming used to refer to three activities. Name them and define them.
- MINING STRUCTURE: the process of extracting info from the typology of the web, which pages are the target of links from other pages, etc.
- MINING CONTENT: the process of extracting information from the text, images, and other forms of content that make up the pages. What site has the best deal on Xerox printers?
- MINING USAGE: the process of extracting information on how people who traverse those links with their browsers make use of them. What pages do the visitors browse?
What is the goal of Information Retrieval?
- Maximize recall & Precision
- RECALL: of all pages on the topic of the subject, what percent are returned?
- PRECISION: of the pages returned in your search, what percent are on the correct topic?
What is the most common application of data mining?
What is product feature extraction?
Product feature extraction is the process of creating structured data from unstructured data (XML tags make this process much more reliable and efficient). This type of content mining is used for comparative shopping engines
For a marketer responsible for selling products or services on the web, what kind of mining is the most important?
- Usage mining because usage patterns of our customers are essential. To do this we must collect the paths users take during their session.
- This is known as click stream analysis and is the analysis of web logs
- Web logs are the lowest level of data available
- Different servers have different formats
- web logs are a series of request for pages recieved by the web server
What do you need to take into consideration before you start to examine the web logs?
- Spider generated logs must be removed.
- They are easy to spot as they hit every link on the web site visited.
- Users that have visited the site multiple times must be noted so that their behavior can be properly put into perspective versus a 1st time visitor.
- If a log in is not required by the web site then determining prior visitors will be imperfect.
How do you exclude or restrict a robot from crawling/indexing a page?