Improvisation in Web Mining Techniques by Scrubbing Log Files

Rachit Goel
Sandeep Jain


World Wide Web is one of the most interactive and popular medium to spread the information. The increasing popularity and size growth of WWW has overwhelmed with an immense amount of widely dispersed interconnected and dynamic information. Web pages typically contain a large amount of information that is not part of the main content of the pages, e.g. banner ads, navigation bars, copyright notices, etc. Such noise on web pages usually leads to poor results in Web Mining which mainly depends upon the web page content. Therefore, it becomes very essential to extract information from the bulks of data and structure them into useful knowledge that will be helpful for some type of understanding. This leads to the birth of data mining. Web usage mining is the subject field of Web mining which deals with the discovery and analysis of usage patterns from web data specifically web logs in order to improve the web based applications. The Web usage mining process consists of three phases: Data Preprocessing, Pattern Discovery and Pattern Analysis. In this paper an improvised algorithm will be proposed that gives a clean file consisting of only relevant data from the Web usage mining perspective as output.

Keywords- Data Mining, Web Mining, Web Usage Mining, Data Pre processing.


