Search Engine Project

An offline search engine that I built in Java for my Information Retrieval & Web Search course. This Search Engine can Parse, Index, Rank, and finally perform an efficient search for thousands of HTML webpages parsed from XML Files in a few seconds.

Parsing

XML Files were parsed using Document Builder and the following tags were extracted <HTML> and The HTML tags and <body> were parsed using JSOUP.

Indexing

Dictionary and posting list is saved in the following data structure : Map<String, List> index

Search

With the help of posting list we perform the search across 10700 html documents

What I Learnt

  • Built an inverted index
  • Importance of ranking in a search engine.
  • Techniques to improve speed and efficiency in a search engine.
  • Dealt with complications of parsing searchable information from a few cluttered and unorganized HTML web pages.