Developed a Python-based Search Engine that efficiently indexes and searches through HTML documents using a trie-based data structure. The search engine is designed to process unstructured text, filter out irrelevant words, and rank results based on term frequency.
Key Features & Functionalities:
HTML Parsing: Extracts text content from web pages while removing unnecessary formatting. Tokenization & Stopword Removal: Cleans and preprocesses text using NLTK to eliminate common stopwords and enhance search accuracy.
Trie-based Indexing: Utilizes a prefix tree (trie) for optimized search operations, enabling fast lookups with minimal memory overhead.
Efficient Searching: Retrieves results in O(k) time complexity, where k is the length of the search term.
Result Ranking: Sorts search results based on word frequency across documents, improving relevance.
Robust Edge Case Handling: Includes handling for empty queries, special characters, and nonexistent terms.
Technologies Used:
Programming Language: Python
Libraries: BeautifulSoup (for HTML parsing), NLTK (for natural language processing), Regular Expressions (for tokenization)
Data Structures: Trie (Prefix Tree)