Search engine information retrieval with Lucene.
View the Project on GitHub yeexunwei/search-engine-security-breach
Develop a search engine with Lucene model. Results returned are ranked using tf-idf score. Users provide relevance feedback to improve the effectiveness of the information retrieved.
forms.py
- flask form classgraph.ipynb
- working notebook to determine calculation for relevance feedbacklucene.py
- includes stemming, indexing, tokenising, calculate search score and return search resultswebapp.py
- final working flask app to handle route for home, search, result and add ratingIn this Information Retrieval system we used tf-idf model which stands for term frequency inverse document frequency, is a scoring measure widely used in information retrieval (IR) or summarization. TF-IDF is intended to reflect how relevant a term is in a given document. The intuition behind it is that if a word occurs multiple times in a document, we should boost its relevance as it should be more meaningful than other words that appear fewer times (TF). At the same time, if a word occurs many times in a document but also along many other documents, maybe it is because this word is just a frequent word; not because it was relevant or meaningful (IDF). The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
A ranking function will be implemented to rank query terms by user using Lucene library. Tf-idf score will calculate the term weightage for given user query and documents. Initial ranking of the documents is based on tf-idf weightage.
With rating given by user, the system generates the next set of relevant information. Rating option of one to five is provided for user according to queries. Three steps of calculation:
Data source from Kaggle.
# flask
pip install flask
# whoosh
pip install whoosh
export FLASK_APP=webapp.py
flask run
Running on: