Boolean Retrieval Model- Inverted Index and Positional Index

Syed Ahsan Akhtar Naqvi
5 min readMar 20, 2022
Information Retrieval

While it’s amazing to use the search engine to fulfill your information need, it’s equally important for you to learn the mechanics behind it to obtain accurate results.

Information Retrieval is a to process to retrieve relevant documents from a large unstructured set of data(usually free text) in response to a user query that satisfies his or her information requirement. An IR system has the ability to represent, store, organize, and access information items.[1]

In other words, the Information Retrieval system sorts and rank the documents by making use of crawling, parsing, indexing, and various weighting schemes.

What is Boolean Model?

A standard Boolean Model is a simple approach to Information Retrieval which provides core foundation to numerous other models of Information Retrieval like Extended Boolean Model , Vector space model and etc.The model is based on Boolean logic and set theory operations.The user query is transformed into a boolean expression , written either in conjunctive normal form or disjunctive normal form.[2]

In this blog, we will explore the concepts and implementation of Inverted Index and Positional Index in Python.

Python Implementation of Inverted Index and Positional Index

You may find the code on the Github repository.

Step by step guide to generating an Inverted and Positional index is as follows:

1. Gather Data:

I have used an offline data set(corpus) for this implementation but you may find various datasets on Kaggle.

2. Import Libraries and Packages:

import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from re import search
from flask import Flask, render_template, url_for, request

These are a few libraries that will help you throughout the retrieval process.NLTK here is imported to deal with human language data. Flask is an additional library that is to be used to design an UI.

2. Document pre-processing:

It is the foremost step in generating an efficient Information Retrieval Model. The pre-processing stage involves stemming, removal of stopwords, removing the punctuations marks, and generating stemmed tokens from the dataset.

def ReadStopWords(self):f=open("stopword.txt","r" )
stopWords=f.read()
stopWords=stopWords.split("\n")
def RemovePuncs(self):
self.fileText=self.fileText.lower()
punctuations_marks = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
for char in self.fileText:
if char not in punctuations_marks:
continue
else:
self.fileText=self.fileText.replace(char," ")
def StopWordsRemoval(self): #Removing Stop wordsstopWordsList=stopwords.words('english')
custom_sw_list =stopWordsList
Tokens = word_tokenize(self.fileText)
self.fileText=[word for word in Tokens if not word in custom_sw_list]
#Stemimming of document words and tokenization
def PorterStemmer(self):
stemmer = PorterStemmer()
stemmedTokens=[]
for tokens in self.fileText:
stemWord=stemmer.stem(tokens)
stemmedTokens.append(stemWord)
self.fileText=stemmedTokens

Here, I have read my own stopwords file but you can make full use of the NLTK stopwords package to remove stopwords(Most common words) from your textual document with the same method.

3. Create Inverted Index Dictionary:

An Inverted Index dictionary can simply be seen as a look-up table where multiple document IDs correspond to each term.

def Dictionary(self,docID):unique=[]
for word in self.fileText:
#Picking out the unique tokens
if word not in unique:
unique.append(word)
self.fileText=unique
self.fileText.sort()
#Forming Dictionary {{{{ Term -- > DOC ID }}}}
for term in (self.fileText):
if term in self.inverted_index:
self.inverted_index[term].add(docID)
else:
self.inverted_index[term]={docID}
return self.inverted_index

Step 2 generates the stemmed tokens from our data collection. In this step, we will pick out the unique items or words from the tokens and form a dictionary against them. A dictionary can simply be thought of as a hashmap where each term exists as a key and document IDs are mapped as values corresponding to those keys.

4. Create Positional Index Dictionary:

Inverted Index may fail to produce efficient results in case of proximity or “phrase” queries. In such a case, to enable faster phrase search performance and faster relevance ranking with the Phrase module, Positional Indexing is used.[3]

def Dictionary(self,docID):
temp_dict={}
Position=0
for word in self.fileText:
key=word
temp_dict.setdefault(key,[])
temp_dict[key].append(Position)
Position+=1
for x in temp_dict:
if self.positional_Index.get(x):
self.positional_Index[x][docID]=temp_dict.get(x)
else:
key=x
self.positional_Index.setdefault(key,[])
self.positional_Index[key]={}
self.positional_Index[x][docID]=temp_dict.get(x)
#Dictionary is going to be : { Term : { docID: pos1,pos2,...} }

Here, I have not done anything different from that of the Inverted Index dictionary but have constructed a Python nested dictionary to save the positions of tokens(stemmed words) occurring in that particular document corresponding to terms.

5. Query Processing:

  • Proximity Query Processing:

Proximity queries need to be processed before moving towards the posting lists intersection and retrieval phase. Instances of proximity queries include XY /15 which states that X AND Y are 15 spaces apart.

def QueryProcessingProximity(self,Query):
Query=Query.lower()
result=[]
#Tokenizing Query
Tokens = word_tokenize(Query)
print(Tokens)
#Stemming Query
stemmer = PorterStemmer()
stemmedTokens=[]
for tokens in Tokens:
stemWord=stemmer.stem(tokens)
stemmedTokens.append(stemWord)
#Extracting the skip position
Query=stemmedTokens
Query[2]=re.sub("/","",Query[2])
skipMarker=int(Query[2])
word1=self.positional_Index.get(Query[0])
word2=self.positional_Index.get(Query[1])
intersect=set(word1).intersection(set(word2))
#This intersect will be used in the intersection retrieval stage in the 6th step of this blog.

The query processing is similar to word processing where the query terms are stemmed and the skip position is obtained from the overall query.Word1 and word2 contain a dictionary of Document IDs for corresponding query terms.

  • Boolean Query:

A query is translated into conjunction or disjunction operators and is then evaluated.

def QueryProcessingBooleanQuery(self,Query):
print("User query is : " , Query)
excluded=["AND" , "or" , "and", "OR" , "not" , "NOT"]
operations=[]
Tokens = word_tokenize(Query)
for word in Tokens:
if word in excluded:
operations.append(word)
Query=Query.lower() #Lowercase the query if not
stemmer = PorterStemmer() #Stem the Query and Tokenize
stemmedTokens=[]
for tokens in Tokens:
stemWord=stemmer.stem(tokens)
stemmedTokens.append(stemWord)
Query=stemmedTokens
QueryTerms=[] #Select Query terms to retrieve posting List
for term in Query:
if term not in excluded:
QueryTerms.append(term)
return QueryTerms,operations #These values will be used in the posting list retrieval process.

6. Postings List Retrieval:

  • Inverted Index:
#----------Retrieve all Posting alls for Query Terms---------------xdef Posting_Lists_Retrieval(self,QueryTerms):
#The parameter of QueryTerms comes from query processing of step5.
count=len(QueryTerms)
PostingList=[]
for i in range(count):
PostingList.append(self.inverted_index[QueryTerms[i]])
return PostingList
#-----------Intersect or Union or Not common lists.----------#def Posting_Lists_Intersect(self,PostingList,Operations):
count=len(PostingList)
print(count)
if(count ==1 ):
return sorted(PostingList[0])
else:
result=set(PostingList[0])
k=1
while count > 1 :
s1=set(PostingList[k])
if(Operations[k-1]=="AND" or Operations[k-1]=="and"):
result=result.intersection(s1)
k=k+1
count=count-1
elif(Operations[k-1]=="OR"):
result=result.union(s1)
k=k+1
count=count-1
return sorted(result)
  • Positional Index:
positionListW1=self.positional_Index[Query[0]][i] 
positionListW2=self.positional_Index[Query[1]][i] PositionListLength1=len(positionListW1)
PositionListLength2=len(positionListW2)
count1=0
while count1!=PositionListLength1:
count2=0
while count2!=PositionListLength2:
if (abs(positionListW1[count1] -positionListW2[count2])==skipMarker):
result.append(i)
count2=count2+1
count1=count1+1
unique=[]
for word in result:
if word not in unique:
unique.append(word)
data=unique
return data

This is the final step for the Inverted and Positional Index. It will console out the document IDs in which the query term is found and document IDs with positions of the query term in the case of Positional Index.

7. Output:

Following UI is deployed on a local server designed on Flask.

Home Page
Inverted Index Results
Positional Index Results

References:

[1]https://www.geeksforgeeks.org/what-is-information-retrieval/

[2]https://www.sciencedirect.com/topics/mathematics/boolean-model

[3]https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/cbdv_phrasesearch_positindex.html#:~:text=Positional%20indexing%20improves%20the%20performance,word%20thesaurus%20expansions%20as%20well.

Feel free to comment with your suggestions below!

Thank you.

--

--