Mathematical document retrieval system based on signature hashing

Scientific documents and magazines involve large number of mathematical expressions and formulas along with text. The continuous growth of such documents necessitates the requirement of developing specialized tools and techniques, which could handle and analyse mathematical expressions and formulas. Mathematical expressions and formulae are highly structured and quite different from traditional text. Due to which conventional text retrieval system performs poorly in retrieving scientific documents based on mathematical expression formulated as a query. Mathematical information retrieval is concerned with finding information in documents that include mathematics. To address the challenges posed by mathematical formulae as compared to text, this paper aims to construct a math aware search engine, which can retrieve relevant scientific documents based on a mathematical query. A novel signature based hashing scheme to index raw mathematical web documents is proposed in this paper, which can also take mathematical notational equivalences into account. The proposed system demonstrates better precision and stability of the ranked results when compared with other related state-of-the-art math aware search engines.


Introduction
Mathematics is a very important constituent in the domain of Science, Technology, Engineering and Mathematics (STEM). Its very need is felt in different spheres of research, education and industries. There would be a seldom scientific document without a single mathematical expression (ME)/symbol. In this digital era, with more and more scientific documents being generated, information explosion indeed was inevitable. To store, manage and retrieve this vast amount of scientific documents thereby mathematical expressions novel strategies, principles and tools were developed in the last decade.
The domain of information retrieval (IR) began from early 1950; as a result many IR models are into existence now namely Boolean Model, Vector Space Model (VSM), Probabilistic model etc. However, vector representation does not consider the ordering of words in a document that is a crucial factor for MEs and exact matching may retrieve too few or too many documents [1][2]. The field of IR has been exhaustively explored for many decades but a distinct focus is required for Mathematical Information Retrieval (MIR) because conventional text retrieval systems are not suitable for retrieving mathematical expressions [3][4].
As stated in [5] "Mathematical Information Retrieval is concerned with finding information in documents that include mathematics. This is important for technical disciplines that use math frequently. (e.g. Physics and Computer Science). Mathematical Information Retrieval (MIR) systems are formula based search engine. User information needs requires careful investigation and good understanding to develop firm principles and foundations in the area of MIR systems." The order of the terms in a mathematical expression (ME) is crucial issue which influence the semantics of a ME but presently in most of the existing text-based MIR sytems bag-of-words approach have been implemented as a result the order of the terms consequently, structure of a ME get lost. Furthermore, with the aforementioned approach most of the MIR systems have used inverted index with tf-idf ranking. Therefore, this paper proposes an alternative indexing scheme i.e. signature based hash index for mathematical information retrieval while constructing a math-aware search engine: SigMa. Moreover, we also extend the concept of structure-encoded strings (SES) for MathML documents to eliminate extraneous sysmbols like <mi>, <mo> etc. without losing the structure of a ME.

Background
Classically information retrieval (IR) models can be classified into three broad categories namely set-theoretic, algebraic and probabilistic models [1,6].

Set Theoretic Model
Documents are modeled as sets depending on the terms that it contains. Thereafter, the standard set-theoretic operations are used to derive the similarities. Based on the foundations of set theory and boolean algebra, Standard Boolean Model was derived where connectives like ^, _, ¬ etc. are used to issue the query in conjunction with the key terms [7]. Although being a very simple and efficient model to implement, it also has some limitations. Firstly, it fails to retrieve results with partial match and secondly general users find it very difficult to form complex queries. Due to these reasons, its performance results in either high precision and low recall or low precision and high recall. The strict Boolean and fuzzy-set models are preferable to other models in terms of computational requirements [8].

Algebraic Model
Documents are modeled as vectors, matrices or tuples. The similarity measure here is obtained as a scalar value while document and query terms are represented as vectors. The popular vector space model falls under this category. In an abstract way, the model is based on the notion that important terms convey the meaning of the document. For calculating the weight of the terms, there are two features, which are widely used namely term frequency and inverse document frequency [9].

Probabilistic Model
In this model, the notion of relevance is captured under probabilistic framework as described in [6,8,9]. In other words, this model tries to answer the probability of document d j to be relevant, for a given query q i . This model is based on a concrete mathematical foundation of probability and also considers term dependence, relationships, weight of the query terms etc. This model is built on a concrete mathematical foundation and also considers the feature of term dependence but the model has many variations depending on many assumptions. Another substantial problem with this model is that it is very hard to implement this model for large-scale information retrieval systems like web search.
One of the fundamental variance between text and mathematical expressions (ME) lies in their encoding schemes and formats. There are several encoding schemes available for mathematical expressions like MathML [10], LATEX [11] and Openmath [12] to name a few. Figure 1 provides the representation of mathematical expressions in different encoding schemes adapted from [2]. Moreover, a mathematical notation is quite inconsistent, and symbol set is limited. A notation is commonly reused, and there often exist several different ways of writing down the same core meaning [13]. For example, Like text, ME's also exhibit the property of polysemy. For instance, the Greek letter α (alpha) could be a Sommerfeld's constant in physics, dominant animal or human in zoology, the brightest star in a constellation in astronomy etc. that makes it ambiguous. Furthermore, using different variables, constants or symbols may result numerous ways to write an implicitly equivalent mathematical expression like 47 a 2 +b 2 vs. α 2 +β 2 which demonstrate the property of synonymy. Normalization is a process to reduce mismatch among the expressions that are semantically similar in nature along with the reduction in index size [2,13]. Indexing is another major concern in the field of mathematical information retrieval systems (MIR)/math search engines (MSE). Broadly, there are two breeds of MIR systems based on indexing scheme namely text-based and tree based. In text based MIR systems, the emphasis lies on constructing a plain text representation of mathematical expression/formula. Thereafter, it employs several popular information retrieval frameworks like Lucene, Solr etc. to accomplish the task of indexing in an automated way. However, the text representation of the mathematical expression results in either complete or partial loss of structure of the equation [14]. For instance, to extract the feature vectors a clustering technique combined with regular expression was proposed in [15] while [16] used finite state automata to accomplish the task. Similarly Miner et. al. proposed MathDex [17] which uses the text, based n-gram indexing but does not consider several fundamental mathematical equivalences [18].
LaTeXSearch [19] provided by Springer supports LATEX and text queries to retrieve documents from their database while SearchOnMath [20] a part of Microsoft BizSpark program now, considered five math contained datasets namely English version of Wikipedia, Wolfram Math Word, DLMF, Socratic and Planet Math for indexing and retrieval task. The indexing schemes of both the engines are not available as they are proprietary product. EgoMath [21] uses a reverse polish notation to store a mathematical formula and uses augmentation algorithm by applying transformation and generalization rules together with an ordering algorithm on the input. All these systems although presents high recall but precision level need substantial efforts.
On the other hand, in tree based systems trees and variants of trees like tries/substitution trees are employed where leaves of tree points to the expressions and the posting list. These trees are generally inspired from the automatic theorem proving data structures. The benefit of this approach is structure of the mathematical expression/ formulae and each attribute of mathematical representation is arranged in a well-structured manner and retrieval is quite fast. For e.g. MathWebSearch [22] forms a substitution tree of each substructure for semantic representation of formulae. It can work for exact and similar matching by backtracking of substitution tree. A similar approach of substitution tree was proposed by Schellenberg et. al. [23] depending on the layout of the mathematical expression for indexing and retrieval purposes. MIaS [24] also follows the same principle for indexing its documents while creating a separate tree for each substructure of a single mathematical formulae structure, which increases recall of the system but makes it more useful in a broad scale of real world applications. While WikiMirs 2.0 [25] considers only formula information but WikiMirs 3.0 [26] also added a context index. The basic system is based on LATEX markups extracted from Wikipedia dataset. Although these systems offer very high precision but system suffers from low recall. This paper constructs a math aware search engine with an an alternative approach for indexing that is based on signature hashing along with the implementation of structure-encoded strings for mathematical expressions extended for MathML documents. The reason to use an alternative approach was motivated by the fact that most of the systems disucssed above have used a bag-of words approach along with tf-idf scores . The major bottleneck with this approach is the loss of order, thereby the whole structure which is a crucial aspect of a ME. Most of the math aware systems discussed in this section were either academic prototypes which are inactive as per their current status or propeitary products. Hence, to compare our system we have considered MIaS and WikiMirs because of their availability and are closely related to our approach.

RESEARCH METHOD
Typically a document ∈ ( ) can be represented as m-dimensional feature vector. Similarly a query ∈ ( ) can also be represented as a vector. A similarity coefficient can be measured between the two documents using a function ( , ), which associates a score (real number) to a document. This score generally lies in the range of [0, 1] representing no similarity if 0 or exact match if 1. But searching an m-dimensional feature vector cannot better O(D). However, a hash based indexing scheme can overcome this difficulty as it can easily determine whether or not is a member of in constant time [27]. The central notion of this scheme is to maximize the probability of collision for similar mathematical structures.
The workflow of the proposed system: SigMa is shown in Figure 2 can be divided into two phases namely: off-line phase for constructing the index and on-line phase for retrieval as the user issues  which is approximately 10% of the collection approximately, and 287,850 Text articles, which contribute 90% of the collection approximately. There are around 590,000 formulas in this corpus encoded using presentation and content MathML. With the prefix *wpmath* or *wp*, the corpus has been divided into 160 parts containing around 2000 articles approximately in each of the sub-directories.
Each file is annotated with an unique identifier after translating all the formulae into MathML that appears as a <math >tag. Annotation of each file follows the convention i. e. name of the file, followed by the relative offset of the formula in the file, e.g. *id="FileName:0"* for the first formula in *FileName. html*. LaTeXML ( http://dlmf. nist. gov/LaTeXML/ ) is used to convert each formula from LaTeX to MathML, producing three representations for each formula: a. Presentation MathML: It is used to specify the layout and the appearance of the formula. b. Content MathML: LaTeXML provides an operator tree representation for the semantics of an mathematical expression. c. LATEX String: It specifies the symbol layout of the formula using LATEX representation. d. The size of the corpus in uncompressed form is 5. 15 GB.
The query set for the purpose was downloaded along with necessary relevance judgments. The query set is presented in JSON format, which is composed of with approximately 100 queries. Each query contains a query string in LATEX along with list of labels containing the URL and its score.

Off-line Phase
In this phase the raw data goes to preprocessing stages and index is created using signature based hashing scheme without any intervention from the user. This step is necessary for fast retrieval of the documents. We have considered P-MML as our source input document format. The Off-line Phase has following modules: a. Math Extractor

Math Extractor
This module parses and extracts all the mathematical expressions from documents of our data set. We have considered Presentation MathML (P-MML) as our primary supported format. MathML, a W3C standard, is used for representation of mathematical formulae [28]. Following assumptions are made during pre-processing stage of the document. a. Mathematical text and space are not considered, so <mtext > along with <mspace > and <ms > elements are eliminated. b. MathML elements which contributes mostly towards appearance or styling information with a very less or no consideration for the content and semantics are not considered. Hence, <mstyle >, <merror >, <mpadded>, <mphantom >, <mlabeledtr > and <menclose > are eliminated. c. Tensors are not considered in this system, it may be incorporated in our subsequent version as tensors could be represented in many ways. So <mmultiscripts > are removed. d. Similiar to pre-processing stages as described in [10,40]

, Elementary Math Layout and Enlivening
Expressions are completely ignored for the simple reason as these elements are generally used for grouping, binding actions or alignment purposes. Next, the source document is segregated into two parts: math-text for mathematical content and body-text for other textual content present in the document apart from mathematics.

Structure Encoded String Generator
In this module, we have adopted and extended the work reported in [29]. The authors have addressed the problem of an automated performance evaluation of Mathematical Expression (ME) recognition and proposed a novel way to convert a Mathematical Expression (ME) that may be non-linear in nature into a Structure Encoded String (SES) which is linear representation without losing structure of ME's spatial relationships like superscripts, subscripts etc. Their work was based on LATEX input. According to their hypothesis, any symbol in a ME is spatially associated with six surrounding positions Considering Figure 3(a) and 3(b) "a, b, c, +,=" represents base mathematical symbol (M) and superscript "2" which is in the northern region represents top right (TR). So, the Structure Encoded String (SES) of the Pythagoras formula a 2 + b 2 =c 2 will be < aNS2NE + bNS2NE=cNS2NE > Here, NS represents start of the northern region and NE is designated to mark the end of the northern region. After extracting the mathematical expressions from the documents, we generate equivalent SES for further processing. Scanning the Presentation MathML (P-MML) markup from <math> to <\math> generates SES. Furthermore, two special set of structure symbols i. e Ns and Ne (Ss and Se) are used to preserve structural information of ME. Here, Ns stand for North start and Ne for North end and similarly Ss and Se are designated for southern region subexpression. Therefore, by using this approach we can convert mathematical expressions into structure encoded string, thereby making expressions linear without losing any structural information. The approach could easily be expanded for other formats like content MathML, chemical structures etc. A complete list of other structural symbols used in the algorithm is given in Table 1. MATRIX START: It marks the start of a matrix row MTS 10.
MATRIX END It marks the end of a matrix ME 11.
ROOT: For all kinds of roots RT 12.
TOP START: For capturing start of the above sub-expression TS 15.
TOP END: For marking end of the above sub-expression TE

Hash-key Generator and Index creation
As reported in [30] a hash function f (x) maps a set of keywords into an integer interval from 1 to n. "A signatures is defined as a sequence of w bits created to represent the data contained in each document in a collection. The signature for a document is created by hashing each term to a w string, and OR'ing each of these bit strings together" [31].
Subsequently, query processing also takes the same route by creating a query signature first, thereby comparing the signature in the collection [32]. Document signatures are associated to a bit vector which may take value 0 when there is no match for a particular symbol and 1 when there is a match for a particular symbol. It is based on a fairly obvious representation of the "structure" of the word as a bit word, used as a hash (signature) in the hash table. In the process of search by keyword w, the system successively computes all the signatures and finds those in which the component f (w) equals 1. Only these documents may contain the keyword w, and they are sequentially scanned for matches.
As per [33] it can be formally defined as: " The signature sign (w) of the word w is an m-dimensional vector whose kth element equals 1 if the word w contains the symbol a such that f (a)=k and zero otherwise. "The signature number of a word is given by: While indexing, we calculate hashes for each signature generated through the documents i. e. SES. This SES along with its doc_id is added in the corresponding hash table row, which we construct during the process. We also created an empty bit vector (size=12) and a mapping table containing 12 classes of mathematical operators and symbols to create the bit vector. For instance, the generated SES i. e. ajNSj2jNEj+jbjNSj2jNEj=jcjNSj2jNEj which represents the formula: a 2 +b 2 =c 2 is encoded into a bit vector: 100000001110.
The hash computing process for each bit of the hash, symbols from the SES is matched with the mapping table. Bit 1 at position i in the hash means that there is a true matching of the ith set from the mapping table. Finally, a complete signature hash table is generated. For handling collision problem, we have used chaining method that allows many items to exist at the same location in the hash table by holding a reference to a collection (or chain) of items. The central idea is that similar SES will yield a similar bit vector and subsequently will be hashed in the same location. The complete process is illustrated in the

Online Retrieval Phase
In the on-line phase, a LATEX query string is considered as input. This LATEX query string is converted to P-MML on the fly. This P-MML again goes through SES converter module and hash key generator of the index module generates a signature file for the query .

Matcher and Ranker
The proposed approach uses Jaccard distance [34,35] for matching query and index database. This model is used to calculate the similarity between two sets A and B given by the following expression:

SCORE=|A∩B|/|A∪B|
The numerator represents the commonality between A and B, and the denominator represents the union of A and B. The Jaccard distance implementation operates at a token level, where we compare the SES by first tokenizing them and then dividing the number of tokens shared by the SES in the chain once a match is found in our hash table. After that we retrieve top k documents in descending order based on their score. If two or more documents gets the same rank, they are ordered on first come first serve basis.

RESULTS AND ANALYSIS
We evaluated our system using the following evaluation measures: a. Precision It measures the exactness of the retrieval process [9,36]. If I denote the actual set of relevant document and O denotes the retrieved set of document, then the precision is given by: PRECISION=|I∩O|⁄|O| b. Discounted Cumulative Gain (DCG) DCG measures the usefulness, or gain, of a document based on its position in the result list [1,37]. DCG of the top-k retrieved results can be calculated using: Here, the list is named rel in which the i-th element (reli) denotes whether the i-th retrieved formula is relevant to the query (reli=1) or not (reli=0).
We have taken LATEX representations of mathematical equations as query with a query id 1, 2, 3… as shown in Table 2. For each query we have retrieved the top 10 results (documents) on the basis of score. We have considered three state-of-the-art MIR systems namely MIaS, WikiMirs 1 and WikiMirs 2 to compare our results. The precison@10 is calculated and a comparative analysis for 25 queries is shown in the Figure 5. We have also calculated DCG for each system based on the results returned by the query issued. The relevancy of the document is measured on a scale 1 to 5 where 1 means not relevant and 5 means highly relevant.2, 3 and 4 can be assigned as partial relevancy based on how much these retrieved  Figure 6.

Figure 5. Precision@10 Comparision
It may be observed that our system i.e. SigMa performs better than MIaS and WikiMirs1 in terms of precision and it is comparable to WikiMirs2. As far as the usefulness of results is concerned, we observe that the SigMa yeilds much better DCG than MIaS and WikiMirs1 but WikiMirs2 achieves better DCG than us. This may be due the fact that the improved version of WikiMirs i.e. WikiMirs2 incorporates an additional context index which improves upon the ranking of the results. Currently we are in process of indexing Mathematical Retrieval Collection 6. It contains more than 324,000 XHTML documents and having a size of 48 GB approx. (uncompressed). We are also analyzing different similarity measures and weighting scheme for mathematical expressions. We also assert that although the precision level of our system was decent but false positives are also inevitable in the signature based hash scheme. We are also examining other data structures like tries, directed acyclic graph, bloom filters to address the issues of structure preservation, ordering, normalization and false positives/negatives.

CONCLUSION
In attempt of crafting a better retrieval model in the domain of MIR systems, we theorized that a signature based hashed indexing scheme would be better alternative instead of tree based or text based model. To reason with the theory we have constructed a mathematical search engine namely "SigMa" particularly for scientific documents with mathematical content.
At first mathematical information is extracted from the scientific documents and converted to structure encoded strings. These strings then are served as the input for the hash based indexing scheme, which aimed at converting these SES into a bit vector/signatures. A hash table of these signatures is created which enabled the online searching. Queries in the form of LaTeX strings are converted to P-MML on the fly and simultaneousely bit vectors are generated. Finally these bitvectors are searched in the hash table of signatures and relevant results are retrieved if found a match. The system is compared with state-of-art MIR systems and we have observed that the preliminary results of this scheme are encouraging and competitive than other systems.
Although SigMa is aimed at faster retrieval and for this employs a hashing scheme based on document signatures. The limitation of this scheme is that false negative is inevitable. SigMa is also not void of false negatives. Similarity matching and weighting schemes have to dealt differently for ISSN: 2528-2417  Mathematical document retrieval system based on signature hashing (Sourish Dhar) 55 mathematical expression as it has to take into its consideration both the order as well as the equivalence of mathematical symbol notation. In future, other optimization techniques and weighting schemes can also be explored. Moreover for reducing false hits, Bloom Filter may be explored for its proven efficiency to eliminate false negatives. A better weighting scheme for the purpose of ranking and by exploring the semantics of mathematical expression along with meta data of the scientific documents could serve as a pointer to other research directions. Moreover, how to compute the similarity score according to the features of structures still remains an open problem, because the intent of different users of the MIR systems vary according to their context and precise needs.