Freetext Indexing Tutorial

Table of Contents

Getting setup

The basics: adding and querying

A silly but more interesting example

Querying simple expressions

Querying wildcards

A larger example with realistic data

Querying Gov

Querying from Prolog

Using Free-Text Indexing from SPARQL

Why a magic predicate?

As in the other tutorials, you should read this while running an interactive Lisp session. The forms should be evaluated one after another from the top of the tutorial to the bottom.

Getting setup

(require :agraph)                    ; load Agraph  
  
(in-package :triple-store-user)      ; go in the right package  
  
(enable-print-decoded t)             ; make printing look nicer  
  
(make-tutorial-store)                ; create a triple-store 

Tell AllegroGraph the predicates that you want to index: comments and labels.

(register-freetext-predicate  
  !<http://www.w3.org/2000/01/rdf-schema#comment>)  
(register-freetext-predicate  
  !<http://www.w3.org/2000/01/rdf-schema#label>) 

Make sure AllegroGraph understood

> (freetext-registered-predicates)  
("http://www.w3.org/2000/01/rdf-schema#label"                                                      
 "http://www.w3.org/2000/01/rdf-schema#comment")  
 

The basics: adding and querying

Now we add some triples:

(add-triple !"Jans" !rdfs:comment  
            !"Born in Amsterdam in the Netherlands")  
(add-triple !"Gary" !rdfs:comment  
            !"Born in Springfield in the USA")  
(add-triple !"Steve" !rdfs:label  
            !"Born in New Amsterdam in the USA") 

Using the index:

; return all triple-ids that match "amsterdam"  
(freetext-get-ids "amsterdam")        
;=> (1 3)  
  
; a boolean expression  
(freetext-get-ids '(and "amsterdam" "usa"))  
  
; a phrase, note the quotation marks  
(freetext-get-ids "\"New Amsterdam\"")      

We can also return a cursor:

(freetext-get-triples '(and "usa" "born"))  ; return a cursor  
;=> #<db.agraph::triple-id-list-cursor @ #x10034e1232> 

Which we can use in the usual way. First we'll bind it to the variable cursor.

(setf cursor (freetext-get-triples '(and "usa" "born"))) 

then loop over the cursor with the handy iterate-cursor function.

(iterate-cursor (triple cursor)  
    (print triple)) 

And sometimes it is handy to get them in a list:

(freetext-get-triples-list '(and "usa" "born"))  
  
;=>  
(<triple 3: "Steve" rdfs:label Born in New Amsterdam in the USA default-graph>  
    <triple 2: "Gary" rdfs:comment Born in Springfield in the USA default-graph>) 

And sometimes you only want the subjects back, especially in your favorite query language:

(freetext-get-unique-subjects '(and "netherlands" "born"))  
;=> ({"Jans"})  
 

A silly but more interesting example

(register-namespace "ex" "http://www.franz.com/simple#") 

First we add some new triples to our open triple-store, note that the object of each new triples is a long string filled with random numbers (in English). We're going to add triples in a somewhat round about fashion:

  1. first we'll create an N-Triples file of our data
  2. and then we'll use load-ntriples to load this file

Here is the code:

(defun fill-dummy-ntriple-file-and-load-it (max)  
  (let ((list '("one " "two " "three " "four "  
            "five " "six " "seven " "eight " "nine " "ten ")))  
    (with-open-file (out "dum.ntriples" :direction :output  
             :if-exists :supersede)  
      (dotimes (i max)  
        (let ((subject (string+ '<subject- i "> ")))  
          (dotimes (j 5)  
            (let ((predicate "<http://www.w3.org/2000/01/rdf-schema#comment> ")  
              (object (apply 'triple-store::string+  
                     (let ((li nil))  
                       (dotimes (i (1+ (random 8)))  
                         (push (nth (random 10) list) li))  
                       li))))  
              (format out "~a~a~s .~%" subject predicate object))))))  
    (load-ntriples "dum.ntriples")  
    (index-all-triples))) 

Let's try it out:

(fill-dummy-ntriple-file-and-load-it 10) 

And look at some triples:

(dolist (e (get-triples-list))  
  (print e)) 

So now we want to play with this file: let us write a little test function:

(defun print-freetext-triples (query)  
  (print-triples (freetext-get-triples query)))  
  
;; an easier to type version!  
(defun pft (query)  
  (print-freetext-triples query)) 

Querying simple expressions

(pft "eight")  
  
(pft '(and "ten" "eight"))  
  
(pft '(and "ten" "eight" (or "three" "four")))  
  
(pft '(or (and "five" "one")  
    (and "ten" "eight" (or "three" "four")))) 

Querying wildcards

;; wildcards -> * is zero or more occurrences  
;;              ? is one character  
;; no * allowed in phrases  
  
(pft "?i*") ; five six nine  
  
(pft "?i?e") ; five nine  
  
(pft  
 '(or (and "fiv*" "on*")  
   (and "te*" "eigh*" (or "th*ree" "fo*ur" "\"one five\"")))) 

A larger example with realistic data

And here is an example of a large file, filled with weapon systems, terrorists, and a lot of common knowledge from the cyc database (available on request: please mail ssears@franz.com)

We include this non-trivial example because it will allow us to do some select queries

(defun read-gov ()  
  (format t "~%Add triples")  
  (load-ntriples "/path/to/the/gov/data/Gov.ntriples")  
  (format t "~%Index-all-triples")  
  (index-all-triples))  
  
(time (read-gov))  
  
(register-namespace "c" "http://www.cyc.com/2002/04/08/cyc#")  
  
(get-triples-list :p !rdfs:comment) 

Querying Gov

The pft (print-freetext-triples) function was defined above.

(pft '(and "collection" "people"))  
  
(pft "\"collection of people\"") 

Querying from Prolog

(select (?person)  
  (lisp ?list  
        (freetext-get-unique-subjects  
          '(and "collection" "people")))  
  (member ?person ?list)  
  (q ?person !rdfs:subClassOf !c:AsianCitizenOrSubject))  
  
(select (?person)  
  (lisp ?list (freetext-get-unique-subjects  
                "\"collection of people\""))  
  (member ?person ?list)  
  (q ?person !rdfs:subClassOf !c:AsianCitizenOrSubject)) 

Using Free-Text Indexing from SPARQL

You can refer to the contents of the free-text index from within your SPARQL queries by using one of the "magic" predicate:

Use fti:match when you want to match simple strings and phrases; Use fti:matchExpression if you need to handle more complex text matching expressions (e.g., ones with ands and ors in them).

A triple pattern such as

?x fti:match "baseball" 

will generate bindings for ?x, where each binding is the subject of a matching triple. "Matching" means that the predicate of the triple is registered with the free-text indexing system, and the object of the triple matches the query (in this case, "baseball"). For example

(sparql:run-sparql  
  "PREFIX fti: <http://franz.com/ns/allegrograph/2.2/textindex/>  
   SELECT ?x WHERE { ?x fti:match \"baseball\" }"  
   :results-format :table) 

You can use all of your normal free-text patterns here, and you can use multiple fti:match triple patterns in your queries (recall that the strings used in SPARQL expressions can use single quotes which helps reduce the number of characters you need to escape immensely.).

Phrase Searches (note the mix of double and single quotes):

(sparql:run-sparql  
  "PREFIX fti: <http://franz.com/ns/allegrograph/2.2/textindex/>  
   SELECT ?x WHERE { ?x fti:match '\"collection of people\"' }"  
   :results-format :table)  
  
(sparql:run-sparql  
    "PREFIX fti:  <http://franz.com/ns/allegrograph/2.2/textindex/>  
     PREFIX c:    <http://www.cyc.com/2002/04/08/cyc#>  
     PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>  
     SELECT ?x WHERE {  
      ?x fti:match '\"collection of people\"' .  
      ?x rdfs:subClassOf c:PersonWithOccupation  
     }"  
:results-format :table) 

Multiple fti:match predicates in a single query (here we use single quotes instead of double):

(sparql:run-sparql  
    "PREFIX fti:  <http://franz.com/ns/allegrograph/2.2/textindex/>  
     PREFIX c:    <http://www.cyc.com/2002/04/08/cyc#>  
     PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>  
     SELECT ?x WHERE {  
        ?x fti:match 'people' .  
        ?x fti:match 'murder' .  
      }"  
 :results-format :table) 

And, finally, an example of fti:matchExpression:

(sparql:run-sparql  
    "PREFIX fti:  <http://franz.com/ns/allegrograph/2.2/textindex/>  
     PREFIX c:    <http://www.cyc.com/2002/04/08/cyc#>  
     PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>  
     SELECT ?x WHERE {  
        ?x fti:matchExpression  
           '(and (or \"people\" \"person\") \"murder\")' .  
      }"  
 :results-format :table) 

Why a magic predicate?

The motivation for providing a magic predicate is that SPARQL FILTERs cannot generate new bindings. In many cases generating new bindings is unnecessary:

SELECT ?name {  
    ?x foaf:name ?name .  
    FILTER (regex(?name, "John", "i"))  
} 

but this is not always true. There is also precedent for the magic predicate approach in other implementations.