Home

cmis search: jon doe, john p. doe, john doe

You are here

6 posts / 0 new
Last post
cmis search: jon doe, john p. doe, john doe

Hi all,
I'm wondering if someone has already found an appropriate approach for dealing with search/score for names like these below with CMIS, such that a search phrase "jon doe" or "john doe" returns all three phrases (i.e. all documents that have cmis:name of any of these 3):

    Jon Doe John Doe
    John P. Doe

Using a query with the CONTAINS() is not working out as expected, although if I only search for the first name 'john', I'll get two, but all other scenarios I'll only get 1 of the 3 in this example. (wrapping phrase in quotes allowed spaces for those reviewing this thread for usage of CONTAINS).

ItemIterable<QueryResult> results = session.query("SELECT * FROM cmis:document  WHERE  CONTAINS ( 'cmis:name: \"" + fullname + " \"')", false); 

(* for brevity, would include SCORE() and other specific fields if possible)

standard CMIS would be preferred, but if not, that's fine too just trying to figure out appropriate approach.

references used:
http://wiki.alfresco.com/wiki/Full_Text_Search_Query_Syntax
http://docs.alfresco.com/4.0/index.jsp?topic=%2Fcom.alfresco.enterprise.doc%2Fconcepts%2Frm-searchsyntax-APIs.html
http://docs.oasis-open.org/cmis/CMIS/v1.0/os/cmis-spec-v1.0.html#_Toc243905420

-Darren

Re: cmis search: jon doe, john p. doe, john doe

This would do it:
SELECT * FROM cmis:document  WHERE  cmis:name like 'Jo%Doe'

So would this:
SELECT * FROM cmis:document  WHERE  cmis:name = 'Jon Doe' or cmis:name = 'John Doe' or cmis:name = 'John P. Doe'

Jeff

Chief Community Officer
Alfresco Software
Blog: ecmarchitect.com | Twitter: jeffpotts01
CMIS APIs: Apache Chemistry | CMIS and Apache Chemistry in Action
Alfresco tutorials: Alfresco Developer Series

Re: cmis search: jon doe, john p. doe, john doe

thanks Jeff, but I may have poorly worded the question.

Given a full name phrase that a user puts in such as "john doe", how can I pass that given phrase to return similar documents, such as documents that contain the following that would be considered an 'expected' result for the user supplied phrase:

Jon Doe
John P Doe
John Doe

Re: cmis search: jon doe, john p. doe, john doe

The CMIS query examples I gave you query against the name property. If instead you want to search the full-text contents, you can use the CONTAINS keyword in a CMIS query language query. So, for example:
SELECT * FROM cmis:document WHERE CONTAINS('John Doe')

Returns the documents that contain "John Doe" in the text. In my test, I also get a hit for a document containing "John P. Doe".

The CMIS spec (and the underlying Lucene engine embedded in Alfresco) supports wildcarding in full-text searches. So you might expect to be able to search for:
SELECT * FROM cmis:document WHERE CONTAINS('Jo*Doe')

to get back documents containing the names you listed.

But in my test on 4.2.c Community Edition with Lucene (not SOLR) this returns zero hits. I think it is because of the word break between "John" and "Doe".

So, refining the search to:
SELECT * FROM cmis:document WHERE CONTAINS('Doe') and CONTAINS('Jo*')

I get hits for docs containing Jon Doe, John P Doe, and John Doe.

Jeff

Chief Community Officer
Alfresco Software
Blog: ecmarchitect.com | Twitter: jeffpotts01
CMIS APIs: Apache Chemistry | CMIS and Apache Chemistry in Action
Alfresco tutorials: Alfresco Developer Series

Re: cmis search: jon doe, john p. doe, john doe

thanks for looking into this Jeff. Unfortunately, it looks like it should be better controlled during the content model design phase by making sure 'first name' and 'last name' are distinct fields if it is a critical search field for documents that only have one name.

If a document has multiple names, or using fulltext on an OCR'd document, just have to manage expectations (particularly with multi-part names with middle initial, hypenated last names, generational suffixes, etc) :-)

-D

Re: cmis search: jon doe, john p. doe, john doe

Hi

You should be able to combine phrase slop and wild-carding

CONTAINS('cmis:name:\\'Jo*n Doe\\'~1')

~2 will allow more of a gap but will also allow tokens in reverse order but not require then to be next to each other.
(~1 will match tokens at the same position which can be odd using lucene as opposed to SOLR)

~1 allows a token to be out of place by one it is two moves to reverse the order

Hope this helps
Span queries etc is still on the list ....

Andy

Andy Hind
Alfresco Development