• Plone Summer of code 2008 discussion

  • Leveraging the Plone external indexing and searching story - Take II

    from deo on Apr 06, 2008 10:33 PM
    Hello folks,
    
    I finally got the necessary time to join and write down some
    of the ideas floating on my head, so here it goes:
    
    Leveraging the Plone external indexing and searching story
    ==========================================================
    
    :Author: Dorneles Treméa
    :Contact: dorneles@...
    :Date: 2008-04-06
    :Version: 0.1
    
    .. contents::
    
    
    Abstract
    --------
    
    The main goal of this proposal is to improve the default Plone
    indexing and searching functionalities of Plone (which currently
    are highly tied to ``portal_catalog``) to allow us to use external
    searching servers, like `Solr`_ and `Xapian`_ (just to cite two
    known examples), while keeping the compatibility with the current
    code base. We should be able to:
    
    - make the indexing/searching more scalable
    
    - make the indexed data easily accessible by other Plone sites
      and/or even applications written in other languages (Java/C#/...)
    
    - improve resources allocation: one shared cache, instead of one
      cache per Zope instance
    
    
    Detailed Description
    --------------------
    
    Motivation
    **********
    
    Plone's indexing and searching mechanisms are **highly** tied to
    the ``portal_catalog`` implementation, which uses the ``ZODB`` as
    the storage layer.
    
    This has some drawbacks:
    
    1) when a search is made using the ``portal_catalog``, the catalog
       data objects needs to loaded into memory, this may cause some
       existing active objects to be deactivated, that's one of the
       reasons why it's recommended to mount the portal_catalog as a
       separated ``ZODB`` file
    
    2) to allow scalability you need to use ``ZEO`` and each Zope client
       will keep it's own memory cache, which *wastes resources*
    
    3) due its nature, the indexing process can be a **heavy/long**
       operation, causing *read* and or *write* conflicts, this is a
       problem that `PloneQueueCatalog`_ tried to address in the past
    
    4) data being stored inside the ``ZODB`` makes it *harder* to be
       shared with external applications, especially non-Python ones
    
    In the last years some good search engine solutions started to appear
    in the market. Two promising solutions are `Lucene`_ and `Xapian`_.
    Plone should cope with these externals tools in a way to avoid the
    drawbacks above. Not only with those two options, but also with
    anything else that fulfill a set of requirements.
    
    Focus Areas
    ***********
    
    These particular requirements need to be defined and Plone needs to be
    improved to allow the flexibility required. Some of the areas affected
    by this proposal are:
    
    - Search
    - LiveSearch
    - Advanced Search
    - Nagigation Portlet
    - Topics/SmartFolders
    
    All of them need to be improved to use a central indexing/searching
    mechanism, which in turn will be pluggable (and so, extensible).
    
    There are already some initial projects (specially for `Solr`_, an
    enterprise search server based on `Lucene`_) trying to partly address
    the issues raised by this proposal:
    
    - in the indexing area: `enfold.indexing`_/`collective.indexing`_ (and
      more recently `z3c.indexing.dispatch`_), both are generic and are in
      an advanced stage
    
    - in the searching area: `enfold.solr`_/`collective.solr`_, both are
      `Solr`_ specific and are in an intermediate stage
    
    - in the integration area: `SolrIntegration`_, which is Solr specific
      and is in an initial stage
    
    Deliverables
    ************
    
    A good number of interfaces need to be defined, lots of tests need
    to be written and a fine grain integration work need to be done.
    
    Goals
    *****
    
    The primary goal is to have a feature complete implementation, working
    out-of-the-box both with the standard ``portal_catalog`` and also with
    a `Solr`_ server. If time permits, I'll also work on the `Xapian`_
    integration.
    
    About Me
    ********
    
    Hello! I'm Dorneles ``deo`` Treméa, a Brazilian guy living in
    Garibaldi (at the extreme south of the country) with his lovely
    wife and two wonderful daughters.
    
    I've being in touch with Plone since before the 1.0 version was
    released (yeah, at some point back in 2002...) so you probably know
    me from the Plone mailing lists or IRC channels or even personally!
    
    For the past 3 years I was working with the folks from Jarn (formerly
    known as Plone Solutions) where I made great friends and had a lot of
    fun working directly with Alexander Limi, Geir Baekholt, Helge Tesdal,
    Stefan Holek, Florian Schulz, Denis Mishunov, Martijn Pieter and
    Wichert Akkerman. In the end of 2007 I joined Enfold Systems, to help
    Alan Runyan and his gang with the challenges of integrating Plone with
    heterogeneous environments.
    
    I also currently hold the Administrative Director position at the
    `Brazilian Python Association (APyB)`_ and I'm the CEO of X3ng, one
    of the pioneers Plone companies in Brazil.
    
    Talking about the GSoC, I was one of the three original Plone mentors
    in 2006, but unfortunately my student didn't completed successfully
    his project. This year my post-graduation proposal was accepted by the
    university and I decided to run as a student, so here I am... ;-)
    
    It would be great to be mentored by anyone with interest in this
    particular area, including the authors of all cited products. In
    truth, I would like to be multi-mentored to make sure the results
    match the expectations of the whole Plone Community!
    
    .. _Lucene: http://lucene.apache.org/
    .. _Xapian: http://www.xapian.org/
    .. _Solr: http://lucene.apache.org/solr
    .. _PloneQueueCatalog:
    http://dev.plone.org/collective/browser/PloneQueueCatalog
    .. _enfold.indexing:
    https://svn.enfoldsystems.com/browse/public/enfold.solr/trunk/enfold.indexing
    .. _collective.indexing:
    http://dev.plone.org/collective/browser/collective.indexing
    .. _z3c.indexing.dispatch: http://svn.zope.org/z3c.indexing.dispatch
    .. _enfold.solr:
    https://svn.enfoldsystems.com/browse/public/enfold.solr/trunk/enfold.solr
    .. _collective.solr:
    http://dev.plone.org/collective/browser/collective.solr
    .. _SolrIntegration:
    https://svn.enfoldsystems.com/browse/public/enfold.solr/trunk/SolrIntegration
    .. _Brazilian Python Association (APyB):
    http://associacao.pythonbrasil.org/
    
    -- 
    
    Dorneles Treméa
    X3ng Web Technology
    http://nosleepforyou.blogspot.com
    
    Thread Outline:
  • Re: Leveraging the Plone external indexing and searching story - Take II

    from fschulze on Apr 07, 2008 06:14 AM
    > Hello folks,
    >
    > I finally got the necessary time to join and write down some
    > of the ideas floating on my head, so here it goes:
    
    Does your proposal aim to define proper interfaces to let the search,  
    advanced search, livesearch and collection/topics be agnostic to the  
    underlying engine? Basically the integration part of any search engine  
    into Plone?
    
    If so, then I like this proposal!
    
    Regards,
    Florian Schulze
    
    
    • Re: Leveraging the Plone external indexing and searching story - Take II

      from deo on Apr 07, 2008 10:38 AM
      Hey Florian,
      
      > Does your proposal aim to define proper interfaces to let the search,  
      > advanced search, livesearch and collection/topics be agnostic to the  
      > underlying engine? Basically the integration part of any search engine  
      > into Plone?
      > 
      > If so, then I like this proposal!
      
      yes sir! :-)
      
      -- 
      
      Dorneles Treméa
      X3ng Web Technology
      http://nosleepforyou.blogspot.com