Notes
Slide Show
Outline
1
Writing a Perl XS swig interface to the CLucene C++ text search engine
  • Peter Edwards
2
Introduction
  • Peter Edwards ~ background
  • Subject ~ writing a Perl XS swig interface to the CLucene C++ text search engine
3
Aims
  • Give an idea of the process involved in selecting and using an external library from Perl
  • Introduction to extending Perl using XS, swig, GNU autotools
  • Entertainment
  • Audience: What is your background and interest?
4
Topics
  • Understanding the Problem
  • The Answer (at a high level)
  • Technical Options
  • Investigating Options
  • Writing a perl / C++ Interface
  • Layers and Components
  • Lessons Learned
5
Terms
  • Perl ~ Pathologically Eclectic Rubbish Lister
    $_ = "wftedskaebjgdpjgidbsmnjgc";
    tr/a-z/oh, turtleneck Phrase Jar!/; print;
  • Perl XS ~ eXternal Subroutine
    allows a perl program to call a C language subroutine
    XS is also the “glue” language specifying the calling interface
    contains complex “perlguts” stuff that will destroy your sanity
  • SWIG ~ Simplified Wrapper and Interface Generator
    makes it easy to call a C/C++ library from many languages (perl, python, ruby, PHP…)
  • C++ ~ Object Oriented version of C programming language
  • text search ~ boolean searching of stemmed words, wildcards
  • CLucene ~ C++ text search engine based on Java Lucene


6
Understanding the Problem
  • Recruitment software written in Perl
  • 20,000+ candidate Word CVs/resumes
  • Boolean searching using words or partial words and wildcards
    e.g. (“BA” or “MA”) and “literature”
  • Combined with SQL searching
    e.g. geographic area, skill profile codes, pay rate
  • Speed < 2 seconds
  • Old system used dtSearch proprietary s/w
7
The Answer (at a high level)
  • Load
  • Convert candidate CVs from Word to text using wvWare (OpenOffice) converter
  • Index text against candidate no.
  • Search
  • Search text -> cand nos -> SQL temp table
  • Normal SQL search on other criteria
8
Technical Options (at 2003/4)
  • Proprietary
  • dtSearch ~ cost; hard to get cand nos out; Windows interface when perl app is Web
  • Open Source
  • Java Lucene ~ slow but good API and power
  • C++ CLucene ~ alpha quality rewrite of Lucene in Visual C++ as degree project by Ben van Klinken
  • Perl CPAN (PLucene etc.) below
    http://search.cpan.org/modlist/String_Language_Text_Processing
9
Investigating Perl Options
  • Wrote test harness to load 1000 CVs then do some searches
  • Tried about 5 CPAN modules
  • PLucene search speed okay for small volumes but exponential increase in insert time
    >60 seconds per insert
  • Why? Tokenises doc, multi-lingual word stemming, adds doc id to reverse lookup index for each stem token
  • Other modules faster but search options weak
  • Need to look further
10
Investigating CLucene
  • Wrote similar C++ test harness
  • Speed good: search 20,000 CVs <1 second
    load 3 CVs per sec (mostly Word->text)
  • Code written as VC++ degree project and registered at SourceForge
  • Jimmy Pritts changed layout and added GNU autoconf files configure.ac  Makefile.in to let it build cross-platform on Windows, cygwin, Linux
  • Had C DLL interface used by PHP wrapper
  • Decided to write Perl wrapper
11
Interfacing Perl to C++
  • When I wrote this wrapper, Perl to C++ interfacing via XS or SWIG was tricky and despite the optimism expressed at http://www.johnkeiser.com/perl-xs-c++.html  I had difficulties mapping the CLucene API to XS
  • Reasons: C++ namespace mangling; object and method mapping; C++ memory garbage collection
  • So I decided to go via the C DLL wrapper to hide this complexity
12
Perl XS
  • Always start with h2xs utility
  • Code is C with macro extensions
  • Write C code (XSUBs)
  • Call internal Perl routines (perlguts) to create variables, allocate arrays…
     newSViv(IV), sv_setiv(SV*, IV) ~ scalar integer variable
  • Complicated
  • Nyarlathotep / “Crawling Chaos”


13
Enter SWIG
  • Creates XS for you from a .i definition file
  • Parses C/C++ .h header files to get types and function prototypes
  • Allows for inline C/XS code
14
Swig XS Sample
  • From argv.i


  • // Creates a new Perl array and places a NULL-terminated char ** into it
  • %typemap(out) char ** {
  •         AV *myav;
  •         SV **svs;
  •         int i = 0,len = 0;
  •         /* Figure out how many elements we have */
  •         while ($1[len])
  •            len++;
  •         svs = (SV **) malloc(len*sizeof(SV *));
  •         for (i = 0; i < len ; i++) {
  •             svs[i] = sv_newmortal();
  •             sv_setpv((SV*)svs[i],$1[i]);
  •         };
  •         myav =  av_make(len,svs);
  •         free(svs);
  •         $result = newRV((SV*)myav);
  •         sv_2mortal($result);
  •         argvi++;
  • }


15
Diagram of Layers
16
CLucene C++ Interface
  • src/CLucene/search/SearchHeader.h:
  • #include "CLucene/StdHeader.h"
  • #ifndef _lucene_search_SearchHeader_
  • #define _lucene_search_SearchHeader_


  • #include "CLucene/index/IndexReader.h“
  • …
  • using namespace lucene::index;
  • namespace lucene{ namespace search{


  •         //predefine classes
  •         class Searcher;
  •         class Query;
  •         class Hits;


  •     class HitDoc {
  •       public:
  •         float_t score;
  •         int_t id;
  •         lucene::document::Document* doc;


  •         HitDoc* next;                                     // in doubly-linked cache
  •         HitDoc* prev;                                     // in doubly-linked cache


  •         HitDoc(const float_t s, const int_t i);
  •         ~HitDoc();
  •     };
17
CLucene C DLL Interface
  • src/wrappers/dll/clucene_dll.h:
  • #ifndef _DLL_CLUCENE
  • #define _DLL_CLUCENE
  • #include "CLucene/CLConfig.h"
  • …
  • #ifdef _UNICODE
  • //unicode methods
  • # define CL_UNLOCK CL_U_Unlock
  • # define CL_OPEN CL_U_Open
  • # define CL_DOCUMENT_INFO CL_U_Document_Info
  • # define CL_ADD_FILE CL_U_Add_File
  • …
  • CLUCENEDLL_API int CL_U_Unlock(const wchar_t* dir);
  • CLUCENEDLL_API int CL_U_Delete(const int resource, const wchar_t* query,
  •  const wchar_t* field);
  • CLUCENEDLL_API int CL_U_Add_Field(const int resource, const wchar_t* fie
  • ld, const wchar_t* value, const int value_length, const int store, const int ind
  • ex, const int token);
  • …


18
SWIG Definition File clucene.i
  • %module "FulltextSearch::CLuceneWrap"
  • %{
  • #include "clucene_dllp.h"
  • %}
  • // our definitions for CLucene variables and functions
  • %include "clucene_perl.h"
  • //%include "clucene_dll.h" // could use this but then would need to call CL_N_Se
  • arch not CL_SEARCH etc.


  • %include typemaps.i


  • %include argv.i


  • // helper functions where pointers to result buffers are expected
  • // would be better done with a %typemap(out) if I knew enough about perlguts


  • %inline %{


  • int val_len;
  • char * val;


  • int CL_GetField1(int resource, char * field)
  • {
  •         return CL_GETFIELD(resource,field,&val,&val_len);
  • }
  • …
  • }
19
SWIG-Generated XS CLuceneWrap.pm
  • # This file was automatically generated by SWIG
  • package FulltextSearch::CLuceneWrap;
  • require Exporter;
  • require DynaLoader;
  • @ISA = qw(Exporter DynaLoader);
  • package FulltextSearch::CLuceneWrapc;
  • bootstrap FulltextSearch::CLuceneWrap;
  • package FulltextSearch::CLuceneWrap;
  • @EXPORT = qw( );


  • # ---------- BASE METHODS -------------


  • package FulltextSearch::CLuceneWrap;


  • sub TIEHASH {
  •     my ($classname,$obj) = @_;
  •     return bless $obj, $classname;
  • }


  • sub CLEAR { }
  • …
20
SWIG-Generated XS clucene_wrap.c
  • #ifdef __cplusplus
  • extern "C" {
  • #endif
  • XS(_wrap_CL_OPEN) {
  •     {
  •         char *arg1 ;
  •         int arg2 = (int) 1 ;
  •         int result;
  •         int argvi = 0;
  •         dXSARGS;


  •         if ((items < 1) || (items > 2)) {
  •             SWIG_croak("Usage: CL_OPEN(path,create);");
  •         }
  •         if (!SvOK((SV*) ST(0))) arg1 = 0;
  •         else arg1 = (char *) SvPV(ST(0), PL_na);
  •         if (items > 1) {
  •             arg2 = (int) SvIV(ST(1));
  •         }
  •         result = (int)CL_OPEN(arg1,arg2);


  •         ST(argvi) = sv_newmortal();
  •         sv_setiv(ST(argvi++), (IV) result);
  •         XSRETURN(argvi);
  •         fail:
  •         ;
  •     }
  •     croak(Nullch);
  • }
21
CLucene.pm Perl OO Wrapper
  • Back into the realms of sanity
  • Normal OO package with methods
  • Calls XS wrapper functions
22
Build Environment
  • Uses GNU autotools and m4 macro processor
  • Definition files
  • configure.ac ~ top level build definitions
  • Makefile.am ~ makefile flags definitions
  • Programs
  • libtool ~ generalised library building
  • aclocal ~ builds aclocal.m4 from configure.ac
  • autoconf ~ reads configure.ac to create configure script
  • autoheader ~ creates C header defines for configure
  • automake ~ creates Makefile.in from Makefile.am
  • autoreconf ~ manually remake whole tree of GNU build files
23
Bootstrap shell script
  • #!/bin/sh
  • # Bootstrap the CLucene installation.


  • mkdir -p ./build/gcc/config
  • set -x
  • libtoolize --force --copy --ltdl --automake
  • aclocal
  • autoconf
  • autoheader
  • automake -a --copy --foreign


24
Autoconf configure.ac file
  • dnl Process this file with autoconf to produce a configure script.
  • dnl Written by Jimmy Pritts.


  • dnl initialize autoconf and automake
  • AC_INIT([clucene], [1])
  • AC_PREREQ([2.54])
  • AC_CONFIG_SRCDIR([src/CLucene.h])
  • AC_CONFIG_AUX_DIR([./build/gcc/config])
  • AC_CONFIG_HEADERS([config.h])
  • AM_INIT_AUTOMAKE


  • dnl Check for existence of a C and C++ compilers.
  • AC_PROG_CC
  • AC_PROG_CXX


  • dnl Check for headers
  • AC_HEADER_DIRENT


  • dnl Configure libtool.
  • AC_PROG_LIBTOOL


  • dnl option to use UTF-8 as internal 8-bit charset to support characters in Unicodeâ
  •    ˘
  • AC_ARG_ENABLE(utf8,
  • AC_HELP_STRING([--enable-utf8],[UTF-8 as internal 8-bit charset to support characters in Unicodeâ
  •                  ˘ (default=no)]),
  •  [AC_DEFINE([UTF8],[],[use UTF-8 as internal 8-bit charset to support characters in Unicodeâ
  •             ˘])],enable_utf8=no)


  • AM_CONDITIONAL(USEUTF8, test x$enable_utf8 = xyes)


  • AC_CONFIG_FILES([Makefile src/Makefile examples/Makefile examples/demo/Makefile examples/tests/Makefile examples/util/Makefile wrappers/Makefile wrappers/dll/Makefile wrappers/dll/dlltest/Makefile])
  • AC_OUTPUT


25
Makefile.am files
26
Recap
  • We saw how and why I selected an external Perl library
  • We looked at GNU autotools to provide a cross-platform build environment
  • We investigated the layers of code needed to interface perl to a C++ library ~ SWIG, C, XS inline helpers, low and high level Perl modules
27
Lessons Learned
  • Start off a new external library using GNU autotools and keeping in mind that the API should be easy to use through SWIG
  • Use SWIG not XS to wrap a C/C++ library
  • Always use h2xs to start a Perl extension
  • Open Source feedback and testing are more valuable than you expect (2 emails this week alone)


28
Where to Get More Information
  • Perl XS http://en.wikipedia.org/wiki/XS_%28Perl%29
    http://www.perl.com/doc/manual/html/pod/perlguts.html
  • C++ / XS http://www.johnkeiser.com/perl-xs-c++.html
  • SWIG http://en.wikipedia.org/wiki/SWIG
    http://www.swig.org/
  • Lucene http://en.wikipedia.org/wiki/Lucene
  • CLucene http://sourceforge.net/projects/clucene/
  • Autoconf http://www.gnu.org/software/autoconf/
  • Book  “Extending and Embedding Perl”, Jenness & Couzens (Manning, 2002)


  • Any Questions


  • These slides are at http://perl.dragonstaff.com/