|
1
|
|
|
2
|
- Peter Edwards ~ background
- Subject ~ writing a Perl XS swig interface to the CLucene C++ text
search engine
|
|
3
|
- Give an idea of the process involved in selecting and using an external
library from Perl
- Introduction to extending Perl using XS, swig, GNU autotools
- Entertainment
- Audience: What is your background and interest?
|
|
4
|
- Understanding the Problem
- The Answer (at a high level)
- Technical Options
- Investigating Options
- Writing a perl / C++ Interface
- Layers and Components
- Lessons Learned
|
|
5
|
- Perl ~ Pathologically Eclectic Rubbish Lister
$_ = "wftedskaebjgdpjgidbsmnjgc";
tr/a-z/oh, turtleneck Phrase Jar!/; print;
- Perl XS ~ eXternal Subroutine
allows a perl program to call a C language subroutine
XS is also the “glue” language specifying the calling
interface
contains complex “perlguts” stuff that will destroy your sanity
- SWIG ~ Simplified Wrapper and Interface Generator
makes it easy to call a C/C++ library from many languages (perl,
python, ruby, PHP…)
- C++ ~ Object Oriented version of C programming language
- text search ~ boolean searching of stemmed words, wildcards
- CLucene ~ C++ text search engine based on Java Lucene
|
|
6
|
- Recruitment software written in Perl
- 20,000+ candidate Word CVs/resumes
- Boolean searching using words or partial words and wildcards
e.g. (“BA” or “MA”) and “literature”
- Combined with SQL searching
e.g. geographic area, skill profile codes, pay rate
- Speed < 2 seconds
- Old system used dtSearch proprietary s/w
|
|
7
|
- Load
- Convert candidate CVs from Word to text using wvWare (OpenOffice)
converter
- Index text against candidate no.
- Search
- Search text -> cand nos -> SQL temp table
- Normal SQL search on other criteria
|
|
8
|
- Proprietary
- dtSearch ~ cost; hard to get cand nos out; Windows interface when perl
app is Web
- Open Source
- Java Lucene ~ slow but good API and power
- C++ CLucene ~ alpha quality rewrite of Lucene in Visual C++ as degree
project by Ben van Klinken
- Perl CPAN (PLucene etc.) below
http://search.cpan.org/modlist/String_Language_Text_Processing
|
|
9
|
- Wrote test harness to load 1000 CVs then do some searches
- Tried about 5 CPAN modules
- PLucene search speed okay for small volumes but exponential increase in
insert time
>60 seconds per insert
- Why? Tokenises doc, multi-lingual word stemming, adds doc id to reverse
lookup index for each stem token
- Other modules faster but search options weak
- Need to look further
|
|
10
|
- Wrote similar C++ test harness
- Speed good: search 20,000 CVs <1 second
load 3 CVs per sec (mostly Word->text)
- Code written as VC++ degree project and registered at SourceForge
- Jimmy Pritts changed layout and added GNU autoconf files configure.ac Makefile.in to let it build
cross-platform on Windows, cygwin, Linux
- Had C DLL interface used by PHP wrapper
- Decided to write Perl wrapper
|
|
11
|
- When I wrote this wrapper, Perl to C++ interfacing via XS or SWIG was
tricky and despite the optimism expressed at http://www.johnkeiser.com/perl-xs-c++.html I had difficulties mapping the CLucene
API to XS
- Reasons: C++ namespace mangling; object and method mapping; C++ memory
garbage collection
- So I decided to go via the C DLL wrapper to hide this complexity
|
|
12
|
- Always start with h2xs utility
- Code is C with macro extensions
- Write C code (XSUBs)
- Call internal Perl routines (perlguts) to create variables, allocate
arrays…
newSViv(IV), sv_setiv(SV*,
IV) ~ scalar integer variable
- Complicated
- Nyarlathotep / “Crawling Chaos”
|
|
13
|
- Creates XS for you from a .i definition file
- Parses C/C++ .h header files to get types and function prototypes
- Allows for inline C/XS code
|
|
14
|
- From argv.i
- // Creates a new Perl array and places a NULL-terminated char ** into it
- %typemap(out) char ** {
- AV *myav;
- SV **svs;
- int i = 0,len = 0;
- /* Figure out how many
elements we have */
- while ($1[len])
- len++;
- svs = (SV **)
malloc(len*sizeof(SV *));
- for (i = 0; i < len ;
i++) {
- svs[i] =
sv_newmortal();
-
sv_setpv((SV*)svs[i],$1[i]);
- };
- myav = av_make(len,svs);
- free(svs);
- $result =
newRV((SV*)myav);
- sv_2mortal($result);
- argvi++;
- }
|
|
15
|
|
|
16
|
- src/CLucene/search/SearchHeader.h:
- #include "CLucene/StdHeader.h"
- #ifndef _lucene_search_SearchHeader_
- #define _lucene_search_SearchHeader_
- #include "CLucene/index/IndexReader.h“
- …
- using namespace lucene::index;
- namespace lucene{ namespace search{
- //predefine classes
- class Searcher;
- class Query;
- class Hits;
- class HitDoc {
- public:
- float_t score;
- int_t id;
-
lucene::document::Document* doc;
- HitDoc* next; //
in doubly-linked cache
- HitDoc* prev; //
in doubly-linked cache
- HitDoc(const float_t s,
const int_t i);
- ~HitDoc();
- };
|
|
17
|
- src/wrappers/dll/clucene_dll.h:
- #ifndef _DLL_CLUCENE
- #define _DLL_CLUCENE
- #include "CLucene/CLConfig.h"
- …
- #ifdef _UNICODE
- //unicode methods
- # define CL_UNLOCK CL_U_Unlock
- # define CL_OPEN CL_U_Open
- # define CL_DOCUMENT_INFO CL_U_Document_Info
- # define CL_ADD_FILE CL_U_Add_File
- …
- CLUCENEDLL_API int CL_U_Unlock(const wchar_t* dir);
- CLUCENEDLL_API int CL_U_Delete(const int resource, const wchar_t* query,
- const wchar_t* field);
- CLUCENEDLL_API int CL_U_Add_Field(const int resource, const wchar_t* fie
- ld, const wchar_t* value, const int value_length, const int store, const
int ind
- ex, const int token);
- …
|
|
18
|
- %module "FulltextSearch::CLuceneWrap"
- %{
- #include "clucene_dllp.h"
- %}
- // our definitions for CLucene variables and functions
- %include "clucene_perl.h"
- //%include "clucene_dll.h" // could use this but then would
need to call CL_N_Se
- arch not CL_SEARCH etc.
- %include typemaps.i
- %include argv.i
- // helper functions where pointers to result buffers are expected
- // would be better done with a %typemap(out) if I knew enough about
perlguts
- %inline %{
- int val_len;
- char * val;
- int CL_GetField1(int resource, char * field)
- {
- return
CL_GETFIELD(resource,field,&val,&val_len);
- }
- …
- }
|
|
19
|
- # This file was automatically generated by SWIG
- package FulltextSearch::CLuceneWrap;
- require Exporter;
- require DynaLoader;
- @ISA = qw(Exporter DynaLoader);
- package FulltextSearch::CLuceneWrapc;
- bootstrap FulltextSearch::CLuceneWrap;
- package FulltextSearch::CLuceneWrap;
- @EXPORT = qw( );
- # ---------- BASE METHODS -------------
- package FulltextSearch::CLuceneWrap;
- sub TIEHASH {
- my ($classname,$obj) = @_;
- return bless $obj, $classname;
- }
- sub CLEAR { }
- …
|
|
20
|
- #ifdef __cplusplus
- extern "C" {
- #endif
- XS(_wrap_CL_OPEN) {
- {
- char *arg1 ;
- int arg2 = (int) 1 ;
- int result;
- int argvi = 0;
- dXSARGS;
- if ((items < 1) ||
(items > 2)) {
-
SWIG_croak("Usage: CL_OPEN(path,create);");
- }
- if (!SvOK((SV*) ST(0)))
arg1 = 0;
- else arg1 = (char *)
SvPV(ST(0), PL_na);
- if (items > 1) {
- arg2 = (int)
SvIV(ST(1));
- }
- result =
(int)CL_OPEN(arg1,arg2);
- ST(argvi) =
sv_newmortal();
- sv_setiv(ST(argvi++), (IV)
result);
- XSRETURN(argvi);
- fail:
- ;
- }
- croak(Nullch);
- }
|
|
21
|
- Back into the realms of sanity
- Normal OO package with methods
- Calls XS wrapper functions
|
|
22
|
- Uses GNU autotools and m4 macro processor
- Definition files
- configure.ac ~ top level build definitions
- Makefile.am ~ makefile flags definitions
- Programs
- libtool ~ generalised library building
- aclocal ~ builds aclocal.m4 from configure.ac
- autoconf ~ reads configure.ac to create configure script
- autoheader ~ creates C header defines for configure
- automake ~ creates Makefile.in from Makefile.am
- autoreconf ~ manually remake whole tree of GNU build files
|
|
23
|
- #!/bin/sh
- # Bootstrap the CLucene installation.
- mkdir -p ./build/gcc/config
- set -x
- libtoolize --force --copy --ltdl --automake
- aclocal
- autoconf
- autoheader
- automake -a --copy --foreign
|
|
24
|
- dnl Process this file with autoconf to produce a configure script.
- dnl Written by Jimmy Pritts.
- dnl initialize autoconf and automake
- AC_INIT([clucene], [1])
- AC_PREREQ([2.54])
- AC_CONFIG_SRCDIR([src/CLucene.h])
- AC_CONFIG_AUX_DIR([./build/gcc/config])
- AC_CONFIG_HEADERS([config.h])
- AM_INIT_AUTOMAKE
- dnl Check for existence of a C and C++ compilers.
- AC_PROG_CC
- AC_PROG_CXX
- dnl Check for headers
- AC_HEADER_DIRENT
- dnl Configure libtool.
- AC_PROG_LIBTOOL
- dnl option to use UTF-8 as internal 8-bit charset to support characters
in Unicodeâ
- ˘
- AC_ARG_ENABLE(utf8,
- AC_HELP_STRING([--enable-utf8],[UTF-8 as internal 8-bit charset to
support characters in Unicodeâ
- ˘ (default=no)]),
- [AC_DEFINE([UTF8],[],[use UTF-8
as internal 8-bit charset to support characters in Unicodeâ
- ˘])],enable_utf8=no)
- AM_CONDITIONAL(USEUTF8, test x$enable_utf8 = xyes)
- AC_CONFIG_FILES([Makefile src/Makefile examples/Makefile
examples/demo/Makefile examples/tests/Makefile examples/util/Makefile
wrappers/Makefile wrappers/dll/Makefile wrappers/dll/dlltest/Makefile])
- AC_OUTPUT
|
|
25
|
|
|
26
|
- We saw how and why I selected an external Perl library
- We looked at GNU autotools to provide a cross-platform build environment
- We investigated the layers of code needed to interface perl to a C++
library ~ SWIG, C, XS inline helpers, low and high level Perl modules
|
|
27
|
- Start off a new external library using GNU autotools and keeping in mind
that the API should be easy to use through SWIG
- Use SWIG not XS to wrap a C/C++ library
- Always use h2xs to start a Perl extension
- Open Source feedback and testing are more valuable than you expect (2
emails this week alone)
|
|
28
|
- Perl XS http://en.wikipedia.org/wiki/XS_%28Perl%29
http://www.perl.com/doc/manual/html/pod/perlguts.html
- C++ / XS http://www.johnkeiser.com/perl-xs-c++.html
- SWIG http://en.wikipedia.org/wiki/SWIG
http://www.swig.org/
- Lucene http://en.wikipedia.org/wiki/Lucene
- CLucene http://sourceforge.net/projects/clucene/
- Autoconf http://www.gnu.org/software/autoconf/
- Book “Extending and Embedding
Perl”, Jenness & Couzens (Manning, 2002)
- Any Questions
- These slides are at http://perl.dragonstaff.com/
|