Sunday, March 13, 2011
Search indexer on local desktop
Hi Foxes!
Have you ever wanted to use something like windows search or google desktop software, but lighter, with a small footprint memory, fast and open source? I'm tired about windows search, which I don't use very often, but when you need it it is pretty cool.
So today I was looking for 2 kind of tools :
Have you ever wanted to use something like windows search or google desktop software, but lighter, with a small footprint memory, fast and open source? I'm tired about windows search, which I don't use very often, but when you need it it is pretty cool.
So today I was looking for 2 kind of tools :
- one open source indexer / search engine just on the filename, which will be the most used,
- one just to index the file contents, like pdf, html, doc, ppt, dejavu and so on
Comparison links about desktop search engine
- very interesting, based on a database from twitter contents.
- list of open source indexer, from 2001
- Baza 2007 paper
- list of 20 popular desktop search engine, december 2010
- a huge list of search engine tools
- desktop search engine (Wikipedia list)
- reference management software (wikipedia)
- relational database management system (Wikipedia)
- paper on information retrieval
- Trec website : The Text REtrieval Conference
- stack overflow : ligthweight html search indexing
- stack overflow list of tools for crawling the web
Indexer based only on the filename
I have found :- everything is a freeware, light and fast. It works only with a NTFS filesystem and on windows.
- locate32 is working like the locate and updatedb linux commands. It is open source.
Search engine based on the file content
Hot, to have a look on it!
- indri
- lucene which is developped in pure JAVA. There are many other implemantation in other language too, like C++ with Clucene. Open source and widely used.
- swish-e
- zettair, only support html, text. No PDF basically. You should convert anything to PS, then use ps2acii tool before indexing. Pretty boring.
- doc fetcher which trig only the information that a file has been updated in background. It parses it again only when you run the application. It use a GUI. But if you don't have too much RAM, you would like to avoid to run the JVM.
- strigi (old homepage) must be used with a backend engine likce clucene. Can work on WinXP, but you have to compile everything with cygwin.
- datapark search is open source
- mind retrieve indexes the web you have visited only. It can be usefull.
- mendeley a collaborative pdf and any document indexer (wikipedia). Very usefull if you want to write some papers with many references. Mendeley is a free reference manager and academic social network
- sphinx written in C++, works on WinXP
Other result
- Basilic, server side
- hyper estraider
- refdb, in C
- refbase in PHP
- BM25 ranking (wikipedia)is present in lucene
- sino. It has no dependance. You can compile it like this, even under windows. Simple to use but when I try to index a folder, it uses around 500Mo of RAM! Too much to me.
- Wilma (new) and Wilbur (old version)
- regain, in java, based on lucene
My combo winner are...
- everything is impressive, amazing!
- swish-e. It is used a command tool, and you have to use it with additionnal software such pdftohtml, pdf2txt, catdoc, src2dest filter and so on. It is very fast, the database looks to be quite ok about the size and it don't use too much memory when indexing (around 60Mo on my PC running WinXP)
Labels: clucene, dataparksearch, engine, everything, grep, index, indexer, indri, information retrieval, lucene, mendeley, retrieval, search, sino, strigi, swish-e, zettair