elasticserach FScrawler intermediate file location? help

I am looking for a method that can intercept the output of FSCrwaler before this gets passed to the elastic engine for indexing.

Just take a case, what I have is a PDF file which I wanted to index plus I also have some attributes/metadata for the same PDF which are stored in the database. I want to index both the content (PDF file content + attributes stored in PostgreSQL) in a single index file so that I can refine my search criteria and get correct results. As far I understand for PDF indexing text needs to be extracted from the file and stored in a text format and then passed to the indexing engine. I am relatively new (an HTDIG veteran -:D ) to the elastic ecosystem so looking for a way to intercept the text extracted from a PDF file so that I can append other text (in this scenario fetched from PostgreSQL) and then pass to the indexing mechanism, a kind of single index file for both the content.

Wondering if anyone has encountered this kind of scenario and can provide some pointers? It will be good to know that where does FSCrawler stores intermediate files created during the indexing process in elastic search? Can we intercept them and add some custom info?

Comments will be much appreciated!


apache lucy search examples

Investigating search engines and this time apache Lucy 0.4.2. I am showing a basic indexer and a small search application. See below code for indexer (This will take documents one by one and then index them). Search module will take arugument as STDIN and then will show the search result.

This is pure command line utility just to show how basic indexing and searching works using apache lucy.



use strict;
use warnings;
use Lucy::Simple;

# Ensure the index directory is both available and empty.
my $index = "/ppant/LucyTest/index";
system( "rm", "-rf", $index );
system( "mkdir", "-p", $index );
# Create the helper...a new Lucy::Simple object
my $lucy = Lucy::Simple new( path = $index, language = 'en', );

# Add the first "document". (We are mainly adding meta data of the document)
my %one = ( title ="This is a title of first article" , body ="some text inside the body we need to test the implementaion of lucy", id =1 );
$lucy-add_doc( \%one );

# Add the second "document".
my %two = ( title ="This is another article" , body ="I am putting some basic content, using some words which are also in first document like implementation", id =2 );
$lucy add_doc( \%two );

# Both the documents are now indexed in path

One indexing of the documents is done we'll make a small search script.



use strict;
use warnings;

use Lucy::Search::IndexSearcher;

my $term = shift || die "Usage: $0 search-term";

my $searcher = Lucy::Search::IndexSearcher new( index ='/ppant/LucyTest/index');
# A basic search command line which will look for indexed items based on STDIN and will show that in which document query string is found and no of hits
my $hits = $searcher hits( query =$term );
while ( my $hit = $hits next ) {
print "Title: $hit {title} - ID: $hit {id}\n";
# End of search.cgi


If you want to explore more check Full Code on GitHub

Implementing mapper attachment in elasticsearch

Create a new index
[code lang=”bash”]curl -X PUT "" -d ‘{
"settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}

Mapping attachement type
[code lang=”bash”]curl -X PUT "" -d ‘{
"attachment" : {
"properties" : {
"file" : {
"type" : "attachment",
"fields" : {
"title" : { "store" : "yes" },
"file" : { "term_vector":"with_positions_offsets", "store":"yes" }
} } } } }'[/code]

Shell script to convert content to base64 encoding and index
[code lang=”bash”]#!/bin/sh</code>

coded=`cat TestPDF.pdf | perl -MMIME::Base64 -ne ‘print encode_base64($_)’`
echo "$json" &gt; json.file
curl -X POST "" -d @json.file
Query  (Search esults will be highlighted)
[code lang=”bash”]curl "" -d ‘{
"fields" : ["title"],
"query" : {
"query_string" : {
"query" : "Cycling tips"
"highlight" : {
"fields" : {
"file" : {}
} } }'[/code]


If you want to explore more check Full Code on GitHub

Apache Lucy search engine

Apache Lucy is full-text search engine library written in C and targeted at dynamic languages. The good news is that, the inaugural release provides Perl bindings. For more information, you can visit the Apache Lucy website Let’s hope for some good result. I still have to try but you can download from here.