apache lucy search examples

Investigating search engines and this time apache Lucy 0.4.2. I am showing a basic indexer and a small search application. See below code for indexer (This will take documents one by one and then index them). Search module will take arugument as STDIN and then will show the search result.

This is pure command line utility just to show how basic indexing and searching works using apache lucy.

indexer.pl

#!/usr/local/bin/perl

use strict;
use warnings;
use Lucy::Simple;

#
# Ensure the index directory is both available and empty.
#
my $index = "/ppant/LucyTest/index";
system( "rm", "-rf", $index );
system( "mkdir", "-p", $index );
# Create the helper...a new Lucy::Simple object
my $lucy = Lucy::Simple new( path = $index, language = 'en', );

# Add the first "document". (We are mainly adding meta data of the document)
my %one = ( title ="This is a title of first article" , body ="some text inside the body we need to test the implementaion of lucy", id =1 );
$lucy-add_doc( \%one );

# Add the second "document".
my %two = ( title ="This is another article" , body ="I am putting some basic content, using some words which are also in first document like implementation", id =2 );
$lucy add_doc( \%two );

# Both the documents are now indexed in path

One indexing of the documents is done we'll make a small search script.

search.cgi

#!/usr/local/bin/perl

use strict;
use warnings;

use Lucy::Search::IndexSearcher;

my $term = shift || die "Usage: $0 search-term";

my $searcher = Lucy::Search::IndexSearcher new( index ='/ppant/LucyTest/index');
# A basic search command line which will look for indexed items based on STDIN and will show that in which document query string is found and no of hits
my $hits = $searcher hits( query =$term );
while ( my $hit = $hits next ) {
print "Title: $hit {title} - ID: $hit {id}\n";
}
# End of search.cgi


***********************************************************************

If you want to explore more check Full Code on GitHub

Import RDBMS table to HDFS with sqoop from postgreSQL

Steps:

1. Download JDBC driver

 

$wget http://jdbc.postgresql.org/download/postgresql-9.3-1102.jdbc4.jar

 

 2. Copy: 

cp /home/cloudera/Desktop/postgresql-9.3-1102.jdbc4.jar /usr/lib/sqoop/lib/ 

 

3. Configure: 

/var/lib/pgsql/data/pg_hba.conf

file. You need to allow the IP/host of machine running Hadoop.

Restart postgreSQL using 

$pg_ctl restart

 

4. Run sqoop: Open the terminal on machine running hadoop and type the below command.

 

 cloudera@cloudera-vm:/usr/lib/sqoop bin/sqoop import --connect jdbc:postgresql://192.168.0.34:5432/Testdb--table employee --username postgres -P --target-dir /sqoopOut1 -m 1 

 

Enter password:

 

prerequisites:

  • Cloudera hadoop VM distribution or any other machine running hadoop.
  • postgreSQL installation.
  • database Testdb and employee table on a running instance of postgreSQL (e.g.; 192.168.0.34:5432 in point 4).

 

All set! Your pgsql table data is now available on HDFS of  VM hadoop cluster.

 

Enjoy hadoop learning!

WEBDAV authentication for Office docunments on SSL enabled sites

As explained in my post related to the Webdav authentication issue using IE for Office documents on Windows 7, I found that the fix will not work for SSL enabled sites. To support HTTPS you will have to add the following lines in /etc/httpd/conf.d/ssl.conf file.


<VirtualHost>
RewriteEngine On
RewriteCond %{REQUEST_METHOD} ^(OPTIONS|PROPFIND)$ [NC]
RewriteRule ^.*$ – [F,L]
</VirtualHost>

Restart httpd

To generate an SSL certificate you can use OpenSSL library

Solving the authentication problem while opening Office documents hosted on Apache in IE8/IE9 on Windows 7

We were facing a problem in IE 8/9 on Windows 7 while accessing  Office 2007/ Office 2010 documents hosted on apache/Cent OS 4.6. After some analysis I found the reason and finally ended in a fix. See below my findings and solution. Hope this helps:

The main issue is with the Microsoft’s way of implementing Webdav protocol for accessing web content through Microsoft Web Client. When we click on a Office document then web client  sends HTTP /1.1 OPTIONS Request header to server to check the WebDav communication (My server doesn’t have WebDav). In response Apache return 200 OK Response header to Web Client which results in prompting the authentication screen by Windows 7.  Well you have option in IE to pass the authentication login automatically but that would be security breach as you will be exposing your machine authentication to internet so I would not prefer that. Best way is to configure Apache to reject these request. This is how i have solved. These changes needs to be done in httpd.conf file in /etc/httpd/conf folder (Cent OS 4.6)

# One way to doing it – Deny access based on request method


RewriteEngine On
RewriteCond %{REQUEST_METHOD} ^(OPTIONS|PROPFIND)$ [NC]
RewriteRule ^.*$ - [F,L]

# Another way to implementing – Deny acess based on user agent (Vista and Windows 7 used same user agent with different version so this Regx shall work for both

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Microsoft-WebDAV-MiniRedir
RewriteRule ^.*$ - [F,L]

Explanation on Flags:

1. [F] flag causes the server to return a 403 Forbidden status code to the client.

2. Use of the [NC] flag causes the RewriteRule to be matched in a case-insensitive manner. That is, it doesn’t care whether letters appear as upper-case or lower-case in the matched URI.

3. The [L] flag causes mod_rewrite to stop processing the rule set. In most contexts, this means that if the rule matches, no further rules will be processed. This corresponds to the last command in Perl.

 Some References:

Microsoft knowledge article on authentication requests from office documents 

Apache mod_rewrite rule documentaion

fiddler tool for debugging HTTP requests

Apache JIRA compromised

This is how apache JIRA is hacked

Browse the link to see the details https://blogs.apache.org/infra/entry/apache_org_04_09_2010

If any of you have account on apache.org for various projects then you might have received mail from them to change the password. If not then change your password for apache account (JIRA, Bugzilla etc)

I must say this is a big learning. Let’s be a bit more alert.