elasticserach FScrawler intermediate file location? help

I am looking for a method that can intercept the output of FSCrwaler before this gets passed to the elastic engine for indexing.

Just take a case, what I have is a PDF file which I wanted to index plus I also have some attributes/metadata for the same PDF which are stored in the database. I want to index both the content (PDF file content + attributes stored in PostgreSQL) in a single index file so that I can refine my search criteria and get correct results. As far I understand for PDF indexing text needs to be extracted from the file and stored in a text format and then passed to the indexing engine. I am relatively new (an HTDIG veteran -:D ) to the elastic ecosystem so looking for a way to intercept the text extracted from a PDF file so that I can append other text (in this scenario fetched from PostgreSQL) and then pass to the indexing mechanism, a kind of single index file for both the content.

Wondering if anyone has encountered this kind of scenario and can provide some pointers? It will be good to know that where does FSCrawler stores intermediate files created during the indexing process in elastic search? Can we intercept them and add some custom info?

Comments will be much appreciated!


Resize a column in a PostgreSQL table without changing data: Repost

This morning I realized that one of the columns in the table I have recently created is running out of space. It is varchar(10) and I want to make it varchar(20) and of course wanted to do without losing any data which got filled in the last couple of days. Though there is a standard way to changing the size and type of the col using ALTER commands which can become tricky sometimes with the data. So I was looking for an alternate way and I found an awesome post on how to do this in an easy way without disturbing data, this is an almost 10 yrs old post so reposting the commands in case they get lost. For detailed reading please visit the blog link given in references.

Command to check the current size of a given column:

SELECT atttypmod FROM pg_attribute WHERE attrelid = 'TABLE1'::regclass AND attname = 'COL1';

In my case, the present size is 10 and I want to increase it to 25, so the command to update column will be as below:

UPDATE pg_attribute SET atttypmod = 25 WHERE attrelid = 'TABLE1'::regclass AND attname = 'COL1';

I hope this info and re-post helps.

Original blog link:

Deep learning on large images: challenges and CNN

Applying deep learning on large images is always a challenge but there a solution using convolutional but first, let’s understand in brief where is the challenge? So one of the challenges in computer vision is that inputs become very big as we increase image size.

Ok. Consider the example of a basic image classification problem e.g.; cat detection.

Let’s take an input 64×64 image of a cat and try to figure out if that is a cat or not? to do that we’ll need 64x64x3, (where 3 is the no of RGB channel) parameters, so the x input feature will have 12288 dimensions though this is not a very large number considering just the 64×64 image size which is very small there are lots of input features to deal with..

Say now if we take 1000×1000 image (1MB) which is decent in size but there will be 1000x1000x3 (where 3 are the RGB channels) ~ 3M input features

so if you put in a deep network then x will be 3M, and suppose if first hidden layer has 1000 hidden units then the total no of weights for a fully connected network will be (Weight matrix, X) ~ (1000, 3M) dimension means that the matrix will have 3 Billion parameters which are very very large, so with that much of data it is difficult to avoid overfitting in Neural Network and also the computational requirement to train the 3 Billion params is not feasible.

This is just for a 1MB image but in computer vision problem you don’t want to stick with using just tiny images using bigger images results in overfitting and huge input feature vector so here comes convolution operation which is the basic building block of Convolutional Neural Network (CNN).

Image Source: Andrew Ng Deep Learning Course

More on CNN in the next post.


Inspiration & source of understanding: Andrew Ng, Deep Learning Specialization