Eclipse Community Forums: Platform - User Assistance (UA)

Home » Eclipse Projects » Platform - User Assistance (UA) » Add pdf files to search index

Add pdf files to search index [message #484994]

Thu, 10 September 2009 03:49

Eclipse User

Hello,

I would like to add pdf files to help search index of my rcp product.

I understand that Lucene library can index pdf files (and word documents)
but it seems that Eclipse only supports html files.

Do I need to add a new LuceneSearchProvider and associate it with pdf
extension?

Can I then just delegate adding to the index to the default indexer, or do I
need to use Lucene APIs directly to parse and index the documents?

Thanks in advance for your help,
Mohamed.

Re: Add pdf files to search index [message #485412 is a reply to message #484994]

Fri, 11 September 2009 12:16

Eclipse User

This sounds like the sort of feature that others in the community may
have already implemented, and if so it would be great to get this
contributed to Eclipse. I'll ask around at IBM to see if we have anyone
looking into this. Meanwhile if anyone else on the newsgroup has
implemented this or thought about implementing this I'd be interested to
know what approach you used.

Re: Add pdf files to search index [message #485978 is a reply to message #485412]

Tue, 15 September 2009 15:13

Eclipse User

Mohamed Hussein wrote:
> Can I then just delegate adding to the index to the default indexer,
or do I
> need to use Lucene APIs directly to parse and index the documents?

I think that in the Eclipse help, it parses the HTML files also. I'm not
sure at which point in the indexing process. I do know that there is an
HTMLParser.java class in the org.eclipse.help.base.source JAR, and I
presume that is there because the HTML files have to be parsed at some
point in the process.

So I would imagine that to get PDF files into the Lucene index, the PDF
files would have to be parsed.

Chris Goldthorpe wrote:
> This sounds like the sort of feature that others in the community may
> have already implemented, and if so it would be great to get this
> contributed to Eclipse. I'll ask around at IBM to see if we have anyone
> looking into this. Meanwhile if anyone else on the newsgroup has
> implemented this or thought about implementing this I'd be interested to
> know what approach you used.

One thought is that the PDF document would need to be parsed. I just
went over to lucene.apache.org and the FAQ has this about indexing PDF:
http://wiki.apache.org/lucene-java/LuceneFAQ#head-c45f8b25d7 86f4e384936fa93ce1137a23b7e422

"In order to index PDF documents you need to first parse them to extract
text that you want to index from them. Here are some PDF parsers that
can help you with that:

PDFBox is a Java API from Ben Litchfield that will let you access the
contents of a PDF document. It comes with integration classes for Lucene
to translate a PDF into a Lucene document.

XPDF is an open source tool that is licensed under the GPL. It's not a
Java tool, but there is a utility called pdftotext that can translate
PDF files into text files on most platforms from the command line.

Based on xpdf, there is a utility called pdftohtml that can translate
PDF files into HTML files. This is also not a Java application.

JPedal is a Java API for extracting text and images from PDF documents."
----------------------------------------

A link about PDFBox to extract the text from a PDF:
http://www.pdfbox.org/userguide/text_extraction.html

The PDFBox site says that it is licensed under the BSD License. I don't
know if that is compatible with the Eclipse license, such that PDFBox
would be a viable solution to ship with the Eclipse Platform itself.

XPDF and Jpedal seem to be GPL or LGPL.

Hope that helps,
Lee Anne

Re: Add pdf files to search index [message #623553 is a reply to message #484994]

Fri, 11 September 2009 12:16

Eclipse User

Re: Add pdf files to search index [message #623554 is a reply to message #485412]

Tue, 15 September 2009 15:13

Eclipse User

Re: Add pdf files to search index [message #899262 is a reply to message #623554]

Tue, 31 July 2012 05:12

Eclipse User

Hi, Yes there should have a parser to parse PDF to extract text for building index, but do we need to code our self or we can just deploy a plug-in for that purpose?

Previous Topic:	Adding html to search index but not TOC
Next Topic:	Enabling commands for particular file types only

Goto Forum:

-=] Back to Top [=-

Current Time: Fri Jul 11 21:05:26 EDT 2025

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter