search inside pdf

I've got a new install of Nuxeo (I downloaded the 5.7.1 vmware) it seems to work and launch fine.

I uploaded 2 different PDF files, but the search does not find the text inside of one of them. why?

the one pdf is not an image, it is text selectable and searchable inside of adobe reader.

0 votes

1 answers

2989 views

ANSWER

Copy URL

Marwane K.A.

Hi, does this happen only with this particular PDF?

Maybe the PDF library didn't manage to parse this file on particular. You can see if it's the case by checking the Nuxeo logs for errors (look for "logs/server.log" in the Nuxeo folder, not sure where it is on the VM though)

08/02/2013

jbrowne

after re-uploading the file (renamed to 'eli.pdf'), this is what is in the server.log (located here: /var/log/nuxeo/)

2013-08-02 13:06:56,525 WARN [org.nuxeo.ecm.core.storage.sql.FulltextExtractorWork] Could not extract fulltext of file 'eli.pdf' for document: 1189530d-0b13-4ce6-93e8-8834a759ff88: org.nuxeo.ecm.core.convert.api.ConversionException: Error during text extraction with PDFBox

08/02/2013

Florent Guillaume

If you activate the DEBUG level for org.nuxeo.ecm.core.storage.sql.FulltextExtractorWork in lib/log4j.xml then you'll get more info about the cause of the error.

08/02/2013

jbrowne

there is no listing for 'Fulltext' in that file. do I have to add that section? what should the section look like if I have to add it?

08/02/2013

Florent Guillaume

This:

&lt;category name=&quot;org.nuxeo.ecm.core.storage.sql.FulltextExtractorWork&quot;&gt;
  &lt;priority value=&quot;DEBUG&quot; /&gt;
&lt;/category&gt;

08/04/2013

jbrowne

2013-08-05 14:33:37,425 WARN [org.nuxeo.ecm.core.storage.sql.FulltextExtractorWork] Could not extract fulltext of fi le 'eli3.pdf' for document: f8af5c77-d464-430d-b7f4-e54f1dfecd2d: org.nuxeo.ecm.core.convert.api.ConversionException: Error during text extraction with PDFBox 2013-08-05 14:33:37,425 DEBUG [org.nuxeo.ecm.core.storage.sql.FulltextExtractorWork] Could not extract fulltext of fi le 'eli3.pdf' for document: f8af5c77-d464-430d-b7f4-e54f1dfecd2d: org.nuxeo.ecm.core.convert.api.ConversionException: Error during text extraction with PDFBox org.nuxeo.ecm.core.convert.api.ConversionException: Error during text extraction with PDFBox

    at org.nuxeo.ecm.core.convert.plugins.text.extractors.PDF2TextConverter.convert(PDF2TextConverter.java:154)
    at org.nuxeo.ecm.core.convert.service.ConversionServiceImpl.convert(ConversionServiceImpl.java:168)
    at org.nuxeo.ecm.core.convert.plugins.text.extractors.FullTextConverter.convert(FullTextConverter.java:73)
    at org.nuxeo.ecm.core.convert.service.ConversionServiceImpl.convert(ConversionServiceImpl.java:168)
    at org.nuxeo.ecm.core.storage.sql.FulltextExtractorWork.convert(FulltextExtractorWork.java:231)
    at org.nuxeo.ecm.core.storage.sql.FulltextExtractorWork.blobsToText(FulltextExtractorWork.java:198)
    at org.nuxeo.ecm.core.storage.sql.FulltextExtractorWork.work(FulltextExtractorWork.java:148)
    at org.nuxeo.ecm.core.work.AbstractWork.run(AbstractWork.java:164)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)

Caused by: java.lang.IllegalArgumentException: Comparison method violates its general contract!

    at java.util.TimSort.mergeHi(TimSort.java:868)
    at java.util.TimSort.mergeAt(TimSort.java:485)
    at java.util.TimSort.mergeCollapse(TimSort.java:408)
    at java.util.TimSort.sort(TimSort.java:214)
    at java.util.TimSort.sort(TimSort.java:173)
    at java.util.Arrays.sort(Arrays.java:659)
    at java.util.Collections.sort(Collections.java:217)
    at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:551)
    at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:443)
    at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
    at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
    at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:242)
    at org.nuxeo.ecm.core.convert.plugins.text.extractors.PDF2TextConverter.convert(PDF2TextConverter.java:142)
    ... 10 more

08/05/2013

ANSWERS

From your traces above you're victim of a PDFBox bug: PDFBOX-1512.

A workaround is add this to the JAVA_OPTS:

-Djava.util.Arrays.useLegacyMergeSort=true

1 votes

Copy URL

jbrowne

excuse my lack of knowledge - where do I add this?

08/05/2013

Florent Guillaume

Inside bin/nuxeo.conf. See the other places where JAVA_OPTS is used in this file.

08/05/2013

Florent Guillaume

Authentication Required

search inside pdf

Invite people to answer

What's new in Quandora?