Nuxeo-Platform-OCR Question

Hi:

I'm trying to install 'Nuxeo-platform-ocr' (https://github.com/nuxeo/nuxeo-platform-ocr) , but I do not know where to locate the file 'content_in_doc', so that Nuxeo can use to analyze.

I have followed this manual https://github.com/nuxeo/nuxeo-platform-ocr, but not clear where to locate.

I'm using Ubuntu 10.11 + Tesseract + 3 + Nuxeo Olena (scribe)

Could you tell me where I locate the file 'content_in_doc'?

Thanks, and regards.

1 votes

12 answers

3779 views

ANSWER



Hi Install from http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/Download olena-tools | olena-doc | olena-doc from from mandriva and you will find content_in_doc at /usr/lib/scribo/content_in_doc

0 votes



Hi @rbahntje i was facing the same problem $NUXEO_HOME/server/tmp/ocr_olena_xxx.xml was generated succesfuly but no annotations. After a few days a sloved this by editing code at github and now it works fine in production. I have create a pull request at https://github.com/nuxeo/nuxeo-platform-ocr/pulls describing the problem and solution.

0 votes



Yes, for example, the image:

Produces the following xml file in $NUXEO_HOME/server/tmp (I only reproduces some lines)

File: ocr_olena_1355615432288.xml

more ocr_olena_1355615432288.xml
<?xml version="1.0" encoding="UTF-8"?>
<PcGts>
  <Metadata>
    <Creator>LRDE</Creator>
    <Created>2012-12-15T20:51:01</Created>
    <LastChange>2012-12-15T20:51:01</LastChange>
    <Comments>Generated by Scribo from Olena.</Comments>
  </Metadata>
  <Page imageFilename="noname" imageWidth="1200" imageHeight="880">
    <TextRegion id="1" orientation="0" readingOrientation="0" readingDirection="left-to-right" type="text" reverseVideo="false" indented="false" textColour="turquoise" kerning="4
" color="#567BC8" colorReliability="0" baseline="51" meanline="34" xHeight="18" dHeight="-7" aHeight="27" charWidth="14">
      <Coords>
        <Point x="30" y="41"/>
        <Point x="30" y="32"/>
        <Point x="42" y="32"/>
        <Point x="42" y="33"/>
        <Point x="43" y="33"/>
        <Point x="43" y="34"/>
        <Point x="44" y="34"/>
        <Point x="44" y="36"/>
        <Point x="45" y="36"/>
        <Point x="45" y="37"/>
        <Point x="171" y="37"/>
        <Point x="171" y="36"/>
        <Point x="275" y="36"/>
        <Point x="275" y="31"/>
        <Point x="276" y="31"/>
        <Point x="276" y="30"/>
        <Point x="322" y="30"/>
        <Point x="322" y="29"/>
        <Point x="377" y="29"/>
        <Point x="377" y="34"/>
        <Point x="378" y="34"/>
        <Point x="378" y="35"/>
        <Point x="459" y="35"/>
        <Point x="459" y="34"/>
        <Point x="498" y="34"/>
        <Point x="498" y="30"/>
        <Point x="499" y="30"/>
        <Point x="499" y="28"/>
        <Point x="499" y="34"/>
        <Point x="568" y="34"/>
        <Point x="568" y="29"/>
        <Point x="569" y="29"/>
        <Point x="569" y="28"/>
        <Point x="569" y="33"/>
        <Point x="663" y="33"/>
        <Point x="663" y="32"/>
        <Point x="666" y="32"/>
        <Point x="666" y="30"/>
        <Point x="667" y="30"/>
        <Point x="667" y="27"/>
        <Point x="668" y="27"/>
        <Point x="668" y="26"/>
        <Point x="690" y="26"/>
        <Point x="690" y="27"/>
        <Point x="699" y="27"/>
        <Point x="699" y="32"/>
        <Point x="876" y="32"/>
        <Point x="876" y="26"/>
        <Point x="878" y="26"/>
        <Point x="878" y="25"/>
        <Point x="903" y="25"/>
        <Point x="903" y="31"/>
        <Point x="920" y="31"/>
        <Point x="920" y="32"/>
        <Point x="921" y="32"/>
        <Point x="921" y="33"/>
        <Point x="923" y="33"/>
        <Point x="923" y="35"/>
        <Point x="924" y="35"/>
        <Point x="924" y="38"/>
        <Point x="925" y="38"/>
        <Point x="925" y="40"/>
        <Point x="924" y="40"/>
        <Point x="924" y="44"/>
        <Point x="923" y="44"/>
        <Point x="923" y="45"/>
        <Point x="922" y="45"/>
        <Point x="922" y="46"/>
        <Point x="921" y="46"/>
        <Point x="921" y="47"/>
        <Point x="919" y="47"/>
        <Point x="919" y="48"/>
        <Point x="860" y="48"/>
        <Point x="860" y="53"/>
        <Point x="859" y="53"/>
        <Point x="859" y="49"/>
        <Point x="784" y="49"/>
        <Point x="784" y="50"/>
        <Point x="783" y="50"/>
        <Point x="783" y="53"/>
        <Point x="782" y="53"/>
        <Point x="782" y="54"/>
        <Point x="781" y="54"/>
        <Point x="781" y="49"/>
        <Point x="717" y="49"/>
        <Point x="717" y="51"/>
        <Point x="716" y="51"/>
        <Point x="716" y="54"/>
        <Point x="715" y="54"/>
        <Point x="715" y="55"/>
        <Point x="715" y="49"/>
        <Point x="651" y="49"/>
        <Point x="651" y="50"/>
        <Point x="486" y="50"/>
        <Point x="486" y="51"/>
        <Point x="401" y="51"/>
        <Point x="401" y="55"/>
        <Point x="401" y="52"/>
        <Point x="226" y="52"/>
        <Point x="226" y="57"/>
        <Point x="225" y="57"/>
        <Point x="225" y="58"/>
        <Point x="224" y="58"/>
        <Point x="224" y="53"/>
        <Point x="54" y="53"/>
        <Point x="54" y="54"/>
        <Point x="54" y="53"/>
        <Point x="30" y="53"/>
        <Point x="30" y="42"/>
      </Coords>
        <Line text="Pese a su compacto diseï¬o, es realmente versétil y muy completo" id="7" boldness="2.78846" boldnessReliability="22.8793" color="#567BC8" colorReliability="4.8
4687" orientation="0" readingOrientation="0" readingDirection="left-to-right" type="text" reverseVideo="false" indented="false" textColour="turquoise" kerning="4" baseline="51" m
eanline="34" xHeight="18" dHeight="-7" aHeight="27" charWidth="14">
          <Coords>
            <Point x="30" y="25"/>
            <Point x="925" y="25"/>
            <Point x="925" y="58"/>
            <Point x="30" y="58"/>
          </Coords>
        </Line>
    </TextRegion>
    <TextRegion id="2" orientation="0" readingOrientation="0" readingDirection="left-to-right" type="text" reverseVideo="false" indented="false" textColour="black" kerning="5" co
lor="#352F20" colorReliability="0" baseline="127" meanline="85" xHeight="43" dHeight="-2" aHeight="60" charWidth="34">
      <Coords>
        <Point x="33" y="98"/>
        <Point x="33" y="73"/>
        <Point x="81" y="73"/>
        <Point x="81" y="85"/>
        <Point x="216" y="85"/>
        <Point x="216" y="75"/>
        <Point x="217" y="75"/>
        <Point x="217" y="74"/>
        <Point x="219" y="74"/>
        <Point x="219" y="73"/>
        <Point x="220" y="73"/>
        <Point x="220" y="72"/>
        <Point x="222" y="72"/>
        <Point x="222" y="71"/>
        <Point x="224" y="71"/>
        <Point x="224" y="70"/>
        <Point x="471" y="70"/>
        <Point x="471" y="69"/>
        <Point x="473" y="69"/>
        <Point x="473" y="68"/>
        <Point x="475" y="68"/>
        <Point x="475" y="69"/>
        <Point x="568" y="69"/>
        <Point x="568" y="70"/>
        <Point x="571" y="70"/>
        <Point x="571" y="71"/>
        <Point x="573" y="71"/>
        <Point x="573" y="72"/>
        <Point x="575" y="72"/>
        <Point x="575" y="73"/>
        <Point x="576" y="73"/>
        <Point x="576" y="74"/>
        <Point x="577" y="74"/>
        <Point x="577" y="75"/>
        <Point x="578" y="75"/>
        <Point x="578" y="77"/>
        <Point x="579" y="77"/>
        <Point x="579" y="80"/>
        <Point x="580" y="80"/>
        <Point x="580" y="115"/>
        <Point x="581" y="115"/>
        <Point x="581" y="125"/>
        <Point x="507" y="125"/>
        <Point x="507" y="126"/>
        <Point x="421" y="126"/>
        <Point x="421" y="127"/>
        <Point x="256" y="127"/>
        <Point x="256" y="128"/>
        <Point x="145" y="128"/>
        <Point x="145" y="129"/>
        <Point x="57" y="129"/>
        <Point x="57" y="128"/>
        <Point x="33" y="128"/>
        <Point x="33" y="99"/>
      </Coords>
        <Line text="Mountain Serie 2" id="25" boldness="8.28571" boldnessReliability="33.6896" color="#352F20" colorReliability="3.10364" orientation="0" readingOrientation="0" r
eadingDirection="left-to-right" type="text" reverseVideo="false" indented="false" textColour="black" kerning="5" baseline="127" meanline="85" xHeight="43" dHeight="-2" aHeight="6
0" charWidth="34">
          <Coords>
            <Point x="33" y="68"/>
            <Point x="581" y="68"/>
            <Point x="581" y="129"/>
            <Point x="33" y="129"/>
          </Coords>
        </Line>
    </TextRegion>
    <TextRegion id="3" orientation="0" readingOrientation="0" readingDirection="left-to-right" type="text" reverseVideo="false" indented="false" textColour="indigo" kerning="2" c
olor="#5D5559" colorReliability="13.0052" baseline="184" meanline="174" xHeight="11" dHeight="-4" aHeight="16" charWidth="8">
      <Coords>
        <Point x="29" y="513"/>
        <Point x="29" y="357"/>
        <Point x="28" y="357"/>
        <Point x="28" y="281"/>
        <Point x="29" y="281"/>
        <Point x="29" y="280"/>
        <Point x="30" y="280"/>
        <Point x="30" y="279"/>
        <Point x="34" y="279"/>
        <Point x="34" y="276"/>
        <Point x="35" y="276"/>
        <Point x="35" y="275"/>
        <Point x="102" y="275"/>
        <Point x="102" y="180"/>
        <Point x="103" y="180"/>
        <Point x="103" y="175"/>
        <Point x="136" y="175"/>
        <Point x="136" y="174"/>
        <Point x="194" y="174"/>
        <Point x="194" y="171"/>
        <Point x="302" y="171"/>
        <Point x="302" y="170"/>
        <Point x="415" y="170"/>
        <Point x="415" y="169"/>
        <Point x="415" y="173"/>
        <Point x="427" y="173"/>
        <Point x="427" y="174"/>
        <Point x="428" y="174"/>
        <Point x="428" y="202"/>
        <Point x="429" y="202"/>
        <Point x="429" y="233"/>
        <Point x="428" y="233"/>
        <Point x="428" y="253"/>
        <Point x="429" y="253"/>
        <Point x="429" y="331"/>
        <Point x="427" y="331"/>
        <Point x="427" y="354"/>
        <Point x="429" y="354"/>
        <Point x="429" y="409"/>
        <Point x="430" y="409"/>
        <Point x="430" y="512"/>
        <Point x="429" y="512"/>
        <Point x="429" y="518"/>
        <Point x="430" y="518"/>
        <Point x="430" y="565"/>
        <Point x="431" y="565"/>
        <Point x="430" y="565"/>
        <Point x="430" y="615"/>
        <Point x="431" y="615"/>
        <Point x="431" y="750"/>
        <Point x="430" y="750"/>
        <Point x="430" y="769"/>
        <Point x="431" y="769"/>
        <Point x="431" y="776"/>
        <Point x="430" y="776"/>
        <Point x="430" y="795"/>
        <Point x="431" y="795"/>
        <Point x="431" y="854"/>
        <Point x="427" y="854"/>
        <Point x="427" y="855"/>
        <Point x="397" y="855"/>
        <Point x="397" y="857"/>
        <Point x="396" y="857"/>
        <Point x="396" y="858"/>
        <Point x="245" y="858"/>
        <Point x="245" y="859"/>
        <Point x="244" y="859"/>
        <Point x="244" y="858"/>
        <Point x="183" y="858"/>
        <Point x="183" y="857"/>
        <Point x="32" y="857"/>
        <Point x="32" y="848"/>
        <Point x="31" y="848"/>
        <Point x="31" y="847"/>
        <Point x="30" y="847"/>
        <Point x="30" y="822"/>
        <Point x="31" y="822"/>
        <Point x="31" y="770"/>
        <Point x="30" y="770"/>
        <Point x="30" y="572"/>
        <Point x="31" y="572"/>
        <Point x="31" y="560"/>
        <Point x="30" y="560"/>
        <Point x="30" y="543"/>
        <Point x="29" y="543"/>
        <Point x="29" y="514"/>
      </Coords>
        <Line text="a propuesta de Mountain para esta" id="49" boldness="16.6923" boldnessReliability="118.987" color="#666265" colorReliability="23.2958" orientation="0" reading
Orientation="0" readingDirection="left-to-right" type="text" reverseVideo="false" indented="false" textColour="indigo" kerning="2" baseline="184" meanline="174" xHeight="11" dHei
ght="-4" aHeight="16" charWidth="8">
          <Coords>
            <Point x="102" y="169"/>
            <Point x="428" y="169"/>
            <Point x="428" y="188"/>
            <Point x="102" y="188"/>
          </Coords>
        </Line>
        <Line text="comparative es peculiar en tanto que" id="66" boldness="18" boldnessReliability="144.741" color="#6A6467" colorReliability="33.4482" orientation="0" readingOr
ientation="0" readingDirection="left-to-right" type="text" reverseVideo="false" indented="false" textColour="indigo" kerning="2" baseline="210" meanline="200" xHeight="11" dHeigh
t="-4" aHeight="16" charWidth="8">
          <Coords>
            <Point x="102" y="195"/>
            <Point x="429" y="195"/>
            <Point x="429" y="214"/>
            <Point x="102" y="214"/>
          </Coords>
        </Line>
        <Line text="ha elegido una caja Antec Minuet" id="73" boldness="8.85185" boldnessReliability="83.2387" color="#524749" colorReliability="33.7844" orientation="0" readingO
rientation="0" readingDirection="left-to-right" type="text" reverseVideo="false" indented="false" textColour="indigo" kerning="2" baseline="236" meanline="226" xHeight="11" dHeig
ht="-4" aHeight="16" charWidth="9">
          <Coords>
            <Point x="103" y="221"/>
            <Point x="429" y="221"/>
            <Point x="429" y="240"/>
            <Point x="103" y="240"/>
          </Coords>
        </Line>
        <Line text="para alojar una equlllbrada seleccién" id="86" boldness="8.36364" boldnessReliability="85.951" color="#4A4143" colorReliability="31.384" orientation="0" readi
ngOrientation="0" readingDirection="left-to-right" type="text" reverseVideo="false" indented="false" textColour="indigo" kerning="2" baseline="261" meanline="251" xHeight="11" dH
eight="-5" aHeight="16" charWidth="8">
          <Coords>
            <Point x="103" y="246"/>
            <Point x="429" y="246"/>
            <Point x="429" y="266"/>
            <Point x="103" y="266"/>
          </Coords>
        </Line>
        <Line text="de componentes. No destacan por ser Ios" id="96" boldness="10.4375" boldnessReliability="96.4558" color="#52494D" colorReliability="28.5853" orientation="0" r
eadingOrientation="0" readingDirection="left-to-right" type="text" reverseVideo="false" indented="false" textColour="indigo" kerning="2" baseline="288" meanline="277" xHeight="12
" dHeight="-4" aHeight="16" charWidth="9">
          <Coords>
            <Point x="28" y="273"/>
            <Point x="429" y="273"/>
            <Point x="429" y="292"/>
            <Point x="28" y="292"/>
          </Coords>
        </Line>
        <Line text="ma's répidos de este informe. pero no quedan" id="112" boldness="18.4857" boldnessReliability="164.682" color="#696166" colorReliability="29.5019" orientation
="0" readingOrientation="0" readingDirection="left-to-right" type="text" reverseVideo="false" indented="false" textColour="indigo" kerning="2" baseline="314" meanline="303" xHeig
ht="12" dHeight="-4" aHeight="16" charWidth="9">
          <Coords>
            <Point x="29" y="299"/>
            <Point x="429" y="299"/>
            <Point x="429" y="318"/>
            <Point x="29" y="318"/>
          </Coords>
        </Line>
        <Line text="mal en ninquna de las pruebas de rendimien-" id="119" boldness="20.6571" boldnessReliability="168.983" color="#655D62" colorReliability="27.1991" orientation=
"0" readingOrientation="0" readingDirection="left-to-right" type="text" reverseVideo="false" indented="false" textColour="indigo" kerning="3" baseline="339" meanline="329" xHeigh
t="11" dHeight="-5" aHeight="16" charWidth="8">
          <Coords>
            <Point x="29" y="324"/>
            <Point x="429" y="324"/>
            <Point x="429" y="344"/>
            <Point x="29" y="344"/>
          </Coords>
        </Line>
        <Line text="to del banco de benchmarks. La ausencia més" id="128" boldness="22.7143" boldnessReliability="184.03" color="#676064" colorReliability="27.9409" orientation="
0" readingOrientation="0" readingDirection="left-to-right" type="text" reverseVideo="false" indented="false" textColour="indigo" kerning="2" baseline="365" meanline="354" xHeight
="12" dHeight="-1" aHeight="17" charWidth="9">
          <Coords>
            <Point x="28" y="349"/>
            <Point x="429" y="349"/>
            <Point x="429" y="366"/>
            <Point x="28" y="366"/>
          </Coords>
        </Line>
        <Line text="llamativa es la de ma's memorla RAM. que" id="145" boldness="11.9355" boldnessReliability="114.239" color="#564E51" colorReliability="33.6988" orientation="0"
 readingOrientation="0" readingDirection="left-to-right" type="text" reverseVideo="false" indented="false" textColour="indigo" kerning="2" baseline="391" meanline="380" xHeight="
12" dHeight="-2" aHeight="16" charWidth="9">
          <Coords>
            <Point x="29" y="376"/>
            <Point x="429" y="376"/>
            <Point x="429" y="393"/>
            <Point x="29" y="393"/>
          </Coords>
        </Line>
        <Line text="aunque es de buena factura, nunca esté de" id="154" boldness="15.1515" boldnessReliability="135.539" color="#554D50" colorReliability="33.7148" orientation="0
" readingOrientation="0" readingDirection="left-to-right" type="text" reverseVideo="false" indented="false" textColour="indigo" kerning="3" baseline="417" meanline="406" xHeight=
"12" dHeight="-4" aHeight="17" charWidth="8">
          <Coords>
            <Point x="29" y="401"/>
            <Point x="430" y="401"/>
            <Point x="430" y="421"/>
            <Point x="29" y="421"/>
          </Coords>
        </Line>
0 votes



0

Hi i was facing the same problem $NUXEO_HOME/server/tmp/ocr_olena_xxx.xml was generated succesfuly but no annotations. After a few days a sloved this by editing code at github and now it works fine in production. I have create a pull request at https://github.com/nuxeo/nuxeo-platform-ocr/pulls describing the problem and solution.

02/02/2013


Does the XML file produced by content_in_doc contain anything ? Sometimes, if the document has a poor quality, content_in_doc may produce an empty file… :)

0 votes



Sorry, I was not so clear in my previous post. Last year I was able to get this pluging worl¿king under Nuxeo 5.4.2 in an Oracle Linux installation. Then, I was trying to use the plugin in Nuxeo 5.5 under Ubuntu with the situation described above.

With the realease of Nuxeo 5.6 I decide to make a fresh installation under Ubuntu, and I was having troubles to get the content_in_doc binary. With Guillaume suggestion to use de new Olena package (thanks Guillaume!) I can get the file.

But when I try to use the OCR plugin, I get the same situation that under 5.5: he content_in_doc command is working fine. I try to convert an image from the commands lines and it works.

When I upload an image to Nuxeo, I can see a process like this running:

nuxeo 17203 17198 0 00:23 pts/0 00:00:00 content_in_doc /var/lib/nuxeo/server/tmp/cmdLineBasedConverter2216478922130777180.JPG /var/lib/nuxeo/server/tmp/ocr_olena_1355109823244.xml

And the file ocr_olena_xxxxxxx.xml is created under $NUXEO_HOME/tmp

But….. no annotations are generated in the document in Nuxeo”

And no errors are generated in server.log. This is all the information the log register after I do an upload: 2012-12-10 00:34:16,310 INFO [it.tidalwave.image.op.ReadOp] readMetadata(java.io.FileInputStream@705d5338, 0) 2012-12-10 00:34:16,319 INFO [it.tidalwave.image.op.ReadOp] read(java.io.FileInputStream@44f4660b, 0) 2012-12-10 00:34:17,438 DEBUG [com.hp.hpl.jena.shared.LockMRSW] Lock : Nuxeo-Work-default-4 2012-12-10 00:34:17,439 DEBUG [com.hp.hpl.jena.shared.LockMRSW] Nuxeo-Work-default-4 » enterCS: Thread R/W: 0/0 :: Model R/W: 0/0 (thread: Nuxeo-Work-default-4) 2012-12-10 00:34:17,439 DEBUG [com.hp.hpl.jena.shared.LockMRSW] Nuxeo-Work-default-4 « enterCS: Thread R/W: 1/0 :: Model R/W: 1/0 (thread: Nuxeo-Work-default-4) 2012-12-10 00:34:17,440 DEBUG [com.hp.hpl.jena.shared.LockMRSW] Nuxeo-Work-default-4 » leaveCS: Thread R/W: 1/0 :: Model R/W: 1/0 (thread: Nuxeo-Work-default-4) 2012-12-10 00:34:17,441 DEBUG [com.hp.hpl.jena.shared.LockMRSW] Nuxeo-Work-default-4 « leaveCS: Thread R/W: 0/0 :: Model R/W: 0/0 (thread: Nuxeo-Work-default-4)

Any suggestion to get the OCR plugin to work under NUxeo 5.6?

Ruben

0 votes



there is any sort of configuration about what extract, where extract and other? or we only to expect what olena think we expect? :)

0 votes



Regarding content_in_doc binary itself, there is not much options provided : you can choose the output format, the result location, the OCR language and enable/disable some detections and OCR steps. It is meant to extract all the possible data related to the document : content, layout, typographical information and images. If you want a more "dedicated" tool, I am afraid that you need to get into the source code :D

What would be your needs ?

12/07/2012

Configurations from nuxeo, for example … If you have a component under nuxeo this is good but my customer want to configure every aspect directly from nuxeo :) and, also, more people needs some specific configuration. For example for me and my company that would be great if i can specify some items to detect in documents (like invoice number in my invoices). I know that there is an integration (for example) between Ephesoft and Nuxeo but, Nuxeo can do this better (and without strong configuration and multiple functions in the beautifull Ephesoft software)

thank you for your question :)

12/07/2012


Thanks for the info. I ve get this package and the content_in_doc is now working. Now I can get the plugin to work under Nuxeo 5.6, I will try again with Nuxeo 5.4.2 in order to see if the problem is generated by the changes introduced with 5.6

0 votes



Olena is now available as deb packages for Debian and Ubuntu : http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/Download

Once installed, content_in_doc binary is located in /usr/lib/scribo/

0 votes



Oliver

The content_in_doc command is working fine. I try to convert an image from the commands lines and it works.

When I upload an image to Nuxeo, I can see a process like this running:

root 25994 25991 97 19:25 pts/0 00:00:15 content_in_doc /opt/nuxeo-cap-5.5-tomcat/tmp/cmdLineBasedConverter22108.jpg /opt/nuxeo-cap-5.5-tomcat/tmp/ocr_olena_1333236340089.xml

And the file ocr_olena_xxxxxxx.xml is created under $NUXEO_HOME/tmp

But….. no annotations are generated in the document in Nuxeo I will try to recompile all again

0 votes



Thanks to you, I just discovered ocr_olena_XX.xml files are also created in my tmp directory. Good to know.

Still, I can't see where and how Nuxeo is using them.

04/02/2012

Ok, just a little thing : have you looked at the module's source code? Especially https://github.com/nuxeo/nuxeo-platform-ocr/blob/develop/src/main/java/org/nuxeo/ecm/platform/ocr/annotation/ImageAnnotationHelper.java

(since the other components seem to work well)

The very end of the file states :

        annotationsService.addAnnotation(annotation, new UserPrincipal(
                &quot;OCR&quot;, null, false, true), &quot;http://server/&quot;);

I don't know about you, but I haven't created (nor been told to) an "OCR" user. And "http://server/" doesn't match anything interesting in my deployment context.

04/02/2012

I tried to modify the UserPrincipal to an existing user, and the baseURL to my server's, but it doesn't work any better.
04/02/2012


I ve installed the JAI package ( http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-java-client-419417.html#7341-JAI-1.1.2-oth-JPR ), and copy the jai_codec.jar, jai_core.jar and mlibwrapper_jai.jar in mi $NUXEOP_HOME/nxserver/lib

Now I does not get any error messages anymore, but nothing happens when I upload an image file to Nuxeo

How can I debug what is happenning?

0 votes



Same here. The JAI warnings disappeared (thanks for the hint!), but nothing is happening.
02/27/2012

Oliver, did you find a solution? I am still having the same problem
03/22/2012

Sadly no, I'm still stuck on this, and without time to investigate it further for now.
03/23/2012


Ok, finally managed to get every piece together (using Olena's git repository instead of release package, and still patching here and there).

First time I imported an image, I had an error about Tesseract being unable to find language data. Right (btw : how do we specify Nuxeo what language it should use to apply OCR?). Then I added the language data, and now I don't have any information about OCR anymore, this is perfectly silent. But no annotations are created.

The only thing that could be related is :

2012-02-09 17:02:36,993 WARN [it.tidalwave.image.java2d.ImplementationFactoryJ2D] JAI not available: java.lang.ClassNotFoundException: javax.media.jai.PlanarImage

Any idea?

0 votes



Finally I made a fresh install from scratch in an Oracle ELinux 5U7 and I can get the content_in_doc binary (I was missing the GDCM2 library) but now I am having the same issue than OlivierM, when I upload an image the server.log show this message:

WARN [it.tidalwave.image.java2d.ImplementationFactoryJ2D] JAI not available: java.lang.ClassNotFoundException: javax.media.jai.PlanarImage

Any idea?

02/24/2012


I just tried to build against the latest stable version (2.0) of Olena and it seems to work fine. I have updated the README.md of nuxeo-platform-ocr to point to the right source archive.

Beware that the build of olena is has several steps and 2 calls to make in 2 separate folders (the build root and the scribo/src subfolder):

$ wget http://www.lrde.epita.fr/dload/olena/2.0/olena-2.0.tar.bz2
$ tar jxvf olena-*.tar.bz2
$ cd olena-2.0/
$ mkdir _build
$ cd _build
$ ../configure && make
$ cd scribo/src
$ make

The scribo/src should then hold the content_in_doc binary. If not check any error messages in the output the build. Maybe your are missing the development headers for tesseract? Have you installed tesseract 3 from the source tarball and installed it system-wide using sudo make install?

0 votes



I ve compiled Olena 1.0 with Tesseract 3.0 with no problem Now I am trying to make the plugin with Olena 2.0 and Tesseract 3.0.1 But I can not get content_in_doc , the only files I ve found with the name content_in_doc are the followings:

[root@nx scribo]# find ./ -name &quot;content_in_*&quot;
./src/.deps/content_in_doc-content_in_doc.Po
./src/.deps/content_in_hdoc-content_in_hdoc.Po
./src/contest/DAE-2011/content_in_hdoc_dae
./src/contest/DAE-2011/.deps/content_in_doc_dae-content_in_doc_dae.Po
./src/contest/DAE-2011/.deps/content_in_hdoc_dae-content_in_hdoc_dae.Po
./src/contest/DAE-2011/content_in_hdoc_dae-content_in_hdoc_dae.o
./src/contest/DAE-2011/content_in_doc_dae-content_in_doc_dae.o
./src/contest/DAE-2011/content_in_doc_dae
./src/contest/hdlac-2011/.deps/content_in_hdoc_hdlac-content_in_hdoc_hdlac.Po
./src/contest/hdlac-2011/content_in_hdoc_hdlac-content_in_hdoc_hdlac.o
./src/contest/hdlac-2011/content_in_hdoc_hdlac

Is content_in_doc_dae the same? the Makefile in scribo/src have the lines targeting content_in_doc commented:

DIST_COMMON = README $(dist_bin_SCRIPTS) $(srcdir)/Makefile.am \
        $(srcdir)/Makefile.in $(top_srcdir)/scribo/scribo.mk
utilexec_PROGRAMS = $(am__EXEEXT_1) $(am__EXEEXT_2) $(am__EXEEXT_3) \
        $(am__EXEEXT_4)
#am__append_1 = pbm_text_in_doc
am__append_2 = text_in_doc_preprocess \
        text_in_picture text_in_picture_neg
#am__append_3 = text_recognition_in_picture
##am__append_4 = content_in_doc \
##      content_in_hdoc \
##      non_text_components
subdir = scribo/src

What I am missing?

01/03/2012

As written in the README.md file and as I already answered you have to run make in the $SOURCE_ROOT/_build/scribo/src folder as well and the content_in_doc binary will be created there too.
01/03/2012

I am running make inside $SOURCE_ROOT/_build/scribo/src folder
01/04/2012

I just tried from scratch in a new empty folder from the original tarball and the content_in_doc related lines in the Makefile are not commented out and the binary is built successfully. I suspect that in your case the configure script did not detect some missing dependency: re-run it from scratch and try to spot a warning message in its output. If you can't find any such message then please join the olena mailing lists where there are people more knowledgeable with the olena build than I am.
01/05/2012

Right now I'm trying to compile Olena/content_in_doc on Debian Squeeze. I had to install the following packages to make content_in_doc enabled in Makefiles :

apt-get install tesseract-ocr-fra tesseract-ocr-dev libboost1.42-dev libmagick++-dev libtiff4-dev libgdcm2-dev libqt4-dev libcfitsio3-dev build-essential

01/06/2012

In my case I built tesseract 3 from the source tarball (as not yet available in ubuntu, I don't know for debian). tesseract 3 gives much better results than tesseract 2 in practice.
01/09/2012

Here I did it using Squeeze's own Tesseract.
01/10/2012

Yet another try. Did it by using (hand-compiled) libleptonica and libtesseract (3). Apparently, Olena 2 only detects the latter when it's compiled "–with-multiple-libraries" (so that it has libtesseract_api.so and so on, and not just libtesseract.so).

But after that, I got a bunch of C++ errors here and there. Fixed a few, but that's pretty much I can do.

02/09/2012