Difference between revisions of "Automatic key extraction full example"

From OpenKM Documentation
Jump to: navigation, search
Line 1: Line 1:
 +
{{TOCright}} __TOC__
 +
 
== SVN checkout modules ==
 
== SVN checkout modules ==
  
Line 6: Line 8:
  
 
Select the svn type and type the url https://openkm.svn.sourceforge.net/svnroot/openkm/trunk/thesaurus to refer thesaurus:  
 
Select the svn type and type the url https://openkm.svn.sourceforge.net/svnroot/openkm/trunk/thesaurus to refer thesaurus:  
 
  
 
== Installing openkm classes into maven repository ==
 
== Installing openkm classes into maven repository ==
Line 13: Line 14:
  
 
  mvn clean package install -Dmaven.test.skip=true
 
  mvn clean package install -Dmaven.test.skip=true
 
  
 
== Donwloading AGROVOC thesaurus ==
 
== Donwloading AGROVOC thesaurus ==
 
We'll use agrovoc for testing purposes, you can downloading from http://oaei.ontologymatching.org/2007/environment/ please read terms of use.
 
We'll use agrovoc for testing purposes, you can downloading from http://oaei.ontologymatching.org/2007/environment/ please read terms of use.
 
  
 
Copy into '''thesaurus/src/test/resources/vocabulary''' folder the file '''ag_skos_20070219.rdf
 
Copy into '''thesaurus/src/test/resources/vocabulary''' folder the file '''ag_skos_20070219.rdf
 
''' and '''agrovoc_oaei2007.owl'''
 
''' and '''agrovoc_oaei2007.owl'''
 
  
 
Into '''vocabulary''' folder there's '''testdocs''' folders are some agrovoc training docs to creating KEA module.
 
Into '''vocabulary''' folder there's '''testdocs''' folders are some agrovoc training docs to creating KEA module.
 
  
 
== Create runtime configuration ==  
 
== Create runtime configuration ==  
Line 45: Line 42:
 
  documentEncoding
 
  documentEncoding
 
  testDocs ( optional )
 
  testDocs ( optional )
 
  
 
In my case  
 
In my case  
Line 89: Line 85:
 
== Copying vocabulary files into OpenKM ==
 
== Copying vocabulary files into OpenKM ==
 
Create a folder called vocabulary int %JBOSS_HOME%, copy into files called '''ag_skos_20070219.rdf''', '''agrovoc_oaei2007.owl''', '''ag_skos_20070219.model''', and '''stopwords_en.txt'''
 
Create a folder called vocabulary int %JBOSS_HOME%, copy into files called '''ag_skos_20070219.rdf''', '''agrovoc_oaei2007.owl''', '''ag_skos_20070219.model''', and '''stopwords_en.txt'''
 
  
 
== Configuring OpenKM.cfg ==
 
== Configuring OpenKM.cfg ==
Line 108: Line 103:
  
 
kea.automatic.keyword.extraction.restriction is an optional paramater to indicate that only words in thesaurus are enabled to be extracted.
 
kea.automatic.keyword.extraction.restriction is an optional paramater to indicate that only words in thesaurus are enabled to be extracted.
 
  
 
== Creating thesaurus ==
 
== Creating thesaurus ==
Line 131: Line 125:
 
== Automatic key extraction in new uploaded document ==
 
== Automatic key extraction in new uploaded document ==
 
Upload a new document, for example some document from testdocs/en/train  
 
Upload a new document, for example some document from testdocs/en/train  
 
  
 
In your jboss console it'll appears something like this:
 
In your jboss console it'll appears something like this:

Revision as of 09:59, 16 April 2013

SVN checkout modules

To creating KEA model must checkout openkm and thesaurus modules:

Select the svn type and type the url https://openkm.svn.sourceforge.net/svnroot/openkm/trunk/openkm to refer openkm:

Select the svn type and type the url https://openkm.svn.sourceforge.net/svnroot/openkm/trunk/thesaurus to refer thesaurus:

Installing openkm classes into maven repository

Ensure you've intalled openkm into your local maven repository, to ensure it you can execute the command:

mvn clean package install -Dmaven.test.skip=true

Donwloading AGROVOC thesaurus

We'll use agrovoc for testing purposes, you can downloading from http://oaei.ontologymatching.org/2007/environment/ please read terms of use.

Copy into thesaurus/src/test/resources/vocabulary folder the file ag_skos_20070219.rdf and agrovoc_oaei2007.owl

Into vocabulary folder there's testdocs folders are some agrovoc training docs to creating KEA module.

Create runtime configuration

Now we can create runtime configuration, it must be executed the ModelBuilder class with some params


Okm installation guide 004.jpeg


For training KEA module is needed execute ModelBuilder class with that params:

sourceFolder 
trainingFolder 
vocabularyFile 
vocabularyType
stopwordFile 
modelFileName 
porterStemmerClass 
stopwordClass 
language 
documentEncoding
testDocs ( optional )

In my case

sourceFolder=/home/jllort/softwareFactoryGalileo/thesaurus/vocabulary ( all path are relative to sourceFolder )

trainingFolder=testdocs/en/train

vocabularyFile=ag_skos_20070219.rdf

vocabularyType=skos

stopwordFile=stopwords_en.txt

modelFileName=ag_skos_20070219.model

porterStemmerClass=com.openkm.kea.stemmers.PorterStemmer

stopwordClass=com.openkm.kea.stopwords.StopwordsEnglish

language=en

documentEncoding=UTF-8

testDocs=testdocs/en/test


The params to execute ModelBuilder class are "/home/jllort/softwareFactoryGalileo/thesaurus/vocabulary testdocs/en/train ag_skos_20070219.rdf skos stopwords_en.txt ag_skos_20070219.model com.openkm.kea.stemmers.PorterStemmer com.openkm.kea.stopwords.StopwordsEnglish en UTF-8 testdocs/en/test" and VM argument "-Xmx526M" as you can see in next screenshot

Okm installation guide 005.jpeg


Classpath must be shown as


Okm installation guide 006.jpeg


It all goes fine it has been generated into vocabulary folder a file called agrovoc_oaei2007.model


350

Copying vocabulary files into OpenKM

Create a folder called vocabulary int %JBOSS_HOME%, copy into files called ag_skos_20070219.rdf, agrovoc_oaei2007.owl, ag_skos_20070219.model, and stopwords_en.txt

Configuring OpenKM.cfg

Thesaurus configuration values

kea.thesaurus.owl.file=/vocabulary/agrovoc_oaei2007.owl
kea.thesaurus.base.url=http://www.fao.org/aos/agrovoc
kea.thesaurus.tree.root=SELECT DISTINCT UID, TEXT FROM {UID} Y {OBJECT}, {UID} rdfs:label {TEXT} ; [rdfs:subClassOf {CLAZZ}] where not bound(CLAZZ) and lang(TEXT)="en" USING NAMESPACE foaf=<http://xmlns.com/foaf/0.1/>, dcterms=<http://purl.org/dc/terms/>, rdf=<http://www.w3.org/1999/02/22-rdf-syntax-ns#>, owl=<http://www.w3.org/2002/07/owl#>, rdfs=<http://www.w3.org/2000/01/rdf-schema#>, skos=<http://www.w3.org/2004/02/skos/core#>, dc=<http://purl.org/dc/elements/1.1/>
kea.thesaurus.tree.childs=SELECT DISTINCT UID, TEXT FROM {UID} rdfs:subClassOf {CLAZZ}, {UID} rdfs:label {TEXT} where xsd:string(CLAZZ) = "RDFparentID" and lang(TEXT)="en" USING NAMESPACE foaf=<http://xmlns.com/foaf/0.1/>, dcterms=<http://purl.org/dc/terms/>, rdf=<http://www.w3.org/1999/02/22-rdf-syntax-ns#>, owl=<http://www.w3.org/2002/07/owl#>, rdfs=<http://www.w3.org/2000/01/rdf-schema#>, skos=<http://www.w3.org/2004/02/skos/core#>, dc=<http://purl.org/dc/elements/1.1/>


KEA model configuration values

kea.thesaurus.skos.file=/vocabulary/ag_skos_20070219.rdf
kea.thesaurus.vocabulary.serql=SELECT X,UID FROM {X} skos:prefLabel {UID} WHERE lang(UID) ="en" USING NAMESPACE rdf=<http://www.w3.org/1999/02/22-rdf-syntax-ns#>, skos=<http://www.w3.org/2004/02/skos/core#>,rdfs=<http://www.w3.org/2000/01/rdf-schema#>,dc=<http://purl.org/dc/elements/1.1/>, dcterms=<http://purl.org/dc/terms/>, foaf=<http://xmlns.com/foaf/0.1/>
kea.model.file=/vocabulary/ag_skos_20070219.model
kea.stopwords.file=/vocabulary/stopwords_en.txt
kea.automatic.keyword.extraction.number=10
kea.automatic.keyword.extraction.restriction=on

kea.automatic.keyword.extraction.restriction is an optional paramater to indicate that only words in thesaurus are enabled to be extracted.

Creating thesaurus

Login into OpenKM with some user with administrator grants, go to Administration tab and select Generate Thesaurus option. Then select the "show level" and execute the "send" option.


Okm installation guide 008.jpeg


Please be patient it's needed some time to building all thesaurus. Depending your hardware configuration ( RAM ) could take some hours before process it'll be finishing.


Okm installation guide 009.jpeg


After finishing Thesarus creation in your desktop could see the thesaurus folders representation as is shown:


Okm installation guide 010.jpeg


Automatic key extraction in new uploaded document

Upload a new document, for example some document from testdocs/en/train

In your jboss console it'll appears something like this:


Okm installation guide 011.jpeg


And in your OpenKM UI the extracted keywords as shown:


Okm installation guide 012.jpeg