Difference between revisions of "Creating automatic key extraction training files"

From OpenKM Documentation
Jump to: navigation, search
Line 1: Line 1:
 
Creating training files is so easy you simply must create a couple of files that KEA will use for creating KEA model extractor.
 
Creating training files is so easy you simply must create a couple of files that KEA will use for creating KEA model extractor.
 
   
 
   
 +
 
The main file to be analyzed by kea must be a foo.txt file ( if you've got pdf, doc, rtf or other type of file, that must be converted to txt ). Each file foo.txt must have a foo.key file. The foo.key file contains the keys which you identifies the document, that keys must be present into your thesaurus.
 
The main file to be analyzed by kea must be a foo.txt file ( if you've got pdf, doc, rtf or other type of file, that must be converted to txt ). Each file foo.txt must have a foo.key file. The foo.key file contains the keys which you identifies the document, that keys must be present into your thesaurus.
  
Line 12: Line 13:
 
  MERCHANTS
 
  MERCHANTS
 
</source>
 
</source>
 +
  
 
Both files among other pair of couples must be under some directory. That directoy path is what it'll be used by KEA to create the model. Take a look at [[Automatic_key_extraction_full_example]] and the use of the trainingFolder param used by application to creation the KEA model.
 
Both files among other pair of couples must be under some directory. That directoy path is what it'll be used by KEA to create the model. Take a look at [[Automatic_key_extraction_full_example]] and the use of the trainingFolder param used by application to creation the KEA model.
 +
  
 
You need a significative couples of documents in order making a good key extraction model. Upper 100 or more files ( depending how large is your thesaurus, etc... ) it's good size to starting.
 
You need a significative couples of documents in order making a good key extraction model. Upper 100 or more files ( depending how large is your thesaurus, etc... ) it's good size to starting.
 +
  
 
We suggest you take a look at KEA project in order to see how that files are defined in training folder [http://www.nzdl.org/Kea]
 
We suggest you take a look at KEA project in order to see how that files are defined in training folder [http://www.nzdl.org/Kea]
 +
  
 
== How optimize model ==
 
== How optimize model ==
 
The KEA model is something alive. The idea is that users tunning the KEA model in OpenKM. For doing it we suggest creation of some metadata to indicating that user has validated some documents key ( flag to indicate that are documents that can be used to creating a new model ). After passed some time you can create a minimal application to extract relevant documents ( using openoffice conversion can created easilly txt files ) and key files too ( assigned keywords to documents ).
 
The KEA model is something alive. The idea is that users tunning the KEA model in OpenKM. For doing it we suggest creation of some metadata to indicating that user has validated some documents key ( flag to indicate that are documents that can be used to creating a new model ). After passed some time you can create a minimal application to extract relevant documents ( using openoffice conversion can created easilly txt files ) and key files too ( assigned keywords to documents ).
 +
  
 
While your repository is growing your KEA model it'll become more efficient.
 
While your repository is growing your KEA model it'll become more efficient.

Revision as of 08:26, 30 September 2010

Creating training files is so easy you simply must create a couple of files that KEA will use for creating KEA model extractor.


The main file to be analyzed by kea must be a foo.txt file ( if you've got pdf, doc, rtf or other type of file, that must be converted to txt ). Each file foo.txt must have a foo.key file. The foo.key file contains the keys which you identifies the document, that keys must be present into your thesaurus.

Example of foo.key

 AMARANTHUS
 PLANT PRODUCTION
 GEOGRAPHICAL DISTRIBUTION
 NUTRITIVE VALUE
 SEEDS
 MERCHANTS


Both files among other pair of couples must be under some directory. That directoy path is what it'll be used by KEA to create the model. Take a look at Automatic_key_extraction_full_example and the use of the trainingFolder param used by application to creation the KEA model.


You need a significative couples of documents in order making a good key extraction model. Upper 100 or more files ( depending how large is your thesaurus, etc... ) it's good size to starting.


We suggest you take a look at KEA project in order to see how that files are defined in training folder [1]


How optimize model

The KEA model is something alive. The idea is that users tunning the KEA model in OpenKM. For doing it we suggest creation of some metadata to indicating that user has validated some documents key ( flag to indicate that are documents that can be used to creating a new model ). After passed some time you can create a minimal application to extract relevant documents ( using openoffice conversion can created easilly txt files ) and key files too ( assigned keywords to documents ).


While your repository is growing your KEA model it'll become more efficient.