Be updated, subscribe to the OpenKM news

Sumarization of documents

Ana CanteliWritten by Ana Canteli on 14 September 2018

Much of the information we read today is the product of documents summaries. Headlines, meeting minutes, synopsis of films and series, book editorials, weather forecasts, press releases, and many more. Also, the impact of new technologies on the production of documents summarization and distribution of text summarization must be taken into account. Looking only at the internet; the amount of information in the form of documents, images, audio and video that takes place day by day and that grows exponentially; it is not possible to analyze everything entirety. This volume of information is why it is especially appealing to determine how to make a proper documents summarization, to use that information appropriately and make better-informed decisions.

In electronic document management, summarization is a process that involves reducing a document or a group of documents (multi-document summarization), into a set of words or paragraphs that conveys the main idea of the document.

The international standard ISO 214:1976, "Preparation of Summaries" indicates that a summary of documents is the abbreviated and precise presentation of a document, without interpretation, criticism or express mention of the author of the summary. We can summarize a text, a picture, a video, audios, online information or hypertexts, a file or a documentary series.

Writing a summary is easy. The difficult thing is to write a good summary, so the quality of the summary is important; the better it is, the more useful it will be in a document management system.

The characteristics of a good text summarization should be the following:

  • Concession. Preliminary data or topics of common knowledge should be omitted.

  • Relevance. The summary must be adapted to the main message of the document, without obviating or interpreting the data.

  • Clarity and coherence. It must contain complete sentences, endowed with linear and global coherence.

  • Depth. It will be different depending on the type of summary or the various levels of detail that are pursued.

  • Linguistic consistency. A text summary must adapt to the linguistic guidelines in use and must take into account the morphological and syntactic rules of the language.

  • Chronological proximity; between the edition of the original document and the summary. The time elapsed between the publication of the original and the summary should not be excessive, especially in scientific and technical areas.

In addition, the text summarization has other uses: disseminating the information, determining the relevance, avoiding the need to read the full text in secondary documents and help the automated search as it says by ISO:

  • Helps to determine congruence: a well-prepared summary enables readers to quickly and accurately identify the content of a document and decide whether to read it in its entirety.

  • Avoid reading the entire text in documents. A well-prepared summary provides sufficient information on secondary issues. This saves the user time.

  • Help in the automated search. The automatic summarization incorporated in the catalogues or directories are very useful for:

    • Extracting index terms from the text, that is, the index from the summary.

    • Search for keywords that are not found in the title.

One of the solutions provided by the natural language processing - a field of computer science & data science, artificial intelligence and linguistics that studies the interaction between computers and human language - have been automatic text summarization software, which acts on texts, images, web pages or emails.

The multi-document summarization incorporated in the documentary catalogues are very useful to improve summarization techniques: extract terms of text indexing, to searches of keywords out of the title, to serve as bibliometric control and to help the diffusion through the alert services.

The documents summarization is useful in two phases: in the selection and acquisition processes that take place in the first phase of document collection and its integration in the document management system and in the exit phase, where it is an excellent instrument of information retrieval; for example through the search engine.

Summaries may be written by the author of the document, a specialist in the field, the publisher, a documentalist or a computer program (automatic text summary); although in business environments the ideal would be to have one or two people specialized in categorization, indexing, and summarization so that the cataloguing of the documentary repository is uniform.

The summarization can be said to be a set of diverse processes and techniques of summary on a text, among which are:

  • The selection of what is important.

  • The omission of what is not.

  • Generalization of the particular to the specific.

  • The identification of general or global structures.

There are two main approaches when carrying out the automatic summarization process:
The extractive approach, where extractive methods are used, that is, the selection of subsets of existing words, phrases or sentences from the original text to form the summary. Moreover, the abstractive summarization, where an internal semantic representation is constructed, and then abstractive methods of natural language generation and processing are used to create a summary that is close to what a human could generate.

Also, within the literature, two particular types of automatic text summarization stand out, which are often used: the extraction of key phrases, whose objective is to select individual words or phrases to label a document. The other is multi-document summarization where the goal is to select whole sentences to create a small summary paragraph.

On the other hand, we find different types of summaries, which depend on the approach of the automatic summarization tool to perform it. The summaries by the relevance of queries -query relevant summaries- and the summaries of multi-documents (generated by multi-document summarization) stand out

For example: let's imagine that we have a summarization software that contains an algorithm that extracts keywords from a text. The document may include prominent keywords as tags, but this is not usually the case. To select which words are important enough to be considered keywords, we can count on a thesaurus - a controlled dictionary of terms - which, if they appear in the text, will be regarded as key terms, so they will be part of the automatic text summarization. To improve the performance of natural language processing, we will work not only with dictionaries of terms but also with synonyms. You can work with algorithms that use other logic to detect the keywords; for example the number of times a term appears, the more times it is repeated that word it becomes more significant regarding the rest of the terms that the text contains. Another complementary and applicable logic to promote deep learning could be the position of that term within the text; if it appears in the first paragraph, that condition makes it a keyword.

The algorithms can take into account, as a means of machine learning, the relationship with other terms, apart from the frequency and position of the keywords. The applications of summarization techniques can work on unigrams (a single word) programs (2 words) trigrams (3 words), which can lead to a more coherent selection of relevant keywords, to create a summary, since it is considered that words that are closer to one another are significantly related and "recommended" to each other. Additional machine learning conditions can be added; for example, if the key phrase that contains three words -trigram- begins with a word whose first letter is in capital letters.

The summarization techniques are part of deep learning and data mining. When an algorithm is prepared to recognize a pattern - machine learning - that scheme can also result in an inconvenience. If for example, we prepare the summarization algorithm for the detection of key phrases of 3 words, it will ignore those sentences composed of 4 or more elements, even if they are relevant.

Summarization systems and applications are made to obtain summaries as part of a faster process - can process more documents than a human. Therefore they are more productive and cheaper, although the quality of the results is not optimal. The best summarization processes at the moment are not the automatic ones, but manual ones - in which a person reads the document and, thanks to their knowledge both linguistic and thematic, summarizes the content of the document - but you can count on the support of applications that help to perform automatic summarization tasks. The supervised deep learning in the summarizing tools allows introducing models as examples that show the system the most suitable summary techniques, so that it compares its results, with the models. The supervisor discards the incorrect keywords, and the summary algorithm learns. This is a synthesis, the KEA (Keyword Extraction Algorithm) available in the OpenKM document management system, that can be executed manually or automatically.

Contact us

CAPTCHA ImageRefresh Image

Don't hesitate to contact us

OpenKM in 5 minutes!