Graphs in clusters: a hybrid approach to unsupervised extractive long document summarization using language models

  • Open access
  • Published: 29 June 2024
  • Volume 57 , article number  189 , ( 2024 )

Cite this article

You have full access to this open access article

thesis on document clustering

  • Tuba Gokhan 1 ,
  • Malcolm James Price 2   na1 &
  • Mark Lee 2   na1  

Effective summarization of long documents is a challenging task. When addressing this challenge, Graph and Cluster-Based methods stand out as effective unsupervised solutions. Graph-Based Unsupervised methods are widely employed for summarization due to their success in identifying relationships within documents. Cluster-Based methods excel in minimizing redundancy by grouping similar content together before generating a concise summary. Therefore, this paper merges Cluster-Based and Graph-Based methods by applying language models for Unsupervised Extractive Summarization of long documents. The approach simultaneously extracts key information while minimizing redundancy. First, we use BERT-based sentence embeddings to create sentence clusters using k -means clustering and select the optimum number of clusters using the elbow method to ensure that sentences are categorized based on their semantic similarities. Then, the TextRank algorithm is employed within each cluster to rank sentences based on their importance and representativeness. Finally, the total similarity score of the graph is used to rank the clusters and eliminate less important sentence groups. Our method achieves comparable or better summary quality and reduced redundancy compared to both individual Cluster-Based and Graph-Based methods, as well as other supervised and Unsupervised baseline models across diverse datasets.

Avoid common mistakes on your manuscript.

1 Introduction

Summarization systems strive to achieve a core objective - generating concise and compact summaries that encapsulate the central themes of source documents while minimizing redundancy Radev et al. ( 2002 ); Vilca and Cabezudo ( 2017 ). They are categorized into two primary approaches: Extractive Summarization and Abstractive Summarization Nallapati et al. ( 2016 ). In Extractive Summarization, salient sentences are chosen directly from the original document. On the other hand, Abstractive Summarization involves the use of different words or phrases to construct the summary.

Supervised methods have proven successful in summarization. However, their effectiveness hinges on the availability of large-scale, human-generated summaries, which are costly and challenging to obtain. Moreover, these methods encounter difficulties in summarizing long documents due to limitations in input length Zheng and Lapata ( 2019 ).

Unsupervised methods have the potential to alleviate the challenges faced by supervised methods when especially dealing with long documents. Graph-Based and Cluster-Based methods are widely use methods for Unsupervised Summarization. Graph-Based methods represent sentences as nodes and assign weights to the edges between nodes based on the similarity of sentences Erkan and Radev ( 2004 ); Liu et al. ( 2021 ). The summarization task becomes a node selection process, where nodes (i.e., sentences) are chosen for inclusion in the final summary based on their centrality scores. Although these methods have achieved success, they often result in summaries which include redundant sentences due to similar sentences receiving similar centrality scores. On the other side, clustering-based methods group similar sentences or documents into clusters and select representative sentences from each cluster to compose the summary Qazvinian and Radev ( 2008 ); Pawar et al. ( 2022 ). Each cluster represents a distinct topic within the document, with high internal similarity and lower similarity to components in other clusters. This approach helps reduce redundancy by including only one representative sentence from each cluster.

This study focuses on Unsupervised approaches for Extractive Summarization of long documents and offers a hybrid solution that combines the strengths of both Graph-Based and clustering-based methods. First, we create sentence embeddings using pre-trained language models to represent the sentences in the document. These embeddings capture the semantic meaning of the sentences, enabling efficient similarity comparisons. Secondly, we determine the optimal number of clusters based on the sentence embeddings, ensuring that similar sentences are grouped together while minimizing redundancy in the final summary. Thirdly, within each cluster, we employ Graph-Based methods to select the most important representative sentence. This step ensures that the essential information from each cluster is included in the summary. Lastly, we rank the clusters to identify the most significant ones, thereby reducing their collective count. In long documents, the number of clusters may exceed the desired number of sentences in the summary. To overcome this challenge we add another stage and assign a significance value to each cluster using sentence embeddings. By doing so, we can identify and prioritize clusters that contain the most crucial sentences for inclusion in the final summary. This approach removes similar sentence groups that may not add significant value to the summary.

To assess the efficacy of our proposed method, we conduct analyses using two scientific papers datasets: arXiv and PubMed. We perform a the comparative study between our hybrid approach and several existing Unsupervised text Summarization methods. To evaluate performance, we utilize well-established evaluation metrics ROUGE, and various redundancy measurement metrics. Across multiple datasets, show a notable improvement in summarization quality compared to other methods. These findings underscore the potential of our hybrid approach to enhance the extraction of essential information from diverse document types. By effectively combining the benefits of Cluster-Based and Graph-Based methods, our approach offers a promising avenue for advancing Unsupervised text summarization techniques.

The paper is organized as follows: Sect.  2 reviews related work in the field of Text Summarization. Section  3 presents the methodology of our proposed hybrid approach. In Sect.  4 , we describe the experimental setup and present the results and analysis. Section  5 discusses the findings and implications of our study, highlighting the contributions and limitations. Finally, Sect.  6 concludes the paper and outlines future research directions.

2 Related work

2.1 cluster based methods.

Clustering-based methods are a class of Unsupervised Summarization techniques used for grouping topics. In these methods, similar sentences or documents are grouped into clusters, and a representative sentence is selected from each cluster to create the summary. Each cluster represents a distinct topic within the document. While the components of a cluster share a high degree of similarity with each other, they are less similar to components belonging to other clusters. The underlying idea behind this approach is that redundant or overlapping information is typically present in similar sentences or documents, and selecting one representative sentence from each cluster helps ensure that multiple sentences conveying similar information are not included in the summary.

The clustering process begins by representing the sentences as feature vectors. A feature vector is a mathematical representation of the sentence, where each dimension corresponds to a feature or characteristic of the sentence. The features can be simple, such as word frequencies or TF-IDF scores, or more complex, such as semantic embeddings or syntactic structures. Various techniques can be applied to create these feature vectors such as Word2Vec, GloVe, BERT.

Table 1 illustrates that the use of cluster-based methods is common in multi-document summarisation due to their capability to group distinct topics into individual clusters. We advocate that clustering-based methods effectively summarize long documents because long documents frequently involve multiple topics. In our work, we utilize clustering with SentenceBERT models -an advanced technique for generating sentence embeddings- to create the feature vectors.

2.2 Graph based methods

Graph-Based methods are widely employed in Unsupervised Extractive Summarization due to their ability to represent text as a graph in a flexible and efficient way. This representation helps the algorithm understand the overall structure and flow of the text, which is important for making effective summaries. Within our research, we examine several Graph-Based methods, as detailed in Table 2 .

TextRank Mihalcea and Tarau ( 2004 ) and LexRank Erkan and Radev ( 2004 ) emerge as notable early contributions within the realm of Graph-Based Extractive Summarization. The primary difference between these algorithms is in their methods for calculating sentence similarity (as indicated in Table 2 ) and the generation of graph edges. In the TextRank approach, sentences are represented as nodes within an undirected graph, and edge weights are computed based on the similarity of sentence occurrences. The LexRank algorithm introduces a pruning threshold during the initialization of the graph, leading to the removal of edges with lower weights.

More recent methods include PacSum Zheng and Lapata ( 2019 ), which constructs a directed graph through using BERT Devlin et al. ( 2019 ) to calculate sentence similarities. Within the constructed directed graph of PacSum, edges reflect the relative positioning of sentences within the document. STAS Xu et al. ( 2020 ) involves the pre-training of a hierarchical transformer model using unlabeled documents. This model is then employed for ranking sentences through sentence-level self-attention mechanisms and pre-training objectives. Liu et al. ( 2021 ) introduce a Graph-Based approach that leverages both similarities and relative distances within the local context of each sentence. They extend their methodology from single-document summarization to a multi-document context by incorporating document-level graphs through proximity-based cross-document edges. FAR Liang et al. ( 2021 ) propose a facet-aware centrality-based model. They introduce a modified Graph-Based ranking strategy to filter out irrelevant sentences by utilizing sentence-document similarity. Furthermore, Dong et al. ( 2021 ) present a long document summarization technique named HipoRank. This approach enhances the assessment of sentence centrality by integrating directionality and hierarchy into the graph structure. This incorporation is achieved through the integration of boundary positional functions and hierarchical topic information grounded in discourse structure. GUSUM Gokhan et al. ( 2022 ) propose a node-weighted graph model for document summarization, employing sentence feature scoring methods to define node weights. Similarly, NoWRank Gokhan et al. ( 2023 ) employs node weights for long document summarization, introducing a novel ranking approach.

3 The GrinCH: overview and methodology

In this section, we present the methodology of our proposed approach, GrinCH ( Gr aphs in C lusters: A H ybrid Approach to Unsupervised Extractive Long Document Summarization using Language Models). Footnote 1

figure 1

Overview of the GrinCH for selecting a 3-sentence summary from a 19-sentence corpus. The figure illustrates the process of dividing the corpus into 5 distinct clusters, each represented by a large circle. Within each cluster, sentences are depicted in a graph structure, with individual sentences represented as nodes (small circles) and their relationships indicated by edges. The edges are formed based on sentence relations identified by SentenceBERT models. The selected sentences for summarization, as determined by the TextRank algorithm, are depicted as black nodes. Among the clusters, three are highlighted with dark circles, signifying their significance. Specifically, Sentence 2 (S2), Sentence 5 (S5), and Sentence 15 (S15) are are selected for inclusion in the generated summary

This approach encompasses three distinct stages. The first stage involves clustering sentences using pre-trained language models and determining the optimal number of clusters. This step ensures that sentences are grouped in a manner that captures their inherent similarities within the given document. In the second stage, a Graph-Based ranking algorithm is employed within each cluster to rank sentences based on their importance and representativeness. This is used to identify the most important sentence within each cluster. In the third stage, clusters are ranked to identify and eliminate less important sentence groups (Fig.  1 ).

The forthcoming sections provide a detailed description of each stage. By adopting this hybrid strategy, the objective is to enhance the extraction of essential information from long documents while minimizing redundancy.

3.1 Clustering using pre-trained language models

3.1.1 pre-trained language models with sentencebert.

In our study, we vectorize the sentences using SentenceBERT (Reimers and Gurevych 2019 ). To do this, we consider 3 language models (Table 3 ) which are compared in an ablation study.

The language models nli-distilroberta-base-v2 , all-mpnet-base-v2 , and all-distilroberta-v1 are all based on the Transformer architecture and employ masked language modeling during pre-training. They generate sentence embeddings that capture the contextual meaning of sentences and are well-suited for a wide range of downstream NLP tasks.

The nli-distilroberta-base-v2 model, with 66 million parameters, is particularly useful for tasks like sentiment analysis, text classification, named entity recognition, and question-answering. It provides high-quality sentence embeddings, making it suitable for tasks that require understanding the overall context and meaning of sentences.

With 147 million parameters, the all-mpnet-base-v2 model offers advanced capabilities for tasks like text generation, language translation, text summarization, and text classification. It is beneficial for tasks that involve generating natural language text or summarizing large volumes of text efficiently.

The all-distilroberta-v1 model, with 134 million parameters, stands out due to its domain-specific pre-training. It is tailored to perform well on domain-specific NLP tasks, such as text classification and NER. This model is particularly useful when working with text data from specific domains where domain-specific knowledge and terminology play a crucial role.

Each of the 3 models boasts strong representation capabilities. The ablation study reveals their nearly equivalent performance; however, nli-distilroberta-base-v2 exhibits slightly better results.

3.1.2 Determining the optimal number of clusters with K-means

For clustering, we employ the k -means algorithm. k -means groups similar data points into clusters based on their similarity to a cluster centroid (Hartigan and Wong 1979 ). We use the sentence embeddings generated by SentenceBERT as input data for k -means clustering (Algo 1 ). We use the elbow method (Nainggolan et al. 2019 ) to determine the optimal number of clusters (Fig  2 ).

figure a

Optimal Number of Clusters with k -means and Elbow Method

figure 2

Elbow curve for an 89-sentence corpus, depicting the relationship between the number of clusters and the inertia. The graph shows that the optimum cluster number, determined by the k -means algorithm, is 21

3.2 Graph-based ranking within clusters

We employ the TextRank (Mihalcea and Tarau 2004 ) algorithm to identify the most important sentence within each of cluster. The algorithm ranks sentences based on the equation:

Where, V , which signifies the set of nodes (sentences) in the graph; E , representing the set of edges between nodes established through similarity scores; \(S_(v)\) , indicating the importance score of node v ; and d , the damping factor usually set within the range of 0.8 and 0.85.

This equation captures the iterative nature of the TextRank algorithm, where the importance score of a sentence node is determined by a combination of its neighbors’ scores and the damping factor (See Algo. 2 ). In our approach, we calculate sentence similarity by using the cosine similarity among sentence embeddings previously generated within the relevant clusters.

figure b

TextRank in clusters for sentence selection

3.3 Ranking clusters

Figure  3 illustrates the distribution of optimal cluster numbers among 100 documents from the arXiv dataset (Datasets are detailed in Sec.  4.1 ). These documents demonstrate an average optimal cluster count of 30, while the average sentence count in the gold standard summaries of the arXiv dataset is 7.

figure 3

Distribution of Optimum Cluster Numbers for Randomly Chosen 100 Documents from the arXiv Dataset

To identify the most significant clusters, we assign values to individual clusters based on the weights of the edges present within the graphs of those clusters. The summation of these edge weights yields the graph weight, which subsequently defines the total weight of the cluster. We then select the clusters with the highest weights for inclusion in the summary, ensuring that the number of sentences equals the average count found in the gold-standard summaries for that dataset.

figure c

Grinch Summarization Algorithm (Part 1)

figure d

Grinch Summarization Algorithm (Part 2)

4 Experimental results

4.1 datasets.

In our study, we investigate datasets categorized by the length of the documents. We use two widely used benchmark long scientific paper datasets consisting of lengthy documents, each possessing distinct characteristics (Table 4 ).

ArXiv and PubMed are datasets that contain scientific papers collected from and respectively (Cohan et al. 2018 ). They are among the earliest datasets used for large- scale long document summarization research, and they use abstracts of articles as the gold standard summaries. For our our experimental phase, we use the respective test sets for each dataset taken from the Hugging Face dataset library. Footnote 2

4.2 Automated evaluation results

We use ROUGE (Lin and Hovy 2003 ) to evaluate how well the summaries generated by different models perform for first automated evaluation. We look at F1-based ROUGE-1, ROUGE-2, and ROUGE-L scores on datasets, and we calculate these using the py-rouge package. Footnote 3 In Table 5 , we compare our approach with other methods for summarizing long documents.

First block of Table 5 includes upper bound and baseline methods. Oracle (Nallapati et al. 2017 ) uses a greedy algorithm to create an Oracle summary for each document (Zheng and Lapata 2019 ). It tries out different combinations of sentences and selects a set that gets the highest ROUGE score against the gold standard summary. Lead, TextRank (Mihalcea and Tarau 2004 ), and LexRank (Erkan and Radev 2004 ) are strong baseline methods. Lead simply picks the first sentences from the document to create a summary. In the next block of the table, we consider supervised neural Extractive methods for summarization. We compare our method with SummaRuNNer (Nallapati et al. 2017 ), GlobalLocalCont (Xiao and Carenini 2019 ), and Sent-PTR (Pilault et al. 2020 ). The third block is about Unsupervised Graph-Based Extractive methods (See Sec.  2.2 ). Finally, we present the performance of our GrinCH method.

Table 5 shows that our method outperforms all baselines by wide margins in terms of ROUGE-1,2,L (arXiv: R-1 +8.86, R-2 +1.69, R-L +9.64; PubMed: R-1 +6.61, R-L +7.07). In the otherside, GrinCH attains competitive performance results comparable to supervised models that necessitate hundreds of thousands of training instances, and surpassing all Extractive models in terms of R-L ( arXiv: R-L +0.56; PubMed: R-1 +0.95, R-L +2.19). In addition, our method exhibits performance similar to recent Unsupervised Graph-Based techniques ( PubMed: R-1 +0.82 ).

4.3 Redundancy analysis

To assess redundancy, we use a range of metrics and systematically compare the redundancy levels between the gold standard summaries and our summaries across two datasets (Table 6 ). We start by following the approach of Bommasani and Cardie ( 2020 ). In their research, they quantify redundancy by calculating the average ROUGE-L F-score across all distinct pairs of sentences within the summary. Redundancy is formulated as:

Peyrard et al. ( 2017 ) introduce another redundancy metric centered around the recurrence of n-grams within a summary. This metric involves calculating the ratio of unique n-grams to the total number of n-grams present in the summary (as defined in equation 3). In our research, we adopt this approach, referred to as the \(n-gram ratio\) (Xiao and Carenini 2020 ). This metric gauges the uniqueness of n-grams; lower values indicate a higher degree of redundancy within the document.

Additionally to these metrics, we use our redundancy measurement metric. Our measurement uses sentence embeddings to calculate the similarity between sentences within the documents themselves. This allows us to quantify redundancy from a different perspective.

Using our redundancy evaluation method and the R-L Redundancy metric, GrinCH exhibits significant improvement in reducing redundancy when compared to gold standard summaries (Our Method: arXiv \(-\) 0.02, PubMed \(-\) 0.03; R-L Redundancy: arXiv \(-\) 0.04, PubMed \(-\) 0.03). Furthermore, GrinCH demonstrates substantial advancement in redundancy reduction compared to GUSUM and NoWRank (Our Method: arXiv \(-\) 0.13, PubMed \(-\) 0.15; R-L Redundancy: arXiv \(-\) 0.05, PubMed \(-\) 0.05).

Regarding the n-gram ratio, GrinCH surpasses gold standard summaries in terms of unique words usage (UniGram: PubMed +0.02; BiGram: arXiv +0.01, PubMed +0.02; TriGram: arXiv +0.01, PubMed +0.01). Additionally, GrinCH shows improvement in redundancy reduction compared to GUSUM and NoWRank (UniGram: arXiv +0.08 PubMed +0.07; BiGram: arXiv +0.06, PubMed +0.07; TriGram: arXiv +0.04, PubMed +0.05)

4.4 Ablation study

For the ablation study, we use the PubMed test set. We follow a step-by-step deconstruction process, where we analyze the effects of excluding key components- Without Clustering, Without TextRank, and Without Cluster Ranking. Additionally, we explore the impact of varying parameters, particularly by altering the embedding models. The results are presented Table 7 and Table 8 .

In our ablation study, we found that the ROUGE metric highlights clear differences in how each of the three components affects the quality of summarization. This distinction is especially noticeable when looking at R-1. On the other hand, the choice of sentence embedding models has only a modest effect.

Table 8 shows a considerable reduction in redundancy for all metrics noticeable reduction in redundancy is evident, validating the effectiveness of our approach. However, the investigation into various sentence embedding models yielded similar results in terms of redundancy.

figure 4

The sentence positions of extracted sentences for each model applied to the PubMed dataset. Each dot on the graph represents a selected sentence for a generated summary. The x-axis shows the 6658 documents from PubMed ordered by document length (calculated by sentence count for the document) from lowest sentence count on the left to highest sentence count on the right. The y-axis shows the position of each extracted sentence in the original document with the beginning of the document being at the bottom and the end of the document at the top

4.5 Sentence distribution comparison

Figure  4 shows the sentence positions in source document for Extractive summaries generated by different models on the PubMed test set.

In Oracle, the distribution of the positions of the selected sentences in the original document differs depending on document length. In shorter documents selected sentences are uniformly distributed in the original document with sentences in early, middle, and later sections of the document being roughly equally likely to be selected for the summary. However, in longer documents we see from the increasing white space in the centre of the y-axis that sentences in earlier and later sections are chosen more often than sentences in the middle. Both Lead and PacSum select sentences only from towards the beginning of documents with this pattern becoming more pronounced as the document length increases. HipoRank’s sentence distribution is similar to Oracle except that longer documents the pattern of choosing earlier and later sentences is even more pronounced. NoWRank and GrinCH both exhibit a consistent and uniform distribution of selected sentences across documents of all lengths.

5 Discussion

We introduce a novel method for long document Extractive Summarization. Our approach combines cluster and Graph-Based techniques, resulting in a hybrid strategy. The experimental results indicate that our method achieves comparable summary quality to other Unsupervised Graph-Based Extractive Summarization methods. Notably, our method demonstrates a distinct advantage in terms of redundancy reduction, as evidenced by the lower redundancy observed in the generated summaries compared to other methods. This outcome underscores the effectiveness of our hybrid approach in producing concise and informative summaries while maintaining a higher level of content diversity.

However, there are certain limitations to acknowledge in this study. First, determining the optimal cluster number for extremely long documents can be time-consuming, potentially impacting the scalability of the approach. To address this challenge, a potential solution could involve introducing a threshold or applying heuristic approach in clustering stage such as Genetic Algorithm (Murthy and Chowdhury 1996 ), Ant Colony Approach (Shelokar et al. 2004 ), Particle Swarm Optimization (Chen and Ye 2012 ). Secondly, evaluating redundancy in summarization is limited by a constrained set of metrics available in the literature. The absence of an automated metric that accurately quantifies redundancy without requiring human intervention presents a challenge. Given the crucial role of redundancy elimination in the summarization process, the development of a robust automated evaluation metric for redundancy becomes crucial. In this context, creating a dedicated metric, similar to how ROUGE is used as a benchmark evaluation metric for summaries, could significantly enhance the effectiveness of developing summarization systems.

6 Conclusion and future work

In this paper, we introduce a novel hybrid approach for Unsupervised Extractive long document summarization. Our approach integrates cluster and Graph-Based techniques through the utilization of pre-trained language models. The experimental outcomes from summarization datasets demonstrate that our approach attains performance levels akin to those of state-of-the-art supervised neural models, along with the strong baselines established by earlier Unsupervised Graph-Based Summarization models. In the future, we would like to explore the adaptability of this method to diverse languages and domains. Additionally, investigating its potential for sentence selection in multi-document summarization holds exciting prospects.

Agarwal N, Gvr K, Reddy RS, Rosé CP (2011) Scisumm: A multi-document summarization system for scientific articles. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations. HLT ’11, pp. 115–120. Association for Computational Linguistics, USA

Alguliyev R, Aliguliyev R, Isazade N, Abdi A, Idris N (2019) Cosum: text summarization based on clustering and optimization. Exp Syst.

Article   Google Scholar  

Bommasani R, Cardie C (2020) Intrinsic evaluation of summarization datasets. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8075–8096. Association for Computational Linguistics, Online. .

Chen C-Y, Ye F (2012) Particle swarm optimization algorithm and its application to clustering analysis. In: 2012 Proceedings of 17th Conference on Electrical Power Distribution, pp. 789–794

Chen J, Zhuge H (2014) Summarization of scientific documents by detecting common facts in citations. Future Gener Comput Syst 32:246–252.

Cohan A, Dernoncourt F, Kim DS, Bui T, Kim S, Chang W, Goharian N (2018) A discourse-aware attention model for abstractive summarization of long documents. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 615–621. Association for Computational Linguistics, New Orleans, Louisiana. .

Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota. .

Dong Y, Mircea A, Cheung JCK (2021) Discourse-Aware Unsupervised Summarization for Long Scientific Documents. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 1089–1102. Association for Computational Linguistics, Online. .

Erkan G, Radev DR (2004) Lexrank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479

Gokhan T, Smith P, Lee M (2023) Node-weighted centrality ranking for unsupervised long document summarization. In: Métais E, Meziane F, Sugumaran V, Manning W, Reiff-Marganiec S (eds) Natural language processing and information systems. Springer, Cham, pp 299–312.

Chapter   Google Scholar  

Gokhan T, Smith P, Lee M (2022) GUSUM: Graph-based unsupervised summarization using sentence features scoring and sentence-BERT. In: Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing, pp. 44–53. Association for Computational Linguistics, Gyeongju, Republic of Korea.

Hartigan JA, Wong MA (1979) Algorithm as 136: A k-means clustering algorithm. J Royal Stat Soc 28(1):100–108

Google Scholar  

Koh HY, Ju J, Liu M, Pan S (2022) An empirical survey on long document summarization: Datasets, models, and metrics. ACM Comput Surv.

Liang X, Wu S, Li M, Li Z (2021) Improving unsupervised extractive summarization with facet-aware modeling. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 1685–1697. Association for Computational Linguistics, Online. .

Lin C-Y, Hovy E (2003) Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 150–157.

Liu J, Hughes DJD, Yang Y (2021) Unsupervised Extractive Text Summarization with Distance-Augmented Sentence Graphs, pp. 2313–2317. Association for Computing Machinery, New York, NY, USA.

Mihalcea R, Tarau P (2004) TextRank: Bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411. Association for Computational Linguistics, Barcelona, Spain.

Miller D (2019) Leveraging BERT for extractive text summarization on lectures

Murthy CA, Chowdhury N (1996) In search of optimal clusters using genetic algorithms. Pattern Recognit Lett 17(8):825–832.

Nainggolan R, Perangin-angin R, Simarmata E, Tarigan AF (2019) Improved the performance of the k-means cluster using the sum of squared error (sse) optimized by using the elbow method. J Phys 1361(1):012015.

Nallapati R, Zhai F, Zhou B (2017) Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. Proceedings of the AAAI Conference on Artificial Intelligence 31 (1)

Nallapati R, Zhou B, Santos C, Gulçehre Ç, Xiang B (2016) Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 280–290. Association for Computational Linguistics, Berlin, Germany. .

Pawar S, Manjula Gururaj H, Chiplunar NN (2022) Text summarization using document and sentence clustering. Procedia Computer Science 215, 361–369 . 4th International Conference on Innovative Data Communication Technology and Application

Peyrard M, Botschen T, Gurevych I (2017) Learning to score system summaries for better content selection evaluation. In: Proceedings of the Workshop on New Frontiers in Summarization, pp. 74–84. Association for Computational Linguistics, Copenhagen, Denmark. .

Pilault J, Li R, Subramanian S, Pal C (2020) On extractive and abstractive neural document summarization with transformer language models. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9308–9319. Association for Computational Linguistics, Online. .

Qazvinian V, Radev DR (2008) Scientific paper summarization using citation summary networks. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 689–696. Coling 2008 Organizing Committee, Manchester, UK .

Radev DR, Hovy E, McKeown K (2002) Introduction to the special issue on summarization. Comput Linguist 28(4):399–408.

Reimers N, Gurevych I (2019) Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992. Association for Computational Linguistics, Hong Kong, China .

Shelokar PS, Jayaraman VK, Kulkarni BD (2004) An ant colony approach for clustering. Anal Chimica Acta 509(2):187–195.

Vilca GCV, Cabezudo MAS (2017) A study of abstractive summarization using semantic representations and discourse level information. In: International Conference on Text, Speech, and Dialogue. Springer, pp. 482–490

Wang Z, Ma L, Zhang Y (2016) A novel method for document summarization using word2vec. In: 2016 IEEE 15th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), pp. 523–529 .

Xiao W, Carenini G (2019) Extractive summarization of long documents by combining global and local context. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3011–3021. Association for Computational Linguistics, Hong Kong, China. .

Xiao W, Carenini G (2020) Systematically exploring redundancy reduction in summarizing long documents. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp. 516–528. Association for Computational Linguistics, Suzhou, China.

Xu S, Zhang X, Wu Y, Wei F, Zhou M (2020) Unsupervised extractive summarization by pre-training hierarchical transformers. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1784–1795. Association for Computational Linguistics, Online. .

Zheng H, Lapata M (2019) Sentence Centrality Revisited for Unsupervised Summarization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6236–6247. Association for Computational Linguistics, Stroudsburg, PA, USA.

Download references


The first author would like to acknowledge the Ministry of National Education of Turkey for the financial support of her research activity.

Author information

Malcolm James Price and Mark Lee have contributed equally to this work.

Authors and Affiliations

MBZUAI, Masdar City, UAE

Tuba Gokhan

University of Birmingham, Birmingham, UK

Malcolm James Price & Mark Lee

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Tuba Gokhan .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit .

Reprints and permissions

About this article

Gokhan, T., Price, M.J. & Lee, M. Graphs in clusters: a hybrid approach to unsupervised extractive long document summarization using language models. Artif Intell Rev 57 , 189 (2024).

Download citation

Accepted : 06 June 2024

Published : 29 June 2024


Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • SentenceBERT
  • Language models
  • Sentence centrality
  • Find a journal
  • Publish with us
  • Track your research

Improved K-means clustering algorithms : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, Massey University, New Zealand

Thumbnail Image

Open Access Location

Journal title, journal issn, volume title, description, collections.

Monash University

Restricted Access

Reason: Access restricted by the author. A copy can be requested for private research and study by contacting your university library for a document delivery request.

Document clustering

Campus location, year of award, department, school or centre, degree type, usage metrics.

Faculty of Information Technology Theses

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .


Here are 53 public repositories matching this topic..., ddansabelenda / doc-clusterizer.

DocClusterizer is a Java desktop application designed to analyze and cluster documents based on their content similarity. The application utilizes Lucene and Tika libraries to process various file extensions such as txt, pdf, docx, and pptx.

  • Updated Apr 6, 2024

FrancescoPaoloL / LearningNLP

This repository contains what I'm learning about NLP

  • Updated Dec 25, 2023

nbaryalakshmi / information_retrieval_document_clustering

Document Clustering

  • Updated Dec 7, 2023

KhushiBhadange / Doc-Sync-And-Topic-mapper

Explore my Document Clustering and Theme Extraction project, offering effective tools for organizing and extracting valuable insights from extensive text datasets. The objective is to provide a systematic approach to comprehend and organize unstructured text data.

  • Updated Sep 18, 2023

Siddharth1989 / DocumentClusteringForCryptocurrencyInfoDocumentSet

This project implements document clustering with the EM (Expectation-Maximization) algorithm for a Cryptocurrency Information Document Set.

  • Updated Aug 21, 2023
  • Jupyter Notebook

SyedMuhammadFaheem / InformationRetrieval

This repo consists of all the assignments, projects, tasks of Information Retrieval course of FAST NUCES Spring 2023.

  • Updated May 14, 2023

HotelTango314 / cs5293sp23-project2

The 3rd of 4 NLP Projects - this project clusters a corpus of culinary recipe texts. The cuisine of each recipe is known and each cluster is labeled with the majority cuisine in that cluster. New recipes are then introduced and clustered and labeled with the cuisine of the closest cluster.

  • Updated Apr 22, 2023

nunososorio / docxmatch

DocxMatch is a Streamlit app that analyzes the similarity between Word files.

  • Updated Apr 17, 2023

inuwamobarak / document-clustering

This repo is for my article with Analytics Vidhya. In this project, we embark on organizing set of articles from Wikipedia using the Wikipedia library into similar groups (or clusters).

  • Updated Apr 15, 2023

FranzTscharf / DBPRO-DokCluster

Development of a Document Clustering System with carrot2 and elasticsearch

  • Updated Mar 14, 2023

mbilalakmal / InformationRetreivalA3

  • Updated May 3, 2024

arashshams / Food_Recipes_Document_Clustering

This repository hosts an unsupervised model for Document Clustering of food recipes.

  • Updated Aug 11, 2022

sorayutmild / Unsupervised-Thai-Document-Clustering-with-Sanook-news

An unsupervised model to clustering Thai news. Using TD-IDF, SimCSE-WangchanBERTa with weighted by number of named entities as a vector representation, and using k-means as an clustering model.

  • Updated Jul 24, 2022

hardikasnani / classifying-and-clustering-the-newsgroups

I leveraged an algorithmic approach for document classification and document clustering. Various models have been trained for document classification and they all have been evaluated using performance metrics followed by tuning of the model hyper-parameters to reach the most accurate classification. Additionally, a model has been trained for doc…

  • Updated Jul 8, 2022

zaferyalcin / Short-Turkish-Documents-Clustering

In this project, short document clustering algorithms for Turkish language used Turkish News Category for Turkish short document clustering. Dataset compiled from print media and news sites published by Interpress Media Compared using Monitoring Company dataset.

  • Updated Mar 17, 2022

zaferyalcin / Short-English-Documents-Clustering

In this project, short document clustering algorithms for English language.

lukacupic / PDF-Document-Management-and-Search-System

Bachelor's Thesis at FER, University of Zagreb, 2018.

  • Updated Jan 24, 2022

thisishardik / forum-posts-clustering

This project incorporates Hierarchical document clustering of the Kaggle forum posts using data from Meta Kaggle. Includes fine-tuned vectors using GoogleNews embeddings.

  • Updated Dec 2, 2021

nidhisinha11 / predictive-analytics-2021

Document Clustering project utilizing K-Means algorithm. Requires Stanford CoreNLP as a dependency. From my undergraduate course in Predictive Analytics taken with Anasse Bari at NYU.

  • Updated Nov 9, 2021

Wittline / document-clustering

Agglomerative Hierarchical Document Clustering

  • Updated Aug 27, 2021

Improve this page

Add a description, image, and links to the document-clustering topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the document-clustering topic, visit your repo's landing page and select "manage topics."

  • Bibliography
  • More Referencing guides Blog Automated transliteration Relevant bibliographies by topics
  • Automated transliteration
  • Relevant bibliographies by topics
  • Referencing guides

Conceptual clustering with application on FCA context

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, recommendations, a review of conceptual clustering algorithms.

Clustering is a fundamental technique in data mining and pattern recognition, which has been successfully applied in several contexts. However, most of the clustering algorithms developed so far have been focused only in organizing the collection of ...

Hierarchical distance-based conceptual clustering

In this work we analyse the relation between hierarchical distance-based clustering and the concepts that can be obtained from the hierarchy by generalisation. Many inconsistencies may arise, because the distance and the conceptual generalisation ...

Preventing Overlaps in Agglomerative Hierarchical Conceptual Clustering

Hierarchical Clustering is an unsupervised learning task, whi-ch seeks to build a set of clusters ordered by the inclusion relation. It is usually assumed that the result is a tree-like structure with no overlapping clusters, i.e., where clusters ...


Published in.

Pergamon Press, Inc.

United States

Publication History

Author tags.

  • Conceptual clustering
  • Formal concept analysis
  • Concept lattice reduction
  • k-means clustering
  • Quality threshold clustering
  • Research-article


Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.


  1. Document clustering and classification

    thesis on document clustering

  2. PPT

    thesis on document clustering

  3. (Color online) Illustration of document clustering. (a) The clustering

    thesis on document clustering

  4. PPT

    thesis on document clustering

  5. An example of the document clustering.

    thesis on document clustering

  6. An example of the document clustering.

    thesis on document clustering


  1. Document clustering tool for grouping related documents

  2. Klastering Dokumen dengan K-Means

  3. K

  4. Introduction for writing a Thesis documents using LaTeX *Full Tutorial*

  5. Document Clustering Extracting keywords

  6. NLP -Python


  1. PDF Investigation of Machine Learning Tools for Document Clustering and

    of the document clustering in particular type of conditions. In order to achieve those goals, we design and implement a synthetic data gener-ation engine that assists in creating, evaluating and improving different models and clustering methods. In addition, this thesis strives to address issues of model evaluation and automatic

  2. PDF Document Clustering using various External Knowledge sources

    an efficient hierarchy, good clustering accuracy, etc. In this thesis, we present a framework for document clustering that addresses all the above issues. The framework is a fusion of two broach approaches. The first approach consists of a topic-based document clustering algorithm, and in the second approach we provide methods to

  3. PDF Document Clustering

    The aim of this thesis is to improve the efficiency and accuracy of document clustering. We discuss two clustering algorithms and the fields where these perform better than the known standard clustering algorithms. The first approach is an improvement of the graph partitioning techniques used for document clustering.

  4. PDF Distributed Document Clustering and Cluster Summarization in Peer-to

    This thesis addresses difficult challenges in distributed document clustering and cluster summarization. Mining large document collections poses many challenges, one of which is the extraction of topics or summaries from documents for the purpose of interpretation of clustering results. Another important challenge, which is caused by new trends in

  5. PDF Scalable Fine-grained Document Clustering Via Ranking

    Scalable Fine-Grained Document Clustering via Ranking i Keywords Big Data, CICR, Clustering Analysis, Clustering Approximations, Document ... unsupervised clustering are presented in this thesis. This thesis also proposed a solution for approximation of clustering on a streamed data. Using this

  6. PDF Document Clustering with Query Constraints Master's Thesis

    In this thesis, we revisit the document clustering problem from an information retrieval perspective that explicitly addresses the need for appropriate cluster labels. The challenge of finding meaningful cluster labels is one of the major impediments for an extensive use of document clustering in down-market information retrieval systems.

  7. An Approach for Documents Clustering Using K -Means Algorithm

    Deokar ( 2013) proposed a unstructured text documents clustering with the usage of K -means algorithm. This paper explained the K -means, residual sum of squares, termination condition, bad choice of initial seed and then further TF-IDF calculation for corresponding documents. This approach is implemented in Java.

  8. PDF Comparing applicability of prevalent Clustering Algorithms for Document

    Comparing applicability of prevalent Clustering Algorithms for Document Clustering. Bachelor's Thesis submitted to Prof. Dr. Wolfgang K. H¨ardle and Prof. Dr. Nadja Klein Humboldt-Universit¨at zu Berlin School of Business and Economics Ladislaus von Bortkiewicz Chair of Statistics by Luisa Krawczyk (573814) in partial fulfillment of the ...

  9. PDF Document Clustering Algorithms, Representations and Evaluation for

    Clustering is an unsupervised learning approach that groups similar examples together without any human labeling of the data. Due to the very broad nature of this de nition, there have been many di erent approaches to clustering explored in the scienti c literature. This thesis addresses the computational e ciency of document clustering

  10. PDF Deep Learning for Document Clustering: a Survey, Taxonomy and ...

    Recently, deep learning techniques have achieved distinguish results in solving the problems facing documents clustering such as complex semantics and high dimensionality. This paper aims to examines a. comprehensive review related to documents clustering, and survey the recent work in document clustering.

  11. PDF Clustering Web Documents: A Phrase-Based Method for Grouping Search

    This dissertation investigates whether the automatic grouping of similar documents (document clustering) is a feasible method of presenting the results of Web search engines. We identify several key requirements for document clustering of search engine results: clustering quality, concise and accurate cluster descriptions, and speed. In


    By considering those challenges there, in the current thesis proposed a semantic document clustering framework and the framework be developed by using Python platform and tested each of steps. In ...

  13. PDF Clustering and Summarization of Large Document Sets

    document collection can reduce the amount of time an analyst must spend evaluating information and can improve an analysts ability to nd critical in-formation more e ciently. With this motivation in mind, this thesis attempts to explore various text clustering and text summarization algorithms that can be used to create a sys-

  14. Effects of similarity metrics on document clustering

    used in the thesis. Document Clustering is defined as unsupervised document organization, automatic topic extraction and fast information retrieval. For Example, in web search huge numbers of pages are returned when user enters a query making it difficult for user to browse or extract needed information where as clustering produces results ...

  15. Graphs in clusters: a hybrid approach to unsupervised ...

    Effective summarization of long documents is a challenging task. When addressing this challenge, Graph and Cluster-Based methods stand out as effective unsupervised solutions. Graph-Based Unsupervised methods are widely employed for summarization due to their success in identifying relationships within documents. Cluster-Based methods excel in minimizing redundancy by grouping similar content ...

  16. PDF Jimma University Jimma Institute of Technology Faculty of Computing

    A THESIS SUBMITTED TO THE FACULTY OF COMPUTING IN PARTIAL FULFILLMENT FOR THE DEGREE OF MASTER OF SCIENCE IN INFORMATION TECHNOLOGY June 25, 2019 Jimma, Ethiopia. JIMMA UNIVERSITY ... clustering of documents and exploration was done by the LDA based on the generated topics.

  17. PDF Multilingual Document Clustering

    In this thesis, we focus on clustering multilingual documents of resource-poor languages. We propose three main approaches which do not make use of any language dependent resources or tools for multilingual document cluster-ing. In the rst approach, we perform clustering of multilingual documents us-ing Wikipedia as external knowledge.

  18. PDF Hierarchical Clustering With Global Objectives: Approximation

    objectives is in stark contrast with the flat clustering literature, where objectives like k-means, k-median or k-center have been studied intensively starting from the 1950s, leading to a comprehensive theory on clustering. In this thesis, we study approximation algorithms and their limitations, making progress towards

  19. Improved K-means clustering algorithms : a thesis presented in partial

    K-means clustering algorithm is designed to divide the samples into subsets with the goal that maximizes the intra-subset similarity and inter-subset dissimilarity where the similarity measures the relationship between two samples. As an unsupervised learning technique, K-means clustering algorithm is considered one of the most used clustering algorithms and has been applied in a variety of ...

  20. Document clustering

    Document clustering. This thesis was scanned from the print manuscript for digital preservation and is copyright the author. Researchers can access this thesis by asking their local university, institution or public library to make a request on their behalf. Monash staff and postgraduate students can use the link in the References field.


    September 2020. versity, HaramayaHARAMAYA UNIVERSITYPOSTGRADUATE PROGRAM DIRECTORATEI hereby certify that I have evaluated this Thesis entitled Enhancing Effectiveness of Afaan Oromo Information Retrieval Using Latent Semantic Indexing and Document Clustering Based Searching that prepared by Belete Bogale under my advice and I.

  22. document-clustering · GitHub Topics · GitHub

    DocClusterizer is a Java desktop application designed to analyze and cluster documents based on their content similarity. The application utilizes Lucene and Tika libraries to process various file extensions such as txt, pdf, docx, and pptx. ... Bachelor's Thesis at FER, University of Zagreb, 2018. tf-idf bachelor-thesis document-clustering ...

  23. Dissertations / Theses: 'K-means clustering algorithm'

    The thesis deals with text mining. It describes the theory of text document clustering as well as algorithms used for clustering. This theory serves as a basis for developing an application for clustering text data. The application is developed in Java programming language and contains three methods used for clustering.

  24. Dissertations / Theses: 'Cluster clustering'

    The thesis combines two cluster models (Porter, 1990 and Swann et al., 1998) that contribute to an understanding of how, what and why certain detenninants lead to better perfonnance of finns in a cluster. ... Document clustering and cluster labeling are two vital problems in the information retrieval domain because of their ability to organize ...

  25. Conceptual clustering with application on FCA context

    Conceptual clustering is one of the key approaches for automatic concept generation from input contexts. In this paper, we propose an extension of the dominating k-means method. The proposed method introduces a flexible distance metric that enables the approximation of both Euclidean and meet(or join)-based similarity calculations.

  26. PDF eCofA NovaSeq 6000 S1 Cluster Cartridge v1.5

    This document certifies that the product(s) described above meet quality specifications. Print Name: Signature: Department: Quality Date: Template, Certificate of Analysis RI-JO - I Level 200006048, Ver. 01, Effective Date: 13-JUN-2022 . Title: eCofA NovaSeq 6000 S1 Cluster Cartridge v1.5 - 20865613