A SURVEY REPORT ON THE EXISTING METHODS OF BUILDING A PARALLEL CORPUS

Sonal Khosla; Haridasa Acharya

doi:10.26483/ijarcs.v9i4.6171

PDF

Published: Aug 20, 2018

DOI: https://doi.org/10.26483/ijarcs.v9i4.6171

Keywords:

Sentence Alignment, Web Mining, Parallel Corpus, Manual, Corpus

Sonal Khosla

Haridasa Acharya

Abstract

This paper is a survey of the existing methods of building a parallel Corpus. The paper starts with a short introduction to a parallel corpus followed and the applications of a parallel corpus. Parallel corpus built in different language pairs and the method adopted is discussed and presented. The paper covers some of the methodologies of the major parallel corpus built. The survey report is restricted to corpus built aligned at sentence and document level.Â

Downloads

Download data is not yet available.

Issue

Vol. 9 No. 4 (2018): July â€“ August 2018

Section

Articles

COPYRIGHT

Submission of a manuscript implies: that the work described has not been published before, that it is not under consideration for publication elsewhere; that if and when the manuscript is accepted for publication, the authors agree to automatic transfer of the copyright to the publisher.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
The journal allows the author(s) to retain publishing rights without restrictions.
The journal allows the author(s) to hold the copyright without restrictions.

References

Ali, A., Siddiq, S., & Malik, M. K. (2010). Development of parallel corpus and English to Urdu statistical machine translation. Int. J. of Engineering & Technology IJET-IJENS, 10, 31-33.

Avramidis, E., Ruiz Costa-JussÃ , M., Federmann, C., Melero, M., Pecina, P., & Van Genabith, J. (2012). A Richly annotated, multilingual parallel corpus for hybrid machine translation. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12) (pp. 2189-2193). European Language Resources Association (ELRA).

Aziz, W. F., Pardo, T. A., & Paraboni, I. (2008, October). Building a Spanish-Portuguese parallel corpus for statistical machine translation. In Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web (pp. 369-371). ACM.

Botley, S., McEnery, T., & Wilson, A. (Eds.). (2000). Multilingual corpora in teaching and research (No. 22). Rodopi.

Bharadwaj, R. G., & Varma, V. (2011, March). Language independent identification of parallel sentences using wikipedia. In Proceedings of the 20th international conference companion on World wide web (pp. 11-12). ACM.

Bin, L. U., Jiang, T., Chow, K., & BENJAMIN K, T. (2010). Building a large English-Chinese parallel corpus from comparable patents and its experimental application to SMT. In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora (pp. 42-49).

Brown, P. F., Lai, J. C., Mercer, R. L., 1991. Aligning sentences in parallel corpora. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 169â€“176.

Chang, B. (2004). Chinese-English parallel corpus construction and its application. In Proceedings of The 18th Pacific Asia Conference on Language, Information and Computation (pp. 283-290).

Chinnakotla, M. K., Ranadive, S., Damani, O. P., & Bhattacharyya, P. (2007, September). Hindi to English and Marathi to English cross language information retrieval evaluation. In Workshop of the Cross-Language Evaluation Forum for European Languages (pp. 111-118). Springer, Berlin, Heidelberg.

Choudhary, N., & Jha, G. N. (2011, November). Creating multilingual parallel corpora in indian languages. In Language and Technology Conference (pp. 527-537). Springer, Cham.

CuÅ™Ãn, J., ÄŒmejrek, M., Havelka, J., & KuboÅˆ, V. (2004, March). Building a parallel bilingual syntactically annotated corpus. In International Conference on Natural Language Processing (pp. 168-176). Springer, Berlin, Heidelberg.

Dash, N. S., & Chaudhuri, B. B. (2001, November). Why do we need to develop corpora in Indian languages? In the International Working Conference on Sharing Capability in Localization and Human Language Technologies SCALLA-2001. Bangalore.

Eberle, K., GeiÃŸ, J., GinestÃ-Rosell, M., Babych, B., Hartley, A., Rapp, R., Sharoff, S. & Thomas, M. (2012, April). Design of a hybrid high quality machine translation system. In Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra) (pp. 101-112). Association for Computational Linguistics.

Frankenberg-Garcia, A. (2009). Compiling and using a parallel corpus for research in translation. Babel: international journal of translation, 21(1), 57-71.

Garje, G. V., & Kharate, G. K. (2013). Survey of machine translation systems in India. International Journal on Natural Language Computing (IJNLC), 2(4), 47-67.

Gale, W. A., & Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational linguistics, 19(1), 75-102.

Jagarlamudi, J., & Kumaran, A. (2007, September). Cross-Lingual Information Retrieval System for Indian Languages. In CLEF (pp. 80-87).

Jayaram, B. D., & Rajyashree, K. S. (2005). Corpora in Indian languages. Problems of Quantitative Linguistics, 323-329.

Liu, Z. (2013). Automated Building of Sentence-Level Parallel Corpus and Chinese-Hungarian Dictionary (Doctoral dissertation, WORCESTER POLYTECHNIC INSTITUTE).

Liu, W., Chang, Z., Teahan, W., 2014. Experiments with compression-based methods for English-Chinese sentence alignment. In Proceedings of Second International Conference on Statistical Language and Speech Processing (SLSP), Springer International Publishing, pp. 14â€“16.

Ma, X. (2006, May). Champollion: A robust parallel text sentence aligner. In LREC 2006: Fifth International Conference on Language Resources and Evaluation (pp. 489-492).

Martin, J., Johnson, H., Farley, B., & Maclachlan, A. (2003, May). Aligning and using an English-Inuktitut parallel corpus. In Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond-Volume 3 (pp. 115-118). Association for Computational Linguistics.

McEnery, T., & Xiao, R. (2011). What corpora can offer in language teaching and learning? Handbook of research in second language teaching and learning, 2, 364-380.

Megyesi, B. B., Hein, A. S., & Johanson, E. C. (2006). Building a swedish-turkish parallel corpus. LREC, Genoa, Italy.

Nair, L. R., & David Peter, S. (2012). Machine translation systems for Indian languages. International Journal of Computer Applications (0975â€“8887), 39(1).

Nazar, R. (2011). Parallel corpus alignment at the document, sentence and vocabulary levels. Procesamiento del lenguaje natural, (47).

Pilevar, M. T., Faili, H., & Pilevar, A. H. (2011, February). Tep: Tehran english-persian parallel corpus. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 68-79). Springer Berlin Heidelberg.

Post, M., Callison-Burch, C., & Osborne, M. (2012, June). Constructing parallel corpora for six indian languages via crowdsourcing. In Proceedings of the Seventh Workshop on Statistical Machine Translation (pp. 401-409). Association for Computational Linguistics.

Rosen, A., & VavrÃn, M. (2012). Building a multilingual parallel corpus for human users. In LREC (pp. 2447-2452).

Samy, D., Sandoval, A. M., Guirao, J. M., & Alfonseca, E. (2006). Building a Parallel Multilingual Corpus (Arabic-Spanish-English). In Proceedings of the 5th Intl. Conf. on Language Resources and Evaluations, LREC.

Shen, G. R. (2011). Corpus-based Approaches to Translation Studies. Cross-Cultural Communication, 6(4), 181-187.

Singh, A. K., & Surana, H. (2007a, June). Can corpus based measures be used for comparative study of languages? In Proceedings of ninth meeting of the ACL special interest group in computational morphology and phonology (pp. 40-47). Association for Computational Linguistics.

Singh, T. D. (2012). Building Parallel Corpora for SMT System: A Case Study of English-Manipuri. International Journal of Computer Applications, 52(14).

Sinha, R. M. K. (2009, August). Automated mining of names using parallel Hindi-English corpus. In Proceedings of the 7th Workshop on Asian Language Resources (pp. 48-54). Association for Computational Linguistics.

Sreelekha, S., Bhattacharyya, P., & Malathi, D. (2014). Lexical resources for Hindi-Marathi MT. In: The WILDRE2 2nd Workshop on Indian Language Data: Resources and evaluation.

Sridhar, V. K. R., Barbosa, L., & Bangalore, S. (2011). A Scalable Approach to Building a Parallel Corpus from the Web. In INTERSPEECH (pp. 2113-2116).

Srivastava, R., & Bhat, R. A. (2013). Transliteration Systems across Indian Languages Using Parallel Corpora. In PACLIC.

Tian, L., Wong, D. F., Chao, L. S., Quaresma, P., Oliveira, F., & Yi, L. (2014). UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation. In LREC (pp. 1837-1842).

Tiedemann, J. (2007). Building a multilingual parallel subtitle corpus. Proc. CLIN, 14.

Yeka, J. R., Kolachina, P., & Sharma, D. M. (2014, May). Benchmarking of English-Hindi parallel corpora. In LREC (pp. 1812-1818).

Zhang, Y., Uchimoto, K., Ma, Q., & Isahara, H. (2005). Building an annotated Japanese-Chinese parallel corpusâ€“a part of NICT multilingual corpora. In Second International Joint Conference on Natural Language Processing (pp. 85-90).

Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & SchluÂ¨ter, P. (2013). DGT-TM: A freely available translation memory in 22 languages. arXiv preprint arXiv:13095226.

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

References