Advances in information extraction offer Army path beyond language barriers

By U.S. Army CCDC Army Research Laboratory Public AffairsMay 7, 2020

Breakthroughs in information extraction methods may help Soldiers more easily sift through large volumes of text written in a foreign language and obtain valuable data.
Breakthroughs in information extraction methods may help Soldiers more easily sift through large volumes of text written in a foreign language and obtain valuable data. (Photo Credit: U.S. Army photo illustration - Shutterstock) VIEW ORIGINAL

ADELPHI, Md. -- The U.S. Army is working to break down language barriers and increase information flow. Soldiers may not have time to sift through hundreds of documents to find the specific answers they seek -- especially when that information is in another language.

In a partnership that began with the now-completed Network Science Collaborative Technology Alliance, the U.S. Army Combat Capabilities Development Command’s Army Research Laboratory and the University of Illinois at Urbana-Champaign, devised a new way to extract valuable information from text written in a foreign language.

In a text-processing system, this approach will help users organize large amounts of source data into an information network by way of a common vector space that identifies semantic patterns in multiple languages.

“The advantage of going from text data to an information network via a previously trained extraction system is that you can retain links to the original source data for later access since you may not know until much later what question you want to ask,” said Dr. Clare Voss, an Army computer scientist who collaborated with Illinois’ Dr. Heng Ji on the team that developed this approach. “Our program takes the language data and converts it over several phases into this information network so that, downstream, other programs can use it as a source of information.”

Information overload can hinder Soldiers in gathering information just as much as they are hindered by their inability to understand foreign language texts, Voss said. Since there is often too much accessible information for a human to scrutinize, researchers rely on information extraction techniques to annotate and distill the content to be presented to Soldiers for their mission.

“Information extraction converts otherwise unstructured, complex data into some kind of recognizable structure,” Voss said. “Once you’ve structured it, you can make it computational for downstream automatic applications or visualize it as a diagram for a human.”

The methods that Ji and Voss developed under the alliance take advantage of neural networks and methods developed in recent studies to create vectors to represent words in natural language text. They preprocess the text structurally before encoding it into vectors in order to incorporate contextual information about the sentences in which the words appear, while also including lexical information about the words themselves.

During preprocessing, words in each sentence were broken down into nouns, verbs, adjectives and other parts of speech, in conjunction with a symbolic method known as dependency parsing. These elements and their relationship at the end of the full extraction pipeline will be later categorized as entities, events or relations.

“The essential problem is that there is just too much information, as stored in all these elements, for a human to keep track of,” Voss said. “We have more text than what our analysts who receive all these documents could possible read. So, we want to get the content into a form that can be exploited by humans in many different ways. For the [alliance], our end goal was to construct information networks. To get there, we built other intermediary structures we call common structured semantic spaces from preprocessed vectors and parser relations.”

The researchers then trained a graphical convolutional neural network that encoded the contextualized vectors into a common vector space. Within this space, they can compare the semantic structures both within and across different languages and measure how close the structures are to each other.

“What this means is that you can have sentences in two languages that mean the same thing and their structures will logically resemble each other,” Voss said. “For our research, we took texts in different languages, structured and mapped them into this common space, and found extractable patterns that span across languages.”

The researchers then evaluated the capabilities of this common semantic space by building it with contextualized vectors in one language and testing how well it could be used to identify the entities, relations and events of a different language. While a common vector space generally performs best when tested with the same language that it was built upon, Voss and her colleagues discovered that some languages created semantic spaces that yielded higher accuracy extractions of entities and relations when tested with a different language.

For example, a semantic space constructed only with English inputs supported relation extraction cross-lingually; it yielded an accuracy score of 42.5 percent when subsequently tested with identifying relations in Chinese sentences, but an even higher accuracy score of 58.7 percent on relations in Arabic sentences.

These test scores are most striking when compared to the English test counterpart, Voss said. The relation extraction accuracy that this same English-based semantic space supported on new English test sentences was 68.2 percent. This indicates, as the researchers pointed out, that a sufficient number of semantic similarities exists between English, Chinese and Arabic that it is feasible to use the structures in the semantic space of one language to identify some structural aspects of another for information extraction.

“Not every noun in English is a noun in Chinese, but enough of them are that our approach puts the shared language elements together in this common space,” Voss said. “So if you build the space in English, you can preprocess Chinese text, run it through the neural network to place its structure into the common space and still successfully extract some Chinese entity-to-entity relations.”

Researchers found common semantic spaces built with preprocessed text from two different languages support higher accuracy extraction in a third language than when the space is built with only one of the languages. A common space built from a combination of vectors and parsed relations from English and Arabic preprocessed texts yielded higher accuracy extractions on Chinese texts than common spaces built with English, Arabic or Chinese individually.

The team said this discovery suggests that systems could use this new method of information extraction to not only process large amounts of single-language texts more accurately, but they could potentially use the appropriate common vector space to process large amounts of foreign text without prior access to training documents in that language.

The results of this research have already caught the attention of an Army expert who has expressed interest in applying this technique to build information networks from multilingual, open source documents within web-enabled systems used by civil affairs analysts.

For Soldiers stationed around the world, effective communication with the local population plays an important role in fostering cooperation and achieving mission success. This new method of information extraction could augment current system capabilities to include pointers to relevant foreign language content for those who want to know more about these local communities.

“While this approach could ultimately supplement human intelligence activities in a tactical environment, it has been designed for downstream use by people who are trying to gather information about a region and put together descriptions of resources there,” Voss said.

The Association for Computational Linguistics in the Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the International Joint Conference in Natural Language Processing published this research.

(Photo Credit: U.S. Army) VIEW ORIGINAL

CCDC Army Research Laboratory is an element of the U.S. Army Combat Capabilities Development Command. As the Army’s corporate research laboratory, ARL discovers, innovates and transitions science and technology to ensure dominant strategic land power. Through collaboration across the command’s core technical competencies, CCDC leads in the discovery, development and delivery of the technology-based capabilities required to make Soldiers more lethal to win the nation’s wars and come home safely. CCDC is a major subordinate command of the U.S. Army Futures Command.