Army researchers improve translation of U.S. -- Korean documents
May 15, 2013
ABERDEEN PROVING GROUND, Md. (May 15, 2013) -- Joint exercises across the globe require U.S. Soldiers to communicate with armed forces who speak different languages.
U.S. Army researchers are improving text translation technologies for U.S. Forces Korea, or USFK, allowing them to easily translate documents used by the United States and Republic of Korea, or ROK, during joint operations.
The U.S. Army Research Development and Engineering Command's communications and electronics center, or CERDEC, is evaluating documents and improving software for the Korean Advanced Text Translator project, or KATT, in the domain of mission command.
KATT traces its roots to 2008 when USFK requested assistance in translation technologies. USFK contacted CERDEC Command, Power and Integration, or CP&I, based on its previous research into translation technologies. After several meetings and some initial funding setbacks, the program was funded by the Defense Acquisition Challenge Program in 2012.
Congress established the DAC program to support Department of Defense research and development through test and evaluation of commercial technologies with a view toward procurement and fielding. Proposals were evaluated based on how well they met the needs of select operational areas outlined by the program.
Once selected, funding was provided by the OSD Comparative Technology Office. The OSD CTO and the RDECOM HQ Global Technology Integration Team, which manages the DAC program for the entire Army, provides oversight for the KATT project.
"There was always this need for translation into Korean but no funding," said Tim George, CERDEC CP&I engineer. "Nobody was serving that need, and then we found DAC and we were able to go in and get them [USFK] something."
The project uses a commercial translation software program, SYSTRAN, as a foundation. CERDEC CP&I improves translation by altering the software's database of words and translation algorithm
Much of CERDEC CP&I's translation work has focused on speech-to-speech translation, like the Medical Applications for Speech Translation project and the Spontaneous Speech to Speech Translation Tool. CERDEC CP&I also worked on the text translation tool Coalition Chat Line Plus, which used SYSTRAN to translate between English, Spanish, Portuguese and French.
"How do you improve technology? The answer is the same, whether it is speech recognition or text translation," George said. "You collect data, and put that data in the system. You tell the system, this text means that and that's how the system becomes more accurate.
"We've been doing a lot of speech-to-speech, pick up a device and speak into it, but what they [USFK] realistically have are a lot of documents. They need to do joint planning together and that's what this [KATT] is really focused on -- it's not conversational, it's all document based."
The data being used to improve the accuracy of the SYSTRAN system comes from thousands of mission command documents provided by USFK. These documents include Microsoft Word and PowerPoint files, PDFs and emails. All documents are releasable to the ROK.
These documents cannot immediately be imported into the system for translation improvements, said Daniel Yaeger, CERDEC CP&I engineer.
"There is a whole lot of work that has to be done to the documents before we can even upload them into SYSTRAN and begin to train the model," Yaeger said. "We can't just take the documents and throw them into the system. Different documents come in different formats, so they need to be treated in a certain way."
This treatment involves normalizing the documents through a program in development by CERDEC CP&I that will extract information from the documents and create a standardized format for all data that will enter SYSTRAN.
"Our first real surprise of the project is that the documents would be mixed Korean and English, line-by-line," George said. "We thought they would give us English documents and give us Korean documents, and we'd use standard tools on them. We never imagined they would give us English and Korean alternating documents."
For those documents to be useful, it must be determined whether the English and Korean texts are correct translations. A Korean linguist from RDECOM's Army Research Laboratory, or ARL, is working with CERDEC CP&I engineers to determine the accuracy of those translations.
"A lot of value is added to the data before it even touches the translation software," Yaeger said. "Our linguist at ARL is assisting us with verifying alignment of the data and actually correcting translation errors."
"We have to fix these inconsistencies beforehand, or else you will get an incorrect translation when it's added to the system. The quantity of the data is not as important as the quality of data," said Jason Binder, CERDEC CP&I engineer. "Yes, you need a certain amount of data for it to be statistically relevant, but it also needs to be of high quality."
Once the documents are normalized and checked for translation accuracy, the data are added to a training model within SYSTRAN, where CERDEC CP&I engineers can track whether translation accuracy within the software is improving, staying the same or declining. They do this using a test set of documents kept separate from the documents used to propagate the input data for the system.
Several performance metrics determine the accuracy of translation software, but not all are realistic to the working usability of the product. While a system may be accurate when tested in the lab by specific mathematical formulas and metrics, the system may prove to be more or less accurate in the field.
"You can put a translation system between two people who don't speak each other's languages, and what we want to know is can they understand that something is a bad translation and re-do the translation attempt with the system? Do they have to call over a translator?" George asked.
For example, a device can be tested in the lab to be accurate for 70 percent of the words translated but can have a working accuracy 90 percent in the field, meaning the individuals using the device may not need assistance from a human translator despite a mistranslation, Yaeger said.
"Can they get their job done? That's the gold standard for the military," George said.
Another challenge is getting it accredited for use by the ROK, said Scott Merker, CERDEC CP&I engineer. The system must be scanned for vulnerabilities and comply with system requirements of the U.S. and Korean networks.
"This program will have to be connected to the Korean network," Merker said. "They aren't going to just let us hook up to it. We have to get the system accredited, which is a long, arduous process."
The final product will be a version of the SYSTRAN software with a profile tailored to mission command documents, making it more useful to USFK than an unaltered commercial product. The translation software can be accessed via a web browser, or the user can click a button within Microsoft Word or PowerPoint that will translate the document directly within those programs.
The project is on schedule to finish by August. The final translation product will be transitioned to Product Director Machine Foreign Language Translation Software and the Training and Doctrine Command.
Because of the nature of human language and the statistical nature of the systems involved, translation software will never be 100 percent accurate, Yaeger said. Human language continually changes over time, meaning translation software can always be improved.
"The key with translation technologies, which is different than most acquisition technologies you deal with in the Army, is that it is constantly evolving," he said.
CERDEC is part of the U.S. Army Research, Development and Engineering Command, which has the mission to develop technology and engineering solutions for America's Soldiers.
RDECOM is a major subordinate command of the U.S. Army Materiel Command. AMC is the Army's premier provider of materiel readiness -- technology, acquisition support, materiel development, logistics power projection, and sustainment -- to the total force, across the spectrum of joint military operations. If a Soldier shoots it, drives it, flies it, wears it, eats it or communicates with it, AMC provides it.