Recover demaged MS-Word file
Recently I was asked to try to repair Word file, when you open it you receive the the following message:
The name in the end tag of the element must match the element type in the start tag”
The file was loaded with mathematical formulas, and attempt to restore it or put in earlier versions failed. So how do you start?
Starting from a version 2007, Microsoft has adopted an agreed format for Word documents, Excel and Power Point, called Open XML Format. In fact, every Word file built from a collection of xml files, and all together contracted by zip, plus a “docx” extension. Suppose we have a file named MyDoc.docx, by changing the name to MyDoc.docx.zip, we can Extract with simple zip software, and receive a folder with all the xml files. To bring it back to a word file, simply compress back, and download the zip file name extension.
Step 1: Extracting to XML Files
First we’ll copy the corrupted file, in order not to hurt him more than he has been hurt. For example, let’s call him ErrorFile.docx. We will add the .zip extension. to the file name, and double clicking on it will open the actual extracted files. At this point we can see the structure of the xml.
Step 2: Finding the damaged line
Pay attention to the error message we got- the last line refers us to word/document.xml, line 2, column 93,496. Unfortunately, Word does not really handle the layout of the xml file, so most of the file is in one line, and that’s why we got this number - 93,496. The Document.xml file, is the file that manage all the structure of the Word document, and is actually the main file of any word document. In order to work on it properly, we’ll copy it to another library.
To find the specific error in the file, please refer to the exact column that appears in the error description, and find why the file is invalid in this area. In practice, since we are dealing with large files, we might have to scan quite a lot of lines before we can understand where is the mistake. To minimize this time, you can simply open a new xml file on VisualStudio 2010 and paste the file into the design surface, and it will be done automatically. Changing the file lines cause the line number in the error message to be irrelevant for us, but do not worry - Notice the red dots on the ScrollBar, they indicate the location of the error.
A quick check by collapsing unnecessary elements resulted the following picture,that explains it all:
A sequence of elements that arranged not in the correct order - the opening element for <AlternateContent> is before <oMath>, but also its closing element. The <Choice> element’s location is not so clear.
Step 3: The solution
In order to find the right ratio of those elements for those who unfamiliar with this format, you should look for the rest of the file - how to do it right. You can find many places whereAlternateContent wraps Choice, and all that’s left is to change the location of oMath.
Step 4: Running the new file
After fixing the file - document.xml, keep it and replace it with the original document.xml file, found under the ErrorFile.docx.zip we opened. No need to compress (and vice versa, compress by zip software usually will not work). Just return to the parent directory, remove the zip extension, and open the file properly.
Next time, Please, create versions .