Google Patent Data Analytics: viewing XML format patent bibliographic data

Monday 4 November 2013

viewing XML format patent bibliographic data

OK—you have obtained some bibliographic patent data in XML format from one of the available sources (see last week’s post). Now what?

You can inspect individual XML files with an XML file viewer. XML files consist of text only, so a text editor such as the Microsoft Windows Notepad utility will do. Here is a small portion of the USPTO’s XML file for US patent no. 8329177 as viewed in Notepad. You can see the descriptive tag pairs (e.g. <document-id></document-id>) encapsulating the information content, but the tags’ hierarchical structure isn’t readily apparent via Notepad.

The tags’ hierarchical structure is more apparent if we inspect the XML file with a spreadsheet program such as Microsoft Excel. Here is a small portion of the same XML file (i.e. US 8329177) as viewed in Excel.  The file has been converted into the familiar spreadsheet row/column format, with the tags appearing as column headings and the encapsulated information shown in rows beneath the respective headings. The conversion (or "flattening") process repeats information in some cells, as seen here.  The column headers have been narrowed to show more columns.   When viewed in Excel, the XML file for US 8329177 has 53 rows and 131 columns, so the worksheet has a total of 6,943 cells. However, the worksheet is rather sparse: only 2,408 of those cells contain information.

A web browser can also be used to inspect an XML file. Here is the same small portion of the XML file for US 8329177 shown above in Notepad, as viewed in Microsoft Internet Explorer. In this case, the tags’ hierarchical structure is made apparent by color highlighting and by hierarchical indentation levels. The encapsulated information is bolded.  This makes it somewhat easier to browse through the contents of a single XML file—if that is what you want to do.

None of this is very helpful, unless you are only interested in a particular XML file’s bibliographic patent data content and don’t mind manually inspecting and deciphering the file’s tagged information as outlined above. More generally, what we want to do is examine the information content of a number (preferably a very large number) of bibliographic patent data XML files in parallel. Future posts will delve into that topic.