Google Patent Data Analytics: XML format patent bibliographic data

Monday, 21 October 2013

XML format patent bibliographic data

Many patent offices (e.g. USPTO, EPO, WIPO, SIPO, CIPO) publish their bibliographic patent data in XML format in accordance with WIPO standard ST.36.   Some patent offices (e.g. CIPO) make the data available free of charge, for non-commercial use.   Others (e.g. USPTO) make the data available free of charge with no usage restrictions.

Like HTML, XML (extensible markup language) employs tags to encapsulate information.  Unlike HTML tags, XML tags impart no display characteristics (e.g. fonts) to the tagged information.  Also unlike HTML tags, XML tags are user-definable.  This means that they can be—and usually are—self-describing.  XML tags can also be arranged, e.g. nested to present information hierarchically.  Patent bibliographic data stored in the XML format defined by WIPO's ST.36 standard utilizes self-describing tags which are defined and hierarchically arranged in accordance with the standard.


Consider this extract from the USPTOs XML document for US patent no. 8309744.  Notice the field tags.  For example, the <country></country> tag pair encapsulates the “US” country code, telling us that this document pertains to a US patent.

The <doc-number></doc-number> tag pair encapsulates “08309744”, telling us the document's number.

The <kind></kind> tag pair encapsulates “B2”, telling us that the document is a granted utility patent.

The <date></date> tag pair encapsulates “20121113”, telling us that the patent issued on November 13, 2012.

Those four tag pairs are nested within the <document-id></document-id> tag pair which is in turn nested within the <publication-reference></publication-reference> tag pair.  The information encapsulated by those tag pairs identifies the published document.


The <document-id></document-id>, <country></country>, <doc-number></doc-number>, <date></date> tag pairs are also hierarchically nested within a pair of <application-reference></application-reference> tags.  Since the tags are self-describing, you can easily understand that the encapsulated information tells us that the '744 patent issued from US application serial no. 13/081,794 which was filed on April 7, 2011.

The depicted extract is just a small part of the USPTO's XML document publication for US patent no. 8309744.  Anyone familiar with patent information could read the
raw XML document and discern its meaning fairly readily.  However, XML documents are not normally intended for human reading.  Their primary purpose is to preserve a document's organization and structure in computer-readable form.  The visualizations presented via this blog were developed by computer processing of XML documents corresponding to the visualized patent publications.