Google Patent Data Analytics: The USPTO’s patent bibliographic data—concatenated XML

Monday 18 November 2013

The USPTO’s patent bibliographic data—concatenated XML

In a previous post I mentioned that the Canadian Intellectual Property Office’s 2012 XML format patent bibliographic data is provided in a single 188 MB archive from which 58,572 separate XML files can be extracted. 21,592 of those files correspond to Canadian utility patents which issued in 2012. The other files correspond to Canadian laid-open applications (kind code A1), reissue patents (kind code E), re-examined patents (kind code F) and republished versions of previously published files (e.g. to correct errors). The files range in size from about 2-39 KB, with an average size of about 3.3 KB.

In contrast, as shown in this portion of Google’s "USPTO Bulk Downloads: Patent Grant Bibliographic Data" web page, the USPTO’s 2012 patent bibliographic data is provided in 52 separate .zip archives—one per week (recall that US patents are issued in batches, on Tuesday of each week throughout the year). Unless you relish the prospect of manually downloading 52 separate archives one at a time, you’ll want to consider using a bulk file download utility.

After downloading one or more of the USPTO’s weekly bibliographic data archives you can extract the contents of each archive. Unlike the CIPO’s archive, which extracts into a multiplicity of XML files—one per Canadian patent bibliographic document—the USPTO’s archives extract into just three files. This example shows details of the three files extracted from the USPTO’s bibliographic data archive for week 4 of 2012, i.e. bibliographic data for US patents which issued on Tuesday, 24 January 2012.

The .txt file contains a checklist of the patent numbers included in the archive, with one patent number per line. For week 4 of 2012 the checklist includes design patent numbers D0652606 through D0653015, plant patent numbers PP022464 through PP022468, reissue patent numbers RE043120 through RE043146 and utility patent numbers 08099794 through 08104093. The .html file includes a header (shown here—click to enlarge) summarizing the contents of the archive.

Notice that there is only one rather large .xml file. This is a concatenated XML file. As explained in the USPTO’s Bulk Data Product FAQs:
It is important to understand that the concatenated XML documents in the ZIP files, which have file extension "XML," are not the same as standard XML files and therefore will not be immediately readable by an ordinary XML parser. Instead, the files must be broken into individual XML documents, by splitting them apart at the XML declarations and/or DOCTYPE declarations.

Thus, unlike the CIPO’s archive from which one may directly extract separate XML files corresponding to individual Canadian bibliographic patent documents, some further processing of the USPTO’s concatenated XML files is required. Since XML files consist only of text and since each separate XML document within the USPTO’s concatenated XML file is prefaced by a unique XML declaration header (e.g. <?xml version="1.0" encoding="UTF-8"?>) it is relatively straightforward to split the concatenated XML file into separate XML files. For week 4 of 2012 this should yield 4,725 separate XML files as shown in above header.