Google Patent Data Analytics: November 2013

Monday 25 November 2013

Plant patents

The US is one of the few countries that grants plant patents. Canada grants "plant breeders’ rights" which are administered not by the Canadian Intellectual Property Office but by the Plant Breeders’ Rights Office, which is part of the Canadian Food Inspection Agency.

The USPTO’s group art unit 1661 handles plant patent applications. US plant patents are allocated one of two kind codes: P2 or P3; depending on whether the application from which the patent issued underwent pre-grant publication (P3) or not (P2).

The USPTO’s bibliographic patent data contains a wealth of information for plant patents, including the botanical denomination (i.e. Latin name) of the genus and species of the patented plant; and the plant variety designation (i.e. cultivar name) of the patented plant. For example, US plant patent PP23241 issued on 4 December 2012 for a plant having the botanical denomination Echinacea purpurea and variety designation "Quills and Thrills". According to the patent’s abstract, the plant is "characterized by large inflorescences with quilled ray florets of purple pink, a compact, multicrown habit, a long bloom time, and excellent vigor".


It’s relatively straightforward to feed bibliographic data of this sort directly from a visualization tool into a search engine, to obtain further information. For example, the botanical denomination or the variety designation can be fed into a search engine directly off the visualization to get an image of the plant—as seen here for "Quills and Thrills".


As further examples, either the botanical denomination or the variety designation can be fed into a search engine directly off the visualization to lookup the corresponding plant patent(s) in the Google patents database, or to see if the botantical denomination appears in The Plant List—as seen here for Echinacea purpurea—etc.



Click the "Plant Patents" tab above to explore these and other aspects of the USPTO’s 2012 plant patents.

Monday 18 November 2013

The USPTO’s patent bibliographic data—concatenated XML

In a previous post I mentioned that the Canadian Intellectual Property Office’s 2012 XML format patent bibliographic data is provided in a single 188 MB archive from which 58,572 separate XML files can be extracted. 21,592 of those files correspond to Canadian utility patents which issued in 2012. The other files correspond to Canadian laid-open applications (kind code A1), reissue patents (kind code E), re-examined patents (kind code F) and republished versions of previously published files (e.g. to correct errors). The files range in size from about 2-39 KB, with an average size of about 3.3 KB.

In contrast, as shown in this portion of Google’s "USPTO Bulk Downloads: Patent Grant Bibliographic Data" web page, the USPTO’s 2012 patent bibliographic data is provided in 52 separate .zip archives—one per week (recall that US patents are issued in batches, on Tuesday of each week throughout the year). Unless you relish the prospect of manually downloading 52 separate archives one at a time, you’ll want to consider using a bulk file download utility.

After downloading one or more of the USPTO’s weekly bibliographic data archives you can extract the contents of each archive. Unlike the CIPO’s archive, which extracts into a multiplicity of XML files—one per Canadian patent bibliographic document—the USPTO’s archives extract into just three files. This example shows details of the three files extracted from the USPTO’s bibliographic data archive for week 4 of 2012, i.e. bibliographic data for US patents which issued on Tuesday, 24 January 2012.

The .txt file contains a checklist of the patent numbers included in the archive, with one patent number per line. For week 4 of 2012 the checklist includes design patent numbers D0652606 through D0653015, plant patent numbers PP022464 through PP022468, reissue patent numbers RE043120 through RE043146 and utility patent numbers 08099794 through 08104093. The .html file includes a header (shown here—click to enlarge) summarizing the contents of the archive.

Notice that there is only one rather large .xml file. This is a concatenated XML file. As explained in the USPTO’s Bulk Data Product FAQs:
It is important to understand that the concatenated XML documents in the ZIP files, which have file extension "XML," are not the same as standard XML files and therefore will not be immediately readable by an ordinary XML parser. Instead, the files must be broken into individual XML documents, by splitting them apart at the XML declarations and/or DOCTYPE declarations.

Thus, unlike the CIPO’s archive from which one may directly extract separate XML files corresponding to individual Canadian bibliographic patent documents, some further processing of the USPTO’s concatenated XML files is required. Since XML files consist only of text and since each separate XML document within the USPTO’s concatenated XML file is prefaced by a unique XML declaration header (e.g. <?xml version="1.0" encoding="UTF-8"?>) it is relatively straightforward to split the concatenated XML file into separate XML files. For week 4 of 2012 this should yield 4,725 separate XML files as shown in above header.

Monday 11 November 2013

Working with XML format patent bibliographic data

The USPTO issues United States patents in batches, on Tuesday of each week throughout the year. The Canadian Intellectual Property Office does the same: Canadian patents are issued in batches, on Tuesday of each week throughout the year.

According to the USPTO’s statistics 253,155 US utility patents were granted in 2012. I’m ignoring reissue, design and plant patents for comparison purposes. Canada does not grant design or plant patents. Instead of design patents, Canada grants industrial design registrations. Instead of plant patents, Canada grants plant breeders’ rights (these are administered not by the CIPO but by the Plant Breeders’ Rights Office, which is part of the Canadian Food Inspection Agency).

A search of the CIPO’s online patent database reveals that 21,592 Canadian utility patents issued in 2012. So, in 2012, the volume of Canadian utility patent grants was about 8.5% of the volume of US utility patent grants. An even greater disparity appears in relation to reissue patents: the USPTO granted 822 reissue patents in 2012, but only 20 Canadian reissue patents were granted in the decade spanning 2001-2011.

The USPTO and the CIPO publish bibliographic data for their respective granted patents in XML format, in accordance with WIPO’s ST.36 standard. The CIPO’s Canadian patent bibliographic data XML files are typically provided in .zip type archive files. For example, the CIPO’s 2012 XML format patent bibliographic data is provided in a 188 MB archive from which 58,572 separate XML files can be extracted. However, those XML files pertain not only to granted utility patents (kind code C) but also to laid-open applications (kind code A1), reissue patents (kind code E) and re-examined patents (kind code F).

Moreover, the CIPO may republish a patent bibliographic data XML file—if an error is detected in a previously published version thereof.  For example, the CIPO’s 2012 patent bibliographic data archive includes an XML file for Canadian patent no. 2121906 which issued on 29 April 1993. As shown here, that XML file contains a pair of <ca-date-updated></ca-date-updated> XML tags encapsulating the 31 December 2012 date on which the CIPO republished its XML bibliographic data file for the ‘906 patent (New Years Eve 2012 fell on a Tuesday).  In processing the CIPO’s patent bibliographic data, one must take any such republication into account and perform appropriate update operations on existing data.

This brief discussion touches on only some issues that one must be cognizant of in processing XML format patent bibliographic data. Next week I’ll discuss another issue specific to the USPTO’s XML format patent bibliographic data.

Monday 4 November 2013

viewing XML format patent bibliographic data

OK—you have obtained some bibliographic patent data in XML format from one of the available sources (see last week’s post). Now what?

You can inspect individual XML files with an XML file viewer. XML files consist of text only, so a text editor such as the Microsoft Windows Notepad utility will do. Here is a small portion of the USPTO’s XML file for US patent no. 8329177 as viewed in Notepad. You can see the descriptive tag pairs (e.g. <document-id></document-id>) encapsulating the information content, but the tags’ hierarchical structure isn’t readily apparent via Notepad.

The tags’ hierarchical structure is more apparent if we inspect the XML file with a spreadsheet program such as Microsoft Excel. Here is a small portion of the same XML file (i.e. US 8329177) as viewed in Excel.  The file has been converted into the familiar spreadsheet row/column format, with the tags appearing as column headings and the encapsulated information shown in rows beneath the respective headings. The conversion (or "flattening") process repeats information in some cells, as seen here.  The column headers have been narrowed to show more columns.   When viewed in Excel, the XML file for US 8329177 has 53 rows and 131 columns, so the worksheet has a total of 6,943 cells. However, the worksheet is rather sparse: only 2,408 of those cells contain information.

A web browser can also be used to inspect an XML file. Here is the same small portion of the XML file for US 8329177 shown above in Notepad, as viewed in Microsoft Internet Explorer. In this case, the tags’ hierarchical structure is made apparent by color highlighting and by hierarchical indentation levels. The encapsulated information is bolded.  This makes it somewhat easier to browse through the contents of a single XML file—if that is what you want to do.

None of this is very helpful, unless you are only interested in a particular XML file’s bibliographic patent data content and don’t mind manually inspecting and deciphering the file’s tagged information as outlined above. More generally, what we want to do is examine the information content of a number (preferably a very large number) of bibliographic patent data XML files in parallel. Future posts will delve into that topic.