Google Patent Data Analytics: October 2013

Monday, 28 October 2013

Bibliographic patent data sources

As previously mentioned, many patent offices publish their bibliographic patent data in XML format in accordance with WIPO standard ST.36.  Where can you find the data?  As of the date of this post, the following web pages provide download links or purchase order information for such data.
Note that the data may not be free.  The USPTO makes its data available free of charge, with no usage restrictions.  The CIPO also makes its data available free of charge, but only for non-commercial use.  Each patent office may impose different pricing and/or usage constraints on its data.

Monday, 21 October 2013

XML format patent bibliographic data

Many patent offices (e.g. USPTO, EPO, WIPO, SIPO, CIPO) publish their bibliographic patent data in XML format in accordance with WIPO standard ST.36.   Some patent offices (e.g. CIPO) make the data available free of charge, for non-commercial use.   Others (e.g. USPTO) make the data available free of charge with no usage restrictions.

Like HTML, XML (extensible markup language) employs tags to encapsulate information.  Unlike HTML tags, XML tags impart no display characteristics (e.g. fonts) to the tagged information.  Also unlike HTML tags, XML tags are user-definable.  This means that they can be—and usually are—self-describing.  XML tags can also be arranged, e.g. nested to present information hierarchically.  Patent bibliographic data stored in the XML format defined by WIPO's ST.36 standard utilizes self-describing tags which are defined and hierarchically arranged in accordance with the standard.


Consider this extract from the USPTOs XML document for US patent no. 8309744.  Notice the field tags.  For example, the <country></country> tag pair encapsulates the “US” country code, telling us that this document pertains to a US patent.

The <doc-number></doc-number> tag pair encapsulates “08309744”, telling us the document's number.

The <kind></kind> tag pair encapsulates “B2”, telling us that the document is a granted utility patent.

The <date></date> tag pair encapsulates “20121113”, telling us that the patent issued on November 13, 2012.

Those four tag pairs are nested within the <document-id></document-id> tag pair which is in turn nested within the <publication-reference></publication-reference> tag pair.  The information encapsulated by those tag pairs identifies the published document.


The <document-id></document-id>, <country></country>, <doc-number></doc-number>, <date></date> tag pairs are also hierarchically nested within a pair of <application-reference></application-reference> tags.  Since the tags are self-describing, you can easily understand that the encapsulated information tells us that the '744 patent issued from US application serial no. 13/081,794 which was filed on April 7, 2011.

The depicted extract is just a small part of the USPTO's XML document publication for US patent no. 8309744.  Anyone familiar with patent information could read the
raw XML document and discern its meaning fairly readily.  However, XML documents are not normally intended for human reading.  Their primary purpose is to preserve a document's organization and structure in computer-readable form.  The visualizations presented via this blog were developed by computer processing of XML documents corresponding to the visualized patent publications.

Monday, 14 October 2013

Patent bibliographic data basics

Bibliography is the description of books using details such as author, publication date, edition, etc. which collectively constitute bibliographic data. In relation to patents, bibliographic data encompasses details such as country, patent number & issue date; application number & filing date; priority number(s), country(ies) & date(s); invention title; inventor name(s), citizenship & address; assignee name(s), nationality & residence; and much more.

Have a look at the cover sheet of this United States patent. Everything that you see here—plus more information that you do not see here—constitutes this patent’s bibliographic data.

The visualizations presented via this blog make only limited use of the full range of available patent bibliographic data. In general, text and image information (e.g. abstract, description, claims, drawings) is not used. For the most part, information that can be counted is used.

For example, the question “how many patents did firm X prosecute on behalf of assignee Y for inventions handled by USPTO art unit Z ?” is answered by counting the number of patents which satisfy all three of those criteria. Accordingly, patent bibliographic details such as firm names, assignee names and art unit numbers are utilized. But, apart from counting the total number of claims in a patent, neither the text comprising a patent’s abstract, description and claims nor the drawing images are useful for the purposes of the visualizations presented via this blog.

Some dates can be useful, especially if they facilitate calculation of meaningful statistics for a large group of documents. For example, the time span between an application’s filing date and the corresponding patent’s issue date provides a useful measure that can be used to address questions such as “What is the average filing-to-issue time in years for US patents which issued in 2012 to assignee X for inventions in IPC subclass G06Q ?”

In future posts I’ll delve more deeply into other aspects of patent bibliographic data.

Monday, 7 October 2013

Bubble charts

Bubble charts are sometimes useful for visualizing data. This example uses color to encode country (mauve = Finland, peach = Israel, green = Italy) and size to encode number of patent documents. The labels identify USPTO art units. Overall, the visualization compares Finland, Israel and Italy in terms of the number of US patents which issued in 2012 to assignees located in those countries and which were allocated by the USPTO to one of five different art units. The five art units are:

  • 2617 (cellular telephony)
  • 2618 (radio/satellite communications)
  • 2624 (image analysis)
  • 2916 (a design patent art unit)
  • 2913 (another design patent art unit)
You can easily see that, for Finland, art unit 2617 is the most significant one of the five. For Italy it’s art unit 2913 and for Israel it’s art unit 2624. In the underlying dataset, the Finland/2617 bubble corresponds to 127 patents, the Italy/2913 bubble corresponds to 54 patents and the Israel/2624 bubble corresponds to 48 patents.

For Finland, the next two most significant art units are 2916 and 2618 in that order, but you need to look closely to determine each bubble's size to get them in the right sequence. The Finland/2916 bubble corresponds to 51 patents and the Finland/2918 bubble corresponds to 46 patents. Difficulty in distinguishing bubble sizes is a downside of bubble charts.

For Israel, the next two most significant art units are 2617 and 2618 in that order, as is reasonably apparent from the bubbles’ respective sizes.

For Italy, the next two most significant art units are 2617 and 2624 in that order, but again you need to look closely to get them in the right order.  The Italy/2617 bubble corresponds to 23 patents and the Italy/2624 bubble corresponds to 18 patents.

The bubble size discrimination problem can be addressed by adding ranking values (e.g. 1, 2, 3...) to the bubbles within each color group, by applying different patterns corresponding to the number of patents represented by each bubble, etc. However, such techniques can distract the viewer without adequately addressing the problem.

Bubble charts are useful if you only want to see an approximation. But, if precision matters, bubble charts may not be the best choice. If you look back at my "Top technology sectors by country" post, you’ll see that I used data bars to compare Finland, Israel and Italy in a different context. Consider whether it’s easier to understand the data bar visualization or the bubble chart visualization.