We currently convert patent text data from 15 different formats delivered from the various patent authorities into a single, coherent, easy-to-us format that we call MAPS (Modified APS) and now MAPS-XML (MAPS flavored XML with a complete, well commented DTD). APS was the original USPTO mainframe, 80 column, line-oriented data storage format and is still available in the ASCII character set for issue weeks from 1976 through 1999 from various sources for free (the raw APS data contains over 1600 data errors, but you'll find them. After all, we did (over a period of 5 years, with many errors reported by our clients, for which we are very thankful!).
Even though WIPO produced and maintains an agreed-upon XML standard for patent text data (Currently Standard ST.36), the data sets from all of the authorities that use it are sufficiently different to require hundreds of exceptions if you were to use a single parser to read and index or convert the data from multiple authorities to another format. We have handled the parsing and conversion of all of the different sets in a modular fashion with multiple front-end modules (reader/parser), character set conversion pipe (in-convert-out), symbol conversion pipe(s), OCR Cleanup and Dictionary Pipes, and a final output modules for the desired destination format. We can also handle multiple inputs in the weekly flow with a single output with various character conversion or OCR Cleanup pipes as shown in this diagram:
If you have data you need converted to or from various formats, we no doubt already have what you need to handle the job. We can also tailor it to make it easy to add to your work flow.
Contact us and let us know the following particulars:
- Source Format (specification),
- Source character set and language,
- Number of Source publications,
- How they are grouped when stored (singe files or multiple pubs per physical file),
- Total storage size on disk of Source data,
- Destination Format (specification),
- Destination character set desired,
- Any additional translations required such as:
- HTML Entities to UTF-8 characters or HTML format (ex: X² to X<sup>2</sup> )
- HTML Entities to Plain Text (words for scientific symbols or characters)
- UTF or ISO characters to Text or HTML Entities,
- Additional character data translations and insertions for indexing such as Scientific Symbols to plain text name following them parenthetically, for example: Å (Angstrom) Ø (Phase)
- Any reports required such as lists of all scientific symbols or conversions, and
- Anything else you can think of that you may require.
The bottom line is, we can probably save you money and time. Give us a call at one of the above numbers, or send an email with a brief description to IPDataCorp.com with the user name Support and put Data Conversion Info Request in the subject line, and if you provide enough detail about the data we will let you know what we can do for you, provide you with "ball park cost" and may be able to provide an estimated completion time. Then, you can decide if you'd like a formal quote.
* * * * *