INN Antibody Annotation Format → JSON Format
This document describes a mapping between Prof. Andrew Martin’s “INN Antibody Annotation Format” for antibody- based therapeutics and a new JSON data schema that is to be imported into a MongoDB database and made accessible to queries through a web front-end. The conversion from the INN format to the JSON format is carried out using this parser, and the constraints of the output JSON data are defined in this schema document, which is viewable in interactive form on the github.io domain associated with this repository.
The original INN annotation format is flat text and consists of
keyword[i]: value; tuples, where the keyword often has one or more
instance numbers appended in square brackets. These instances represent
different chains (or single-origin portions of a fused chain) that
require separate annotations for the same attribute. Some keywords may
also be followed by domain, region or isoform indicators in parentheses.
See here for more detail on the INN
annotation format.
Below is the terminology used in this documentation for roughly equivalent concepts in the two data formats.
| INN annotation | JSON |
|---|---|
| field / record* | property+ |
| keyword | name |
| value | value |
* record in the INN annotation should not be confused with the same term in tabular/relational data, where it refers to the whole units of observation (i.e. rows). In the INN annotation, a record is more similar to an individual cell in tabular data, while the observational units (antibody-based therapeutics) are in separate text files.
+ property is used here both when referring to the general attribute (similar to field in the INN annotation) or an individual case (similar to record in INN annotation terms).
An example of two records in the INN format:
VLRange[1]: 1-112;
VLRange[5]: 865-974;
The same data looks like this in the JSON format:
“VLRange”: [
{
“Instance”: 1,
“Start”: 1,
“End”: 112
},
{
“Instance”: 5,
“Start”: 865,
“End”: 974
}
]
The instance numbers are moved from the INN format keywords to nested objects within the JSON values. Since there is more than one instance with the same property, the JSON value is an array of objects. In the simpler and more flexible INN format, the values for different instances are in separate records, while JSON allows a nested structure with a single property name referencing an array of values, with each item in the array corresponding to a different instance.
Below is an example of the conversion of a complete annotation file for a fairly complicated antibody:
Overview of the mapping
The JSON property names match the INN field keywords as closely as possible, including the use of PascalCase. The only changes are as follows:
-
Removal of instance numbers and surrounding square brackets,
-
Removal of disulfide isoform indicators in square brackets e.g. in
HeavyDisulfidesIntra[A] -
Removal of other domain/region indicators in parentheses e.g. in
HeavyChainClass(CH1)andCDRSource(L1,L2,L3), -
Removal of parts of keywords that are aggregated in the JSON format, e.g.
ConfirmedandPotentialin the*NGlycosand*OGlycosfields1, -
Removal of the spaces in
Heavy ChainandLight Chain, -
For top-level
Noterecords, the keyword of the related record (the preceding one) is prepended with a hyphen, e.g.Format-Note.
Any information removed from a name is moved to a subproperty in the JSON format. More details and examples are provided further down in this documentation.
In terms of their structure and types of values within the JSON format, the annotations fall into two groups, depending on their cardinality with the whole antibody:
-
Single annotations for the whole antibody. The JSON value is a string and is unchanged from the original annotation. The only properties in this group are
Request,Format,ID,AbML,HeavySource,LightSource, and most notes that relate to these properties. -
The vast majority are instance-specific: they can – but don’t necessarily always – apply to one or more of the instances or chains that make up an antibody. The JSON value is an array of objects, with each item containing a mandatory
Instance/Instancessubproperty. Many properties – likeAntigen,Type, andConjugate2 – only ever apply to the whole antibody in the case of standard antibody formats, but may describe several instances in more complicated formats. For consistency, the values are always arrays of objects – even for standard antibodies – where the array contains a single object, and the instance subproperty takes the value0. The item subproperties for this group of properties can be summarised as follows (with more details and examples later):Instance/Instancesis mandatory. If there are no square brackets and thus no instances, the value is0. For inter-chain disulfide annotations,InstanceX/InstanceLandInstanceY/InstanceHare used, and both are given the value0if there are no square brackets.- The value in the INN format is either made a single subproperty or
split into several. The single subproperty in the former case or the
primary information in the latter case is called
Value/Valuesor a more descriptive name if it is not already indicated by the parent property. ForFusionandInteractionproperties, the values are the instances themselves, so there are no additionalValuessubproperties.3 Noteis optional.- Additional subproperties may arise from the aggregation of some
fields into single properties, e.g.
ConfirmedandPotentialin the*NGlycosand*OGlycosfields.
Note records are treated differently depending on their associated record.
Those that relate to whole-antibody properties (e.g. ID) become top-level
JSON properties with the related keyword prepended with a hyphen (e.g. ID-Note).
These notes contain single string values, except Format-Note, which can
contain multiple notes for different instances (e.g. in request
12155), and so is an array of objects. All
other Note records are subsumed into their related properties.
The grammar of the subproperty names indicates when to expect an array
of values. It can be different between properties but is consistent
within properties. If the Instance(s) subproperty is written in the
singular form, its value is always a single number.4 If plural, it
may refer to one or more instances, so its value is always an array5
(even if single-item). The same rule is followed for the Value(s) and
Domain(s)6 subproperties. The Regions, Positions and
Mutations subproperties are always arrays and so always written
plural.
The integrity of the data values is maintained through the format conversion by making only minor structural or syntactic (and not semantic) changes, which are as follows:
- Some values are split into several subproperties (as described above), but they always remain grouped together within a single item of the parent property.
- Instance, domain and region indicators (within square brackets or
parentheses after INN keywords) are moved to JSON subproperties. Where
there are several comma-separated indicators, the subproperty is an
array. In the absence of square brackets for annotations with a
mandatory instance subproperty, the value
0is chosen (forInstance) or[0](forInstances). Positions,ValuesandMutationssubproperties are split into arrays.- For
FusionandInteractionrecord values, the space-separated string of instance numbers (e.g.1 2 3) is split into an array of number items (e.g.[1, 2, 3]) for consistency with all of the otherInstancessubproperties. - Similarly,
Domainsvalues are changed from a space-separated string to an array of strings (e.g.VH CLbecomes[“VH”, “CL”]). - All
NONEvalues are converted to JSON’snulltype. This value means the explicit absence of the attribute (rather than a failure of measurement or the data getting lost). - For
CHSRangerecords that only mention two deletions, theStartandEndsubproperties of the JSON property are given null values.
The integrity of the data as it is run through the parser will be verified with unit tests covering the full range of expected data and edge cases. See the test/README page for a complete overview of how this project is tested.
Details of the mapping for every INN annotation keyword
| INN keyword (excluding optional instance/domain/region/isoform indicators) | JSON property name | JSON array item subproperties () = non-mandatory |
|---|---|---|
| Request | Request | N/A |
| Format | Format | N/A |
| ID | ID | N/A |
| AbML | AbML | N/A |
| HeavySource7 | HeavySource | N/A |
| LightSource | LightSource | N/A |
| Note (relating to any of the above keywords, except Format) | <keyword>-Note (relating to any of the above keywords, except Format) | N/A |
| Note (relating to Format) | Format-Note | Instances, Value |
| Note (all other Note records) | N/A (subsumed into related property) | N/A |
| Source | Source | Instances, Value, (Note) |
| Fusion | Fusion | Instances, (Note) |
| Linker | Linker | Instances, Start, End, (Note) |
| Domains | Domains | Instances, Values, (Note) |
| Type | Type | Instances, Value, (Note) |
| Conjugate | Conjugate | Instances, DrugName, (Note) |
| Antigen | Antigen | Instances, Species, Names, Gene, (Note) |
| Binding | Binding | Instances, Species, Names, Gene, (Note) |
| PDB | PDB | Instances, ID, (Note) |
| HeavyChainClass | HeavyChainClass | Instance, (Domains), Value, (Note) |
| HeavyChainLength | HeavyChainLength | Instance, Value, (Note) |
| ChainLength | ChainLength | Instances, Value, (Note) |
| LightChainClass | LightChainClass | Instance, (Domain), Value, (Note) |
| LightChainLength | LightChainLength | Instance, Value, (Note) |
| HeavyPotentialNGlycos HeavyConfirmedNGlycos | HeavyNGlycos | Instance, Potential, Confirmed, (ConfirmedPartial), (ConfirmedRare), (Note) |
| LightPotentialNGlycos LightConfirmedNGlycos | LightNGlycos | Instance, Potential, Confirmed, (ConfirmedPartial), (ConfirmedRare), (Note) |
| PotentialNGlycos ConfirmedNGlycos | NGlycos | Instances, Potential, Confirmed, (ConfirmedPartial), (ConfirmedRare), (Note) |
| HeavyPotentialOGlycos HeavyConfirmedOGlycos | HeavyOGlycos | Instance, Potential, Confirmed, (ConfirmedPartial), (ConfirmedRare), (Note) |
| PotentialOGlycos ConfirmedOGlycos | OGlycos | Instances, Potential, Confirmed, (ConfirmedPartial), (ConfirmedRare), (Note) |
| HeavyConfirmedPTM | HeavyConfirmedPTM | Instance, Type, Positions, (PositionsPartial), (PositionsRare), (Note) |
| LightConfirmedPTM | LightConfirmedPTM | Instance, Type, Positions, (PositionsPartial), (PositionsRare), (Note) |
| ConfirmedPTM | ConfirmedPTM | Instances, Type, Positions, (PositionsPartial), (PositionsRare), (Note) |
| HeavyCysPositions | HeavyCysPositions | Instance, Values, (Note) |
| LightCysPositions | LightCysPositions | Instance, Values, (Note) |
| CysPositions | CysPositions | Instances, Values, (Note) |
| HeavyDisulfidesIntra | HeavyDisulfidesIntra | Instance, (Isoform), Positions, (Note) |
| LightDisulfidesIntra | LightDisulfidesIntra | Instance, (Isoform), Positions, (Note) |
| DisulfidesIntra | DisulfidesIntra | Instances, (Isoform), Positions, (Note) |
| DisulfidesInterH1H28 | DisulfidesInterH1H2 | InstanceX, InstanceY, (Isoform), Positions, (Note) |
| DisulfidesInterL1H1 | DisulfidesInterL1H1 | InstanceL, InstanceH, (Isoform), Positions, (Note) |
| DisulfidesInterL2H2 | DisulfidesInterL2H2 | InstanceL, InstanceH, (Isoform), Positions, (Note) |
| DisulfidesInterL3H39 | DisulfidesInterL3H3 | InstanceL, InstanceH, (Isoform), Positions, (Note) |
| DisulfidesInter | DisulfidesInter | InstanceX, InstanceY, (Isoform), Positions, (Note) |
| HVGermline | HVGermline | Instance, Species, GeneID, (Note) |
| HJGermline | HJGermline | Instance, Species, GeneID, (Note) |
| HCGermline | HCGermline | Instance, (Domains), Species, GeneID, (Note) |
| LVGermline | LVGermline | Instance, Species, GeneID, (Note) |
| LJGermline | LJGermline | Instance, Species, GeneID, (Note) |
| LCGermline | LCGermline | Instance, (Domain), Species, GeneID, (Note) |
| VHRange | VHRange | Instance, Start, End, (Note) |
| CH1Range | CH1Range | Instance, Start, End, (Mutations), (Note) |
| HingeRange | HingeRange | Instance, Start, End, (Mutations), (Note) |
| CH2Range | CH2Range | Instance, Start, End, (Mutations), (Note) |
| CH3Range | CH3Range | Instance, Start, End, (Mutations), (Note) |
| CH4Range | CH4Range | Instance, Start, End, (Mutations), (Note) |
| CHSRange | CHSRange | Instance, Start, End, (Mutations), (Note) |
| VLRange | VLRange | Instance, Start, End, (Note) |
| CLRange | CLRange | Instance, Start, End, (Mutations), (Note) |
| MutationH | MutationH | Instance, Mutations, Reason, (Note) |
| MutationL | MutationL | Instance, Mutations, Reason, (Note) |
| Mutation | Mutation | Instance, Mutations, Reason, (Note) |
| CDRKabatH1 | CDRKabatH1 | Instance, Sequence, Start, End, (Note) |
| CDRKabatH2 | CDRKabatH2 | Instance, Sequence, Start, End, (Note) |
| CDRKabatH3 | CDRKabatH3 | Instance, Sequence, Start, End, (Note) |
| CDRKabatL1 | CDRKabatL1 | Instance, Sequence, Start, End, (Note) |
| CDRKabatL2 | CDRKabatL2 | Instance, Sequence, Start, End, (Note) |
| CDRKabatL3 | CDRKabatL3 | Instance, Sequence, Start, End, (Note) |
| CDRSource | CDRSource | Instances, (Regions), Species, (Note) |
| FusionProteinHeavy | FusionProteinHeavy | Instance, Start, End, (Multimer), (Note) |
| FusionProteinLight | FusionProteinLight | Instance, Start, End, (Multimer), (Note) |
| FusionProtein | FusionProtein | Instance, Start, End, (Multimer), (Note) |
| FusionProteinHeavyLinker | FusionProteinHeavyLinker | Instances, Start, End, (Note) |
| FusionProteinLightLinker | FusionProteinLightLinker | Instances, Start, End, (Note) |
| FusionProteinLinker | FusionProteinLinker | Instances, Start, End, (Note) |
| FusionProteinHeavyDisulfides | FusionProteinHeavyDisulfides | Instance, (Isoform), Positions, (Note) |
| FusionProteinLightDisulfides | FusionProteinLightDisulfides | Instance, (Isoform), Positions, (Note) |
| FusionProteinDisulfides | FusionProteinDisulfides | Instance, (Isoform), Positions, (Note) |
| Interaction | Interaction | Instances, (Note) |
| Heavy Chain | HeavyChain | Instances, Sequence |
| Light Chain | LightChain | Instances, Sequence |
| Chain | Chain | Instances, Sequence |
As can be seen by the chosen JSON subproperty names, the 80+ annotation
properties could be condensed into around 30 groups according to shared
information structures. Each group can be handled by a single method in
the parser’s
code. The grouping also informs much of the logic of the JSON
schema’s modular construction (sharing
definitions for item subproperties, or grouping together identical schemas
with the patternProperties keyword).
JSON Schema and validation
The structure and constraints of the JSON format is formally laid out in
more detail in a JSON Schema document.
It defines the allowed and required properties (at all levels of the
structure) and the data types and a range, set or pattern of values for
many of these properties. There are also examples provided in the
examples subproperty of each property in the schema.
About one third of the properties are given their own unique subschema.
For the rest, the schema defines a number of patternProperties, where
property names matching a given regex pattern are assigned the same
subschema. The properties covered by each pattern are enumerated in
comments so the patternProperty schema that covers a particular property
may be found using Ctrl+F. Note that many of the Heavy/Light-prefixed
properties share the same subschema (with a singular Instance
subproperty), but the more general case (without a prefix) has its own
unique subschema (with a plural Instances subproperty).
In this project, the schema is first used (in a non-automated way) as a form of documentation to guide the development of the parser’s source code (here is an overview). It is then imported for an optional automatic validation step that is built into the app. This is implemented with the python jsonschema library. Along with python unit tests, this schema validation supports a test-driven development approach as well as giving users more confidence in the output of the fully operational parser. In addition, it may bring to attention when new input data doesn’t conform to the expectations of the existing therapeutic antibody annotations; this is fairly likely to happen with the ever-increasing complexity of engineered antibody formats. If and when this happens, the parser may need to be modified or extended (or the schema could simply be relaxed), and downstream applications (such as a database and searchable website) may also need updating.