Towards a more perfect union type - Michał J. Gajda

Page created by Carrie Carrillo
 
CONTINUE READING
Towards a more perfect union type
 Michał J. Gajda

 Abstract derive JSON types [2]1 . This approach requires considerable
 We present a principled theoretical framework for inferring engineering time due to the implementation of unit tests in
 and checking the union types, and show its work in practice a case-by-case mode, instead of formulating laws applying
 on JSON data structures. to all types. Moreover, this approach lacks a sound underly-
 The framework poses a union type inference as a learn- ing theory. Regular expression types were also used to type
arXiv:2011.03076v2 [cs.PL] 12 Jan 2021

 ing problem from multiple examples. The categorical frame- XML documents [13], which does not allow for selecting al-
 work is generic and easily extensible. ternative representation. In the present study, we generalize
 previously introduced approaches and enable a systematic
 CCS Concepts: • Software and its engineering → Gen- addition of not only value sets, but inference subalgorithms,
 eral programming languages; • Social and professional to the union type system.
 topics → History of programming languages.
 1.1.2 Frameworks for describing type systems. Type
 ACM Reference Format: systems are commonly expressed as partial relation of typ-
 Michał J. Gajda. 2020. Towards a more perfect union type. In Pro- ing. Their properties, such as subject reduction are also ex-
 ceedings of 32nd International Symposium on Implementation and
 pressed relatively to the relation (also partial) of reduction
 Application of Functional Languages (IFL2020). ACM, New York, NY,
 USA, 14 pages.
 within a term rewriting system. General formulations have
 been introduced for the Damas-Milner type systems param-
 eterized by constraints [23]. It is also worth noting that tradi-
 1 Introduction tional Damas-Milner type disciplines enjoy decidability, and
 Typing dynamic languages has been long considered a chal- embrace the laws of soundness, and subject-reduction. How-
 lenge [3]. The importance of the task grown with the ubiq- ever these laws often prove too strict during type system
 uity cloud application programming interfaces (APIs) utiliz- extension, dependent type systems often abandon subject-
 ing JavaScript object notation (JSON), where one needs to reduction, and type systems of widely used programming
 infer the structure having only a limited number of sample languages are either undecidable [21], or even unsound [27].
 documents available. Previous research has suggested it is Early approaches used lattice structure on the types [25],
 possible to infer adequate type mappings from sample data which is more stringent than ours since it requires idem-
 [2, 8, 14, 20]. potence of unification (as join operation), as well as com-
 In the present study, we expand on these results. We pro- plementary meet operation with the same properties. Se-
 pose a modular framework for type systems in program- mantic subtyping approach provides a characterization of
 ming languages as learning algorithms, formulate it as equa- a set-based union, intersection, and complement types [9,
 tional identities, and evaluate its performance on inference 10], which allows model subtype containment on first-order
 of Haskell data types from JSON API examples. types and functions. This model relies on building a model
 using infinite sets in set theory, but its rigidity fails to gen-
 1.1 Related work eralize to non-idempotent learning2 . We are also not aware
 of a type inference framework that consistently and com-
 1.1.1 Union type providers. The earliest practical effort
 pletely preserve information in the face of inconsistencies
 to apply union types to JSON inference to generate Haskell
 nor errors, beyond using bottom and expanding to infamous
 types [14]. It uses union type theory, but it also lacks an ex-
 undefined behaviour [5].
 tensible theoretical framework. F# type providers for JSON
 We propose a categorical and constructive framework that
 facilitate deriving a schema automatically; however, a type
 preserves the soundness in inference while allowing for con-
 system does not support union of alternatives and is given
 sistent approximations. Indeed our experience is that most
 shape inference algorithm, instead of design driven by de-
 of the type system implementation may be generic.
 sired properties [20]. The other attempt to automatically in-
 fer schemas has been introduced in the PADS project [8].
 Nevertheless, it has not specified a generalized type-system
 1 This
 design methodology. One approach uses Markov chains to approach uses Markov chains to infer best of alternative type
 representations.
 2 Which would allow extension with machine learning techniques like
 IFL2020, September, 2020, Markov chains to infer optimal type representation from frequency of oc-
 2020. curing values[2].
IFL2020, September, 2020, Michał J. Gajda

2 Motivation data Example = Example { f_6408f5 :: O_6408f5
Here, we consider several examples similar to JSON API de- , f_54fced :: O_6408f5
scriptions. We provide these examples in the form of a few , f_6c9589 :: O_6408f5 }
JSON objects, along with desired representation as Haskell data O_6408f5 = O_6408f5 { size, height :: Int
data declaration. , difficulty :: Double
 , previous :: String }
 1. Subsets of data within a single constructor: However, when this object has multiple keys with values
 a. API argument is an email – it is a subset of valid of the same structure, the best representation is that of a
 String values that can be validated on the client- mapping shown below. This is also an example of when user
 side. may decide to explicitly add evidence for one of the alter-
 b. The page size determines the number of results to re- native representations in the case when input samples are
 turn (min: 10, max:10,000) – it is also a subset of in- insufficient. (like when input samples only contain a single
 teger values (Int) between 10, and 10, 000 element dictionary.)
 c. The date field contains ISO8601 date – a record field data ExampleMap = ExampleMap ( Map Hex ExampleElt)
 represented as a String that contains a calendar data ExampleElt = ExampleElt {
 date in the format "2019-03-03" size :: Int
 , height :: Int
 2. Optional fields: The page size is equal to 100 by default
 , difficulty :: Double
 – it means we expect to see the record like {"page_size":
 , previous :: String }
 50} or an empty record {} that should be interpreted
 in the same way as {"page_size": 100}
 2.1 Goal of inference
 3. Variant fields: Answer to a query is either a number Given an undocumented (or incorrectly labelled) JSON API,
 of registered objects, or String "unavailable" - this is we may need to read the input of Haskell encoding and
 integer value (Int) or a String (Int :|: String) avoid checking for the presence of unexpected format devia-
 4. Variant records: Answer contains either a text message tions. At the same time, we may decide to accept all known
 with a user identifier or an error. – That can be repre- valid inputs outright so that we can use types3 to ensure
 sented as one of following options: that the input is processed exhaustively.
 Accordingly, we can assume that the smallest non-singleton
{"message": "Where can I submit proposal?", "uid" : 1014} set is a better approximation type than a singleton set. We
{"message": "Submit it to HotCRP", "uid" : 317}
 call it minimal containing set principle.
{"error" : "Authorization failed", "code": 401}
 Second, we can prefer types that allow for a fewer num-
{"error" : "User not found", "code": 404}
 ber of degrees of freedom compared with the others, while
 data Example4 = Message { message :: String conforming to a commonly occurring structure. We denote
 , uid :: Int } it as an information content principle.
 | Error { error :: String Given these principles, and examples of frequently oc-
 , code :: Int } curring patterns, we can infer a reasonable world of types
 that approximate sets of possible values. In this way, we
 5. Arrays corresponding to records:
 can implement type system engineering that allows deriving
 type system design directly from the information about data
[ [1, "Nick", null ]
 structures and the likelihood of their occurrence.
, [2, "George", "2019-04-11"]
, [3, "Olivia", "1984-05-03"] ]
 3 Problem definition
 6. Maps of identical objects (example from [2]): As we focus on JSON, we utilize Haskell encoding of the
{ "6408f5": { "size": 969709 , "height": JSON term for convenient reading(from Aeson package [1]);
 510599
 , "difficulty": 866429.732, "previous":specified
 "54fced" as follows:
 },
 "54fced": { "size": 991394 , "height": data Value = Object (Map String Value)
 510598
 , "difficulty": 866429.823, "previous": "6c9589" }, | Array [ Value ]
 "6c9589": { "size": 990527 , "height": 510597 | Null
 , "difficulty": 866429.931, "previous": "51a0cb" } }| Number Scientific
 | String Text
 It should be noted that the last example presented above | Bool Bool
requires Haskell representation inference to be non-monotonic,
as an example of object with only a single key would be best
represented by a record type: 3 Compiler feature of checking for unmatched cases.
Towards a more perfect union type IFL2020, September, 2020,

3.1 Defining type inference correspond to a statically assigned and a narrow type. In
3.1.1 Information in the type descriptions. If an infer- this setting, mempty would be fully polymorphic type ∀ . .
ence fails, it is always possible to correct it by introducing an Languages with dynamic type discipline will treat beyond
additional observation (example). To denote unification op- as untyped, dynamic value and mempty again is an entirely
eration, or information fusion between two type descrip- unknown, polymorphic value (like a type of an element of
tions, we use a Semigroup interface operation to merge an empty array)6 .
types inferred from different observations. If the semigroup class (Monoid t , Eq t, Show t)
is a semilattice, then is meet operation (least upper bound). => Typelike t where
Note that this approach is dual to traditional unification that beyond :: t -> Bool
narrows down solutions and thus is join operation (greatest Besides, the standard laws for a commutative Monoid,
lower bound). We use a neutral element of the Monoid to we state the new law for the beyond set: The beyond set
indicate a type corresponding to no observations. is always closed to information addition by (a) or
 class Semigroup ty where (a) for any value of a, or submonoid. In other words,
 (⋄ ) :: ty -> ty -> ty the beyond set is an attractor of on both sides.7 However,
 class Semigroup ty we do not require idempotence of , which is uniformly
 => Monoid ty where present in union type frameworks based on the lattice [25]
 mempty :: ty and set-based approaches8 [9]. Concerning union types, the
 In other words, we can say that mempty (or bottom) ele- key property of the beyond set, is that it is closed to infor-
ment corresponds to situation where no information was mation acquisition:
accepted about a possible value (no term was seen, not even In this way, we can specify other elements of beyond set
a null). It is a neutral element of Typelike. For example, instead of a single top. When under strict type discipline,
an empty array [] can be referred to as an array type with like that of Haskell [21], we seek to enable each element of
mempty as an element type. This represents the view that the beyond set to contain at least one error message9 .
 always gathers more information about the type, as We abolish the semilattice requirement that has been con-
opposed to the traditional unification that always narrows ventionally assumed for type constraints [24], as this require-
down possible solutions. We describe the laws described be- ment is valid only for the strict type constraint inference,
low as QuickCheck [4] properties so that unit testing can be not for a more general type inference considered as a learn-
implemented to detect apparent violations. ing problem. As we observe in example 5 in sec. 2, we need
 to perform a non-monotonic step of choosing alternative
3.1.2 Beyond set. In the domain of permissive union types, representation after monotonic steps of merging all the in-
a beyond set represents the case of everything permitted formation.
or a fully dynamic value when we gather the information When a specific instance of Typelike is not a semilat-
that permits every possible value inside a type. At the first tice (an idempotent semigroup), we will explicitly indicate
reading, it may be deemed that a beyond set should com- that is the case. It is convenient validation when testing a
prise of only one single element – the top one (arriving at recursive structure of the type. Note that we abolish semi-
complete bounded semilattice), but this is too narrow for lattice requirement that was traditionally assumed for type
our purpose of monotonically gathering information constraints here [25]. That is because this requirement is
 However, since we defined generalization operator valid only for strict type constraint inference, not for a more
as information fusion (corresponding to unification in cat- general type inference as a learning problem. As we saw
egorically dual case of strict type systems.), we may encounter on ExampleMap in sec. 2, we need non-monotonic inference
difficulties in assuring that no information has been lost dur- when dealing with alternative representations. We note that
ing the generalization4 . Moreover, strict type systems usu- this approach significantly generalized the assumptions com-
ally specify more than one error value, as it should con- pared with a full lattice subtyping [24, 25].
tain information about error messages and keep track from Time to present the relation of typing and its laws. In
where an error has been originated5 . order to preserve proper English word order, we state that
 This observation lets us go well beyond typing statement ty 8 Types 8 val instead of classical val :ty. Specifying the laws
of gradual type inference as a discovery problem from in- of typing is important, since we may need to separately con-
complete information [22]. Here we consider type inference sider the validity of a domain of types/type constraints, and
as a learning problem furthermore, find common ground
between the dynamic and the static typing discipline. The 6 May sound similar until we consider adding more information to the type.

languages relying on the static type discipline usually con- 7 So both for ∀a( a) and ∀a.(a), the result is kept in the beyond set.
 8 Which use Heyting algebras, which have more assumptions that the lat-
sider beyond as a set of error messages, as a value should
 tice approaches.
4 Examples will be provided later. 9 Note that many but not all type constraints are semilattice. Please refer to
5 In this case: beyond (Error _) = True | otherwise = False. the counting example below.
IFL2020, September, 2020, Michał J. Gajda

that of the sound typing of the terms by these valid types. 3.3 Constraint definition
The minimal definition of typing inference relation and type 3.3.1 Flat type constraints. Let us first consider typing
checking relation is formulated as consistency between these of flat type: String (similar treatment should be given to
two operations. the Number.type.)
 class Typelike ty => ty ‘ Types‘ val where data StringConstraint = SCDate | SCEmail
 infer :: val -> ty | SCEnum ( Set Text) {- non-empty set of observed values -}
 check :: ty -> val -> Bool | SCNever {- mempty -} | SCAny {- beyond -}
 First, we note that to describe no information, mempty can- instance StringConstraint ‘Types ‘ Text where
not correctly type any term. A second important rule of typ- infer ( isValidDate -> True ) = SCDate
ing is that all terms are typed successfully by any value in infer ( isValidEmail -> True ) = SCEmail
the beyond set. Finally, we state the most intuitive rule for infer value = SCEnum
typing: a type inferred from a term, must always be valid $ Set.singleton value
for that particular term. The law asserts that the diagram infer _ = SCAny
on the figure commutes:
 The last law states that the terms are correctly type-checked check SCDate s = isValidDate s
after adding more information into a single type. (For infer- check SCEmail s = isValidEmail s
ence relation, it would be described as principal type prop- check ( SCEnum vs) s = s ‘Set.member ‘ vs
erty.) The minimal Typelike instance is the one that con- check SCNever _ = False
tains only mempty corresponding to the case of no sample check SCAny _ = True
data received, and a single beyond element for all values
 instance Semigroup StringConstraint where
permitted. We will define it below as PresenceConstraint
 SCNever ⋄ = 
in sec. 3.3.3. These laws are also compatible with the strict,
 SCAny ⋄ _ = SCAny
static type discipline: namely, the beyond set corresponds to
 SCDate ⋄ SCDate = SCDate
a set of constraints with at least one type error, and a task
 SCEmail ⋄ SCEmail = SCEmail
of a compiler to prevent any program with the terms that
 (SCEnum ) ⋄ ( SCEnum )
type only to the beyond as a least upper bound.
 | length ( ‘ Set.union ‘ ) ≤ 10 = SCEnum ( ⋄ )
 _ ⋄ _ = SCAny
3.2 Type engineering principles
Considering that we aim to infer a type from a finite number 3.3.2 Free union type. Before we endeavour on finding
of samples, we encounter a learning problem, so we need to type constraints for compound values (arrays and objects),
use prior knowledge about the domain for inferring types. it might be instructive to find a notion of free type, that is
Observing that : false we can expect that in particular a type with no additional laws but the ones stated above.
cases, we may obtain that : true. After noting that : 123, Given a term with arbitrary constructors we can infer a free
we expect that : 100 would also be acceptable. It means type for every term set as follows: For any value type
that we need to consider a typing system to learn a reason- Set satisfies our notion of free type specified as follows:
able general class from few instances. This observation moti- data FreeType = FreeType { captured :: Set } | Full
vates formulating the type system as an inference problem. instance (Ord , Eq )
As the purpose is to deliver the most descriptive10 types, we => Semigroup ( FreeType ) where
assume that we need to obtain a broader view rather than fo- Full ⋄ _ = Full
cusing on a free type and applying it to larger sets whenever _ ⋄ Full = Full
it is deemed justified. ⋄ = FreeType
 The other principle corresponds to correct operation. It $ (Set.union ‘ on‘ captured) 
implies that having operations regarded on types, we can instance (Ord , Eq , Show )
find a minimal set of types that assure correct operation in => Typelike ( FreeType ) where
the case of unexpected errors. Indeed we want to apply this beyond = (==Full )
theory to infer a type definition from a finite set of examples.
We also seek to generalize it to infinite types. We endeavour instance (Ord , Eq , Show )
rules to be as short as possible. The inference must also be => FreeType ‘ Types‘ where
a contravariant functor with regards to constructors. For
example, if AType x y types {"a": X, "b": Y}, then x infer = FreeType . Set.singleton
must type X, and y must type Y. check Full _term = True
 check (FreeType s) term = term ‘Set.member‘ s
 This definition is deemed sound and applicable to finite
10 The shortest one according to the information complexity principle. sets of terms or values. For a set of values: ["yes", "no",
Towards a more perfect union type IFL2020, September, 2020,

 Type 1 × Type 2

 1 2
 
 Type 2 Type 1 
 Type 1 Type 1 Type 2 Type 2

 infer check value 1 check value 2 infer

 Value 1 check with 1
 True check with 2
 Value 2

 Figure 1. Categorical diagram for Typelike.

"error"], we may reasonably consider that type is an ap- (with options being more complex to process because they
propriate approximation of C-style enumeration, or Haskell- need additional case expression.) In such cases, the latter
style ADT without constructor arguments. However, the de- definition has only one Maybe field (on the toplevel optional-
ficiency of this notion of free type is that it does not allow ity is one), while the former definition has four Maybe fields
generalizing in infinite and recursive domains! It only al- (optionality is four). When we obtain more samples, the pat-
lows to utilize objects from the sample. tern emerges:
3.3.3 Presence and absence constraint. We call the de-
generate case of Typelike a presence or absence constraint. {"error" : "Authorization failed", "code": 401
It just checks that the type contains at least one observation {"message": "Where can I submit my proposal?", "uid" : 1014
of the input value or no observations at all. It is vital as it {"message": "Sent it to HotCRP", "uid" : 93
can be used to specify an element type of an empty array. Af- {"message": "Thanks!", "uid" : 1014
ter seeing true value, we also expect false, so we can say {"error" : "Missing user", "code": 404
that it is also a primary constraint for pragmatically indivis-
ible like the set of boolean values. The same observation is
valid for null values, as there is only one null value ever Type cost function. Since we are interested in types with
to observe. less complexity and less optionality, we will define cost func-
 type BoolConstraint = PresenceConstraint Bool tion as follows:
 type NullConstraint = PresenceConstraint () class Typelike ty => TypeCost ty where
 data PresenceConstraint = Present | Absent typeCost :: ty -> TyCost
 typeCost = if == mempty then 0 else 1
 Variants. It is simple to represent a variant of two mutu-
 instance Semigroup TyCost where (⋄ ) = ( +)
ally exclusive types. They can be implemented with a type re-
 instance Monoid TyCost where mempty = 0
lated to Either type that assumes these types are exclusive,
we denote it by :|:. In other words for Int :|: String
 newtype TyCost = TyCost Int
type, we first control whether the value is an Int, and if
 When presented with several alternate representations
this check fails, we attempt to check it as a String. Variant
 from the same set of observations, we will use this function
records are slightly more complicated, as it may be unclear
 to select the least complex representation of the type. For
which typing is better to use:
 flat constraints as above, we infer that they offer no option-
{"message": "Where can I submit my proposal?", "uid" :ality 1014}
 when no observations occurred (cost of mempty is 0),
{"error" : "Authorization failed", "code": 401} the cost is 1. Type cost should be non-negative,
 otherwise,
 data OurRecord = OurRecord { and non-decreasing when we add new observations to the
 message , error :: Maybe String type.
 , code, uid :: Maybe Int }
 data OurRecord2 =
 Message { message :: String , uid :: Int } 3.3.4 Object constraint. To avoid information loss, a con-
 | Error { error :: String , code :: Int } straint for JSON object type is introduced in such a way to
 The best attempt here is to rely on the available examples simultaneously gather information about representing
being reasonably exhaustive. That is, we can estimate how it either as a Map, or a record. The typing of Map would be
many examples we have for each, and how many of them specified as follows, with the optionality cost being a sum
match. Then, we compare this number with type complexity of optionalities in its fields.
IFL2020, September, 2020, Michał J. Gajda

 data MappingConstraint = MappingNever – mempty 3.3.5 Array constraint. Similarly to the object type, ArrayConstraint
 | MappingConstraint { is used to simultaneously obtain information about all pos-
 keyConstraint :: StringConstraint sible representations of an array, differentiating between an
 , valueConstraint :: UnionType } array of the same elements, and a row with the type depend-
 instance TypeCost MappingConstraint where ing on a column. We need to acquire the information for
 typeCost MappingNever = 0 both alternatives separately, and then, to measure a relative
 typeCost MappingConstraint {..} = likelihood of either case, before mapping the union type to
 typeCost keyConstraint Haskell declaration. Here, we specify the records for two
 + typeCost valueConstraint different possible representations:
 Separately, we acquire the information about a possible data ArrayConstraint = ArrayNever – mempty
typing of a JSON object as a record of values. Note that | ArrayConstraint { rowCase :: RowConstraint
RCTop never actually occurs during inference. That is, we , arrayCase :: UnionType }
could have represented the RecordConstraintas a Typelike Semigroup operation just merges information on the com-
with an empty beyond set. The merging of constraints would ponents, and the same is done when inferring types or check-
be simply merging of all column constraints. ing them: For the arrays, we plan to choose again only one
 data RecordConstraint = of possible representations, so the cost of optionality is the
 RCTop – beyond lesser of the costs of the representation-specific constraints.
 | RCBottom – mempty instance ArrayConstraint ‘Types ‘ Array where
 | RecordConstraint { infer vs = ArrayConstraint { rowCase = infer vs
 fields :: HashMap Text UnionType } , arrayCase = mconcat ( infer ♦$
 instance RecordConstraint ‘Types ‘ Object where Foldable.toList vs) }
 infer = RecordConstraint . Map.fromList check ArrayNever vs = False
 . fmap (second infer) . Map.toList check ArrayConstraint {..} vs =
 check RecordConstraint {fields } obj = check rowCase vs
 all (‘elem‘ Map.keys fields ) && and ( check arrayCase ♦$
 ( Map.keys obj) Foldable.toList vs)
 && and (Map.elems $ Map.intersectionWith
 check fields obj) 3.3.6 Row constraint. A row constraint is valid only if
 && all isNullable ( Map.elems there is the same number of entries in all rows, which is rep-
 $ fields ‘ Map.difference‘ obj) resented by escaping the beyond set whenever there is an
 – absent values are nullable uneven number of columns. Row constraint remains valid
 Observing that the two abstract domains considered above only if both constraint describe the record of the same length;
are independent, we can store the information about both otherwise, we yield RowTop to indicate that it is no longer
 11
options separately in a record . It should be noted that this valid. In other words, RowConstraint is a levitated semilat-
 tice[16] 12 with a neutral element over the content type that
representation is similar to intersection type: any value that
satisfies ObjectConstraintmust conform to both mappingCase, is a list of UnionType objects.
and recordCase. Also, this intersection approach in order to data RowConstraint = RowTop | RowNever
address alternative union type representations benefit from | Row [UnionType]
principal type property, meaning that a principal type serves 3.3.7 Combining the union type. It should note that
to acquire the information corresponding to different rep- given the constraints for the different type constructors, the
resentations and handle them separately. Since we plan to union type can be considered as mostly a generic Monoid in-
choose only one representation for the object, we can say stance [11]. Merging information with and mempty follow
that the minimum cost of this type is the minimum of com- the pattern above, by just lifting operations on the compo-
ponent costs. nent.
 data ObjectConstraint = ObjectNever – mempty data UnionType = UnionType {
 | ObjectConstraint { unionNull :: NullConstraint
 mappingCase :: MappingConstraint , unionBool :: BoolConstraint
 , recordCase :: RecordConstraint } , unionNum :: NumberConstraint
 instance TypeCost ObjectConstraint where , unionStr :: StringConstraint
 typeCost ObjectConstraint { ..} = typeCost mappingCase , unionArr :: ArrayConstraint
 ‘min ‘ typeCost recordCase , unionObj :: ObjectConstraint }

11 The choice of representation will be explained later. Here we only con- 12 Levitated
 lattice is created by appending distinct bottom and top to a set
sider acquiring information about possible values. that does not possess them by itself.
Towards a more perfect union type IFL2020, September, 2020,

 The generic structure of union type can be explained by 3.3.9 Counting observations. In this section, we discuss
the fact that the information contained in each record field is how to gather information about the number of samples sup-
independent from the information contained in other fields. porting each alternative type constraint. To explain this, the
It means that we generalize independently over different di- other example can be considered:
mensions13 {"history": [
 Inference breaks down disjoint alternatives correspond-
 {"error" : "Authorization failed",
ing to different record fields, depending on the constructor
 "code" : 401}
of a given value. It enables implementing a clear and effi-
 ,{"message": "Where can I submit my proposal?",
cient treatment of different alternatives separately14 . Since
 "uid" : 1014}
union type is all about optionality, we need to sum all op- ,{"message": "Sent it to HotCRP",
tions from different alternatives to obtain its typeCost.
 "uid" : 93}
 instance UnionType ‘ Types‘ Value where
 ,{"message": "Thanks!",
 infer ( Bool ) = mempty { unionBool =
 "uid" : 1014}
 infer }
 ,{"error" : "Authorization failed",
 infer Null = mempty { unionNull =
 "code" : 401}]}
 infer ( ) }
 infer ( Number n ) = mempty { unionNum = First, we need to identify it as a list of similar elements.
 infer n } Second, there are multiple instances of each record example.
 infer ( String s) = mempty { unionStr = We consider that the best approach would be to use the mul-
 infer s } tisets of inferred records instead. To find the best represen-
 infer ( Object o) = mempty { unionObj = tation, we can a type complexity, and attempt to minimize
 infer o } the term. Next step is to detect the similarities between type
 infer ( Array ) = mempty { unionArr = descriptions introduced for different parts of the term:
 infer } {"history" : [...]
 ,"last_message" : {"message": "Thanks!",
 check UnionType { unionBool } (Bool )= "uid" : 1014} }
 check unionBool 
 check UnionType { unionNull } Null = We can add the auxiliary information about a number of
 check unionNull () samples observed, and the constraint will remain a Typelike
 check UnionType { unionNum } ( Number n)= object. The Counted constraint counts the number of sam-
 check unionNum n ples observed for the constraint inside so that we can de-
 check UnionType { unionStr } (String s )= cide on which alternative representation is best supported
 check unionStr s by evidence. It should be noted that Counted constraint is
 check UnionType { unionObj } (Object o )= the first example that does not correspond to a semilattice,
 check unionObj o that is aa≠a. This is natural for a Typelike object; it is
 check UnionType { unionArr } (Array )= not a type constraint in a conventional sense, just an accu-
 check unionArr mulation of knowledge.
 data Counted = Counted { count::Int, constraint:: }
3.3.8 Overlapping alternatives. The essence of union instance Semigroup 
type systems have long been dealing with the conflicting => Semigroup ( Counted ) where
types provided in the input. Motivated by the examples above, ⋄ = Counted {
we also aim to address conflicting alternative assignments. count = count + count 
It is apparent that examples 4. to 6. hint at more than one , constraint = constraint ⋄ constraint 
assignment: in example 5, a set of lists of values that may }
correspond to Int, String, or null, or a table that has the Therefore, at each step, we may need to maintain a cardi-
same (and predefined) type for each values; in example 6 A nality of each possible value, and is provided with sufficient
record of fixed names or the mapping from hash to a single number of samples, we may attempt to detect15 . To preserve
object type. efficiency, we may need to merge whenever the number of
 alternatives in a multiset crosses the threshold. We can at-
13 In this example, JSON terms can be described by terms without variables, tempt to narrow strings only in the cases when cardinality
and sets of tuples for dictionaries, so generalization by anti-unification is crosses the threshold.
straightforward.
14 The question may arise: what is the union type without set union? When

the sets are disjoint, we just put the values in different bins to enable easier 15 If we detect a pattern too early, we risk to make the types too narrow to

handling. work with actual API responses.
IFL2020, September, 2020, Michał J. Gajda

4 Finishing touches the inference would represent a principal type in Damas-
The final touch would be to perform the post-processing Milner sense.
of an assigned type before generating it to make it more
resilient to common uncertainties. These assumptions may 5.1 Conclusion
bypass the defined least-upper-bound criterion specified in In the present study, we aimed to derive the types that were
the initial part of the paper; however, they prove to work valid with respect to the provided specification16 , thereby
well in practice[2, 14]. obtaining the information from the input in the most com-
 If we have no observations corresponding to an array type, prehensive way. We defined type inference as representa-
it can be inconvenient to disallow an array to contain any tion learning and type system engineering as a meta-learning
values at all. Therefore, we introduce a non-monotonic step problem in which the priors corresponding to the data
of converting the mempty into a final Typelike object aim- structure induced typing rules. We show how the type
ing to introduce a representation allowing the occurrence of safety can be quickly tested as equational laws with QuickCheck,
any Value in the input. That still preserves the validity of which is a useful prototyping tool, and may be supplemented
the typing. We note that the program using our types must with fully formal proof in the future.
not have any assumptions about these values; however, at We also formulated the union type discipline as ma-
the same time, it should be able to print them for debugging nipulation of Typelike commutative monoids, that repre-
purposes. sented knowledge about the data structure. In addition, we
 In most JSON documents, we observe that the same object proposed a union type system engineering methodology that
can be simultaneously described in different parts of sample was logically justified by theoretical criteria. We demonstrated
data structures. Due to this reason, we compare the sets of that it was capable of consistently explaining the decisions
labels assigned to all objects and propose to unify those that made in practice. We followed a strictly constructive proce-
have more than 60% of identical labels. For transparency, the dure, that can be implemented generically.
identified candidates are logged for each user, and a user can We hope that this kind of straightforward type system
also indicate them explicitly instead of relying on automa- engineering will become widely used in practice, replacing
tion. We conclude that this allows considerably decreasing less modular approaches of the past. The proposed approach
the complexity of types and makes the output less redun- may be used to underlie the way towards formal construc-
dant. tion and derivation of type systems based on the specifica-
 tion of value domains and design constraints.
5 Future work Bibliography
In the present paper, we only discuss typing of tree-like val- [1] Aeson: Fast JSON parsing and generation: 2011. https:// hackage.haskel
ues. However, it is natural to scale this approach to multi- [2] A first look at quicktype: 2017. https:// blog.quicktype.io/ first-look/ .
ple types in APIs, in which different types are referred to [3] Anderson, C. et al. 2005. Towards type inference for javascript.
by name and possibly contain each other. To address these ECOOP 2005 - object-oriented programming (Berlin, Hei-
cases, we plan to show that the environment of Typelike delberg, 2005), 428–452.
objects is also Typelike, and that constraint generalization [4] Claessen, K. and Hughes, J. 2000. QuickCheck: A light-
(anti-unification) can be extended in the same way. weight tool for random testing of haskell programs. SIG-
 It should be noted that many Typelike instances for non- PLAN Not. 35, 9 (Sep. 2000), 268–279. DOI:https://doi.org/10.1145/3577
simple types usually follow one the two patterns of (1) for [5] C Standard undefined behaviour versus Wittgenstein:
a finite sum of disjoint constructors, we bin this informa- https:// www.yodaiken.com/2018/ 05/20/ depressing-and-faintly-terrif y
tion by each constructor during the inference (2) for typing [6] C Undefined Behavior - Depressing and Terrifying (Up-
terms with multiple alternative representations, we infer all dated): 2018. https:// www.yodaiken.com/2018/ 05/20/ depressing-and-f
constraints separately for each alternative representation. [7] EnTangleD: A bi-directional literate programming tool:
In both cases, Generic derivation procedure for the Monoid, 2019. https:// blog.esciencecenter.nl/entangled-1744448f 4b9f .
Typelike, and TypeCost instances is possible [17]. This al- [8] Fisher, K. and Walker, D. 2011. The PADS Project: An
lows us to design a type system by declaring datatypes them- Overview. Proceedings of the 14th international confer-
selves and leave implementation to the compiler. Manual ence on database theory (New York, NY, USA, 2011), 11–
implementation would be only left for special cases, like 17.
StringConstraint and Counted constraint. [9] Frisch, A. et al. 2002. Semantic subtyping. Proceedings
 Finally, we believe that we can explain the duality of cate- 17th annual ieee symposium on logic in computer science
gorical framework of Typelike categories and use general- (2002), 137–146.
ization (anti-unification) instead of unification (or narrow-
ing) as a type inference mechanism. The beyond set would 16 Specificationwas given in the motivation section descriptions of JSON in-
then correspond to a set of error messages, and a result of put examples, and the expected results given as Haskell type declarations.
Towards a more perfect union type IFL2020, September, 2020,

[10] Frisch, A. et al. 2008. Semantic subtyping: Dealing set-
 theoretically with function, union, intersection, and nega-
 tion types. J. ACM. 55, 4 (Sep. 2008). DOI:https://doi.org/10.1145/1391289.1391293.
[11] Generics example: Creating monoid instances: 2012. https:// www.yesodweb.com/blog/ 2012/10/ generic-monoid .
[12] GHCID - a new ghci based ide (ish): 2014. http:// neilmitchell.blogspot.com/ 2014/09/ ghcid-new-ghci-based-ide-ish.html.
[13] Hosoya, H. and Pierce, B. 2000. XDuce: A typed xml
 processing language. (Jun. 2000).
[14] JSON autotype: Presentation for Haskell.SG: 2015. https:// engineers.sg/video/json-autotype-1-0-haskell-sg--429 .
[15] Knuth, D.E. 1984. Literate programming. Comput. J. 27,
 2 (May 1984), 97–111. DOI:https://doi.org/10.1093/comjnl/27.2.97.
[16] Lattices: Fine-grained library for constructing and ma-
 nipulating lattices: 2017. http:// hackage.haskell.org/ package/ lattices-2.0.2/ docs/Algebra-Lattice-Levitated.html .
[17] Magalhães, J.P. et al. 2010. A generic deriving mecha-
 nism for haskell. SIGPLAN Not. 45, 11 (Sep. 2010), 37–48.
 DOI:https://doi.org/10.1145/2088456.1863529.
[18] Michal J. Gajda, D.K. 2020. Fast XML/HTML tools for
 Haskell: XML Typelift and improved Xeno. Manuscript
 under review.
[19] Pandoc: A universal document converter: 2000. https:// pandoc.org.
[20] Petricek, T. et al. 2016. Types from Data: Making Struc-
 tured Data First-Class Citizens in F#. SIGPLAN Not. 51, 6
 (Jun. 2016), 477–490. DOI:https://doi.org/10.1145/2980983.2908115.
[21] Peyton Jones, S. 2019. Type inference as constraint solv-
 ing: How ghc’s type inference engine actually works.
 Zurihac keynote talk.
[22] Siek, J. and Taha, W. 2007. Gradual typing for objects.
 Proceedings of the 21st european conference on object-oriented
 programming (Berlin, Heidelberg, 2007), 2–27.
[23] Sulzmann, M. and Stuckey, P.j. 2008. Hm(x) Type In-
 ference is Clp(x) Solving. J. Funct. Program. 18, 2 (Mar.
 2008), 251–283. DOI:https://doi.org/10.1017/S0956796807006569.
[24] Tiuryn, J. 1992. Subtype inequalities. [1992] Proceedings
 of the Seventh Annual IEEE Symposium on Logic in Com-
 puter Science. (1992), 308–315.
[25] Tiuryn, J. 1997. Subtyping over a lattice (abstract). Com-
 putational logic and proof theory (Berlin, Heidelberg, 1997),
 84–88.
[26] Undefined behavior in 2017: 2017. https:// blog.regehr.org/ archives/1520.
[27] 2019. https:// github.com/microsoft/ TypeScript/issues/ 9825.
6 Appendix: all laws of Typelike

 check mempty = False (mempty contains no terms)
 beyond ⇒ check = True (beyond contains all terms)
 check 1 ⇒ check ( 1 ⋄ 2 ) = True (left fusion keeps terms)
 check 2 ⇒ check ( 1 ⋄ 2 ) = True (right fusion keeps terms)
 check (infer ) = False (inferred type contains the source term)
 1 ⋄ ( 2 ⋄ 3 ) = 1 ⋄ ( 2 ⋄ 3 ) (semigroup associativity)
 mempty ⋄ = (left identity of the monoid)
 ⋄ mempty = (right identity of the monoid)
IFL2020, September, 2020, Michał J. Gajda

7 Appendix: definition module headers
{-# language AllowAmbiguousTypes #-}
{-# language DeriveGeneric #-}
{-# language DuplicateRecordFields #-}
{-# language FlexibleInstances #-}
{-# language GeneralizedNewtypeDeriving #-}
{-# language MultiParamTypeClasses #-}
{-# language NamedFieldPuns #-}
{-# language PartialTypeSignatures #-}
{-# language ScopedTypeVariables #-}
{-# language TypeOperators #-}
{-# language RoleAnnotations #-}
{-# language ViewPatterns #-}
{-# language RecordWildCards #-}
{-# language OverloadedStrings #-}
{-# options_ghc -Wno-orphans #-}
module Unions where

import Control.Arrow(second)
import Data.Aeson
import Data.Maybe (isJust ,catMaybes)
import qualified Data.Foldable as Foldable
import Data.Function ( on )
import Data.Text ( Text )
import qualified Data.Text as Text
import qualified Data.Text.Encoding as Text
import qualified Text.Email.Validate( isValid )
import qualified Data.Set as Set
import Data.Set (Set )
import Data.Scientific
import Data.String
import qualified Data.HashMap.Strict as Map
import Data.HashMap.Strict ( HashMap )
import GHC.Generics ( Generic )
import Data.Hashable
import Data.Typeable
import Data.Time.Format ( iso8601DateFormat
 , parseTimeM
 , defaultTimeLocale)
import Data.Time.Calendar ( Day)
import Missing

≪ freetype >>
≪ typelike >>
≪ basic-constraints >>
≪ row-constraint>>
≪ array- constraint >>
≪ object -constraint >>
≪ presence-absence- constraints >>
≪ union- type-instance >>
≪ type>>
≪ counted >>
≪ typecost >>
≪ representation>>
Towards a more perfect union type IFL2020, September, 2020,

8 Appendix: test suite 9 Appendix: package dependencies
{-# language FlexibleInstances #-} name: union-types
{-# language Rank2Types #-} version: '0.1.0.0'
{-# language MultiParamTypeClasses #-} category: Web
{-# language MultiWayIf #-} author: Anonymous
{-# language NamedFieldPuns #-} maintainer: example@example.com
{-# language ScopedTypeVariables #-} license: BSD-3
{-# language StandaloneDeriving #-} extra-source-files:
{-# language TemplateHaskell #-} - CHANGELOG.md
{-# language TypeOperators #-} - README.md
{-# language TypeApplications #-} dependencies:
{-# language TupleSections #-} - base
{-# language UndecidableInstances #-} - aeson
{-# language AllowAmbiguousTypes #-} - containers
{-# language OverloadedStrings #-} - text
{-# language ViewPatterns #-} - hspec
{-# options_ghc -Wno-orphans #-} - QuickCheck
module Main where - unordered-containers
 - scientific
import qualified Data.Set as Set- hspec
import qualified Data.Text as Text - QuickCheck
import qualified Data.ByteString.Char8 as BS- validity
import Control.Monad(when , replicateM)- vector
import Control.Exception( assert) - unordered-containers
import Data.FileEmbed - scientific
import Data.Maybe - genvalidity
import Data.Scientific - genvalidity-hspec
import Data.Aeson - genvalidity-property
import Data.Proxy - time
import Data.Typeable - email-validate
import qualified Data.HashMap.Strict as Map- generic-arbitrary
import Data.HashMap.Strict ( HashMap)- mtl
import Test.Hspec - hashable
import Test.Hspec.QuickCheck library:
import Test.QuickCheck source-dirs: src
import Test.Arbitrary.Laws exposed-modules:
import Test.LessArbitrary.Laws - Unions
import Test.Validity.Shrinking.Property tests:
import Test.Validity.Utils(nameOf) spec:
import qualified GHC.Generics as Generic main: Spec.hs
import Test.QuickCheck.Classes source-dirs:
import System.Exit(exitFailure ) - test/spec
 dependencies:
import Test.QuickCheck.Arbitrary - union-types
import Test.LessArbitrary as LessArbitrary - mtl
import Unions - random
 - transformers
instance Arbitrary Value where - hashable
 arbitrary = fasterArbitrary - quickcheck-classes
 - file-embed
instance LessArbitrary Value where - bytestring
 lessArbitrary = cheap $$$? genericLessArbitrary - less-arbitrary
 where
 cheap = LessArbitrary.oneof [
 pure Null
 , Bool ♦$ lessArbitrary
 , Number ♦$ lessArbitrary
 ]

instance LessArbitrary 
IFL2020, September, 2020, Michał J. Gajda

10 Appendix: representation of generated htop :: IsString s => s
 Haskell types htop = "Data.Aeson.Value"
We will not delve here into identifier conversion between
JSON and Haskell, so it suffices that we have an abstract
datatypes for Haskell type and constructor identifiers:
 newtype HConsId = HConsId String
 deriving (Eq, Ord , Show,Generic,IsString)
 newtype HFieldId = HFieldId String
 deriving (Eq, Ord , Show,Generic,IsString)
 newtype HTypeId = HTypeId String
 deriving (Eq, Ord , Show,Generic,IsString)
 For each single type we will either describe its exact rep-
resentation or reference to the other definition by name:
 data HType =
 HRef HTypeId
 | HApp HTypeId [HType]
 | HADT [ HCons ]
 deriving (Eq, Ord, Show, Generic)
 For syntactic convenience, we will allow string literals to
denote type references:
 instance IsString HType where 10.1 Code for selecting representation
 fromString = HRef . fromString Below is the code to select Haskell type representation. To
 When we define a single constructor, we allow field and convert union type discipline to strict Haskell type repre-
constructor names to be empty strings (""), assuming that sentations, we need to join the options to get the actual rep-
the relevant identifiers will be put there by post-processing resentation:
that will pick names using types of fields and their contain- toHType :: ToHType ty => ty -> HType
ers [18]. toHType = joinAlts . toHTypes
 data HCons = HCons {
 name :: HConsId joinAlts :: [ HType ] -> HType
 , args :: [ (HFieldId, HType)] joinAlts [] = htop – promotion of empty type
 } joinAlts alts = foldr1 joinPair alts
 deriving ( Eq, Ord, Show, Generic) where
 At some stage we want to split representation into indi- joinPair = HApp ":|:" [ , ]
vidually named declarations, and then we use environment Considering the assembly of UnionType, we join all the
of defined types, with an explicitly named toplevel type: options, and convert nullable types to Maybe types
 data HTypeEnv = HTypeEnv { instance ToHType UnionType where
 toplevel :: HTypeId toHTypes UnionType { ..} =
 , env :: HashMap HTypeId HType prependNullable unionNull opts
 } where
 When checking for validity of types and type environ- opts = concat [ toHTypes unionBool
ments, we might need a list of predefined identifiers that , toHTypes unionStr
are imported: , toHTypes unionNum
 predefinedHTypes :: [HType] , toHTypes unionArr
 predefinedHTypes = [ , toHTypes unionObj ]
 "Data.Aeson.Value"
 , "()" prependNullable :: PresenceConstraint -> [ HType] -> [HT
 , "Double" prependNullable Present tys = [HApp "Maybe" [ joinAlts tys]
 , "String" prependNullable Absent tys = tys
 , "Int" The type class returns a list of mutually exclusive type
 , "Date" – actually: "Data.Time.CalendarDay" representations:
 , "Email" – actually: "Data.Email" class Typelike ty
 ] => ToHType ty where
 Consider that we also have an htop value that represents toHTypes :: ty -> [HType]
any possible JSON value. It is polimorphic for ease of use: Conversion of flat types is quite straightforward:
Towards a more perfect union type IFL2020, September, 2020,

 instance ToHType BoolConstraint where instance ToHType ObjectConstraint where
 toHTypes Absent = [ ] toHTypes ObjectNever = []
 toHTypes Present = [ "Bool"] toHTypes ObjectConstraint { ..} =
 instance ToHType NumberConstraint where if typeCost recordCase ≤ typeCost mappingCase
 toHTypes NCNever = [ ] then toHTypes recordCase
 toHTypes NCFloat = [ "Double"] else toHTypes mappingCase
 toHTypes NCInt = [ "Int"]
 instance ToHType StringConstraint where instance ToHType RecordConstraint where
 toHTypes SCAny = [ "String" ] toHTypes RCBottom = [ ]
 toHTypes SCEmail = [ "Email" ] toHTypes RCTop = [ htop ] – should never happen
 toHTypes SCDate = [ "Date" ] toHTypes ( RecordConstraint fields) =
 toHTypes ( SCEnum es) = [ HADT $ [HADT
 mkCons ♦$ Set.toList es [ HCons "" $ fmap convert $ Map.toList fields]
 ] ]
 where where
 mkCons = (‘ HCons‘ []) convert (k,v) = ( HFieldId $ Text.unpack k
 . HConsId , toHType v )
 . Text.unpack
 toHTypes SCNever = [] instance ToHType MappingConstraint where
 For array and object types we pick the representation toHTypes MappingNever = []
which presents the lowest cost of optionality: toHTypes MappingConstraint {..} =
 [HApp "Map" [ toHType keyConstraint
 , toHType valueConstraint
 ]]

 instance ToHType RowConstraint where
 toHTypes RowNever = [ ]
 toHTypes RowTop = [ htop]
 toHTypes ( Row cols) =
 [HADT
 [ HCons "" $ fmap (\ut -> ("", toHType ut)) cols]
 ]

 instance ToHType ArrayConstraint where
 toHTypes ArrayNever = []
 toHTypes ArrayConstraint { .. } =
 if typeCost arrayCase ≤ typeCost rowCase
 – || count
IFL2020, September, 2020, Michał J. Gajda

 isValidDate :: Text -> Bool {-# language AllowAmbiguousTypes #-}
 isValidDate = isJust {-# language DeriveGeneric #-}
 . parseDate {-# language DuplicateRecordFields #-}
 . Text.unpack {-# language FlexibleInstances #-}
 where {-# language GeneralizedNewtypeDeriving #-}
 parseDate :: String -> Maybe Day {-# language MultiParamTypeClasses #-}
 parseDate = parseTimeM True {-# language NamedFieldPuns #-}
 defaultTimeLocale $ {-# language PartialTypeSignatures #-}
 iso8601DateFormat Nothing {-# language ScopedTypeVariables #-}
 {-# language StandaloneDeriving #-}
 isValidEmail :: Text -> Bool {-# language TypeOperators #-}
 isValidEmail = Text.Email.Validate.isValid {-# language RoleAnnotations #-}
 . Text.encodeUtf8 {-# language ViewPatterns #-}
 instance ( Hashable k {-# language RecordWildCards #-}
 , Hashable v) {-# language OverloadedStrings #-}
 => Hashable (HashMap k v ) where {-# options_ghc -Wno-orphans #-}
 hashWithSalt s = hashWithSalt s module Missing where
 . Foldable.toList
 import Control.Arrow(second)
 instance Hashable v import Data.Aeson
 => Hashable ( V.Vector v ) where import Data.Maybe (isJust ,catMaybes)
 hashWithSalt s = hashWithSalt s import qualified Data.Foldable as Foldable
 . Foldable.toList import Data.Function ( on )
 import Data.Text ( Text )
 – instance Hashable Scientific where import qualified Data.Text as Text
 – instance Hashable Value where import qualified Data.Text.Encoding as Text
 Then we put all the missing code in the module: import qualified Text.Email.Validate( isValid )
 import qualified Data.Set as Set
 import Data.Set (Set )
 import Data.Scientific
 import Data.String
 import qualified Data.Vector as V
 import qualified Data.HashMap.Strict as Map
 import Data.HashMap.Strict ( HashMap )
 import GHC.Generics ( Generic )
 import Data.Hashable
 import Data.Typeable
 import Data.Time.Format ( iso8601DateFormat,parse
 import Data.Time.Calendar ( Day)

 ≪ missing>>

 Acknowledgments
 The author thanks for all tap-on-the-back donations to his
 past projects.
 We wrote the article with the great help of bidirectional
 literate programming [15] tool [7], Pandoc [19] markdown
 publishing system and live feedback from GHCid [12].
You can also read