Coding categorical data¶
Patsy allows great flexibility in how categorical data is coded,
via the function C()
. C()
marks some data as being
categorical (including data which would not automatically be treated
as categorical, such as a column of integers), while also optionally
setting the preferred coding scheme and level ordering.
Let’s get some categorical data to work with:
In [1]: from patsy import dmatrix, demo_data, ContrastMatrix, Poly
In [2]: data = demo_data("a", nlevels=3)
In [3]: data
Out[3]: {'a': ['a1', 'a2', 'a3', 'a1', 'a2', 'a3']}
As you know, simply giving Patsy a categorical variable causes it
to be coded using the default Treatment
coding
scheme. (Strings and booleans are treated as categorical by default.)
In [4]: dmatrix("a", data)
Out[4]:
DesignMatrix with shape (6, 3)
Intercept a[T.a2] a[T.a3]
1 0 0
1 1 0
1 0 1
1 0 0
1 1 0
1 0 1
Terms:
'Intercept' (column 0)
'a' (columns 1:3)
We can also alter the level ordering, which is useful for, e.g.,
Diff
coding:
In [5]: l = ["a3", "a2", "a1"]
In [6]: dmatrix("C(a, levels=l)", data)
Out[6]:
DesignMatrix with shape (6, 3)
Intercept C(a, levels=l)[T.a2] C(a, levels=l)[T.a1]
1 0 1
1 1 0
1 0 0
1 0 1
1 1 0
1 0 0
Terms:
'Intercept' (column 0)
'C(a, levels=l)' (columns 1:3)
But the default coding is just that – a default. The easiest alternative is to use one of the other built-in coding schemes, like orthogonal polynomial coding:
In [7]: dmatrix("C(a, Poly)", data)
Out[7]:
DesignMatrix with shape (6, 3)
Intercept C(a, Poly).Linear C(a, Poly).Quadratic
1 -0.70711 0.40825
1 -0.00000 -0.81650
1 0.70711 0.40825
1 -0.70711 0.40825
1 -0.00000 -0.81650
1 0.70711 0.40825
Terms:
'Intercept' (column 0)
'C(a, Poly)' (columns 1:3)
There are a number of built-in coding schemes; for details you can check the API reference. But we aren’t restricted to those. We can also provide a custom contrast matrix, which allows us to produce all kinds of strange designs:
In [8]: contrast = [[1, 2], [3, 4], [5, 6]]
In [9]: dmatrix("C(a, contrast)", data)
Out[9]:
DesignMatrix with shape (6, 3)
Intercept C(a, contrast)[custom0] C(a, contrast)[custom1]
1 1 2
1 3 4
1 5 6
1 1 2
1 3 4
1 5 6
Terms:
'Intercept' (column 0)
'C(a, contrast)' (columns 1:3)
In [10]: dmatrix("C(a, [[1], [2], [-4]])", data)
Out[10]:
DesignMatrix with shape (6, 2)
Intercept C(a, [[1], [2], [-4]])[custom0]
1 1
1 2
1 -4
1 1
1 2
1 -4
Terms:
'Intercept' (column 0)
'C(a, [[1], [2], [-4]])' (column 1)
Hmm, those [custom0]
, [custom1]
names that Patsy
auto-generated for us are a bit ugly looking. We can attach names to
our contrast matrix by creating a ContrastMatrix
object, and
make things prettier:
In [11]: contrast_mat = ContrastMatrix(contrast, ["[pretty0]", "[pretty1]"])
In [12]: dmatrix("C(a, contrast_mat)", data)
Out[12]:
DesignMatrix with shape (6, 3)
Intercept C(a, contrast_mat)[pretty0] C(a, contrast_mat)[pretty1]
1 1 2
1 3 4
1 5 6
1 1 2
1 3 4
1 5 6
Terms:
'Intercept' (column 0)
'C(a, contrast_mat)' (columns 1:3)
And, finally, if we want to get really fancy, we can also define our
own “smart” coding schemes like Poly
. Just define a class
that has two methods, code_with_intercept()
and
code_without_intercept()
. They have identical signatures, taking
a list of levels as their argument and returning a
ContrastMatrix
. Patsy will automatically choose the
appropriate method to call to produce a full-rank design matrix
without redundancy; see Redundancy and categorical factors for the full details on how
Patsy makes this decision.
As an example, here’s a simplified version of the built-in
Treatment
coding object:
import numpy as np
class MyTreat(object):
def __init__(self, reference=0):
self.reference = reference
def code_with_intercept(self, levels):
return ContrastMatrix(np.eye(len(levels)),
["[My.%s]" % (level,) for level in levels])
def code_without_intercept(self, levels):
eye = np.eye(len(levels) - 1)
contrasts = np.vstack((eye[:self.reference, :],
np.zeros((1, len(levels) - 1)),
eye[self.reference:, :]))
suffixes = ["[MyT.%s]" % (level,) for level in
levels[:self.reference] + levels[self.reference + 1:]]
return ContrastMatrix(contrasts, suffixes)
And it can now be used just like the built-in methods:
# Full rank:
In [13]: dmatrix("0 + C(a, MyTreat)", data)
NameErrorTraceback (most recent call last)
<ipython-input-13-fc2731b99fa5> in <module>()
----> 1 dmatrix("0 + C(a, MyTreat)", data)
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in dmatrix(formula_like, data, eval_env, NA_action, return_type)
289 eval_env = EvalEnvironment.capture(eval_env, reference=1)
290 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 291 NA_action, return_type)
292 if lhs.shape[1] != 0:
293 raise PatsyError("encountered outcome variables for a model "
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
163 return iter([data])
164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 165 NA_action)
166 if design_infos is not None:
167 return build_design_matrices(design_infos, data,
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
68 data_iter_maker,
69 eval_env,
---> 70 NA_action)
71 else:
72 return None
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/build.py in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
694 factor_states,
695 data_iter_maker,
--> 696 NA_action)
697 # Now we need the factor infos, which encapsulate the knowledge of
698 # how to turn any given factor into a chunk of data:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/build.py in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
441 for data in data_iter_maker():
442 for factor in list(examine_needed):
--> 443 value = factor.eval(factor_states[factor], data)
444 if factor in cat_sniffers or guess_categorical(value):
445 if factor not in cat_sniffers:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in eval(self, memorize_state, data)
564 return self._eval(memorize_state["eval_code"],
565 memorize_state,
--> 566 data)
567
568 __getstate__ = no_pickling
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in _eval(self, code, memorize_state, data)
549 memorize_state["eval_env"].eval,
550 code,
--> 551 inner_namespace=inner_namespace)
552
553 def memorize_chunk(self, state, which_pass, data):
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/compat.py in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
115 def call_and_wrap_exc(msg, origin, f, *args, **kwargs):
116 try:
--> 117 return f(*args, **kwargs)
118 except Exception as e:
119 if sys.version_info[0] >= 3:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in eval(self, expr, source_name, inner_namespace)
164 code = compile(expr, source_name, "eval", self.flags, False)
165 return eval(code, {}, VarLookupDict([inner_namespace]
--> 166 + self._namespaces))
167
168 @classmethod
<string> in <module>()
NameError: name 'MyTreat' is not defined
# Reduced rank:
In [14]: dmatrix("C(a, MyTreat)", data)
NameErrorTraceback (most recent call last)
<ipython-input-14-09011f1be5a2> in <module>()
----> 1 dmatrix("C(a, MyTreat)", data)
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in dmatrix(formula_like, data, eval_env, NA_action, return_type)
289 eval_env = EvalEnvironment.capture(eval_env, reference=1)
290 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 291 NA_action, return_type)
292 if lhs.shape[1] != 0:
293 raise PatsyError("encountered outcome variables for a model "
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
163 return iter([data])
164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 165 NA_action)
166 if design_infos is not None:
167 return build_design_matrices(design_infos, data,
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
68 data_iter_maker,
69 eval_env,
---> 70 NA_action)
71 else:
72 return None
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/build.py in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
694 factor_states,
695 data_iter_maker,
--> 696 NA_action)
697 # Now we need the factor infos, which encapsulate the knowledge of
698 # how to turn any given factor into a chunk of data:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/build.py in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
441 for data in data_iter_maker():
442 for factor in list(examine_needed):
--> 443 value = factor.eval(factor_states[factor], data)
444 if factor in cat_sniffers or guess_categorical(value):
445 if factor not in cat_sniffers:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in eval(self, memorize_state, data)
564 return self._eval(memorize_state["eval_code"],
565 memorize_state,
--> 566 data)
567
568 __getstate__ = no_pickling
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in _eval(self, code, memorize_state, data)
549 memorize_state["eval_env"].eval,
550 code,
--> 551 inner_namespace=inner_namespace)
552
553 def memorize_chunk(self, state, which_pass, data):
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/compat.py in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
115 def call_and_wrap_exc(msg, origin, f, *args, **kwargs):
116 try:
--> 117 return f(*args, **kwargs)
118 except Exception as e:
119 if sys.version_info[0] >= 3:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in eval(self, expr, source_name, inner_namespace)
164 code = compile(expr, source_name, "eval", self.flags, False)
165 return eval(code, {}, VarLookupDict([inner_namespace]
--> 166 + self._namespaces))
167
168 @classmethod
<string> in <module>()
NameError: name 'MyTreat' is not defined
# With argument:
In [15]: dmatrix("C(a, MyTreat(2))", data)
NameErrorTraceback (most recent call last)
<ipython-input-15-324e4d268f2e> in <module>()
----> 1 dmatrix("C(a, MyTreat(2))", data)
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in dmatrix(formula_like, data, eval_env, NA_action, return_type)
289 eval_env = EvalEnvironment.capture(eval_env, reference=1)
290 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 291 NA_action, return_type)
292 if lhs.shape[1] != 0:
293 raise PatsyError("encountered outcome variables for a model "
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
163 return iter([data])
164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 165 NA_action)
166 if design_infos is not None:
167 return build_design_matrices(design_infos, data,
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
68 data_iter_maker,
69 eval_env,
---> 70 NA_action)
71 else:
72 return None
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/build.py in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
694 factor_states,
695 data_iter_maker,
--> 696 NA_action)
697 # Now we need the factor infos, which encapsulate the knowledge of
698 # how to turn any given factor into a chunk of data:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/build.py in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
441 for data in data_iter_maker():
442 for factor in list(examine_needed):
--> 443 value = factor.eval(factor_states[factor], data)
444 if factor in cat_sniffers or guess_categorical(value):
445 if factor not in cat_sniffers:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in eval(self, memorize_state, data)
564 return self._eval(memorize_state["eval_code"],
565 memorize_state,
--> 566 data)
567
568 __getstate__ = no_pickling
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in _eval(self, code, memorize_state, data)
549 memorize_state["eval_env"].eval,
550 code,
--> 551 inner_namespace=inner_namespace)
552
553 def memorize_chunk(self, state, which_pass, data):
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/compat.py in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
115 def call_and_wrap_exc(msg, origin, f, *args, **kwargs):
116 try:
--> 117 return f(*args, **kwargs)
118 except Exception as e:
119 if sys.version_info[0] >= 3:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in eval(self, expr, source_name, inner_namespace)
164 code = compile(expr, source_name, "eval", self.flags, False)
165 return eval(code, {}, VarLookupDict([inner_namespace]
--> 166 + self._namespaces))
167
168 @classmethod
<string> in <module>()
NameError: name 'MyTreat' is not defined