Token

An individual token — i.e. a word, punctuation symbol, whitespace, etc.

Implements

Fields

has_vector (Boolean)

A boolean value indicating whether a word vector is associated with the object.

text (String)

Verbatim text content.

text_with_ws (String)

Text content, with trailing space character if present.

vector ([Float])

A real-valued meaning representation.

vector_norm (Float)

The L2 norm of the document’s vector representation.

ancestors ([Token])

The rightmost token of this token’s syntactic descendants.

children ([Token])

A sequence of the token’s immediate syntactic children.

cluster (Int)

Brown cluster ID.

conjuncts ([Token])

A tuple of coordinated tokens, not including the token itself.

dep (String)

Syntactic dependency relation.

end (Int)

The ending character offset of the token within the parent document.

ent_iob (String)

IOB code of named entity tag. 3 means the token begins an entity, 2 means it is outside an entity, 1 means it is inside an entity, and 0 means no entity tag is set.

ent_type (String)

Named entity type

extension (TokenExtension)
head (Token)

The syntactic parent, or "governor", of this token.

id (Int)

The index of the token within the parent document.

is_alpha (Boolean)

Does the token consist of alphabetic characters?

is_ascii (Boolean)

Does the token consist of ASCII characters?

is_bracket (Boolean)

Is the token a bracket?

is_currency (Boolean)

Is the token a currency symbol?

is_digit (Boolean)

Does the token consist of digits?

is_left_punct (Boolean)

Is the token a left punctuation mark, e.g. "(" ?

is_lower (Boolean)

Is the token in lowercase?

is_oov (Boolean)

Is the token out-of-vocabulary (i.e. does it not have a word vector)?

is_punct (Boolean)

Is the token punctuation?

is_quote (Boolean)

Is the token a quotation mark?

is_right_punct (Boolean)

Is the token a right punctuation mark, e.g. ")" ?

is_sent_start (Boolean)

A boolean value indicating whether the token starts a sentence

is_space (Boolean)

Does the token consist of whitespace characters?

is_stop (Boolean)

Is the token part of a “stop list”?

is_title (Boolean)

Is the token in titlecase?

is_upper (Boolean)

Is the token in uppercase?

lang (String)

Language of the parent document’s vocabulary.

left_edge (Token)

The leftmost token of this token’s syntactic descendants.

lefts ([Token])

The leftward immediate children of the word in the syntactic dependency parse.

lemma (String)

Base form of the token, with no inflectional suffixes.

like_email (Boolean)

Does the token resemble an email address?

like_num (Boolean)

Does the token represent a number? e.g. "10.9", "10", "ten", etc.

like_url (Boolean)

Does the token resemble a URL?

lower (String)

Lowercase form of the token

norm (String)

The token's norm, i.e. a normalized form of the token text

orth (String)

Verbatim text content (identical to Token.text). Exists mostly for consistency with the other attributes.

pos (String)

Coarse-grained part-of-speech.

prefix (String)

Hash value of a length-N substring from the start of the token

prob (Float)

Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary).

right_edge (Token)

The rightmost token of this token’s syntactic descendants.

rights ([Token])

The rightward immediate children of the word in the syntactic dependency parse.

shape (String)

Transform of the tokens’s string to show orthographic features. Alphabetic characters are replaced by x or X, and numeric characters are replaced by d, and sequences of the same character are truncated after length 4. For example,"Xxxx"or"dd"

start (Int)

The starting character offset of the token within the parent document.

subtree ([Token])

A sequence containing the token and all the token’s syntactic descendants.

suffix (String)

Hash value of a length-N substring from the end of the token

tag (String)

Fine-grained part-of-speech.

whitespace (String)

Trailing space character if present