Load CoNLL file, extract features on the tokens and vectorize them.
The ConLL file format is a line-oriented text format that describes sequences in a space-separated format, separating the sequences with blank lines. Typically, the last space-separated part is a label.
Since the tab-separated parts are usually tokens (and maybe things like part-of-speech tags) rather than feature vectors, a function must be supplied that does the actual feature extraction. This function has access to the entire sequence, so that it can extract context features.
A sklearn.feature_extraction.FeatureHasher (the “hashing trick”) is used to map symbolic input feature names to columns, so this function dos not remember the actual input feature names.
Parameters: | f : {string, file-like}
features : callable
n_features : integer, optional
split : boolean, default=False
|
---|---|
Returns: | X : scipy.sparse matrix, shape (n_samples, n_features)
y : np.ndarray, dtype np.string, shape n_samples
lengths : np.ndarray, dtype np.int32, shape n_sequences
|
Sequence-aware (repeated) k-fold CV splitter.
Uses a greedy heuristic to partition input sequences into sets with roughly equal numbers of samples, while keeping the sequences intact.
Parameters: | lengths : array-like of integers, shape (n_samples,)
n_folds : int, optional
n_iter : int, optional
shuffle : boolean, optional
random_state : {np.random.RandomState, integer}, optional
yield_lengths : boolean, optional
|
---|---|
Returns: | folds : iterable
|
F-score for BIO-tagging scheme, as used by CoNLL.
This F-score variant is used for evaluating named-entity recognition and related problems, where the goal is to predict segments of interest within sequences and mark these as a “B” (begin) tag followed by zero or more “I” (inside) tags. A true positive is then defined as a BI* segment in both y_true and y_pred, with false positives and false negatives defined similarly.
Support for tags schemes with classes (e.g. “B-NP”) are limited: reported scores may be too high for inconsistent labelings.
Parameters: | y_true : array-like of strings, shape (n_samples,)
y_pred : array-like of strings, shape (n_samples,)
|
---|---|
Returns: | f : float
|
Average accuracy measured on whole sequences.
Returns the fraction of sequences in y_true that occur in y_pred without a single error.
Hidden Markov models (HMMs) with supervised training.
First-order hidden Markov model with multinomial event model.
Parameters: | decode : string, optional
alpha : float
|
---|
Methods
Fit HMM model to data.
Parameters: | X : {array-like, sparse matrix}, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
lengths : array-like of integers, shape (n_sequences,)
|
---|---|
Returns: | self : MultinomialHMM |
Notes
Make sure the training set (X) is one-hot encoded; if more than one feature in X is on, the emission probabilities will be multiplied.
Structured perceptron for sequence classification.
This implements the averaged structured perceptron algorithm of Collins and Daumé, with the addition of an adaptive learning rate.
Parameters: | decode : string, optional
lr_exponent : float, optional
max_iter : integer, optional
random_state : {integer, np.random.RandomState}, optional
trans_features : boolean, optional
verbose : integer, optional
|
---|
References
M. Collins (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. EMNLP.
Hal Daumé III (2006). Practical Structured Learning Techniques for Natural Language Processing. Ph.D. thesis, U. Southern California.
Methods
Fit to a set of sequences.
Parameters: | X : {array-like, sparse matrix}, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
lengths : array-like of integers, shape (n_sequences,)
|
---|---|
Returns: | self : StructuredPerceptron |