Load CoNLL file, extract features on the tokens and vectorize them.
The ConLL file format is a lineoriented text format that describes sequences in a spaceseparated format, separating the sequences with blank lines. Typically, the last spaceseparated part is a label.
Since the tabseparated parts are usually tokens (and maybe things like partofspeech tags) rather than feature vectors, a function must be supplied that does the actual feature extraction. This function has access to the entire sequence, so that it can extract context features.
A sklearn.feature_extraction.FeatureHasher (the “hashing trick”) is used to map symbolic input feature names to columns, so this function dos not remember the actual input feature names.
Parameters:  f : {string, filelike}
features : callable
n_features : integer, optional
split : boolean, default=False


Returns:  X : scipy.sparse matrix, shape (n_samples, n_features)
y : np.ndarray, dtype np.string, shape n_samples
lengths : np.ndarray, dtype np.int32, shape n_sequences

Sequenceaware (repeated) kfold CV splitter.
Uses a greedy heuristic to partition input sequences into sets with roughly equal numbers of samples, while keeping the sequences intact.
Parameters:  lengths : arraylike of integers, shape (n_samples,)
n_folds : int, optional
n_iter : int, optional
shuffle : boolean, optional
random_state : {np.random.RandomState, integer}, optional
yield_lengths : boolean, optional


Returns:  folds : iterable

Fscore for BIOtagging scheme, as used by CoNLL.
This Fscore variant is used for evaluating namedentity recognition and related problems, where the goal is to predict segments of interest within sequences and mark these as a “B” (begin) tag followed by zero or more “I” (inside) tags. A true positive is then defined as a BI* segment in both y_true and y_pred, with false positives and false negatives defined similarly.
Support for tags schemes with classes (e.g. “BNP”) are limited: reported scores may be too high for inconsistent labelings.
Parameters:  y_true : arraylike of strings, shape (n_samples,)
y_pred : arraylike of strings, shape (n_samples,)


Returns:  f : float

Average accuracy measured on whole sequences.
Returns the fraction of sequences in y_true that occur in y_pred without a single error.
Hidden Markov models (HMMs) with supervised training.
Firstorder hidden Markov model with multinomial event model.
Parameters:  decode : string, optional
alpha : float


Methods
Fit HMM model to data.
Parameters:  X : {arraylike, sparse matrix}, shape (n_samples, n_features)
y : arraylike, shape (n_samples,)
lengths : arraylike of integers, shape (n_sequences,)


Returns:  self : MultinomialHMM 
Notes
Make sure the training set (X) is onehot encoded; if more than one feature in X is on, the emission probabilities will be multiplied.
Structured perceptron for sequence classification.
This implements the averaged structured perceptron algorithm of Collins and Daumé, with the addition of an adaptive learning rate.
Parameters:  decode : string, optional
lr_exponent : float, optional
max_iter : integer, optional
random_state : {integer, np.random.RandomState}, optional
trans_features : boolean, optional
verbose : integer, optional


References
M. Collins (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. EMNLP.
Hal Daumé III (2006). Practical Structured Learning Techniques for Natural Language Processing. Ph.D. thesis, U. Southern California.
Methods
Fit to a set of sequences.
Parameters:  X : {arraylike, sparse matrix}, shape (n_samples, n_features)
y : arraylike, shape (n_samples,)
lengths : arraylike of integers, shape (n_sequences,)


Returns:  self : StructuredPerceptron 