Load CoNLL file, extract features on the tokens and vectorize them.
The ConLL file format is a line-oriented text format that describes sequences in a space-separated format, separating the sequences with blank lines. Typically, the last space-separated part is a label.
Since the tab-separated parts are usually tokens (and maybe things like part-of-speech tags) rather than feature vectors, a function must be supplied that does the actual feature extraction. This function has access to the entire sequence, so that it can extract context features.
A sklearn.feature_extraction.FeatureHasher (the “hashing trick”) is used to map symbolic input feature names to columns, so this function dos not remember the actual input feature names.
| Parameters: | f : {string, file-like}
features : callable
n_features : integer, optional
split : boolean, default=False
|
|---|---|
| Returns: | X : scipy.sparse matrix, shape (n_samples, n_features)
y : np.ndarray, dtype np.string, shape n_samples
lengths : np.ndarray, dtype np.int32, shape n_sequences
|
Sequence-aware (repeated) k-fold CV splitter.
Uses a greedy heuristic to partition input sequences into sets with roughly equal numbers of samples, while keeping the sequences intact.
| Parameters: | lengths : array-like of integers, shape (n_samples,)
n_folds : int, optional
n_iter : int, optional
shuffle : boolean, optional
random_state : {np.random.RandomState, integer}, optional
yield_lengths : boolean, optional
|
|---|---|
| Returns: | folds : iterable
|
F-score for BIO-tagging scheme, as used by CoNLL.
This F-score variant is used for evaluating named-entity recognition and related problems, where the goal is to predict segments of interest within sequences and mark these as a “B” (begin) tag followed by zero or more “I” (inside) tags. A true positive is then defined as a BI* segment in both y_true and y_pred, with false positives and false negatives defined similarly.
Support for tags schemes with classes (e.g. “B-NP”) are limited: reported scores may be too high for inconsistent labelings.
| Parameters: | y_true : array-like of strings, shape (n_samples,)
y_pred : array-like of strings, shape (n_samples,)
|
|---|---|
| Returns: | f : float
|
Average accuracy measured on whole sequences.
Returns the fraction of sequences in y_true that occur in y_pred without a single error.
Hidden Markov models (HMMs) with supervised training.
First-order hidden Markov model with multinomial event model.
| Parameters: | decode : string, optional
alpha : float
|
|---|
Methods
Fit HMM model to data.
| Parameters: | X : {array-like, sparse matrix}, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
lengths : array-like of integers, shape (n_sequences,)
|
|---|---|
| Returns: | self : MultinomialHMM |
Notes
Make sure the training set (X) is one-hot encoded; if more than one feature in X is on, the emission probabilities will be multiplied.
Structured perceptron for sequence classification.
This implements the averaged structured perceptron algorithm of Collins and Daumé, with the addition of an adaptive learning rate.
| Parameters: | decode : string, optional
lr_exponent : float, optional
max_iter : integer, optional
random_state : {integer, np.random.RandomState}, optional
trans_features : boolean, optional
verbose : integer, optional
|
|---|
References
M. Collins (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. EMNLP.
Hal Daumé III (2006). Practical Structured Learning Techniques for Natural Language Processing. Ph.D. thesis, U. Southern California.
Methods
Fit to a set of sequences.
| Parameters: | X : {array-like, sparse matrix}, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
lengths : array-like of integers, shape (n_sequences,)
|
|---|---|
| Returns: | self : StructuredPerceptron |