crispio package

Submodules

crispio.annotate module

Tools for annotating guide RNAs from GFF data

crispio.annotate.annotate_from_gff(sgRNA: Mapping[str, str | int], gff_data: GffFile, seqid: str, tags: Iterable[str] | None = None) → Dict[str, int | str][source]

Annotate dictionary of guide information with GFF annotations.

Dictionary must at least have key ‘pam_start’ and ‘pam_end’ mapping to numerical values.

Parameters:

sgRNA (dict) – Dictionary containing ‘pam_start’ and ‘pam_end’, and optionally other information about a guide.
gff_data (bioino.GffFile) – GffFile object which was loaded with lookup=True.
tags (list of str, optional) – Which GFF tags to extract from attributes of GFF features.

Returns:

Guide RNA dictionary updated with GFF annotations.

Return type:

dict

Examples

Set up a minimal single-gene GFF and build its lookup table:

>>> from io import StringIO
>>> from bioino import GffFile
>>> gff_line = '\t'.join([
...     'chr1', 'RefSeq', 'gene', '100', '500', '.', '+', '.',
...     'ID=g1;Name=geneA;locus_tag=b0001',
... ])
>>> gff = GffFile.from_file(StringIO(gff_line), lookup=True)

Guide PAM midpoint at position 251, inside the gene body (offset from gene start = 251 - 100 = 151):

>>> from crispio.annotate import annotate_from_gff
>>> result = annotate_from_gff({'pam_start': 250, 'pam_end': 253}, gff, seqid='chr1')
>>> result['ann_Name']
'geneA'
>>> result['ann_locus_tag']
'b0001'
>>> result['pam_offset']
151
>>> result['ann_strand']
'+'
>>> result['ann_start'], result['ann_end']
(100, 500)

Intergenic guide (PAM midpoint 61, upstream of gene start): bioino 0.0.3 automatically assigns the _up- prefix and computes the distance to the nearest feature, so no manual prefix logic is needed in crispio:

>>> result2 = annotate_from_gff({'pam_start': 60, 'pam_end': 63}, gff, seqid='chr1')
>>> result2['ann_locus_tag']
'_up-geneA'
>>> result2['pam_offset']
39

Unknown seqid (e.g. a plasmid not in the GFF) returns the input dict unchanged rather than raising:

>>> result3 = annotate_from_gff({'pam_start': 250, 'pam_end': 253}, gff, seqid='chrX')
>>> sorted(result3.keys())
['pam_end', 'pam_start']

Custom tag set — only extract Name:

>>> result4 = annotate_from_gff({'pam_start': 250, 'pam_end': 253}, gff,
...                              seqid='chr1', tags=['Name'])
>>> 'ann_Name' in result4
True
>>> 'ann_locus_tag' in result4
False

crispio.cli module

Command-line interface for crispio.

crispio.cli.main()[source]

crispio.crosstalk module

Tools for detecting guide crosstalk.

crispio.features module

Featurize a guide RNA represented by a bioino.GffLine.

Depending on the feature to be calculated, the GFF should have attributes ‘pam_sequence’, ‘guide_sequence’, ‘guide_context_up’, ‘guide_context_down’, and ‘ann_strand’.

Parameters:

gff (bioino.GffLine) – Input guide RNA with additional attributes.
features (str or list of str, optional) – The names of the features to be calculated. Default: calculate all.
scaffold (str, optional) – Guide scaffold. Required for some features. If features is the default, scaffold must be provided.

Returns:

If features is a string, then returns the value of the feature. If it is a list, then returns a dictionary mapping feature names to values.

Return type:

dict, float, or str

Raises:

KeyError – If any features are not supported.
ValueError – If features is neither a string nor iterable.
AttributeError – If features is default but scaffold is not provided.

Examples

Build a representative GffLine (attributes required by the feature set):

>>> from bioino import GffLine
>>> gff_line = GffLine(
...     ['chr1', 'crispio', 'protospacer', 1, 20, '.', '+', '.'],
...     {
...         'guide_sequence':   'ATCGATCGATCGATCGATCG',
...         'pam_sequence':     'CGG',
...         'pam_search':       'NGG',
...         'guide_context_up': 'AAAAAAAAAAAAAAAAAACC',
...         'guide_context_down':'TTTTTTTTTTTTTTTTTTGG',
...         'ann_strand':       '-',
...     },
... )

Single feature by name — returns the raw value (not wrapped in a dict):

>>> from crispio.features import featurize
>>> featurize(gff_line, 'guide_gc')
'0.500'
>>> featurize(gff_line, 'guide_purine')
'0.500'
>>> featurize(gff_line, 'seed_seq')
'GATCG'
>>> featurize(gff_line, 'guide_start3')
'ATC'
>>> featurize(gff_line, 'guide_end3')
'TCG'
>>> featurize(gff_line, 'pam_gc')
'1.000'
>>> featurize(gff_line, 'pam_n')
'C'
>>> featurize(gff_line, 'pam_def')
'GG'
>>> featurize(gff_line, 'context_up2')
'CC'
>>> featurize(gff_line, 'context_down2')
'TT'
>>> featurize(gff_line, 'on_nontemplate_strand')
True
>>> featurize(gff_line, 'guide_autocorr')
'8.223'
>>> featurize(gff_line, 'pam_autocorr')
'1.500'

List of features — returns a feat_-prefixed dict:

>>> featurize(gff_line, features=['guide_gc', 'seed_seq', 'guide_start3'])
{'feat_guide_gc': '0.500', 'feat_seed_seq': 'GATCG', 'feat_guide_start3': 'ATC'}

All features require a scaffold sequence (not a name string). Retrieve it from crispio.utils.sequences:

>>> from crispio.utils import sequences
>>> scaffold_seq = sequences.scaffolds['Sth1']
>>> result = featurize(gff_line, scaffold=scaffold_seq)
>>> sorted(result.keys())   
['feat_context_down2', 'feat_context_up2', 'feat_context_up_autocorr',
 'feat_guide_autocorr', 'feat_guide_end3', 'feat_guide_gc', 'feat_guide_purine',
 'feat_guide_scaff_corr', 'feat_guide_start3', 'feat_on_nontemplate_strand',
 'feat_pam_autocorr', 'feat_pam_def', 'feat_pam_gc', 'feat_pam_n',
 'feat_pam_scaff_corr', 'feat_seed_seq']
>>> result['feat_guide_scaff_corr']
'9.770'
>>> result['feat_pam_scaff_corr']
'2.667'

Calling without a scaffold when computing all features raises AttributeError, not TypeError:

>>> featurize(gff_line)
Traceback (most recent call last):
    ...
AttributeError: Scaffold must be provided to calculate all features.

Unknown feature name raises KeyError:

>>> featurize(gff_line, 'not_a_feature')
Traceback (most recent call last):
    ...
KeyError: 'not_a_feature'

crispio.features.get_context(pam_start: int, pam_end: int, guide_start: int, guide_end: int, genome: str, reverse: bool, extra_bases: int = 20) → Tuple[str, str][source]

Get surrounding sequence.

Examples

>>> genome = 'AAAA' + 'CCCC' + 'GGGG' + 'TTTT' + 'ACGT' + 'TGCA'

Forward strand — guide at [4:8], PAM at [8:12], context window of 4 nt: upstream context is the 4 nt before the guide; downstream is the 4 nt after the PAM:

>>> get_context(pam_start=8, pam_end=12,
...             guide_start=4, guide_end=8,
...             genome=genome, reverse=False, extra_bases=4)
('TTTT', 'AAAA')

Reverse strand — PAM at [4:8] (on forward), guide at [8:12]; context is reverse-complemented and directions are flipped:

>>> get_context(pam_start=4, pam_end=8,
...             guide_start=8, guide_end=12,
...             genome=genome, reverse=True, extra_bases=4)
('TTTT', 'AAAA')

Context window at the right edge of the genome is truncated gracefully — Python slice semantics give an empty string rather than an error:

>>> get_context(pam_start=20, pam_end=24,
...             guide_start=16, guide_end=20,
...             genome=genome, reverse=False, extra_bases=4)
('', 'TTTT')

crispio.features.get_features() → List[str][source]: Get the list of available features.

crispio.fitness module

crispio.map module

Classes for representing guide RNA libraries.

class crispio.map.GuideLibrary(genome: str, guide_matches: Iterable[GuideMatchCollection])[source]

Bases: object

Library of guides from a genome.

genome

Genome sequence that guides are matched to.

Type:: str

guide_matches

List of matches to the genome.

Type:: list of GuideMatchCollection

Convert into a iterable of `bioino.GffLine`s.

Parameters:

max (int, optional) – Number of bioino.GffLine`s to return for each `GuideMatchCollection. Default: return all.
annotations_from (bioino.GffFile, optional) – If provided use the lookup table to annotate the returned `GffLine`s.
tags (list of str, optional) – Which tags to take from annotations_from.
gff_defaults (dict) – In case of missing values that are essential for GFF file formats (namely columns 1-8), take values from this disctionary.

Yields:

bioino.GffLine – Corresponding to a GuideMatch.

Examples

>>> genome = "ATATATATATATATATATATATATACCGTTTTTTTAAAAAAACGGATATATATATATAATATATATATATAATATATATATATA"
>>> lib = GuideLibrary.from_generating(genome=genome)
>>> for gff in lib.as_gff(gff_defaults=dict(seqid="my_seq", source="here", feature="protospacer")):  
...     print(gff)
...
my_seq    here    protospacer     23      42      .       +       .       ID=sgr-06a4ba9b;Name=42-united_exodus;guide_context_down=ATATATATATATAATATATA;guide_context_up=ATATATATATATATATATAT;guide_length=20;guide_re_sites=;guide_sequence=ATACCGTTTTTTTAAAAAAA;guide_sequence_hash=a3987295;mnemonic=united_exodus;pam_end=45;pam_replichore=L;pam_search=NGG;pam_sequence=CGG;pam_start=42;source_name=42-united_exodus
my_seq    here    protospacer     29      48      .       -       .       ID=sgr-f84d1c6a;Name=25-zigzag_state;guide_context_down=TATATATATATATATATATA;guide_context_up=ATATATATATTATATATATA;guide_length=20;guide_re_sites=;guide_sequence=TATCCGTTTTTTTAAAAAAA;guide_sequence_hash=188c9ee6;mnemonic=zigzag_state;pam_end=28;pam_replichore=R;pam_search=NGG;pam_sequence=CGG;pam_start=25;source_name=25-zigzag_state

The seqid supplied in gff_defaults propagates to every output GffLine. This is the mechanism used by the multi-chromosome CLI path to tag each guide with the chromosome it was found on:

>>> genome = ('ATATATATATATATATATATATATACCGTTTTTTTAAAAAAACGG'
...           'ATATATATATATAATATATATATATAATATATATATATA')
>>> lib = GuideLibrary.from_generating(genome=genome, in_memory=True)
>>> defaults = dict(seqid='NC_000913.3', source='crispio',
...                 feature='protospacer', score='.', phase='.')
>>> seqids = {line.columns.seqid for line in lib.as_gff(gff_defaults=defaults)}
>>> seqids
{'NC_000913.3'}

classmethod from_generating(genome: str, max_length: int = 20, min_length: int | None = None, pam_search: str = 'NGG', in_memory: bool = False, limit: int | None = None)[source]

Find all guides matching a PAM sequence in a given genome.

The default behavior is to find matches lazily to save memory and time.

Parameters:

genome (str) – Genome sequence to search.
max_length (int, optional) – Maximum guide length. Default: 20.
min_length (int, optional) – Minimum guide length. Default: same as max_length.
pam_search (str, optional) – IUPAC PAM sequence to search for. Default: “NGG”.
in_memory (bool, optional) – Whether to instantiate matches in memory. Default: lazy matching.

Examples

>>> genome = "ATATATATATATATATATATATATACCGTTTTTTTAAAAAAACGGATATATATATATAATATATATATATAATATATATATATA"
>>> gl = GuideLibrary.from_generating(genome=genome)
>>> len(gl)
2
>>> for match_collection in gl:
...     for guide in match_collection:
...             print(guide)
...
ATACCGTTTTTTTAAAAAAA
TATCCGTTTTTTTAAAAAAA

classmethod from_mapping(guide_seq: str | Iterable[str] | FastaSequence | Iterable[FastaSequence], genome: str, pam_search: str = 'NGG', in_memory: bool = False, limit: int | None = None)[source]

Map a set of expected guides to a genome.

The default behavior is to find matches lazily to save memory and time.

Parameters:

guide_seq (str or bioino.FastaSequence or list) – Guides to map.
genome (str) – Genome to map against.
pam_search (str) – IUPAC PAM sequence to search against.
in_memory (bool, optional) – Whether to instantiate matches in memory. Default: lazy matching.

Return type:

GuideLibrary

Examples

>>> genome = "CCCCCCCCCCCTTTTTTTTTTAAAAAAAAAATGATCGATCGATCGAGGAAAAAAAAAACCCCCCCCCCC"
>>> guide_seq = ["ATGATCGATCGATCG", "ATGATCGATCGATCGCCC"]
>>> gl = GuideLibrary.from_mapping(guide_seq=guide_seq, genome=genome)
>>> for collection in gl:
...     for match in collection:
...             print(match.as_dict())
...
{'pam_search': 'NGG', 'guide_seq': 'ATGATCGATCGATCG', 'pam_seq': 'AGG', 'pam_start': 45, 'reverse': False, 'guide_context_up': 'CTTTTTTTTTTAAAAAAAAA', 'guide_context_down': 'AAAAAAAAAACCCCCCCCCC', 'pam_end': 48, 'length': 15, 'guide_start': 30, 'guide_end': 45}

genome: str

guide_matches: Iterable[GuideMatchCollection]

class crispio.map.GuideMatch(pam_search: str, guide_seq: str, pam_seq: str, pam_start: int, reverse: bool)[source]

Bases: object

Information of guide matching a genome.

pam_search

IUPAC search string for PAM.

Type:: str

guide_seq

Guide spacer sequence.

Type:: str

pam_seq

Actual PAM sequence.

Type:: str

pam_start

Chromosome coordinate of PAM start.

Type:: int

pam_end

Chromosome coordinate of PAM end.

Type:: int

length

Length of guide.

Type:: int

Examples

>>> GuideMatch(pam_search="NGG", guide_seq="ATCGATCG", pam_seq="CGG", pam_start=10, reverse=False)
GuideMatch(pam_search='NGG', guide_seq='ATCGATCG', pam_seq='CGG', pam_start=10, reverse=False, guide_context_up=None, guide_context_down=None, pam_end=13, length=8, guide_start=2, guide_end=10)
>>> GuideMatch(pam_search="NGG", guide_seq="ATCGATCG", pam_seq="CCG", pam_start=10, reverse=True)
GuideMatch(pam_search='NGG', guide_seq='CGATCGAT', pam_seq='CGG', pam_start=10, reverse=True, guide_context_up=None, guide_context_down=None, pam_end=13, length=8, guide_start=13, guide_end=21)

as_dict()[source]

guide_context_down: str | None = None

guide_context_up: str | None = None

guide_end: int

guide_seq: str

guide_start: int

length: int

pam_end: int

pam_search: str

pam_seq: str

pam_start: int

reverse: bool

class crispio.map.GuideMatchCollection(guide_seq: str, pam_search: str, matches: Iterable[GuideMatch], guide_name: str | None = None)[source]

Bases: object

Set of guides with the same sequence but potentially with multiple matches.

guide_seq

Guide spacer sequence.

Type:: str

pam_search

IUPAC search string for PAM.

Type:: str

matches

Objects with matching information.

Type:: iterable of GuideMatch

guide_name

Name or identifier of guide.

Type:: str, optional

classmethod from_search(guide_seq: str, genome: str, pam_search: str = 'NGG', guide_name: str | None = None, in_memory: bool = False)[source]

Find the location of a guide sequence in a genome.

Searches the genome in the forward strand then the reverse strand, returning the match with an adjacent PAM in the order found.

The default behavior is to find matches lazily to save memory and time.

Parameters:

guide_seq (str) – The sequence of the guide to be found.
pam_search (str, optional) – The sequence (IUPAC codes allowed) of the PAM to match. Default: “NGG”.
genome (str) – The genome sequence to search.
guide_name (str) – Name or identifier of guide.

Raises:

ValueError – If guide not found in genome with appropriate PAM.

Returns:

A iterator of dictionaries of match information.

Return type:

GuideMatches

Examples

>>> gmc = GuideMatchCollection.from_search("TTTTTTTAAAAAAA", "CCGTTTTTTTAAAAAAACGG")
>>> len(gmc)
2
>>> for match in gmc:
...     print(match)
...
TTTTTTTAAAAAAA
TTTTTTTAAAAAAA

guide_name: str | None = None

guide_seq: str

matches: Iterable[GuideMatch]

pam_search: str

crispio.plot module

crispio.utils module

Utilities for crispio package.

class crispio.utils.SequenceCollection(pams, scaffolds)

Bases: tuple

pams: Alias for field number 0

scaffolds: Alias for field number 1

crispio package

Submodules

crispio.annotate module

crispio.cli module

crispio.crosstalk module

crispio.features module

crispio.fitness module

crispio.map module

crispio.plot module

crispio.utils module

Module contents