crispio package

Submodules

crispio.annotate module

Tools for annotating guide RNAs from GFF data

crispio.annotate.annotate_from_gff(sgRNA: Mapping[str, str | int], gff_data: GffFile, seqid: str, tags: Iterable[str] | None = None) Dict[str, int | str][source]

Annotate dictionary of guide information with GFF annotations.

Dictionary must at least have key ‘pam_start’ and ‘pam_end’ mapping to numerical values.

Parameters:
  • sgRNA (dict) – Dictionary containing ‘pam_start’ and ‘pam_end’, and optionally other information about a guide.

  • gff_data (bioino.GffFile) – GffFile object which was loaded with lookup=True.

  • tags (list of str, optional) – Which GFF tags to extract from attributes of GFF features.

Returns:

Guide RNA dictionary updated with GFF annotations.

Return type:

dict

Examples

Set up a minimal single-gene GFF and build its lookup table:

>>> from io import StringIO
>>> from bioino import GffFile
>>> gff_line = '\t'.join([
...     'chr1', 'RefSeq', 'gene', '100', '500', '.', '+', '.',
...     'ID=g1;Name=geneA;locus_tag=b0001',
... ])
>>> gff = GffFile.from_file(StringIO(gff_line), lookup=True)

Guide PAM midpoint at position 251, inside the gene body (offset from gene start = 251 - 100 = 151):

>>> from crispio.annotate import annotate_from_gff
>>> result = annotate_from_gff({'pam_start': 250, 'pam_end': 253}, gff, seqid='chr1')
>>> result['ann_Name']
'geneA'
>>> result['ann_locus_tag']
'b0001'
>>> result['pam_offset']
151
>>> result['ann_strand']
'+'
>>> result['ann_start'], result['ann_end']
(100, 500)

Intergenic guide (PAM midpoint 61, upstream of gene start): bioino 0.0.3 automatically assigns the _up- prefix and computes the distance to the nearest feature, so no manual prefix logic is needed in crispio:

>>> result2 = annotate_from_gff({'pam_start': 60, 'pam_end': 63}, gff, seqid='chr1')
>>> result2['ann_locus_tag']
'_up-geneA'
>>> result2['pam_offset']
39

Unknown seqid (e.g. a plasmid not in the GFF) returns the input dict unchanged rather than raising:

>>> result3 = annotate_from_gff({'pam_start': 250, 'pam_end': 253}, gff, seqid='chrX')
>>> sorted(result3.keys())
['pam_end', 'pam_start']

Custom tag set — only extract Name:

>>> result4 = annotate_from_gff({'pam_start': 250, 'pam_end': 253}, gff,
...                              seqid='chr1', tags=['Name'])
>>> 'ann_Name' in result4
True
>>> 'ann_locus_tag' in result4
False

crispio.cli module

Command-line interface for crispio.

crispio.cli.main()[source]

crispio.crosstalk module

Tools for detecting guide crosstalk.

crispio.features module

crispio.features.featurize(gff: GffLine, features: str | Iterable[str] | None = None, scaffold: str | None = None) int | str | Dict[str, int | str][source]

Featurize a guide RNA represented by a bioino.GffLine.

Depending on the feature to be calculated, the GFF should have attributes ‘pam_sequence’, ‘guide_sequence’, ‘guide_context_up’, ‘guide_context_down’, and ‘ann_strand’.

Parameters:
  • gff (bioino.GffLine) – Input guide RNA with additional attributes.

  • features (str or list of str, optional) – The names of the features to be calculated. Default: calculate all.

  • scaffold (str, optional) – Guide scaffold. Required for some features. If features is the default, scaffold must be provided.

Returns:

If features is a string, then returns the value of the feature. If it is a list, then returns a dictionary mapping feature names to values.

Return type:

dict, float, or str

Raises:
  • KeyError – If any features are not supported.

  • ValueError – If features is neither a string nor iterable.

  • AttributeError – If features is default but scaffold is not provided.

Examples

Build a representative GffLine (attributes required by the feature set):

>>> from bioino import GffLine
>>> gff_line = GffLine(
...     ['chr1', 'crispio', 'protospacer', 1, 20, '.', '+', '.'],
...     {
...         'guide_sequence':   'ATCGATCGATCGATCGATCG',
...         'pam_sequence':     'CGG',
...         'pam_search':       'NGG',
...         'guide_context_up': 'AAAAAAAAAAAAAAAAAACC',
...         'guide_context_down':'TTTTTTTTTTTTTTTTTTGG',
...         'ann_strand':       '-',
...     },
... )

Single feature by name — returns the raw value (not wrapped in a dict):

>>> from crispio.features import featurize
>>> featurize(gff_line, 'guide_gc')
'0.500'
>>> featurize(gff_line, 'guide_purine')
'0.500'
>>> featurize(gff_line, 'seed_seq')
'GATCG'
>>> featurize(gff_line, 'guide_start3')
'ATC'
>>> featurize(gff_line, 'guide_end3')
'TCG'
>>> featurize(gff_line, 'pam_gc')
'1.000'
>>> featurize(gff_line, 'pam_n')
'C'
>>> featurize(gff_line, 'pam_def')
'GG'
>>> featurize(gff_line, 'context_up2')
'CC'
>>> featurize(gff_line, 'context_down2')
'TT'
>>> featurize(gff_line, 'on_nontemplate_strand')
True
>>> featurize(gff_line, 'guide_autocorr')
'8.223'
>>> featurize(gff_line, 'pam_autocorr')
'1.500'

List of features — returns a feat_-prefixed dict:

>>> featurize(gff_line, features=['guide_gc', 'seed_seq', 'guide_start3'])
{'feat_guide_gc': '0.500', 'feat_seed_seq': 'GATCG', 'feat_guide_start3': 'ATC'}

All features require a scaffold sequence (not a name string). Retrieve it from crispio.utils.sequences:

>>> from crispio.utils import sequences
>>> scaffold_seq = sequences.scaffolds['Sth1']
>>> result = featurize(gff_line, scaffold=scaffold_seq)
>>> sorted(result.keys())   
['feat_context_down2', 'feat_context_up2', 'feat_context_up_autocorr',
 'feat_guide_autocorr', 'feat_guide_end3', 'feat_guide_gc', 'feat_guide_purine',
 'feat_guide_scaff_corr', 'feat_guide_start3', 'feat_on_nontemplate_strand',
 'feat_pam_autocorr', 'feat_pam_def', 'feat_pam_gc', 'feat_pam_n',
 'feat_pam_scaff_corr', 'feat_seed_seq']
>>> result['feat_guide_scaff_corr']
'9.770'
>>> result['feat_pam_scaff_corr']
'2.667'

Calling without a scaffold when computing all features raises AttributeError, not TypeError:

>>> featurize(gff_line)
Traceback (most recent call last):
    ...
AttributeError: Scaffold must be provided to calculate all features.

Unknown feature name raises KeyError:

>>> featurize(gff_line, 'not_a_feature')
Traceback (most recent call last):
    ...
KeyError: 'not_a_feature'
crispio.features.get_context(pam_start: int, pam_end: int, guide_start: int, guide_end: int, genome: str, reverse: bool, extra_bases: int = 20) Tuple[str, str][source]

Get surrounding sequence.

Examples

Use a genome with visually distinct regions to make direction clear: AAAA|CCCC|GGGG|TTTT|ACGT|TGCA (blocks of 4, 24 bp total)

>>> genome = 'AAAA' + 'CCCC' + 'GGGG' + 'TTTT' + 'ACGT' + 'TGCA'

Forward strand — guide at [4:8], PAM at [8:12], context window of 4 nt: upstream context is the 4 nt before the guide; downstream is the 4 nt after the PAM:

>>> get_context(pam_start=8, pam_end=12,
...             guide_start=4, guide_end=8,
...             genome=genome, reverse=False, extra_bases=4)
('TTTT', 'AAAA')

Reverse strand — PAM at [4:8] (on forward), guide at [8:12]; context is reverse-complemented and directions are flipped:

>>> get_context(pam_start=4, pam_end=8,
...             guide_start=8, guide_end=12,
...             genome=genome, reverse=True, extra_bases=4)
('TTTT', 'AAAA')

Context window at the right edge of the genome is truncated gracefully — Python slice semantics give an empty string rather than an error:

>>> get_context(pam_start=20, pam_end=24,
...             guide_start=16, guide_end=20,
...             genome=genome, reverse=False, extra_bases=4)
('', 'TTTT')
crispio.features.get_features() List[str][source]

Get the list of available features.

crispio.fitness module

crispio.map module

Classes for representing guide RNA libraries.

class crispio.map.GuideLibrary(genome: str, guide_matches: Iterable[GuideMatchCollection])[source]

Bases: object

Library of guides from a genome.

genome

Genome sequence that guides are matched to.

Type:

str

guide_matches

List of matches to the genome.

Type:

list of GuideMatchCollection

as_gff(max_per_collection: int | None = None, annotations_from: GffFile | None = None, tags: Iterable[str] | None = None, gff_defaults: dict[str, str | int] | None = None) Iterator[GffLine][source]

Convert into a iterable of `bioino.GffLine`s.

Parameters:
  • max (int, optional) – Number of bioino.GffLine`s to return for each `GuideMatchCollection. Default: return all.

  • annotations_from (bioino.GffFile, optional) – If provided use the lookup table to annotate the returned `GffLine`s.

  • tags (list of str, optional) – Which tags to take from annotations_from.

  • gff_defaults (dict) – In case of missing values that are essential for GFF file formats (namely columns 1-8), take values from this disctionary.

Yields:

bioino.GffLine – Corresponding to a GuideMatch.

Examples

>>> genome = "ATATATATATATATATATATATATACCGTTTTTTTAAAAAAACGGATATATATATATAATATATATATATAATATATATATATA"
>>> lib = GuideLibrary.from_generating(genome=genome)
>>> for gff in lib.as_gff(gff_defaults=dict(seqid="my_seq", source="here", feature="protospacer")):  
...     print(gff)
...
my_seq    here    protospacer     23      42      .       +       .       ID=sgr-06a4ba9b;Name=42-united_exodus;guide_context_down=ATATATATATATAATATATA;guide_context_up=ATATATATATATATATATAT;guide_length=20;guide_re_sites=;guide_sequence=ATACCGTTTTTTTAAAAAAA;guide_sequence_hash=a3987295;mnemonic=united_exodus;pam_end=45;pam_replichore=L;pam_search=NGG;pam_sequence=CGG;pam_start=42;source_name=42-united_exodus
my_seq    here    protospacer     29      48      .       -       .       ID=sgr-f84d1c6a;Name=25-zigzag_state;guide_context_down=TATATATATATATATATATA;guide_context_up=ATATATATATTATATATATA;guide_length=20;guide_re_sites=;guide_sequence=TATCCGTTTTTTTAAAAAAA;guide_sequence_hash=188c9ee6;mnemonic=zigzag_state;pam_end=28;pam_replichore=R;pam_search=NGG;pam_sequence=CGG;pam_start=25;source_name=25-zigzag_state

The seqid supplied in gff_defaults propagates to every output GffLine. This is the mechanism used by the multi-chromosome CLI path to tag each guide with the chromosome it was found on:

>>> genome = ('ATATATATATATATATATATATATACCGTTTTTTTAAAAAAACGG'
...           'ATATATATATATAATATATATATATAATATATATATATA')
>>> lib = GuideLibrary.from_generating(genome=genome, in_memory=True)
>>> defaults = dict(seqid='NC_000913.3', source='crispio',
...                 feature='protospacer', score='.', phase='.')
>>> seqids = {line.columns.seqid for line in lib.as_gff(gff_defaults=defaults)}
>>> seqids
{'NC_000913.3'}
classmethod from_generating(genome: str, max_length: int = 20, min_length: int | None = None, pam_search: str = 'NGG', in_memory: bool = False, limit: int | None = None)[source]

Find all guides matching a PAM sequence in a given genome.

The default behavior is to find matches lazily to save memory and time.

Parameters:
  • genome (str) – Genome sequence to search.

  • max_length (int, optional) – Maximum guide length. Default: 20.

  • min_length (int, optional) – Minimum guide length. Default: same as max_length.

  • pam_search (str, optional) – IUPAC PAM sequence to search for. Default: “NGG”.

  • in_memory (bool, optional) – Whether to instantiate matches in memory. Default: lazy matching.

Examples

>>> genome = "ATATATATATATATATATATATATACCGTTTTTTTAAAAAAACGGATATATATATATAATATATATATATAATATATATATATA"
>>> gl = GuideLibrary.from_generating(genome=genome)
>>> len(gl)
2
>>> for match_collection in gl:
...     for guide in match_collection:
...             print(guide)
...
ATACCGTTTTTTTAAAAAAA
TATCCGTTTTTTTAAAAAAA
classmethod from_mapping(guide_seq: str | Iterable[str] | FastaSequence | Iterable[FastaSequence], genome: str, pam_search: str = 'NGG', in_memory: bool = False, limit: int | None = None)[source]

Map a set of expected guides to a genome.

The default behavior is to find matches lazily to save memory and time.

Parameters:
  • guide_seq (str or bioino.FastaSequence or list) – Guides to map.

  • genome (str) – Genome to map against.

  • pam_search (str) – IUPAC PAM sequence to search against.

  • in_memory (bool, optional) – Whether to instantiate matches in memory. Default: lazy matching.

Return type:

GuideLibrary

Examples

>>> genome = "CCCCCCCCCCCTTTTTTTTTTAAAAAAAAAATGATCGATCGATCGAGGAAAAAAAAAACCCCCCCCCCC"
>>> guide_seq = ["ATGATCGATCGATCG", "ATGATCGATCGATCGCCC"]
>>> gl = GuideLibrary.from_mapping(guide_seq=guide_seq, genome=genome)
>>> for collection in gl:
...     for match in collection:
...             print(match.as_dict())
...
{'pam_search': 'NGG', 'guide_seq': 'ATGATCGATCGATCG', 'pam_seq': 'AGG', 'pam_start': 45, 'reverse': False, 'guide_context_up': 'CTTTTTTTTTTAAAAAAAAA', 'guide_context_down': 'AAAAAAAAAACCCCCCCCCC', 'pam_end': 48, 'length': 15, 'guide_start': 30, 'guide_end': 45}
genome: str
guide_matches: Iterable[GuideMatchCollection]
class crispio.map.GuideMatch(pam_search: str, guide_seq: str, pam_seq: str, pam_start: int, reverse: bool)[source]

Bases: object

Information of guide matching a genome.

IUPAC search string for PAM.

Type:

str

guide_seq

Guide spacer sequence.

Type:

str

pam_seq

Actual PAM sequence.

Type:

str

pam_start

Chromosome coordinate of PAM start.

Type:

int

pam_end

Chromosome coordinate of PAM end.

Type:

int

length

Length of guide.

Type:

int

Examples

>>> GuideMatch(pam_search="NGG", guide_seq="ATCGATCG", pam_seq="CGG", pam_start=10, reverse=False)
GuideMatch(pam_search='NGG', guide_seq='ATCGATCG', pam_seq='CGG', pam_start=10, reverse=False, guide_context_up=None, guide_context_down=None, pam_end=13, length=8, guide_start=2, guide_end=10)
>>> GuideMatch(pam_search="NGG", guide_seq="ATCGATCG", pam_seq="CCG", pam_start=10, reverse=True)
GuideMatch(pam_search='NGG', guide_seq='CGATCGAT', pam_seq='CGG', pam_start=10, reverse=True, guide_context_up=None, guide_context_down=None, pam_end=13, length=8, guide_start=13, guide_end=21)
as_dict()[source]
guide_context_down: str | None = None
guide_context_up: str | None = None
guide_end: int
guide_seq: str
guide_start: int
length: int
pam_end: int
pam_search: str
pam_seq: str
pam_start: int
reverse: bool
class crispio.map.GuideMatchCollection(guide_seq: str, pam_search: str, matches: Iterable[GuideMatch], guide_name: str | None = None)[source]

Bases: object

Set of guides with the same sequence but potentially with multiple matches.

guide_seq

Guide spacer sequence.

Type:

str

IUPAC search string for PAM.

Type:

str

matches

Objects with matching information.

Type:

iterable of GuideMatch

guide_name

Name or identifier of guide.

Type:

str, optional

Find the location of a guide sequence in a genome.

Searches the genome in the forward strand then the reverse strand, returning the match with an adjacent PAM in the order found.

The default behavior is to find matches lazily to save memory and time.

Parameters:
  • guide_seq (str) – The sequence of the guide to be found.

  • pam_search (str, optional) – The sequence (IUPAC codes allowed) of the PAM to match. Default: “NGG”.

  • genome (str) – The genome sequence to search.

  • guide_name (str) – Name or identifier of guide.

Raises:

ValueError – If guide not found in genome with appropriate PAM.

Returns:

A iterator of dictionaries of match information.

Return type:

GuideMatches

Examples

>>> gmc = GuideMatchCollection.from_search("TTTTTTTAAAAAAA", "CCGTTTTTTTAAAAAAACGG")
>>> len(gmc)
2
>>> for match in gmc:
...     print(match)
...
TTTTTTTAAAAAAA
TTTTTTTAAAAAAA
guide_name: str | None = None
guide_seq: str
matches: Iterable[GuideMatch]
pam_search: str

crispio.plot module

crispio.utils module

Utilities for crispio package.

class crispio.utils.SequenceCollection(pams, scaffolds)

Bases: tuple

pams

Alias for field number 0

scaffolds

Alias for field number 1

Module contents