crispio package
Submodules
crispio.annotate module
Tools for annotating guide RNAs from GFF data
- crispio.annotate.annotate_from_gff(sgRNA: Mapping[str, str | int], gff_data: GffFile, seqid: str, tags: Iterable[str] | None = None) Dict[str, int | str][source]
Annotate dictionary of guide information with GFF annotations.
Dictionary must at least have key ‘pam_start’ and ‘pam_end’ mapping to numerical values.
- Parameters:
sgRNA (dict) – Dictionary containing ‘pam_start’ and ‘pam_end’, and optionally other information about a guide.
gff_data (bioino.GffFile) – GffFile object which was loaded with lookup=True.
tags (list of str, optional) – Which GFF tags to extract from attributes of GFF features.
- Returns:
Guide RNA dictionary updated with GFF annotations.
- Return type:
dict
Examples
Set up a minimal single-gene GFF and build its lookup table:
>>> from io import StringIO >>> from bioino import GffFile >>> gff_line = '\t'.join([ ... 'chr1', 'RefSeq', 'gene', '100', '500', '.', '+', '.', ... 'ID=g1;Name=geneA;locus_tag=b0001', ... ]) >>> gff = GffFile.from_file(StringIO(gff_line), lookup=True)
Guide PAM midpoint at position 251, inside the gene body (offset from gene start = 251 - 100 = 151):
>>> from crispio.annotate import annotate_from_gff >>> result = annotate_from_gff({'pam_start': 250, 'pam_end': 253}, gff, seqid='chr1') >>> result['ann_Name'] 'geneA' >>> result['ann_locus_tag'] 'b0001' >>> result['pam_offset'] 151 >>> result['ann_strand'] '+' >>> result['ann_start'], result['ann_end'] (100, 500)
Intergenic guide (PAM midpoint 61, upstream of gene start): bioino 0.0.3 automatically assigns the
_up-prefix and computes the distance to the nearest feature, so no manual prefix logic is needed in crispio:>>> result2 = annotate_from_gff({'pam_start': 60, 'pam_end': 63}, gff, seqid='chr1') >>> result2['ann_locus_tag'] '_up-geneA' >>> result2['pam_offset'] 39
Unknown
seqid(e.g. a plasmid not in the GFF) returns the input dict unchanged rather than raising:>>> result3 = annotate_from_gff({'pam_start': 250, 'pam_end': 253}, gff, seqid='chrX') >>> sorted(result3.keys()) ['pam_end', 'pam_start']
Custom tag set — only extract
Name:>>> result4 = annotate_from_gff({'pam_start': 250, 'pam_end': 253}, gff, ... seqid='chr1', tags=['Name']) >>> 'ann_Name' in result4 True >>> 'ann_locus_tag' in result4 False
crispio.cli module
Command-line interface for crispio.
crispio.crosstalk module
Tools for detecting guide crosstalk.
crispio.features module
- crispio.features.featurize(gff: GffLine, features: str | Iterable[str] | None = None, scaffold: str | None = None) int | str | Dict[str, int | str][source]
Featurize a guide RNA represented by a bioino.GffLine.
Depending on the feature to be calculated, the GFF should have attributes ‘pam_sequence’, ‘guide_sequence’, ‘guide_context_up’, ‘guide_context_down’, and ‘ann_strand’.
- Parameters:
gff (bioino.GffLine) – Input guide RNA with additional attributes.
features (str or list of str, optional) – The names of the features to be calculated. Default: calculate all.
scaffold (str, optional) – Guide scaffold. Required for some features. If features is the default, scaffold must be provided.
- Returns:
If features is a string, then returns the value of the feature. If it is a list, then returns a dictionary mapping feature names to values.
- Return type:
dict, float, or str
- Raises:
KeyError – If any features are not supported.
ValueError – If features is neither a string nor iterable.
AttributeError – If features is default but scaffold is not provided.
Examples
Build a representative GffLine (attributes required by the feature set):
>>> from bioino import GffLine >>> gff_line = GffLine( ... ['chr1', 'crispio', 'protospacer', 1, 20, '.', '+', '.'], ... { ... 'guide_sequence': 'ATCGATCGATCGATCGATCG', ... 'pam_sequence': 'CGG', ... 'pam_search': 'NGG', ... 'guide_context_up': 'AAAAAAAAAAAAAAAAAACC', ... 'guide_context_down':'TTTTTTTTTTTTTTTTTTGG', ... 'ann_strand': '-', ... }, ... )
Single feature by name — returns the raw value (not wrapped in a dict):
>>> from crispio.features import featurize >>> featurize(gff_line, 'guide_gc') '0.500' >>> featurize(gff_line, 'guide_purine') '0.500' >>> featurize(gff_line, 'seed_seq') 'GATCG' >>> featurize(gff_line, 'guide_start3') 'ATC' >>> featurize(gff_line, 'guide_end3') 'TCG' >>> featurize(gff_line, 'pam_gc') '1.000' >>> featurize(gff_line, 'pam_n') 'C' >>> featurize(gff_line, 'pam_def') 'GG' >>> featurize(gff_line, 'context_up2') 'CC' >>> featurize(gff_line, 'context_down2') 'TT' >>> featurize(gff_line, 'on_nontemplate_strand') True >>> featurize(gff_line, 'guide_autocorr') '8.223' >>> featurize(gff_line, 'pam_autocorr') '1.500'
List of features — returns a
feat_-prefixed dict:>>> featurize(gff_line, features=['guide_gc', 'seed_seq', 'guide_start3']) {'feat_guide_gc': '0.500', 'feat_seed_seq': 'GATCG', 'feat_guide_start3': 'ATC'}
All features require a scaffold sequence (not a name string). Retrieve it from
crispio.utils.sequences:>>> from crispio.utils import sequences >>> scaffold_seq = sequences.scaffolds['Sth1'] >>> result = featurize(gff_line, scaffold=scaffold_seq) >>> sorted(result.keys()) ['feat_context_down2', 'feat_context_up2', 'feat_context_up_autocorr', 'feat_guide_autocorr', 'feat_guide_end3', 'feat_guide_gc', 'feat_guide_purine', 'feat_guide_scaff_corr', 'feat_guide_start3', 'feat_on_nontemplate_strand', 'feat_pam_autocorr', 'feat_pam_def', 'feat_pam_gc', 'feat_pam_n', 'feat_pam_scaff_corr', 'feat_seed_seq'] >>> result['feat_guide_scaff_corr'] '9.770' >>> result['feat_pam_scaff_corr'] '2.667'
Calling without a scaffold when computing all features raises
AttributeError, notTypeError:>>> featurize(gff_line) Traceback (most recent call last): ... AttributeError: Scaffold must be provided to calculate all features.
Unknown feature name raises
KeyError:>>> featurize(gff_line, 'not_a_feature') Traceback (most recent call last): ... KeyError: 'not_a_feature'
- crispio.features.get_context(pam_start: int, pam_end: int, guide_start: int, guide_end: int, genome: str, reverse: bool, extra_bases: int = 20) Tuple[str, str][source]
Get surrounding sequence.
Examples
Use a genome with visually distinct regions to make direction clear:
AAAA|CCCC|GGGG|TTTT|ACGT|TGCA(blocks of 4, 24 bp total)>>> genome = 'AAAA' + 'CCCC' + 'GGGG' + 'TTTT' + 'ACGT' + 'TGCA'
Forward strand — guide at [4:8], PAM at [8:12], context window of 4 nt: upstream context is the 4 nt before the guide; downstream is the 4 nt after the PAM:
>>> get_context(pam_start=8, pam_end=12, ... guide_start=4, guide_end=8, ... genome=genome, reverse=False, extra_bases=4) ('TTTT', 'AAAA')
Reverse strand — PAM at [4:8] (on forward), guide at [8:12]; context is reverse-complemented and directions are flipped:
>>> get_context(pam_start=4, pam_end=8, ... guide_start=8, guide_end=12, ... genome=genome, reverse=True, extra_bases=4) ('TTTT', 'AAAA')
Context window at the right edge of the genome is truncated gracefully — Python slice semantics give an empty string rather than an error:
>>> get_context(pam_start=20, pam_end=24, ... guide_start=16, guide_end=20, ... genome=genome, reverse=False, extra_bases=4) ('', 'TTTT')
crispio.fitness module
crispio.map module
Classes for representing guide RNA libraries.
- class crispio.map.GuideLibrary(genome: str, guide_matches: Iterable[GuideMatchCollection])[source]
Bases:
objectLibrary of guides from a genome.
- genome
Genome sequence that guides are matched to.
- Type:
str
- guide_matches
List of matches to the genome.
- Type:
list of GuideMatchCollection
- as_gff(max_per_collection: int | None = None, annotations_from: GffFile | None = None, tags: Iterable[str] | None = None, gff_defaults: dict[str, str | int] | None = None) Iterator[GffLine][source]
Convert into a iterable of `bioino.GffLine`s.
- Parameters:
max (int, optional) – Number of bioino.GffLine`s to return for each `GuideMatchCollection. Default: return all.
annotations_from (bioino.GffFile, optional) – If provided use the lookup table to annotate the returned `GffLine`s.
tags (list of str, optional) – Which tags to take from annotations_from.
gff_defaults (dict) – In case of missing values that are essential for GFF file formats (namely columns 1-8), take values from this disctionary.
- Yields:
bioino.GffLine – Corresponding to a GuideMatch.
Examples
>>> genome = "ATATATATATATATATATATATATACCGTTTTTTTAAAAAAACGGATATATATATATAATATATATATATAATATATATATATA" >>> lib = GuideLibrary.from_generating(genome=genome) >>> for gff in lib.as_gff(gff_defaults=dict(seqid="my_seq", source="here", feature="protospacer")): ... print(gff) ... my_seq here protospacer 23 42 . + . ID=sgr-06a4ba9b;Name=42-united_exodus;guide_context_down=ATATATATATATAATATATA;guide_context_up=ATATATATATATATATATAT;guide_length=20;guide_re_sites=;guide_sequence=ATACCGTTTTTTTAAAAAAA;guide_sequence_hash=a3987295;mnemonic=united_exodus;pam_end=45;pam_replichore=L;pam_search=NGG;pam_sequence=CGG;pam_start=42;source_name=42-united_exodus my_seq here protospacer 29 48 . - . ID=sgr-f84d1c6a;Name=25-zigzag_state;guide_context_down=TATATATATATATATATATA;guide_context_up=ATATATATATTATATATATA;guide_length=20;guide_re_sites=;guide_sequence=TATCCGTTTTTTTAAAAAAA;guide_sequence_hash=188c9ee6;mnemonic=zigzag_state;pam_end=28;pam_replichore=R;pam_search=NGG;pam_sequence=CGG;pam_start=25;source_name=25-zigzag_state
The
seqidsupplied ingff_defaultspropagates to every outputGffLine. This is the mechanism used by the multi-chromosome CLI path to tag each guide with the chromosome it was found on:>>> genome = ('ATATATATATATATATATATATATACCGTTTTTTTAAAAAAACGG' ... 'ATATATATATATAATATATATATATAATATATATATATA') >>> lib = GuideLibrary.from_generating(genome=genome, in_memory=True) >>> defaults = dict(seqid='NC_000913.3', source='crispio', ... feature='protospacer', score='.', phase='.') >>> seqids = {line.columns.seqid for line in lib.as_gff(gff_defaults=defaults)} >>> seqids {'NC_000913.3'}
- classmethod from_generating(genome: str, max_length: int = 20, min_length: int | None = None, pam_search: str = 'NGG', in_memory: bool = False, limit: int | None = None)[source]
Find all guides matching a PAM sequence in a given genome.
The default behavior is to find matches lazily to save memory and time.
- Parameters:
genome (str) – Genome sequence to search.
max_length (int, optional) – Maximum guide length. Default: 20.
min_length (int, optional) – Minimum guide length. Default: same as max_length.
pam_search (str, optional) – IUPAC PAM sequence to search for. Default: “NGG”.
in_memory (bool, optional) – Whether to instantiate matches in memory. Default: lazy matching.
Examples
>>> genome = "ATATATATATATATATATATATATACCGTTTTTTTAAAAAAACGGATATATATATATAATATATATATATAATATATATATATA" >>> gl = GuideLibrary.from_generating(genome=genome) >>> len(gl) 2 >>> for match_collection in gl: ... for guide in match_collection: ... print(guide) ... ATACCGTTTTTTTAAAAAAA TATCCGTTTTTTTAAAAAAA
- classmethod from_mapping(guide_seq: str | Iterable[str] | FastaSequence | Iterable[FastaSequence], genome: str, pam_search: str = 'NGG', in_memory: bool = False, limit: int | None = None)[source]
Map a set of expected guides to a genome.
The default behavior is to find matches lazily to save memory and time.
- Parameters:
guide_seq (str or bioino.FastaSequence or list) – Guides to map.
genome (str) – Genome to map against.
pam_search (str) – IUPAC PAM sequence to search against.
in_memory (bool, optional) – Whether to instantiate matches in memory. Default: lazy matching.
- Return type:
Examples
>>> genome = "CCCCCCCCCCCTTTTTTTTTTAAAAAAAAAATGATCGATCGATCGAGGAAAAAAAAAACCCCCCCCCCC" >>> guide_seq = ["ATGATCGATCGATCG", "ATGATCGATCGATCGCCC"] >>> gl = GuideLibrary.from_mapping(guide_seq=guide_seq, genome=genome) >>> for collection in gl: ... for match in collection: ... print(match.as_dict()) ... {'pam_search': 'NGG', 'guide_seq': 'ATGATCGATCGATCG', 'pam_seq': 'AGG', 'pam_start': 45, 'reverse': False, 'guide_context_up': 'CTTTTTTTTTTAAAAAAAAA', 'guide_context_down': 'AAAAAAAAAACCCCCCCCCC', 'pam_end': 48, 'length': 15, 'guide_start': 30, 'guide_end': 45}
- genome: str
- guide_matches: Iterable[GuideMatchCollection]
- class crispio.map.GuideMatch(pam_search: str, guide_seq: str, pam_seq: str, pam_start: int, reverse: bool)[source]
Bases:
objectInformation of guide matching a genome.
- pam_search
IUPAC search string for PAM.
- Type:
str
- guide_seq
Guide spacer sequence.
- Type:
str
- pam_seq
Actual PAM sequence.
- Type:
str
- pam_start
Chromosome coordinate of PAM start.
- Type:
int
- pam_end
Chromosome coordinate of PAM end.
- Type:
int
- length
Length of guide.
- Type:
int
Examples
>>> GuideMatch(pam_search="NGG", guide_seq="ATCGATCG", pam_seq="CGG", pam_start=10, reverse=False) GuideMatch(pam_search='NGG', guide_seq='ATCGATCG', pam_seq='CGG', pam_start=10, reverse=False, guide_context_up=None, guide_context_down=None, pam_end=13, length=8, guide_start=2, guide_end=10) >>> GuideMatch(pam_search="NGG", guide_seq="ATCGATCG", pam_seq="CCG", pam_start=10, reverse=True) GuideMatch(pam_search='NGG', guide_seq='CGATCGAT', pam_seq='CGG', pam_start=10, reverse=True, guide_context_up=None, guide_context_down=None, pam_end=13, length=8, guide_start=13, guide_end=21)
- guide_context_down: str | None = None
- guide_context_up: str | None = None
- guide_end: int
- guide_seq: str
- guide_start: int
- length: int
- pam_end: int
- pam_search: str
- pam_seq: str
- pam_start: int
- reverse: bool
- class crispio.map.GuideMatchCollection(guide_seq: str, pam_search: str, matches: Iterable[GuideMatch], guide_name: str | None = None)[source]
Bases:
objectSet of guides with the same sequence but potentially with multiple matches.
- guide_seq
Guide spacer sequence.
- Type:
str
- pam_search
IUPAC search string for PAM.
- Type:
str
- matches
Objects with matching information.
- Type:
iterable of GuideMatch
- guide_name
Name or identifier of guide.
- Type:
str, optional
- classmethod from_search(guide_seq: str, genome: str, pam_search: str = 'NGG', guide_name: str | None = None, in_memory: bool = False)[source]
Find the location of a guide sequence in a genome.
Searches the genome in the forward strand then the reverse strand, returning the match with an adjacent PAM in the order found.
The default behavior is to find matches lazily to save memory and time.
- Parameters:
guide_seq (str) – The sequence of the guide to be found.
pam_search (str, optional) – The sequence (IUPAC codes allowed) of the PAM to match. Default: “NGG”.
genome (str) – The genome sequence to search.
guide_name (str) – Name or identifier of guide.
- Raises:
ValueError – If guide not found in genome with appropriate PAM.
- Returns:
A iterator of dictionaries of match information.
- Return type:
GuideMatches
Examples
>>> gmc = GuideMatchCollection.from_search("TTTTTTTAAAAAAA", "CCGTTTTTTTAAAAAAACGG") >>> len(gmc) 2 >>> for match in gmc: ... print(match) ... TTTTTTTAAAAAAA TTTTTTTAAAAAAA
- guide_name: str | None = None
- guide_seq: str
- matches: Iterable[GuideMatch]
- pam_search: str
crispio.plot module
crispio.utils module
Utilities for crispio package.