生产企业做网站有用吗,关于港口码头发展建设的网站,手机社区网站模板,建站之星快速建站价格建议在Jupyter实践。
1. 使用正则表达式匹配指定的氨基酸序列
import re# 氨基酸序列
seq VSVLTMFRYAGWLDRLYMLVGTQLAAIIHGVALPLMMLI# 正则表达式匹配
match re.search(r[A|G]W, seq)# 打印match及匹配到开始位置和结束位置
print(match)
# re.Match object; span(10, …建议在Jupyter实践。
1. 使用正则表达式匹配指定的氨基酸序列
import re# 氨基酸序列
seq VSVLTMFRYAGWLDRLYMLVGTQLAAIIHGVALPLMMLI# 正则表达式匹配
match re.search(r[A|G]W, seq)# 打印match及匹配到开始位置和结束位置
print(match)
# re.Match object; span(10, 12), matchGW
print(match.start())
print(match.end())if match:# 打印匹配到氨基酸print(match.group())# GW
else:print(no match!)2. 使用正则表达式查找全部的氨基酸序列
import reseq RQSAMGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQRPSKP# 匹配R开头、第二个氨基酸为任意、第三个氨基酸为S或T、第四个氨基酸不为P的连续4个氨基酸徐磊
matches re.findall(rR.[ST][^P], seq)
print(matches)
# [RQSA, RRSL, RPSK]# finditer 匹配对象迭代器
match_iter re.finditer(rR.[ST][^P], seq)# 遍历
for match in match_iter:# 打印group和spanprint(match.group(), match.span())print(match.start(), match.end())# RQSA (0, 4)# 0 4# RRSL (18, 22)# 18 22# RPSK (40, 44)# 40 44
3. 使用正则表达式匹配多个特殊字符分割字符串
import re# 匹配特殊字符|和;并分割字符串
annotation ATOM:CA|RES:ALA|CHAIN:B;NUMRES:166
split_string re.split(r[|;], annotation)print(split_string)
# [ATOM:CA, RES:ALA, CHAIN:B, NUMRES:166]
4. 正则表达式获取核型染色体数量区带和CNV大小
karyotype1 46,XY; -11{p11.2-p13, 48.32Mb}
karyotype2 47,XXX; X{3};-11{p11.2-p13.2, 48.32Mb}#### 匹配染色体数量 ####
match re.search(r(\d,\w);, karyotype1)
print(match)
# re.Match object; span(0, 6), match46,XY;chr match.group(1)
print(chr)
# 46,XY#### 匹配染色体开始和结束区带和CNV大小 ####
match2 re.search(r([p|q|pter]\d.?\d)-([p|q|qter]\d.?\d), (\d.?\d)Mb, karyotype2)
print(match2)cyto_start match2.group(1)
cyto_end match2.group(2)
size match2.group(3)print(cyto_start)
# p11.2
print(cyto_end)
# p13.2
print(size)
# 48.325. 正则表达式获取指定格式的字符串内容
# 结果变异VCF文件描述信息
string ##ALTIDDEL,DescriptionDeletion##ALTIDDUP,DescriptionDuplication##ALTIDINV,DescriptionInversion##ALTIDINVDUP,DescriptionInvertedDUP with unknown boundaries##ALTIDTRA,DescriptionTranslocation##ALTIDINS,DescriptionInsertion##FILTERIDUNRESOLVED,DescriptionAn insertion that is longer than the read and thus we cannot predict the full size.##INFOIDCHR2,Number1,TypeString,DescriptionChromosome for END coordinate in case of a translocation##INFOIDEND,Number1,TypeInteger,DescriptionEnd position of the structural variant##INFOIDMAPQ,Number1,TypeInteger,DescriptionMedian mapping quality of paired-ends##INFOIDRE,Number1,TypeInteger,Descriptionread support##INFOIDIMPRECISE,Number0,TypeFlag,DescriptionImprecise structural variation##INFOIDPRECISE,Number0,TypeFlag,DescriptionPrecise structural variation##INFOIDSVLEN,Number1,TypeInteger,DescriptionLength of the SV##INFOIDSVMETHOD,Number1,TypeString,DescriptionType of approach used to detect SV##INFOIDSVTYPE,Number1,TypeString,DescriptionType of structural variant##INFOIDSEQ,Number1,TypeString,DescriptionExtracted sequence from the best representative read.##INFOIDSTRANDS2,Number4,TypeInteger,Descriptionalt reads first ,alt reads first -,alt reads second ,alt reads second -.##INFOIDREF_strand,Number.,TypeInteger,Descriptionplus strand ref, minus strand ref.##INFOIDStrandbias_pval,NumberA,TypeFloat,DescriptionP-value for fisher exact test for strand bias.##INFOIDSTD_quant_start,NumberA,TypeFloat,DescriptionSTD of the start breakpoints across the reads.##INFOIDSTD_quant_stop,NumberA,TypeFloat,DescriptionSTD of the stop breakpoints across the reads.##INFOIDKurtosis_quant_start,NumberA,TypeFloat,DescriptionKurtosis value of the start breakpoints across the reads.##INFOIDKurtosis_quant_stop,NumberA,TypeFloat,DescriptionKurtosis value of the stop breakpoints across the reads.##INFOIDSUPTYPE,Number.,TypeString,DescriptionType by which the variant is supported.(SR,AL,NR)##INFOIDSTRANDS,NumberA,TypeString,DescriptionStrand orientation of the adjacency in BEDPE format (DEL:-, DUP:-, INV:/--)##INFOIDAF,NumberA,TypeFloat,DescriptionAllele Frequency.##INFOIDZMW,NumberA,TypeInteger,DescriptionNumber of ZMWs (Pacbio) supporting SV.##FORMATIDGT,Number1,TypeString,DescriptionGenotype##FORMATIDDR,Number1,TypeInteger,Description# high-quality reference reads##FORMATIDDV,Number1,TypeInteger,Description# high-quality variant readsimport re# 创建空dataframe
df_output pd.DataFrame()list_type []
list_id []
list_description []# 遍历字符串内容内容拷贝至结构变异VCF文件
for str in string.split(\n):# 去除末尾\n和字符串内空格str str.strip().replace( , )# 内容为空或字符串为空则跳过if not str or str :continue# 正则表达式匹配##后的英文字符match re.search(r##(\w), str)type match.group(1) if match else ERORR# 匹配ID内容match re.search(rID(\w), str)id match.group(1) if match else ERORR# 匹配Description内容match re.search(rDescription\(.*?)\, str)description match.group(1) if match else ERORR# 加入列表list_type.append(type)list_id.append(id)list_description.append(description)print(list_description)
# 加入dataframe
df_output[Type] list_type
df_output[ID] list_id
df_output[Description] list_description# 保存至excel
df_output.to_excel(结构变异描述信息说明.xlsx, indexFalse)生信算法文章推荐
生信算法1 - DNA测序算法实践之序列操作
生信算法2 - DNA测序算法实践之序列统计
生信算法3 - 基于k-mer算法获取序列比对索引
生信算法4 - 获取overlap序列索引和序列的算法
生信算法5 - 序列比对之全局比对算法
生信算法6 - 比对reads碱基数量统计及百分比统计
生信算法7 - 核酸序列Fasta和蛋白PDB文件读写与检索
生信算法8 - HGVS转换与氨基酸字母表