python3标准库学习之re

python

字数统计: 1.9k阅读时长: 8 min

 2017/12/03   Share

re

python的re模块提供了类似perl强大的正则表达式，正则表达式可以很好的扩展了python对string等的处理，是一个python不可或缺的标准库。

正则表达式的特殊字符

普通元字符

1	. :匹配除换行符外任意一个字符(DOTALL下则可以匹配换行符)
2	^ :匹配string的开头，或者行首
3	$ :匹配string的结尾，或者行尾
4	* :匹配0个或者任意个字符
5	+ :匹配1个或者任意个字符
6	? :匹配0个或者1个字符
7	*?/+?/?? :取消贪婪模式匹配
8	{m} :至少匹配前面字符m次
9	{m,n} :至少匹配前面字符m次，至多n次
10	{m,n}? :尽可能匹配最小的次数
11	\ :转译符

逻辑、分组元字符

1	[] :集合匹配
2	[amk]匹配a,m,k
3	[1-9],[a-zA-Z]匹配1-9任一数字，a-z,A-Z任一字母
4	[\d],[\w]同样匹配任一数字，任一字母
5	[^^]匹配除了^外的字符
6	\| :匹配or的关系
7	(...) :匹配括号内的任何正则表达式，可以用于检索组的内容

特殊构造元字符

1	(?...) :扩展符号，决定了构造的意义和进一步的语法，扩展通常不会创建一个组，(?P<name>...)是例外
2	(?aiLmsux) :正则匹配标记，a(ascii-only匹配)，i(忽略大小写)，L(依赖locale)，m(多行,^和$会匹配每一行的开头和结尾)，s(DOTALL)，u(unicode匹配)，x(verbose,可以在pattern里面换行写注释
3	(?:...) :(...)的非捕获版本
4	(?P<name>...) :命名匹配
5	(?#...) :#说明括号中的内容是注释，直接省略
6	A(?=B) :若A在B前，才可匹配出A,同时不消耗B
7	A(?!B) :若A不在B前，才可匹配A,同时不消耗B
8	(?<=A)B :若A在B前，才可匹配B,同时不消耗A
9	(?<!A)B :若A不在B前，才可匹配出B,同时不消耗A

预定义字符

1	\number :匹配group内的相同内容
2	\A :在字符串开头处匹配，同^
3	\b :仅在单词的开头或结尾匹配
4	\B :匹配不在单词开头和结尾的字符
5	\d :匹配任何十进制数；同[0-9]
6	\D :匹配任何非数字的字符；同[^0-9]
7	\s :匹配任何空白字符；同[ \t\n\r\f\v]
8	\S :匹配任何非空白的字符；同[^ \t\n\r\f\v]
9	\w :匹配任何数字字母字符；同[a-zA-Z0-9]
10	\W :匹配任何非数字字母字符；同[^a-zA-Z0-9_]
11	\Z :在结尾处匹配，同$

re函数方法

compile(pattern,flags=0)
将正则表达式模式编译成一个正则表达式对象regex，匹配时可以调用这个对象

1	>>> pattern=re.compile(r'[a-z]+(?=\d)')
2	>>> result=pattern.search('test1234')
3	>>> result.group()
4	'test'

相当于

1	result=re.search(r'[a-z]+(?=\d)','test1234')

search(pattern,string,flags=0)
从字符串第一个位置开始搜索直到找到一个匹配的字符
1
>>> result=re.search(r'\d+','test1234')
2
>>> result.group()
3
'1234'

match(pattern.string,flags=0)
从字符第一个字符位置开始搜索找到匹配的字符

>>> result=re.match(r'\d+','test1234')
>>> result.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> result=re.match(r'\d+','1234test')
>>> result.group()
'1234'

fullmatch(pattern,string,flags=0)
全匹配则返回一个match对象，否则返回None

1	>>> re.fullmatch(r'\d+','1234test')
2	>>> re.fullmatch(r'\w+','1234test')
3	<_sre.SRE_Match object; span=(0, 8), match='1234test'>

split(pattern,string,maxsplit=0,flags=0)
分隔字符串，将字符串拆分的模式的匹配项。如果在模式中使用捕获括号，则然后也作为结果列表的一部分返回的文本模式中的所以组，如果maxsplit不为零，顶多maxsplit分裂发生，并且该字符的其余部分将作为列表的最后一个元素返回。

1	>>> re.split(r'\W','Hello,world!')
2	['Hello', 'world', '']
3	>>> re.split(r'\W','Hello,world!',1)
4	['Hello', 'world!']
5	>>> re.split(r'(\W)','Hello,world!')
6	['Hello', ',', 'world', '!', '']
7	>>> re.split(r'\d+','a2b3d5')
8	['a', 'b', 'd', '']

findall(pattern,string,flags=0)
返回一个列表，包含所有满足匹配的字符

1	>>> re.findall(r'<text>(.*?)</text>','<text>hello</text><text>world</text>')
2	['hello', 'world']

finditer(pattern,string,flags=0)
返回一个迭代器，迭代器元素是match对象

1	>>> re.finditer(r'<text>(.*?)</text>','<text>hello</text><text>world</text>')
2	<callable_iterator object at 0x10a374c18>

sub(pattern,repl,string,count=0,flags=0)
替换，类似于字符串的replace
1
>>> re.sub(r'\d+','word','hello 1234')
2
'hello word'
subn(pattern,repl,string,count=0,flags=0)
与sub相同，返回了(new_string,number_of_subs_made)
1
>>> re.subn(r'\d+','word','hello 1234')
2
('hello word', 1)
escape(pattern)
把pattern中除了ascii字母和数字保留，其他都加上转义符\
1
>>> re.escape(r'python.exe')
2
'python\\.exe'
3
>>> re.escape(r'\d+')
4
'\\\\d\\+'
re.purge()
清除正则表达式缓存

flags

re.ASCII: (?a) ascii-only
re.DEBUG: 显示调试信息编译的表达式
re.IGNORECASE: (?!) 忽略大小写
re.LOCALE: (?L)
re.MULTILINE: (?m) 匹配多行(将多行当成一行匹配)
re.DOTALL: (?s)使.也能匹配换行符
re.VERBOSE: (?x)为较长的匹配模式的注释

regex对象的方法与属性

regex.search(string[, pos[, endpos]])
regex.match(string[, pos[, endpos]])
regex.fullmatch(string[, pos[, endpos]])
regex.split(string, maxsplit=0)
regex.findall(string[, pos[, endpos]])
regex.finditer(string[, pos[, endpos]])
regex.sub(repl, string, count=0)
regex.subn(repl, string, count=0)
regex.flags，正则表达式匹配的标志

regex.groups，捕获模式中的组数

1	>>> pattern=re.compile(r'[a-z]+(?=\d)')
2	>>> pattern.groups
3	0
4	>>> pattern=re.compile(r'[a-z]+(\d+)(?=\d)')
5	>>> pattern.groups
6	1

regex.groupindex
regex.pattern，模式字符串中从中重新对象的编译
1
>>> pattern.pattern
2
'[a-z]+(\\d+)(?=\\d)'

match对象的方法与属性

match.expand(template)

match.group([group1, …])，返回特定组的匹配结果（如果不加参数则返回整组的匹配结果）

>>> m=re.search(r"(\w+) (\w+)",'hello python')
>>> m.groups()
('hello', 'python')
>>> m.group()
'hello python'
>>> m.group(0)
'hello python'
>>> m.group(1)
'hello'
>>> m.group(2)
'python'

match.groups(default=None)，返回一个tuple，包含所有匹配到的子组

1	>>> m=re.search(r"(\w+) (\w+)",'hello python')
2	>>> m.groups()
3	('hello', 'python')

match.groupdict(default=None)，返回一个组名和值的字典

1	>>> m=re.search(r"(?P<hostname>\w+) (?P<ip>\d+\.\d+\.\d+\.\d+)",'database 127.0.0.1')
2	>>> m.groupdict()
3	{'hostname': 'database', 'ip': '127.0.0.1'}

match.start([group])，匹配上的第一个字符位置索引

>>> email = "tony@tiremove_thisger.net"
>>> m = re.search("remove_this", email)
>>> m.group()
'remove_this'
>>> m.start()
7

match.end([group])，匹配上的最后一个字符位置索引
1
>>> m.end()
2
18
match.span([group])，返回一个tuple，(m.start(group), m.end(group))
1
>>> m.span()
2
(7, 18)
match.pos，返回传递给match或search开始匹配的位置
match.endpos
match.lastindex
match.lastgroup
match.re
1
>>> m.re
2
re.compile('remove_this')

match.string，被匹配的字符串

1	>>> m.string
2	'tony@tiremove_thisger.net'

原文作者：Zhang Jusene

原文链接：http://jusene.github.io/2017/12/03/python8/

发表日期：December 3rd 2017, 4:19:33 pm

更新日期：November 30th 2019, 2:07:54 am

Next Post

Flask框架入门
Previous Post

PyMongo 基础操作学习

CATALOG

1. re
2. 正则表达式的特殊字符
3. re函数方法
1. 3.1. flags
4. regex对象的方法与属性
5. match对象的方法与属性



1	>>> result=re.search(r'\d+','test1234')
2	>>> result.group()
3	'1234'

1	>>> result=re.match(r'\d+','test1234')
2	>>> result.group()
3	Traceback (most recent call last):
4	File "<stdin>", line 1, in <module>
5	AttributeError: 'NoneType' object has no attribute 'group'
6	>>> result=re.match(r'\d+','1234test')
7	>>> result.group()
8	'1234'

1	>>> re.escape(r'python.exe')
2	'python\\.exe'
3	>>> re.escape(r'\d+')
4	'\\\\d\\+'

1	>>> email = "tony@tiremove_thisger.net"
2	>>> m = re.search("remove_this", email)
3	>>> m.group()
4	'remove_this'
5	>>> m.start()
6	7

1	>>> re.sub(r'\d+','word','hello 1234')
2	'hello word'

1	>>> re.subn(r'\d+','word','hello 1234')
2	('hello word', 1)

1	>>> pattern.pattern
2	'[a-z]+(\\d+)(?=\\d)'

1	>>> m.end()
2	18