python判断中文函数 python中判断英文的函数( 二 )

type 'unicode' 15 我的 English 学的不好要注意的一个问题是 Unicode 虽然号称是“统一码”，不过也是存在着两种形式，即：
UCS-2：为 16 位码，具有 2^16 = 65536 个码位； UCS-4：为 32 位码，目前的规定是其首字节的首位为 0，因此具有 2^31 = 2147483648 个码位，不过现在的只使用了 0x00000000 － 0x0010FFFF 之间的码位，共 1114112 个。
使用Pythonsys 模块提供的一个变量 maxunicode 的值可以判断当前 Python 所使用的 Unicode 类型是 UCS-2 的还是 UCS-4 的。import sys
print sys.maxunicode若 sys.maxunicode 的值为 1114111，即为 UCS-4；若为 65535 ，则为 UCS-2 。
2. 中英文混合字串的分离一旦中英文字串的编码获得统一，那么对它们进行分裂就是很简单的事情了。首先要为中文字串与英文字串分别准备一个收集器，使用两个空的字串对象即可，譬如 zh_gather 与 en_gather；然后要准备一个列表对象，负责按分离次序存储 zh_gather 与 en_gather 的值。下面这个 Python 函数接受一个中英文混合的 Unicode 字串，并返回存储中英文子字串的列表。def split_zh_en (zh_en_str):
zh_en_group = []
zh_gather = ""
en_gather = ""
zh_status = False
for c in zh_en_str:
if not zh_status and is_zh (c):
zh_status = True
if en_gather != "":
zh_en_group.append ([mark["en"],en_gather])
en_gather = ""
elif not is_zh (c) and zh_status:
zh_status = False
if zh_gather != "":
zh_en_group.append ([mark["zh"], zh_gather])
if zh_status:
zh_gather += c
else:
en_gather += c
zh_gather = ""
if en_gather != "":
zh_en_group.append ([mark["en"],en_gather])
elif zh_gather != "":
zh_en_group.append ([mark["zh"],zh_gather])
return zh_en_group上述代码所实现的功能细节是：对中英文混合字串 zh_en_str 的遍历过程中进行逐字识别，若当前字符为中文，则将其添加到 zh_gather 中；若当前字符为英文，则将其添加到 en_gather 中。zh_status 表示中英文字符的切换状态，当 zh_status 的值发生突变时，就将所收集的中文子字串或英文子字串添加到 zh_en_group 中去。
判断字串 zh_en_str 中是否包含中文字符的条件语句中出现了一个 is_zh () 函数，它的实现如下：def is_zh (c):
x = ord (c)
# PunctRadicals
if x = 0x2e80 and x = 0x33ff:
return True
# Fullwidth Latin Characters
elif x = 0xff00 and x = 0xffef:
return True
# CJK Unified Ideographs
# CJK Unified Ideographs Extension A
elif x = 0x4e00 and x = 0x9fbb:
return True
# CJK Compatibility Ideographs
elif x = 0xf900 and x = 0xfad9:
return True
# CJK Unified Ideographs Extension B
elif x = 0x20000 and x = 0x2a6d6:
return True
# CJK Compatibility Supplement
elif x = 0x2f800 and x = 0x2fa1d:
return True
else:
return False这段代码来自 jjgod 写的 XeTeX 预处理程序。
对于分离出来的中文子字串与英文子字串，为了使用方便，在将它们存入 zh_en_group 列表时，我对它们分别做了标记，即 mark["zh"] 与 mark["en"] 。mark 是一个 dict 对象，其定义如下：mark = {"en":1, "zh":2}如果要对 zh_en_group 中的英文字串或中文字串进行处理时，标记的意义在于快速判定字串是中文的，还是英文的，譬如：for str in zh_en_group:
if str[0] = mark["en"]:
do somthing
else:
do somthing
python 判断是不是中文字法一：
isinstance(s, str) 用来判断是否为一般字符串
isinstance(s, unicode) 用来判断是否为unicode
或
if type(str).__name__!="unicode":
str=unicode(str,"utf-8")
else: