整理并验证自cat_book_milk的博客
- 判断字符串是否为纯汉字(正则表达式匹配多个汉字)
/*************************************************************************
* File Name: TestChineseInJava.java
* Author: Kent Lee
* Mail: kent1411390610@gmail.com
* Created Time: Mon May 7 20:18:46 2018
************************************************************************/
public class TestChineseInJava{
public static void main(String[] args){
String allAscii = "China will win in the war of trade with the U.S.A.";
String allChinese = "爱我中华智造国芯";
String chineseWithComma = "全角逗号能否匹配为汉字,呢";
String mixed = "芯片是 IT 行业的命脉";
String regex = "[\\u4e00-\\u9fa5]+";
System.out.println("String: allAscii will be flase,actuallly is:"+allAscii.matches(regex));
System.out.println("String: allChinese will be true,actually is :"+allChinese.matches(regex));
System.out.println("String: chineseWithComma will be flase,actually is:"+chineseWithComma.matches(regex));
System.out.println("String: allAscii will be flase,actuallly is:"+allAscii.matches(regex));
}
}
运行结果:
运行结果
中文汉字的Unicode 编码从
\ue400 至 \u9fa5,所以使用 [\\u4e00-\\u9fa5]+可以匹配多个汉字。字体编辑中日韩 Unicode编码表
但是全角字符不在匹配之列
- 提取字符串中的中文汉字(使用 replaceAll函数替换非汉字字符)
/*************************************************************************
* File Name: TestChineseInJava.java
* Author: Kent Lee
* Mail: kent1411390610@gmail.com
* Created Time: Mon May 7 20:18:46 2018
************************************************************************/
public class TestChineseInJava{
public static void main(String[] args){
String allAscii = "China will win in the war of trade with the U.S.A.";
String allChinese = "爱我中华智造国芯";
String chineseWithComma = "全角逗号能否匹配为汉字,呢";
String mixed = "芯片是 IT 行业的命脉";
String regex = "[\\u4e00-\\u9fa5]+";
String regex2 = "[^\\u4e00-\\u9fa5]";//匹配非汉字
System.out.println("String: allAscii will be flase,actuallly is:"+allAscii.matches(regex));
System.out.println("String: allChinese will be true,actually is :"+allChinese.matches(regex));
System.out.println("String: chineseWithComma will be flase,actually is:"+chineseWithComma.matches(regex));
System.out.println("String: allAscii will be flase,actuallly is:"+allAscii.matches(regex));
System.out.println("retrive pure chinese character:"+mixed.replaceAll(regex2,""));
}
}
关键在于将非汉字的字符替换为空字符就可以实现提取汉字的效果
提取汉字
- 判断是否有汉字(利用编码的长度区别)
/*************************************************************************
* File Name: TestChineseInJava.java
* Author: Kent Lee
* Mail: kent1411390610@gmail.com
* Created Time: Mon May 7 20:18:46 2018
************************************************************************/
public class TestChineseInJava{
public static void main(String[] args){
String allAscii = "China will win in the war of trade with the U.S.A.";
String allChinese = "爱我中华智造国芯";
String chineseWithComma = "全角逗号能否匹配为汉字,呢";
String mixed = "芯片是 IT 行业的命脉";
String regex = "[\\u4e00-\\u9fa5]+";
String regex2 = "[^\\u4e00-\\u9fa5]";
System.out.println("String: allAscii will be flase,actuallly is:"+allAscii.matches(regex));
System.out.println("String: allChinese will be true,actually is :"+allChinese.matches(regex));
System.out.println("String: chineseWithComma will be flase,actually is:"+chineseWithComma.matches(regex));
System.out.println("String: allAscii will be flase,actuallly is:"+allAscii.matches(regex));
System.out.println("retrive pure chinese characters:"+mixed.replaceAll(regex2,""));
System.out.println("true means no any chinese character,or there are.Speaking of mixed:"+(mixed.length() == mixed.getBytes().length));
}
}
运行结果
- 汉字的个数(正则表达式匹配)
/*************************************************************************
* File Name: getChineseCharacters.java
* Author: Kent Lee
* Mail: kent1411390610@gmail.com
* Created Time: Mon May 7 21:23:03 2018
************************************************************************/
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class getChineseCharacters{
public static void main(String[] args){
int count = 0;
String allAscii = "China will win in the war of trade with the U.S.A.";
String allChinese = "爱我中华智造国芯";
String chineseWithComma = "全角逗号能否匹配为汉字,呢";
String mixed = "芯片是 IT 行业的命脉,所以我们无论如何都不能放弃自主芯片的研究";
String motto = "历史告诉我们中国必须走独立自主的道路:赫鲁晓夫曾说苏联拥核可以保护中国,劝中国不要研究核武,但是很快中苏交恶。国与国没有永远的蜜月,可以信任依靠的只有万众一心的人民";
String regex = "[\\u4e00-\\u9fa5]";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(mixed);
while(matcher.find()){
for(int i = 0;i <= matcher.groupCount();i++){
count++;
}
}
System.out.println("there are "+count+" 汉字 in mixed: "+mixed);
count = 0;
matcher = pattern.matcher(motto);
while(matcher.find()){
for(int i = 0;i <= matcher.groupCount();i++){
count++;
}
}
System.out.println("there are "+count+" 汉字 in motto: "+motto);
count = 0;
matcher = pattern.matcher(allChinese);
while(matcher.find()){
for(int i = 0;i <= matcher.groupCount();i++){
count++;
}
}
System.out.println("there are "+count+" 汉字 in allChinese: "+allChinese);
}
}
有一个奇怪的现象:
for(int i = 0;i <= matcher.groupCount();i++){ //如果没有= 就无法得出正确的数字
待解决
次日序:
看了下 javadoc Matcher.find()说明,find 类似于Scanner.hasNextInt()寻找符合匹配 Pattern 的下一个结果。groupCount() 一直返回的是 0,所以如果不加 = 自然得不到正确的结果。详情请看Matcher.groupCount()说明。
如果需要计数匹配的个数可以使用的另外一种表达为:
int count = 0;
while(matcher.find()){
count++;
}
image.png









网友评论