前言

在开发过程中，我们可能会遇到Java各种编码格式之间的转换问题。下面我们来研究下UTF-8和GBK等编码格式之间的相互转化。

实践

在进行编码转换时，我们用ISO-8859-1编码来接受和保存数据，并转换为相应编码。

为什么采用ISO-8859-1编码作为中间转存方案呢？

下面我们通过程序验证：

通过ISO-8859-1转存：

public static void test(String str1,String encode) throws UnsupportedEncodingException {	
		  System.out.println("字符串："+str1);
		  //将str转为原编码字节流
		  byte[] byteArray1=str1.getBytes(encode);
		  System.out.println(byteArray1.length);
		  //转换为一个ISO-8859-1的字符串
		  String str2=new String(byteArray1,"ISO-8859-1");
		  System.out.println("转成ISO-8859-1："+str2);
		  //转回为byte数组
		  byte[] byteArray2=str2.getBytes("ISO-8859-1");		  
		  System.out.println(byteArray2.length);
		  //重新用目标编码格式编码
		  String str3=new String(byteArray2,encode);
		  System.out.println("字符串："+str3);		  
	}	
public static void main(String[] args) throws UnsupportedEncodingException {
		String str1="你好";
		String str2="你好呀";
		test(str1,"UTF-8");
		test(str2,"UTF-8");
	}

运行结果：

 字符串：你好
 6
 转成ISO-8859-1：ä½ å¥½
 6
 字符串：你好
 字符串：你好呀
 9
 转成ISO-8859-1：ä½ å¥½å
 9
 字符串：你好呀

通过GBK转存：

   public static void test(String str1,String encode) throws UnsupportedEncodingException {	
	  System.out.println("字符串："+str1);
	  //将str转为原编码字节流
	  byte[] byteArray1=str1.getBytes(encode);
	  System.out.println(byteArray1.length);
	  //转换为一个GBK的字符串
	  String str2=new String(byteArray1,"GBK");
	  System.out.println("转成GBK："+str2);
	  //转回为byte数组
	  byte[] byteArray2=str2.getBytes("GBK");		  
	  System.out.println(byteArray2.length);
	  //重新用目标编码格式编码
	  String str3=new String(byteArray2,encode);
	  System.out.println("字符串："+str3);		  
}	
public static void main(String[] args) throws UnsupportedEncodingException {
	String str1="你好";
	String str2="你好呀";
	test(str1,"UTF-8");
	test(str2,"UTF-8");
}

运行结果：

 字符串：你好
 6
 转成GBK：浣犲ソ
 6
 字符串：你好
 字符串：你好呀
 9
 转成GBK：浣犲ソ鍛�
 9
 字符串：你好�?

可以看到，当用GBK暂存UTF-8编码字符串时，字符串汉字出现了乱码。

为什么会这样？

分析

我们新增一个方法，将byte数组打印出来：

public static void printHex(byte[] byteArray) {
  StringBuffer sb = new StringBuffer();
  for (byte b : byteArray) {
    sb.append(Integer.toHexString((b >> 4) & 0xF));
    sb.append(Integer.toHexString(b & 0xF));
    sb.append(" ");
  }
  System.out.println(sb.toString());
};

这样上面两个的运行结果分别如下：
ISO-8859-1:

字符串：你好
e4 bd a0 e5 a5 bd 
转成ISO-8859-1：ä½ å¥½
e4 bd a0 e5 a5 bd 
字符串：你好
字符串：你好呀
e4 bd a0 e5 a5 bd e5 91 80 
转成ISO-8859-1：ä½ å¥½å
e4 bd a0 e5 a5 bd e5 91 80 
字符串：你好呀

GBK:

字符串：你好
e4 bd a0 e5 a5 bd 
转成GBK：浣犲ソ
e4 bd a0 e5 a5 bd 
字符串：你好
字符串：你好呀
e4 bd a0 e5 a5 bd e5 91 80 
转成GBK：浣犲ソ鍛�
e4 bd a0 e5 a5 bd e5 91 3f 
字符串：你好�?

可以看到，UTF-8转换为GBK在转换回来时，最后的80变成了3f，为什么会这样？

我们使用”你好呀” 三个字来分析，它的UTF-8 的字节流为：

[e4 bd a0] [e5 a5 bd] [e5 91 80]

我们按照三个字节一组分组，用GBK处理，因为GBK是双字节编码，如下按照两两一组进行分组：

[e4 bd] [a0 e5] [a5 bd] [e5 91] [80 ?]

不够了，怎么办？它把 0x8d当做一个未知字符，用一个半角Ascii字符的 “？” 代替，变成了：

[e4 bd] [a0 e5] [a5 bd] [e5 91] \3f

数据被破坏了。

为什么 ISO-8859-1 没问题呢？

因为 ISO-8859-1 是单字节编码，因此它的分组方案是：

[e4] [bd] [a0] [e5] [a5] [bd] [e5] [91] [80]

因此中间不做任何操作，因此数据没有变化。

问题

你也许会问到，比如将“你好呀”三个字先由UTF-8转为ISO-8859-1，再由ISO-8859-1转为GBK，结果也是乱码啊，不是和下面的代码一样么，性质上？

1 2	String isoFont = new String(chinese.getBytes("UTF-8"),"ISO-8859-1"); String gbkFont = new String(isoFont.getBytes("ISO-8859-1"),"GBK");

1 2	String gbkFont = new String(chinese.getBytes("UTF-8"),"GBK");

两者的性质确实是一样的。

那与上面说的不矛盾吗？

不矛盾。上面的代码，第一步你指定了字符串编码格式为UTF-8，第二步你将其转换为GBK，肯定会乱码。可以认为你拿一个UTF-8的字符串去转GBK字符串，其实在程序里这种写法本身是错误的！

我们来看下面一段代码：

public static void test2() throws UnsupportedEncodingException {
				String chinese = "你好呀";
				//GBK 测试
				String gbkChinese = new String(chinese.getBytes("GBK"),"ISO-8859-1");
				System.out.println(gbkChinese);
				printHex(gbkChinese.getBytes("ISO-8859-1"));
				String gbkTest = new String(gbkChinese.getBytes("ISO-8859-1"),"GBK");
				System.out.println(gbkTest);
				
                //UTF-8测试
				String utf8Chinese = new String(chinese.getBytes("UTF-8"),"ISO-8859-1");
				System.out.println(utf8Chinese);
				printHex(utf8Chinese.getBytes("ISO-8859-1"));
				String utfTest = new String(utf8Chinese.getBytes("ISO-8859-1"),"UTF-8");
				System.out.println(utfTest);			
}

输出结果：

ÄãºÃÑ½
c4 e3 ba c3 d1 bd 
你好呀
ä½ å¥½å
e4 bd a0 e5 a5 bd e5 91 80 
你好呀

可以看到，

GBK分组：[c4 e3]–>你 [ba c3]–>好 [d1 bd]–>呀

UTF-8分组：[e4 bd a0]–>你 [e5 a5 bd]–>好 [e5 91 80]–>呀

字符串“你好呀”在GBK编码和UTF-8编码里生成的byte数据流是不一样的。

结论

所以如何正确将两种编码格式数据进行转换？

注意：这儿的转换指的是这样，比如一个GBK编码文件，里面有“你好呀”字符串，写入到UTF-8编码文件里仍然是“你好呀”。

我们新建一个GBK编码文件，里面有你好呀，三个字符，同时将三个字用UTF-8，写入到另一个文件里。

public class Test2 {
	public static void main(String[] args) throws Exception {
		String line = readInFile("/Users/zhangwentong/junrongdai/gbk.txt", "GBK");
		System.out.println(line);
		writeInFile("/Users/zhangwentong/junrongdai/utf8.txt", line, "UTF-8");

	}
	public static String readInFile(String fileName, String charset) {
		File gbkfile = new File(fileName);
		String line = "";
		FileInputStream gbkIO = null;
		InputStreamReader gbkISR = null;
		BufferedReader br = null;
		try {
			gbkIO = new FileInputStream(gbkfile);
			gbkISR = new InputStreamReader(gbkIO, charset);
			br = new BufferedReader(gbkISR);
			String rline = "";
			while ((rline = br.readLine()) != null) {
				line += rline;
			}
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
                try {
                    if(br!=null) fos.close();
                    if(gbkISR!=null) gbkISR.close();
                    if(gbkIO!=null) gbkIO.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
        }

		return line;
	}
	public static void writeInFile(String fileName, String content, String charset) {
		File f = new File(fileName);
		FileOutputStream fos = null;
		try {
			if (!f.exists()) {
				f.createNewFile();
			}
			fos = new FileOutputStream(f);
			fos.write(content.getBytes(charset));
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			if (fos != null) {
				try {
					fos.close();
				} catch (IOException e) {
					e.printStackTrace();
				}
			}
		}
	}
}

可以测试下上段代码，GBK文字被转为了UTF-8文字。反过来一个UTF-8文件写入到GBK也是可以实现的。

所以，在读取和写入文字时，指定文字的编码格式，再进行读取和写入操作，便不会有乱码的产生。否则读取和写入时会按照执行操作的class文件的编码格式进行写入和读取。

结语

欢迎光临我的博客

https://www.sakuratears.top

我的GitHub地址

https://github.com/javazwt

SakuraTears的博客

UTF-8和GBK等编码格式转换问题

前言

实践

分析

问题

结论

结语