记录Android项目查重并修改全流程（静态代码检测工具Simi

作者: 逆水寒Stephen | 来源:发表于2025-02-23 15:12 被阅读0次

Android静态代码检查
Android Lint
Android 自定义Lint实践总结
Android代码静态检查（lint、Checkstyle、kt
破解StarUML
二进制重排简化板
强制执行Lint规范代码
Android单元测试——辅助工具介绍
IDEA效率插件JRebel的使用
Android 代码混淆(二)

背景：最近两个项目被应用市场查重导致其中一个项目下架，这里为避嫌定义为 “项目A”和“项目 B”，因为项目B 确实是基于项目A 做的，所以有些基础的确实是一模一样的
查重了就得改掉相同的东西才行，但是这个前提是得快速找出一样的东西才行（这些东西主要包括代码片段和资源文件），找到后修改一波
主要需要执行如下这些操作：
1.利用静态代码检测工具Simian来寻找相似代码
2.找到的相似代码手动修改或AI修改成不同的代码结构
3.对比项目的资源文件（主要是 libs 和 res 目录下的文件）修改资源名字和内容
完整脚本放了个链接：https://github.com/woshiluoyong/simianDuplicateCheck

1.利用静态代码检测工具Simian来寻找相似代码

因为我是要寻找两个项目的相似代码，这一步也尤为重要，遍历百度谷歌及 GitHub，有类似需求的，基本都不满足，类似的一般只支持项目内部的代码查重，而且语言也仅支持有限的几种，不支持 java 和 kotlin，部署也比较笨重麻烦，最后还是得用大牛产品Simian来实现，只是需要自己解析Simian输出的重复行数数据然后分析出自己需要的内容，而且Simian实际也是搜寻的一个根目录下的文件来执行比较，所以为避免其他文件干扰，我这边写了两个python脚本来将两个项目的源码 copy 到一个目录下便于分析查重，因为有的文件在两个项目名字完全一样且便于区分两个项目，我还在 copy 的时候修改了文件名，加了特定的前缀好在判断及输出时区分，具体脚本如下：

# coding=utf-8
import os
import shutil

# 遍历源目录及其子目录，拷贝所有特定后缀文件到目标目录，排除特定目录（如build）
# 并为第一级目录和文件名添加特定前缀
def check_copy_files(src_dir, dst_dir, supported_extensions, exclude_dir, prefix):
    # 确保目标目录存在，如果不存在则创建
    if not os.path.exists(dst_dir):
        os.makedirs(dst_dir)
        print(f"Created destination directory: {dst_dir}")

    # 遍历源目录及其子目录
    for root, dirs, files in os.walk(src_dir):
        # 检查当前路径是否包含排除目录（如build）
        if exclude_dir in root.split(os.sep):
            #print(f"Skipping directory: {root}")
            continue  # 跳过当前目录及其子目录

        for file in files:
            # 检查文件扩展名
            if any(file.endswith(ext) for ext in supported_extensions):
                # 构造源文件的完整路径
                src_file_path = os.path.join(root, file)

                # 构造目标文件的完整路径（保留源文件的相对路径结构）
                relative_path = os.path.relpath(root, src_dir)
                if relative_path == ".":
                    # 如果是第一级目录，添加前缀
                    relative_path = prefix
                else:
                    # 如果是子目录，保留相对路径
                    relative_path = os.path.join(prefix, relative_path)

                dst_subdir = os.path.join(dst_dir, relative_path)

                # 如果目标子目录不存在，则创建
                if not os.path.exists(dst_subdir):
                    os.makedirs(dst_subdir)

                # 为文件名添加前缀
                dst_file_name = f"{prefix}#@#{file}"
                dst_file_path = os.path.join(dst_subdir, dst_file_name)

                # 拷贝文件
                shutil.copy2(src_file_path, dst_file_path)
                #print(f"Copied: {src_file_path} -> {dst_file_path}")

    print(f"{prefix} All files have been copied successfully.")

if __name__ == "__main__":
    source_one_directory = "/Users/xxxx/Documents/AndroidProjects/xxxx_xxxx_xxxx_A"
    source_two_directory = "/Users/xxxx/Documents/AndroidProjects/xxxx_xxxx_xxxx_B"
    destination_directory = "sourceCode"
    try:
        if os.path.exists(destination_directory):
            shutil.rmtree(destination_directory)
            print(f"Delete destination directory: {destination_directory} Ok")
    except Exception as e:
        print(f"Delete destination directory: {destination_directory} Err: {e}")
    check_copy_files(source_one_directory, destination_directory, [".java", ".kt"], "build", "A")
    check_copy_files(source_two_directory, destination_directory, [".java", ".kt"], "build", "B")

copy 完成后就执行Simian分析并按相似行数大小逆序排序，并可根据需要输出相似具体行位置，最终汇总并输出到一个log 文件下，具体脚本如下：

# coding=utf-8
import subprocess
import xml.etree.ElementTree as ET
from collections import defaultdict
import itertools
import os
import shutil

# 清理XML文件，删除文件开头的非XML内容
def clean_xml_file(input_file, output_file):
    try:
        with open(input_file, 'r', encoding='utf-8') as infile:
            lines = infile.readlines()
        
        # 找到XML声明的起始行
        start_index = None
        for i, line in enumerate(lines):
            if line.strip().startswith('<?xml'):
                start_index = i
                break
        
        if start_index is None:
            raise ValueError("未找到有效的XML声明，文件可能不是有效的XML格式。")
        
        # 从XML声明开始的内容保留下来
        cleaned_lines = lines[start_index:]
        
        with open(output_file, 'w', encoding='utf-8') as outfile:
            outfile.writelines(cleaned_lines)
        
        print(f"非XML内容清理完成，清理后的文件已保存到: {output_file}")
    except Exception as e:
        print(f"处理文件非XML内容时发生错误: {e}")

def parse_simian_xml(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()
    
    # 提取所有文件对及其重复行数
    file_pairs = defaultdict(lambda: defaultdict(int))
    line_map = defaultdict(lambda: defaultdict(str))
    
    for set_elem in root.find('check').findall('set'):
        blocks = set_elem.findall('block')
        if len(blocks) < 2:
            continue  # 跳过没有成对的块
        
        for block in blocks:
            sourceFile = block.get('sourceFile')
            lineRangeStr = f"{block.get('startLineNumber')}:{block.get('endLineNumber')}"
            isHasValue = True if sourceFile in line_map else False
            #print(f"block: {sourceFile} = {isHasValue} = {lineRangeStr}")
            line_map[sourceFile] = (line_map[sourceFile] +";"+ lineRangeStr) if isHasValue else lineRangeStr
            
        line_count = int(set_elem.get('lineCount'))
        file_paths = [block.get('sourceFile') for block in blocks]
        
        # 生成所有可能的文件对组合
        for file1, file2 in itertools.combinations(file_paths, 2):
            file_pairs[file1][file2] += line_count
            file_pairs[file2][file1] += line_count
    
    #print(f"line_map: {line_map}")
    return file_pairs, line_map

def truncate_filename(filename, max_length=20):
    return filename.split("/")[-1]  # 截断文件名，仅保留最后部分

def print_similarity(file_pairs, line_map, output_file, care_prefix):
    unique_pairs = set()
    similarities = []  # 用于存储排序后的相似性数据

    for file1, pairs in file_pairs.items():
        #print(f"print_similarity file1: {file1} {pairs}")
        for file2, shared_lines in pairs.items():
            # 排序文件名以避免重复输出
            pair = tuple(sorted([file1, file2]))
            if pair not in unique_pairs and file1 != file2:
                unique_pairs.add(pair)
                file1_truncated = truncate_filename(file1)
                file2_truncated = truncate_filename(file2)
                #print(f"print_similarity file1: {file1}")
                #print(f"print_similarity file2: {file2}")
                file1_prefix = file1_truncated.split("#@#")[0]  # 提取前缀
                file2_prefix = file2_truncated.split("#@#")[0]  # 提取前缀
                if file1_prefix != file2_prefix:# 如果前缀不同，则保留该行
                    similarities.append((file1_truncated, file2_truncated, shared_lines, file1, file2))

    # 按 shared_lines 从大到小排序
    similarities.sort(key=lambda x: x[2], reverse=True)

    # 写入结果到文件
    with open(output_file, 'w', encoding='utf-8') as outfile:
        for index, (file1_truncated, file2_truncated, shared_lines, file1, file2) in enumerate(similarities, start=1):
            outfile.write(f"{index}. {file1_truncated} 和 {file2_truncated} 之间的相似行数:「{shared_lines}」行 \n")
            if care_prefix != None:
                outfile.write(f"===行范围如下==>\n")
                if care_prefix != None and care_prefix in file1:
                    outfile.write(f"{file1}:【{line_map[file1]}】\n")
                elif care_prefix != None and care_prefix in file2:
                    outfile.write(f"{file2}:〖{line_map[file2]}〗\n")
                else:
                    outfile.write(f"{file1}:【{line_map[file1]}】\n")
                    outfile.write(f"{file2}:〖{line_map[file2]}〗\n")

def run_simian_command(jar_path, threshold, formatter, source_paths, output_file):
    # 构造命令
    command = [
        "java", "-jar", jar_path,
        f"-threshold={threshold}",  # 将参数和值写在一起
        f"-formatter={formatter}"   # 将参数和值写在一起
    ]
    command.extend(source_paths)  # 添加源代码路径

    try:
        # 打开输出文件，并将命令的输出重定向到该文件
        with open(output_file, "w") as output:
            subprocess.run(command, stdout=output, stderr=subprocess.PIPE, check=True)
        print(f"Simian 命令执行成功，输出已保存到 {output_file}")
    except subprocess.CalledProcessError as e:
        retMsgVal =  e.stderr.decode().strip()
        if retMsgVal is not None and len(retMsgVal) > 0: print(f"Simian 命令执行失败: {retMsgVal}")

if __name__ == "__main__":
    output_file = "check_output.xml"  # 输出文件路径
    result_file = "check_result.log"
    result_backup_file = "check_result_backup.log"

    jar_path = "simian-4.0.0.jar"  # simian.jar 文件的路径
    threshold = 10  # 相似性阈值
    formatter = "xml"  # 输出格式为 XML
    source_paths = [
        "sourceCode/**/*.java",
        "sourceCode/**/*.kt"
    ]

    # 调用函数执行命令
    run_simian_command(jar_path, threshold, formatter, source_paths, output_file)

    # 清理 XML 文件
    clean_xml_file(output_file, output_file)
    # 解析清理后的 XML 文件
    file_pairs, line_map = parse_simian_xml(output_file)
    #print(f"parseResult: {file_pairs}「」{line_map}")
    # 输出相似性结果到文件, care_prefix 为 None 时，不输出行范围，否则根据关心前缀输出行范围
    print_similarity(file_pairs, line_map, result_file, None)#"B#@#")
    print(f"最终相似性结果已保存到: {result_file}")
    
    try:
       sureCopy = input('请问你是否要备份相似性结果?[y/n](直接回车默认不备份)')
    except ValueError:
        sureCopy = 'n'
    if 'y' == sureCopy:
        if os.path.exists(result_backup_file):
            os.remove(result_backup_file)
        shutil.copy2(result_file, result_backup_file)
        print(f"已经备份相似性结果到: {result_backup_file}")

修改脚本对应本地文件路径参数后并用 python 依次执行上面两个脚本后，输出大概如下：

check_output.xml

check_result.log 不带行号范围

check_result.log 输出带关心项目的行号范围

输出的check_output.xml就是Simian生成的原始输出，check_result.log就是汇总check_output.xml得到的分析结果，就可以根据这个log 文件里面的，输出带关心项目的行号范围可以在脚本 2的print_similarity最后一个参数设置脚本 1 里面设定的文件名前缀，最后依次从重复度高的文件修改起走【简单的修改如交换变量、逻辑代码位置，switch、when 和 if 以及三元判断的互相转换、提出子方法等，如何更轻松的修改看下面↓】，修改完成后再依次重复脚本 1 和脚本 2，可以看到行号的变化，直到重复数降下去为止

2.利用Cursor编辑器来轻松修改相似代码

上面找到了要修改的代码文件及代码位置，如果重复度太高修改工作量是巨大的，而且还得保证功能逻辑不变，这个时候有个好用的 AI 助手就事半功倍了，As上可以用Copilot或Gemini，但我尝试多款之后首选推荐Cursor编辑器的composer，这个可以直接修改文件且应用到文件上，（虽然豆包也可以直接应用到文件上，但是用了感觉对代码方面很弱智，个人感觉，勿喷），Cursor各种折腾下来发现一般整个文件没法直接叫它改，因为首先可能出现误删方法还可能直接输出一般就停止了，应该是入参过大溢出了，尝试了很多次，最后圈定为改最多 8 个方法左右，而且只改方法内部逻辑，其他不要动，这个是最靠谱的，具体下面这个界面及操作

image.png

上图中的ai调教文字如下↓，一般只需要修改文件上下文和方法那坨的名字，然后 apply 后最好 as 同步打开文件，看修改的有不有报红，及时修改再改其他的

用于修改 java 文件的ai调教文字
在不改变 xxx、xxx、xxx、xxx、xxx 方法里面逻辑前提下，只修改方法内部的逻辑（比如将空实现的方法、将if块或者else块只包含了一句代码的简单代码块改成一行显示，不影响逻辑前提下也可交换方法内单行代码位置等其他优化手段），每个方法必须要有变化，让修改后的方法和原来差异化大些，不影响原方法的传参和输出，不要误删除字段或方法，不要出现方法无法调用！不输出修改点，直接修改文件
用于修改 kotlin 文件的ai调教文字
在不改变 xxx、xxx、xxx、xxx、xxx 方法里面逻辑前提下，只修改方法内部的逻辑（比如switch、when 换成if else，if else换成switch、when，比如将空实现的方法、将if块或者else块只包含了一句代码的简单代码块改成一行显示，不影响逻辑前提下也可交换方法内单行代码位置，能用作用域函数run、let、apply、also这种就做更换，变量只有一个引用的就直接替换掉变量等其他优化手段），每个方法必须要有变化，让修改后的方法和原来差异化大些，不影响原方法的传参和输出，不要误删除字段或方法，不要出现方法无法调用！不输出修改点，直接修改文件

然后针对有些类方法很多，不可能依次的 copy，也是很非精力，于是有了下面这个脚本，打印出方法且拼接好，只需 copy 到 cursor 上就行

# coding=utf-8
import re

def extract_methods_from_file(file_path):
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            code_content = file.read()

        # 正则表达式匹配方法定义（支持 Kotlin 和 Java）
        if file_path.endswith('.java'):
            method_pattern = re.compile(r'^\s*(public|private|protected|static|\s)*\s*[\w\<\>\[\]]+\s+(\w+)\s*\([^)]*\)\s*(\{|\:|\@Override|\@JvmStatic)?',re.MULTILINE)
        else:
            method_pattern = re.compile(r'^\s*(public\s+|private\s+|override\s+)?fun\s+([a-zA-Z0-9_]+)\s*\(', re.MULTILINE)
        methods = method_pattern.findall(code_content)

        # 提取方法名
        method_names = [method[1] for method in methods]

        # 排除接口中的方法（Kotlin 和 Java）
        interface_pattern = re.compile(
            r'(interface|abstract\s+class)\s+\w+\s*\{([^}]*)\}',
            re.DOTALL
        )
        interface_methods = set()
        for match in interface_pattern.finditer(code_content):
            interface_code = match.group(2)
            interface_method_pattern = re.compile(
                r'\b(public|private|protected|static|\s)*\s*[\w\<\>\[\]]+\s+(\w+)\s*\([^)]*\)\s*(\{|\:)?',
                re.MULTILINE
            )
            interface_methods.update([m[1] for m in interface_method_pattern.findall(interface_code)])

        # 筛除接口中的方法并去重
        unique_method_names = set(method_names) - interface_methods

        print(f"共找到 {len(unique_method_names)} 个方法名：")
        # 根据方法名数量决定是否换行
        unique_method_list = list(unique_method_names)
        if len(unique_method_list) < 10:
            return '、'.join(unique_method_list)
        else:
            formatted_output = []
            newLineNum = 8 if len(unique_method_list) > 16 else (5 if 10 == len(unique_method_list) else 6)
            for i in range(0, len(unique_method_list), newLineNum):
                formatted_output.append('、'.join(unique_method_list[i:i + newLineNum]))
            return '\n'.join(formatted_output)
    except FileNotFoundError:
        return "文件未找到，请检查路径是否正确。"
    except Exception as e:
        return f"发生错误：{e}"

if __name__ == "__main__":
    file_path = 'sourceCode/B/app/src/main/kotlin/com/xxx/xx/android/xx/tool/B#@#xxxManager.kt'
    result = extract_methods_from_file(file_path)
    print(result)

输出大概如下，每次 copy 单行使用即可

3_read_fun_name.py

关于Cursor次数用完的问题：可以闲鱼、多多上看哈Cursor激活的信息，目前我这边闲鱼一碗面拿下

3.使用项目对比工具kdiff3来找到并修改重复资源

kdiff3工具在 mac 上可以用 brew install kidff3安装，安装好后选择两个项目根目录，记住其中 A、B 代表的目录，后面要用

选择两个项目根目录对比

查看重复点
找到libs和res里面重复资源后，图片可以采取更换名字，xml 里面可以进行代码局部更换位置来解决，解决了后可以直接点击菜单上的 ReScan直接更新扫描看重复与否

rescan

具体就是这么多，借助这脚本和工具大体能解决项目重复的问题了，有相当部分还是些体力活，祝你鼠标右手健康，🙂！

Android静态代码检查
Android静态代码检查是一项保证代码开发质量，确保App稳定必不可少的流程。如何借助检测工具有效的检查...
Android Lint
Android Lint 是有 Android SDK 提供的一种静态代码检测工具，用于检测 Android 的代...
Android 自定义Lint实践总结
自定义Lint Android Lint 是由 Android SDK 提供的一种静态代码检测工具，用于检测 An...
Android代码静态检查（lint、Checkstyle、kt
Android代码静态检查（lint、Checkstyle、ktlint、Detekt）在Android项目开发...
破解StarUML
流程介绍下载StarUML，并安装安装破解工具Node.js，asar 解压app.asar文件，修改代码重...
二进制重排简化板
二进制文件的简化版方案原理：简化版做法流程其他：手淘静态库插桩方法：通过修改汇编代码记录函数静态库由 ....
强制执行Lint规范代码
Lint 开发中使用静态代码检测工具对代码进行检查，达到规范代码减少bug的目的。常用的检测工具有FindBugs...
Android单元测试——辅助工具介绍
目录一.Code Coverage Tool : jacoco、IntelliJ IDEA 二.静态代码检测工具...
IDEA效率插件JRebel的使用
JRebel 使用 JRebel 可以在修改代码后，动态重新加载修改的代码，免去了代码工程全量重建、重启的耗时流程...
Android 代码混淆(二)
文章转自我个人博客 Android 代码混淆(一) 中已经记录并走了混淆的整个流程，用命令行进行混淆的操作，并验证...