How to traverse hundreds of millions of files in a folder without stack overflow

How to traverse hundreds of millions of files in a folder without stack overflow

Preface: There are many layers of small files under a folder. How to calculate how many files are under this folder? Recursive traversal, simple violence, recursion is indeed a more convenient solution in general, but when the depth of the folder is deep, repeated recursive calls will cause the method to never be released, causing the jvm stack overflow. What should we do?

  To be honest, I have never encountered this problem before. I heard from an IT senior I admire about his previous interview experience. He said that when he was nervous, he thought of recursion, but didn't think of other solutions.

  Of course, when he told me about this problem, he didn't think of a good solution. It believes that this situation can refer to the recursion of web crawlers. In order to prevent crawlers from being unable to get out of a certain depth, the depth of each crawl is usually set, and then various restrictions are used to ensure that every file is accessed.

  At that time, my inspiration flashed, because I was reviewing the knowledge of data structure at the time. I said that the level of this folder looks good, it is familiar, it is equivalent to the structure of a tree, then how do we traverse when we learn data structure Node's. There are left recursion, middle recursion, and right recursion. Of course, this is the recursive method above. It is not the solution we are looking for.

Look, there are sequence traversals in the corners that we often overlook.

Layer sequence traversal: The layer sequence traversal starts from the root node of the tree where it is located, first visits the root node of the first layer, then visits the nodes on the second layer from left to right, then the nodes on the third layer, and so on The process of visiting the nodes of the tree layer by layer from top to bottom and from left to right is the layer sequence traversal.

Code idea:

We only need to use a list collection to store each file (folder), then read the elements of the list collection in order, and judge if it is a folder, then append all the files (folders) under the folder to the back of the list collection. Then read the next element of the list and so on.

public class demo {
    public static void main(String[] args) {
        List<File> list=new ArrayList<File>();
        File file = new File("C:/intsmaze");
        list.add(file);
         for(int i=0;i<list.size();i++)
         {
             if(list.get(i).isDirectory())
             {
                 File[] tempList = list.get(i).listFiles();
                 for(int j=0;j<tempList.length;j++)
                 {
                     list.add(tempList[j]);
                 }
             }
         }
        
    }
}

All are experienced developers, the above code does not need to be annotated.

Of course, some people will be more truthful. When the number of files is large, even if this code can ensure that the stack does not overflow, but the number of list collections increases, the heap will burst.

Of course, this is a situation, and it is actually very simple. Every time an element is read from the collection, the element should be overflowed from the collection and stored in the hard disk. Then the judgment condition in the loop does not increment i.

public class demo {
    public static void main(String[] args) {
        List<File> list=new ArrayList<File>();
        File file = new File("C:/intsmaze");
        list.add(file);
         for(int i=0;i<list.size();)
         {
             if(list.get(i).isDirectory())
             {
                 File[] tempList = list.get(i).listFiles();
                 for(int j=0;j<tempList.length;j++)
                 {
                     list.add(tempList[j]);
                 }
             }
             list.remove(i);
         }
    }
}

Everyone has a better solution to share and discuss together.