Split data directory into training and test directory with sub directory structure preserved












8















I am interested in using ImageDataGenerator in Keras for data augmentation. But it requires that training and validation directories with sub directories for classes be fed in separately as below (this is from Keras documentation). I have a single directory with 2 subdirectories for 2 classes (Data/Class1 and Data/Class2). How do I randomly split this into training and validation directories



    train_datagen = ImageDataGenerator(
rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
'data/train',
target_size=(150, 150),
batch_size=32,
class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
'data/validation',
target_size=(150, 150),
batch_size=32,
class_mode='binary')

model.fit_generator(
train_generator,
steps_per_epoch=2000,
epochs=50,
validation_data=validation_generator,
validation_steps=800)


I am interested in re-running my algorithm multiple times with random training and validation data splits.










share|improve this question





























    8















    I am interested in using ImageDataGenerator in Keras for data augmentation. But it requires that training and validation directories with sub directories for classes be fed in separately as below (this is from Keras documentation). I have a single directory with 2 subdirectories for 2 classes (Data/Class1 and Data/Class2). How do I randomly split this into training and validation directories



        train_datagen = ImageDataGenerator(
    rescale=1./255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

    test_datagen = ImageDataGenerator(rescale=1./255)

    train_generator = train_datagen.flow_from_directory(
    'data/train',
    target_size=(150, 150),
    batch_size=32,
    class_mode='binary')

    validation_generator = test_datagen.flow_from_directory(
    'data/validation',
    target_size=(150, 150),
    batch_size=32,
    class_mode='binary')

    model.fit_generator(
    train_generator,
    steps_per_epoch=2000,
    epochs=50,
    validation_data=validation_generator,
    validation_steps=800)


    I am interested in re-running my algorithm multiple times with random training and validation data splits.










    share|improve this question



























      8












      8








      8


      3






      I am interested in using ImageDataGenerator in Keras for data augmentation. But it requires that training and validation directories with sub directories for classes be fed in separately as below (this is from Keras documentation). I have a single directory with 2 subdirectories for 2 classes (Data/Class1 and Data/Class2). How do I randomly split this into training and validation directories



          train_datagen = ImageDataGenerator(
      rescale=1./255,
      shear_range=0.2,
      zoom_range=0.2,
      horizontal_flip=True)

      test_datagen = ImageDataGenerator(rescale=1./255)

      train_generator = train_datagen.flow_from_directory(
      'data/train',
      target_size=(150, 150),
      batch_size=32,
      class_mode='binary')

      validation_generator = test_datagen.flow_from_directory(
      'data/validation',
      target_size=(150, 150),
      batch_size=32,
      class_mode='binary')

      model.fit_generator(
      train_generator,
      steps_per_epoch=2000,
      epochs=50,
      validation_data=validation_generator,
      validation_steps=800)


      I am interested in re-running my algorithm multiple times with random training and validation data splits.










      share|improve this question
















      I am interested in using ImageDataGenerator in Keras for data augmentation. But it requires that training and validation directories with sub directories for classes be fed in separately as below (this is from Keras documentation). I have a single directory with 2 subdirectories for 2 classes (Data/Class1 and Data/Class2). How do I randomly split this into training and validation directories



          train_datagen = ImageDataGenerator(
      rescale=1./255,
      shear_range=0.2,
      zoom_range=0.2,
      horizontal_flip=True)

      test_datagen = ImageDataGenerator(rescale=1./255)

      train_generator = train_datagen.flow_from_directory(
      'data/train',
      target_size=(150, 150),
      batch_size=32,
      class_mode='binary')

      validation_generator = test_datagen.flow_from_directory(
      'data/validation',
      target_size=(150, 150),
      batch_size=32,
      class_mode='binary')

      model.fit_generator(
      train_generator,
      steps_per_epoch=2000,
      epochs=50,
      validation_data=validation_generator,
      validation_steps=800)


      I am interested in re-running my algorithm multiple times with random training and validation data splits.







      python machine-learning neural-network keras deep-learning






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited May 18 '18 at 11:39









      Marcin Możejko

      21.6k54878




      21.6k54878










      asked Oct 12 '17 at 19:47









      Sharanya Arcot DesaiSharanya Arcot Desai

      234311




      234311
























          6 Answers
          6






          active

          oldest

          votes


















          9














          Thank you guys! I was able to write my own function to create training and test data sets. Here's the code for anyone who's looking.



          import os
          source1 = "/source_dir"
          dest11 = "/dest_dir"
          files = os.listdir(source1)
          import shutil
          import numpy as np
          for f in files:
          if np.random.rand(1) < 0.2:
          shutil.move(source1 + '/'+ f, dest11 + '/'+ f)





          share|improve this answer































            2














            Unfortunately, it's impossible for the current implementation of keras.preprocessing.image.ImageDataGenerator (as for October 14th, 2017) but as it's a really requested feature I expect it to be added in the nearest future.



            But you could do this using standard Python os operations. Depending on the size of your dataset you could also try to first load all images to RAM and then use a classical fit method which could split your data randomly.






            share|improve this answer
























            • Thanks you. I was able to write a function to create these libraries.

              – Sharanya Arcot Desai
              Oct 19 '17 at 22:50



















            1














            You will need to either manually copy out some of your training data and paste it into a validation directory, or create a program to randomly move data from your training directory to your validation directory. With either of these options, you will need to pass in the validation directory to your validation ImageDataGenerator().flow_from_directory() as the path.



            Details for organizing your data in the directory structure are covered in this video.






            share|improve this answer


























            • Thanks for your answer. But I did not see validation_split as a parameter in fit_generator, and fit_generator is what I want to use.It's a parameter in the fit function.

              – Sharanya Arcot Desai
              Oct 13 '17 at 19:33











            • Ah, you're right. I was thinking it was a parameter in both fit() and fit_generator(), but it is only for fit(). I've updated my answer. You will have to either manually or programmatically create your directory structure for both valid and train sets, and then point to these separate directories with your ImageDataGenerators for each of these sets.

              – blackHoleDetector
              Oct 14 '17 at 16:21



















            1














            https://stackoverflow.com/a/52372042/10111155 provided the easiest way: ImageDataGenerator now supports splitting into train/test from a single directory with subdirectories directly.



            This is copied directly from that answer with no changes. I take no credit. I tried it and it worked perfectly.



            Note that train_data_dir is the same in the train_generator and validation_generator. If you want a three-way split (train/test/valid) using ImageDataGenerator, the source code will need to be modified --- there are nice instructions here.



            train_datagen = ImageDataGenerator(rescale=1./255,
            shear_range=0.2,
            zoom_range=0.2,
            horizontal_flip=True,
            validation_split=0.2) # set validation split

            train_generator = train_datagen.flow_from_directory(
            train_data_dir,
            target_size=(img_width, img_height),
            batch_size=batch_size,
            class_mode='binary',
            subset='training') # set as training data

            validation_generator = train_datagen.flow_from_directory(
            train_data_dir, # same directory as training data
            target_size=(img_width, img_height),
            batch_size=batch_size,
            class_mode='binary'
            subset='validation') # set as validation data

            model.fit_generator(
            train_generator,
            steps_per_epoch = train_generator.samples // batch_size,
            validation_data = validation_generator,
            validation_steps = validation_generator.samples // batch_size,
            epochs = nb_epochs)





            share|improve this answer































              0














              Here's my approach:



              # Create temporary validation set.
              with TemporaryDirectory(dir=train_image_folder) as valid_image_folder, TemporaryDirectory(dir=train_label_folder) as valid_label_folder:
              train_images = os.listdir(train_image_folder)
              train_labels = os.listdir(train_label_folder)

              for img_name in train_images:
              single_name, ext = os.path.splitext(img_name)
              label_name = single_name + '.png'
              if label_name not in train_labels:
              continue
              if random.uniform(0, 1) <= train_val_split:
              # Move the files.
              shutil.move(os.path.join(train_image_folder, img_name), os.path.join(valid_image_folder, img_name))
              shutil.move(os.path.join(train_label_folder, label_name), os.path.join(valid_label_folder, img_name))


              Don't forget to move everything back.






              share|improve this answer































                0














                You solution worked, thanks.



                   import os
                import shutil
                import numpy as np

                sourceN = base_dir + "\train\NORMAL\"
                destN = base_dir + "\val\NORMAL"
                sourceP = base_dir + "\train\PNEUMONIA"
                destP = base_dir + "\val\PNEUMONIA"

                filesN = os.listdir(sourceN)
                filesP = os.listdir(sourceP)

                for f in filesN:
                if np.random.rand(1) < 0.2:
                shutil.move(sourceN + '\'+ f, destN + '\'+ f)

                for i in filesP:
                if np.random.rand(1) < 0.2:
                shutil.move(sourceP + '\'+ i, destP + '\'+ i)

                print(len(os.listdir(sourceN)))
                print(len(os.listdir(sourceP)))
                print(len(os.listdir(destN)))
                print(len(os.listdir(destP)))





                share|improve this answer
























                • Consider adding an explanation.

                  – Grant Miller
                  May 23 '18 at 18:38











                Your Answer






                StackExchange.ifUsing("editor", function () {
                StackExchange.using("externalEditor", function () {
                StackExchange.using("snippets", function () {
                StackExchange.snippets.init();
                });
                });
                }, "code-snippets");

                StackExchange.ready(function() {
                var channelOptions = {
                tags: "".split(" "),
                id: "1"
                };
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function() {
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled) {
                StackExchange.using("snippets", function() {
                createEditor();
                });
                }
                else {
                createEditor();
                }
                });

                function createEditor() {
                StackExchange.prepareEditor({
                heartbeatType: 'answer',
                autoActivateHeartbeat: false,
                convertImagesToLinks: true,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: 10,
                bindNavPrevention: true,
                postfix: "",
                imageUploader: {
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                },
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                });


                }
                });














                draft saved

                draft discarded


















                StackExchange.ready(
                function () {
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f46717742%2fsplit-data-directory-into-training-and-test-directory-with-sub-directory-structu%23new-answer', 'question_page');
                }
                );

                Post as a guest















                Required, but never shown

























                6 Answers
                6






                active

                oldest

                votes








                6 Answers
                6






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes









                9














                Thank you guys! I was able to write my own function to create training and test data sets. Here's the code for anyone who's looking.



                import os
                source1 = "/source_dir"
                dest11 = "/dest_dir"
                files = os.listdir(source1)
                import shutil
                import numpy as np
                for f in files:
                if np.random.rand(1) < 0.2:
                shutil.move(source1 + '/'+ f, dest11 + '/'+ f)





                share|improve this answer




























                  9














                  Thank you guys! I was able to write my own function to create training and test data sets. Here's the code for anyone who's looking.



                  import os
                  source1 = "/source_dir"
                  dest11 = "/dest_dir"
                  files = os.listdir(source1)
                  import shutil
                  import numpy as np
                  for f in files:
                  if np.random.rand(1) < 0.2:
                  shutil.move(source1 + '/'+ f, dest11 + '/'+ f)





                  share|improve this answer


























                    9












                    9








                    9







                    Thank you guys! I was able to write my own function to create training and test data sets. Here's the code for anyone who's looking.



                    import os
                    source1 = "/source_dir"
                    dest11 = "/dest_dir"
                    files = os.listdir(source1)
                    import shutil
                    import numpy as np
                    for f in files:
                    if np.random.rand(1) < 0.2:
                    shutil.move(source1 + '/'+ f, dest11 + '/'+ f)





                    share|improve this answer













                    Thank you guys! I was able to write my own function to create training and test data sets. Here's the code for anyone who's looking.



                    import os
                    source1 = "/source_dir"
                    dest11 = "/dest_dir"
                    files = os.listdir(source1)
                    import shutil
                    import numpy as np
                    for f in files:
                    if np.random.rand(1) < 0.2:
                    shutil.move(source1 + '/'+ f, dest11 + '/'+ f)






                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Oct 20 '17 at 21:54









                    Sharanya Arcot DesaiSharanya Arcot Desai

                    234311




                    234311

























                        2














                        Unfortunately, it's impossible for the current implementation of keras.preprocessing.image.ImageDataGenerator (as for October 14th, 2017) but as it's a really requested feature I expect it to be added in the nearest future.



                        But you could do this using standard Python os operations. Depending on the size of your dataset you could also try to first load all images to RAM and then use a classical fit method which could split your data randomly.






                        share|improve this answer
























                        • Thanks you. I was able to write a function to create these libraries.

                          – Sharanya Arcot Desai
                          Oct 19 '17 at 22:50
















                        2














                        Unfortunately, it's impossible for the current implementation of keras.preprocessing.image.ImageDataGenerator (as for October 14th, 2017) but as it's a really requested feature I expect it to be added in the nearest future.



                        But you could do this using standard Python os operations. Depending on the size of your dataset you could also try to first load all images to RAM and then use a classical fit method which could split your data randomly.






                        share|improve this answer
























                        • Thanks you. I was able to write a function to create these libraries.

                          – Sharanya Arcot Desai
                          Oct 19 '17 at 22:50














                        2












                        2








                        2







                        Unfortunately, it's impossible for the current implementation of keras.preprocessing.image.ImageDataGenerator (as for October 14th, 2017) but as it's a really requested feature I expect it to be added in the nearest future.



                        But you could do this using standard Python os operations. Depending on the size of your dataset you could also try to first load all images to RAM and then use a classical fit method which could split your data randomly.






                        share|improve this answer













                        Unfortunately, it's impossible for the current implementation of keras.preprocessing.image.ImageDataGenerator (as for October 14th, 2017) but as it's a really requested feature I expect it to be added in the nearest future.



                        But you could do this using standard Python os operations. Depending on the size of your dataset you could also try to first load all images to RAM and then use a classical fit method which could split your data randomly.







                        share|improve this answer












                        share|improve this answer



                        share|improve this answer










                        answered Oct 14 '17 at 14:39









                        Marcin MożejkoMarcin Możejko

                        21.6k54878




                        21.6k54878













                        • Thanks you. I was able to write a function to create these libraries.

                          – Sharanya Arcot Desai
                          Oct 19 '17 at 22:50



















                        • Thanks you. I was able to write a function to create these libraries.

                          – Sharanya Arcot Desai
                          Oct 19 '17 at 22:50

















                        Thanks you. I was able to write a function to create these libraries.

                        – Sharanya Arcot Desai
                        Oct 19 '17 at 22:50





                        Thanks you. I was able to write a function to create these libraries.

                        – Sharanya Arcot Desai
                        Oct 19 '17 at 22:50











                        1














                        You will need to either manually copy out some of your training data and paste it into a validation directory, or create a program to randomly move data from your training directory to your validation directory. With either of these options, you will need to pass in the validation directory to your validation ImageDataGenerator().flow_from_directory() as the path.



                        Details for organizing your data in the directory structure are covered in this video.






                        share|improve this answer


























                        • Thanks for your answer. But I did not see validation_split as a parameter in fit_generator, and fit_generator is what I want to use.It's a parameter in the fit function.

                          – Sharanya Arcot Desai
                          Oct 13 '17 at 19:33











                        • Ah, you're right. I was thinking it was a parameter in both fit() and fit_generator(), but it is only for fit(). I've updated my answer. You will have to either manually or programmatically create your directory structure for both valid and train sets, and then point to these separate directories with your ImageDataGenerators for each of these sets.

                          – blackHoleDetector
                          Oct 14 '17 at 16:21
















                        1














                        You will need to either manually copy out some of your training data and paste it into a validation directory, or create a program to randomly move data from your training directory to your validation directory. With either of these options, you will need to pass in the validation directory to your validation ImageDataGenerator().flow_from_directory() as the path.



                        Details for organizing your data in the directory structure are covered in this video.






                        share|improve this answer


























                        • Thanks for your answer. But I did not see validation_split as a parameter in fit_generator, and fit_generator is what I want to use.It's a parameter in the fit function.

                          – Sharanya Arcot Desai
                          Oct 13 '17 at 19:33











                        • Ah, you're right. I was thinking it was a parameter in both fit() and fit_generator(), but it is only for fit(). I've updated my answer. You will have to either manually or programmatically create your directory structure for both valid and train sets, and then point to these separate directories with your ImageDataGenerators for each of these sets.

                          – blackHoleDetector
                          Oct 14 '17 at 16:21














                        1












                        1








                        1







                        You will need to either manually copy out some of your training data and paste it into a validation directory, or create a program to randomly move data from your training directory to your validation directory. With either of these options, you will need to pass in the validation directory to your validation ImageDataGenerator().flow_from_directory() as the path.



                        Details for organizing your data in the directory structure are covered in this video.






                        share|improve this answer















                        You will need to either manually copy out some of your training data and paste it into a validation directory, or create a program to randomly move data from your training directory to your validation directory. With either of these options, you will need to pass in the validation directory to your validation ImageDataGenerator().flow_from_directory() as the path.



                        Details for organizing your data in the directory structure are covered in this video.







                        share|improve this answer














                        share|improve this answer



                        share|improve this answer








                        edited Oct 14 '17 at 16:19

























                        answered Oct 12 '17 at 20:11









                        blackHoleDetectorblackHoleDetector

                        1,024410




                        1,024410













                        • Thanks for your answer. But I did not see validation_split as a parameter in fit_generator, and fit_generator is what I want to use.It's a parameter in the fit function.

                          – Sharanya Arcot Desai
                          Oct 13 '17 at 19:33











                        • Ah, you're right. I was thinking it was a parameter in both fit() and fit_generator(), but it is only for fit(). I've updated my answer. You will have to either manually or programmatically create your directory structure for both valid and train sets, and then point to these separate directories with your ImageDataGenerators for each of these sets.

                          – blackHoleDetector
                          Oct 14 '17 at 16:21



















                        • Thanks for your answer. But I did not see validation_split as a parameter in fit_generator, and fit_generator is what I want to use.It's a parameter in the fit function.

                          – Sharanya Arcot Desai
                          Oct 13 '17 at 19:33











                        • Ah, you're right. I was thinking it was a parameter in both fit() and fit_generator(), but it is only for fit(). I've updated my answer. You will have to either manually or programmatically create your directory structure for both valid and train sets, and then point to these separate directories with your ImageDataGenerators for each of these sets.

                          – blackHoleDetector
                          Oct 14 '17 at 16:21

















                        Thanks for your answer. But I did not see validation_split as a parameter in fit_generator, and fit_generator is what I want to use.It's a parameter in the fit function.

                        – Sharanya Arcot Desai
                        Oct 13 '17 at 19:33





                        Thanks for your answer. But I did not see validation_split as a parameter in fit_generator, and fit_generator is what I want to use.It's a parameter in the fit function.

                        – Sharanya Arcot Desai
                        Oct 13 '17 at 19:33













                        Ah, you're right. I was thinking it was a parameter in both fit() and fit_generator(), but it is only for fit(). I've updated my answer. You will have to either manually or programmatically create your directory structure for both valid and train sets, and then point to these separate directories with your ImageDataGenerators for each of these sets.

                        – blackHoleDetector
                        Oct 14 '17 at 16:21





                        Ah, you're right. I was thinking it was a parameter in both fit() and fit_generator(), but it is only for fit(). I've updated my answer. You will have to either manually or programmatically create your directory structure for both valid and train sets, and then point to these separate directories with your ImageDataGenerators for each of these sets.

                        – blackHoleDetector
                        Oct 14 '17 at 16:21











                        1














                        https://stackoverflow.com/a/52372042/10111155 provided the easiest way: ImageDataGenerator now supports splitting into train/test from a single directory with subdirectories directly.



                        This is copied directly from that answer with no changes. I take no credit. I tried it and it worked perfectly.



                        Note that train_data_dir is the same in the train_generator and validation_generator. If you want a three-way split (train/test/valid) using ImageDataGenerator, the source code will need to be modified --- there are nice instructions here.



                        train_datagen = ImageDataGenerator(rescale=1./255,
                        shear_range=0.2,
                        zoom_range=0.2,
                        horizontal_flip=True,
                        validation_split=0.2) # set validation split

                        train_generator = train_datagen.flow_from_directory(
                        train_data_dir,
                        target_size=(img_width, img_height),
                        batch_size=batch_size,
                        class_mode='binary',
                        subset='training') # set as training data

                        validation_generator = train_datagen.flow_from_directory(
                        train_data_dir, # same directory as training data
                        target_size=(img_width, img_height),
                        batch_size=batch_size,
                        class_mode='binary'
                        subset='validation') # set as validation data

                        model.fit_generator(
                        train_generator,
                        steps_per_epoch = train_generator.samples // batch_size,
                        validation_data = validation_generator,
                        validation_steps = validation_generator.samples // batch_size,
                        epochs = nb_epochs)





                        share|improve this answer




























                          1














                          https://stackoverflow.com/a/52372042/10111155 provided the easiest way: ImageDataGenerator now supports splitting into train/test from a single directory with subdirectories directly.



                          This is copied directly from that answer with no changes. I take no credit. I tried it and it worked perfectly.



                          Note that train_data_dir is the same in the train_generator and validation_generator. If you want a three-way split (train/test/valid) using ImageDataGenerator, the source code will need to be modified --- there are nice instructions here.



                          train_datagen = ImageDataGenerator(rescale=1./255,
                          shear_range=0.2,
                          zoom_range=0.2,
                          horizontal_flip=True,
                          validation_split=0.2) # set validation split

                          train_generator = train_datagen.flow_from_directory(
                          train_data_dir,
                          target_size=(img_width, img_height),
                          batch_size=batch_size,
                          class_mode='binary',
                          subset='training') # set as training data

                          validation_generator = train_datagen.flow_from_directory(
                          train_data_dir, # same directory as training data
                          target_size=(img_width, img_height),
                          batch_size=batch_size,
                          class_mode='binary'
                          subset='validation') # set as validation data

                          model.fit_generator(
                          train_generator,
                          steps_per_epoch = train_generator.samples // batch_size,
                          validation_data = validation_generator,
                          validation_steps = validation_generator.samples // batch_size,
                          epochs = nb_epochs)





                          share|improve this answer


























                            1












                            1








                            1







                            https://stackoverflow.com/a/52372042/10111155 provided the easiest way: ImageDataGenerator now supports splitting into train/test from a single directory with subdirectories directly.



                            This is copied directly from that answer with no changes. I take no credit. I tried it and it worked perfectly.



                            Note that train_data_dir is the same in the train_generator and validation_generator. If you want a three-way split (train/test/valid) using ImageDataGenerator, the source code will need to be modified --- there are nice instructions here.



                            train_datagen = ImageDataGenerator(rescale=1./255,
                            shear_range=0.2,
                            zoom_range=0.2,
                            horizontal_flip=True,
                            validation_split=0.2) # set validation split

                            train_generator = train_datagen.flow_from_directory(
                            train_data_dir,
                            target_size=(img_width, img_height),
                            batch_size=batch_size,
                            class_mode='binary',
                            subset='training') # set as training data

                            validation_generator = train_datagen.flow_from_directory(
                            train_data_dir, # same directory as training data
                            target_size=(img_width, img_height),
                            batch_size=batch_size,
                            class_mode='binary'
                            subset='validation') # set as validation data

                            model.fit_generator(
                            train_generator,
                            steps_per_epoch = train_generator.samples // batch_size,
                            validation_data = validation_generator,
                            validation_steps = validation_generator.samples // batch_size,
                            epochs = nb_epochs)





                            share|improve this answer













                            https://stackoverflow.com/a/52372042/10111155 provided the easiest way: ImageDataGenerator now supports splitting into train/test from a single directory with subdirectories directly.



                            This is copied directly from that answer with no changes. I take no credit. I tried it and it worked perfectly.



                            Note that train_data_dir is the same in the train_generator and validation_generator. If you want a three-way split (train/test/valid) using ImageDataGenerator, the source code will need to be modified --- there are nice instructions here.



                            train_datagen = ImageDataGenerator(rescale=1./255,
                            shear_range=0.2,
                            zoom_range=0.2,
                            horizontal_flip=True,
                            validation_split=0.2) # set validation split

                            train_generator = train_datagen.flow_from_directory(
                            train_data_dir,
                            target_size=(img_width, img_height),
                            batch_size=batch_size,
                            class_mode='binary',
                            subset='training') # set as training data

                            validation_generator = train_datagen.flow_from_directory(
                            train_data_dir, # same directory as training data
                            target_size=(img_width, img_height),
                            batch_size=batch_size,
                            class_mode='binary'
                            subset='validation') # set as validation data

                            model.fit_generator(
                            train_generator,
                            steps_per_epoch = train_generator.samples // batch_size,
                            validation_data = validation_generator,
                            validation_steps = validation_generator.samples // batch_size,
                            epochs = nb_epochs)






                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Nov 21 '18 at 18:06









                            Beau HiltonBeau Hilton

                            7714




                            7714























                                0














                                Here's my approach:



                                # Create temporary validation set.
                                with TemporaryDirectory(dir=train_image_folder) as valid_image_folder, TemporaryDirectory(dir=train_label_folder) as valid_label_folder:
                                train_images = os.listdir(train_image_folder)
                                train_labels = os.listdir(train_label_folder)

                                for img_name in train_images:
                                single_name, ext = os.path.splitext(img_name)
                                label_name = single_name + '.png'
                                if label_name not in train_labels:
                                continue
                                if random.uniform(0, 1) <= train_val_split:
                                # Move the files.
                                shutil.move(os.path.join(train_image_folder, img_name), os.path.join(valid_image_folder, img_name))
                                shutil.move(os.path.join(train_label_folder, label_name), os.path.join(valid_label_folder, img_name))


                                Don't forget to move everything back.






                                share|improve this answer




























                                  0














                                  Here's my approach:



                                  # Create temporary validation set.
                                  with TemporaryDirectory(dir=train_image_folder) as valid_image_folder, TemporaryDirectory(dir=train_label_folder) as valid_label_folder:
                                  train_images = os.listdir(train_image_folder)
                                  train_labels = os.listdir(train_label_folder)

                                  for img_name in train_images:
                                  single_name, ext = os.path.splitext(img_name)
                                  label_name = single_name + '.png'
                                  if label_name not in train_labels:
                                  continue
                                  if random.uniform(0, 1) <= train_val_split:
                                  # Move the files.
                                  shutil.move(os.path.join(train_image_folder, img_name), os.path.join(valid_image_folder, img_name))
                                  shutil.move(os.path.join(train_label_folder, label_name), os.path.join(valid_label_folder, img_name))


                                  Don't forget to move everything back.






                                  share|improve this answer


























                                    0












                                    0








                                    0







                                    Here's my approach:



                                    # Create temporary validation set.
                                    with TemporaryDirectory(dir=train_image_folder) as valid_image_folder, TemporaryDirectory(dir=train_label_folder) as valid_label_folder:
                                    train_images = os.listdir(train_image_folder)
                                    train_labels = os.listdir(train_label_folder)

                                    for img_name in train_images:
                                    single_name, ext = os.path.splitext(img_name)
                                    label_name = single_name + '.png'
                                    if label_name not in train_labels:
                                    continue
                                    if random.uniform(0, 1) <= train_val_split:
                                    # Move the files.
                                    shutil.move(os.path.join(train_image_folder, img_name), os.path.join(valid_image_folder, img_name))
                                    shutil.move(os.path.join(train_label_folder, label_name), os.path.join(valid_label_folder, img_name))


                                    Don't forget to move everything back.






                                    share|improve this answer













                                    Here's my approach:



                                    # Create temporary validation set.
                                    with TemporaryDirectory(dir=train_image_folder) as valid_image_folder, TemporaryDirectory(dir=train_label_folder) as valid_label_folder:
                                    train_images = os.listdir(train_image_folder)
                                    train_labels = os.listdir(train_label_folder)

                                    for img_name in train_images:
                                    single_name, ext = os.path.splitext(img_name)
                                    label_name = single_name + '.png'
                                    if label_name not in train_labels:
                                    continue
                                    if random.uniform(0, 1) <= train_val_split:
                                    # Move the files.
                                    shutil.move(os.path.join(train_image_folder, img_name), os.path.join(valid_image_folder, img_name))
                                    shutil.move(os.path.join(train_label_folder, label_name), os.path.join(valid_label_folder, img_name))


                                    Don't forget to move everything back.







                                    share|improve this answer












                                    share|improve this answer



                                    share|improve this answer










                                    answered Mar 14 '18 at 13:55









                                    RichardRichard

                                    313




                                    313























                                        0














                                        You solution worked, thanks.



                                           import os
                                        import shutil
                                        import numpy as np

                                        sourceN = base_dir + "\train\NORMAL\"
                                        destN = base_dir + "\val\NORMAL"
                                        sourceP = base_dir + "\train\PNEUMONIA"
                                        destP = base_dir + "\val\PNEUMONIA"

                                        filesN = os.listdir(sourceN)
                                        filesP = os.listdir(sourceP)

                                        for f in filesN:
                                        if np.random.rand(1) < 0.2:
                                        shutil.move(sourceN + '\'+ f, destN + '\'+ f)

                                        for i in filesP:
                                        if np.random.rand(1) < 0.2:
                                        shutil.move(sourceP + '\'+ i, destP + '\'+ i)

                                        print(len(os.listdir(sourceN)))
                                        print(len(os.listdir(sourceP)))
                                        print(len(os.listdir(destN)))
                                        print(len(os.listdir(destP)))





                                        share|improve this answer
























                                        • Consider adding an explanation.

                                          – Grant Miller
                                          May 23 '18 at 18:38
















                                        0














                                        You solution worked, thanks.



                                           import os
                                        import shutil
                                        import numpy as np

                                        sourceN = base_dir + "\train\NORMAL\"
                                        destN = base_dir + "\val\NORMAL"
                                        sourceP = base_dir + "\train\PNEUMONIA"
                                        destP = base_dir + "\val\PNEUMONIA"

                                        filesN = os.listdir(sourceN)
                                        filesP = os.listdir(sourceP)

                                        for f in filesN:
                                        if np.random.rand(1) < 0.2:
                                        shutil.move(sourceN + '\'+ f, destN + '\'+ f)

                                        for i in filesP:
                                        if np.random.rand(1) < 0.2:
                                        shutil.move(sourceP + '\'+ i, destP + '\'+ i)

                                        print(len(os.listdir(sourceN)))
                                        print(len(os.listdir(sourceP)))
                                        print(len(os.listdir(destN)))
                                        print(len(os.listdir(destP)))





                                        share|improve this answer
























                                        • Consider adding an explanation.

                                          – Grant Miller
                                          May 23 '18 at 18:38














                                        0












                                        0








                                        0







                                        You solution worked, thanks.



                                           import os
                                        import shutil
                                        import numpy as np

                                        sourceN = base_dir + "\train\NORMAL\"
                                        destN = base_dir + "\val\NORMAL"
                                        sourceP = base_dir + "\train\PNEUMONIA"
                                        destP = base_dir + "\val\PNEUMONIA"

                                        filesN = os.listdir(sourceN)
                                        filesP = os.listdir(sourceP)

                                        for f in filesN:
                                        if np.random.rand(1) < 0.2:
                                        shutil.move(sourceN + '\'+ f, destN + '\'+ f)

                                        for i in filesP:
                                        if np.random.rand(1) < 0.2:
                                        shutil.move(sourceP + '\'+ i, destP + '\'+ i)

                                        print(len(os.listdir(sourceN)))
                                        print(len(os.listdir(sourceP)))
                                        print(len(os.listdir(destN)))
                                        print(len(os.listdir(destP)))





                                        share|improve this answer













                                        You solution worked, thanks.



                                           import os
                                        import shutil
                                        import numpy as np

                                        sourceN = base_dir + "\train\NORMAL\"
                                        destN = base_dir + "\val\NORMAL"
                                        sourceP = base_dir + "\train\PNEUMONIA"
                                        destP = base_dir + "\val\PNEUMONIA"

                                        filesN = os.listdir(sourceN)
                                        filesP = os.listdir(sourceP)

                                        for f in filesN:
                                        if np.random.rand(1) < 0.2:
                                        shutil.move(sourceN + '\'+ f, destN + '\'+ f)

                                        for i in filesP:
                                        if np.random.rand(1) < 0.2:
                                        shutil.move(sourceP + '\'+ i, destP + '\'+ i)

                                        print(len(os.listdir(sourceN)))
                                        print(len(os.listdir(sourceP)))
                                        print(len(os.listdir(destN)))
                                        print(len(os.listdir(destP)))






                                        share|improve this answer












                                        share|improve this answer



                                        share|improve this answer










                                        answered May 23 '18 at 18:07









                                        JordyJordy

                                        1




                                        1













                                        • Consider adding an explanation.

                                          – Grant Miller
                                          May 23 '18 at 18:38



















                                        • Consider adding an explanation.

                                          – Grant Miller
                                          May 23 '18 at 18:38

















                                        Consider adding an explanation.

                                        – Grant Miller
                                        May 23 '18 at 18:38





                                        Consider adding an explanation.

                                        – Grant Miller
                                        May 23 '18 at 18:38


















                                        draft saved

                                        draft discarded




















































                                        Thanks for contributing an answer to Stack Overflow!


                                        • Please be sure to answer the question. Provide details and share your research!

                                        But avoid



                                        • Asking for help, clarification, or responding to other answers.

                                        • Making statements based on opinion; back them up with references or personal experience.


                                        To learn more, see our tips on writing great answers.




                                        draft saved


                                        draft discarded














                                        StackExchange.ready(
                                        function () {
                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f46717742%2fsplit-data-directory-into-training-and-test-directory-with-sub-directory-structu%23new-answer', 'question_page');
                                        }
                                        );

                                        Post as a guest















                                        Required, but never shown





















































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown

































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown







                                        Popular posts from this blog

                                        Ottavio Pratesi

                                        Tricia Helfer

                                        15 giugno