Delete 500 million documents from a mongo collection containing 18 billion documents












0















We are trying to do some cleanup from a mongo collection which has 18 billion documents and we need to remove around 500 million documents out of them. Despite using an indexed query to cleanup the data in batches of 1000 each and using bulk operation, I find the execution being painfully slow. Could someone please help with a strategy to do such cleanup? I am happy to provide more information if needed.



We are using a Scala process comprising of 8 threads, each handling a batch of 1000 to do the cleanup.










share|improve this question























  • Small batches and indexes is what I was going to suggest. Is your hardware simply not up to the task? Do you know what is taking all time? Is it pagefaults?

    – Sergio Tulentsev
    Nov 24 '18 at 8:54













  • I have checked the swap space and it is unused. On checking the free memory on the host, it has got 100G of cached memory which can be used up when required. It doesn't seem to be a memory issue by the look of it and seems to have something to do with the way mongo executes the delete operations. The host I am currently running this on has a spinning disk but I haven't seen a considerable speed up happening on ssd's either.

    – Anish Gupta
    Nov 24 '18 at 9:06


















0















We are trying to do some cleanup from a mongo collection which has 18 billion documents and we need to remove around 500 million documents out of them. Despite using an indexed query to cleanup the data in batches of 1000 each and using bulk operation, I find the execution being painfully slow. Could someone please help with a strategy to do such cleanup? I am happy to provide more information if needed.



We are using a Scala process comprising of 8 threads, each handling a batch of 1000 to do the cleanup.










share|improve this question























  • Small batches and indexes is what I was going to suggest. Is your hardware simply not up to the task? Do you know what is taking all time? Is it pagefaults?

    – Sergio Tulentsev
    Nov 24 '18 at 8:54













  • I have checked the swap space and it is unused. On checking the free memory on the host, it has got 100G of cached memory which can be used up when required. It doesn't seem to be a memory issue by the look of it and seems to have something to do with the way mongo executes the delete operations. The host I am currently running this on has a spinning disk but I haven't seen a considerable speed up happening on ssd's either.

    – Anish Gupta
    Nov 24 '18 at 9:06
















0












0








0








We are trying to do some cleanup from a mongo collection which has 18 billion documents and we need to remove around 500 million documents out of them. Despite using an indexed query to cleanup the data in batches of 1000 each and using bulk operation, I find the execution being painfully slow. Could someone please help with a strategy to do such cleanup? I am happy to provide more information if needed.



We are using a Scala process comprising of 8 threads, each handling a batch of 1000 to do the cleanup.










share|improve this question














We are trying to do some cleanup from a mongo collection which has 18 billion documents and we need to remove around 500 million documents out of them. Despite using an indexed query to cleanup the data in batches of 1000 each and using bulk operation, I find the execution being painfully slow. Could someone please help with a strategy to do such cleanup? I am happy to provide more information if needed.



We are using a Scala process comprising of 8 threads, each handling a batch of 1000 to do the cleanup.







mongodb






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 24 '18 at 8:48









Anish GuptaAnish Gupta

103216




103216













  • Small batches and indexes is what I was going to suggest. Is your hardware simply not up to the task? Do you know what is taking all time? Is it pagefaults?

    – Sergio Tulentsev
    Nov 24 '18 at 8:54













  • I have checked the swap space and it is unused. On checking the free memory on the host, it has got 100G of cached memory which can be used up when required. It doesn't seem to be a memory issue by the look of it and seems to have something to do with the way mongo executes the delete operations. The host I am currently running this on has a spinning disk but I haven't seen a considerable speed up happening on ssd's either.

    – Anish Gupta
    Nov 24 '18 at 9:06





















  • Small batches and indexes is what I was going to suggest. Is your hardware simply not up to the task? Do you know what is taking all time? Is it pagefaults?

    – Sergio Tulentsev
    Nov 24 '18 at 8:54













  • I have checked the swap space and it is unused. On checking the free memory on the host, it has got 100G of cached memory which can be used up when required. It doesn't seem to be a memory issue by the look of it and seems to have something to do with the way mongo executes the delete operations. The host I am currently running this on has a spinning disk but I haven't seen a considerable speed up happening on ssd's either.

    – Anish Gupta
    Nov 24 '18 at 9:06



















Small batches and indexes is what I was going to suggest. Is your hardware simply not up to the task? Do you know what is taking all time? Is it pagefaults?

– Sergio Tulentsev
Nov 24 '18 at 8:54







Small batches and indexes is what I was going to suggest. Is your hardware simply not up to the task? Do you know what is taking all time? Is it pagefaults?

– Sergio Tulentsev
Nov 24 '18 at 8:54















I have checked the swap space and it is unused. On checking the free memory on the host, it has got 100G of cached memory which can be used up when required. It doesn't seem to be a memory issue by the look of it and seems to have something to do with the way mongo executes the delete operations. The host I am currently running this on has a spinning disk but I haven't seen a considerable speed up happening on ssd's either.

– Anish Gupta
Nov 24 '18 at 9:06







I have checked the swap space and it is unused. On checking the free memory on the host, it has got 100G of cached memory which can be used up when required. It doesn't seem to be a memory issue by the look of it and seems to have something to do with the way mongo executes the delete operations. The host I am currently running this on has a spinning disk but I haven't seen a considerable speed up happening on ssd's either.

– Anish Gupta
Nov 24 '18 at 9:06














0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53456602%2fdelete-500-million-documents-from-a-mongo-collection-containing-18-billion-docum%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53456602%2fdelete-500-million-documents-from-a-mongo-collection-containing-18-billion-docum%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Costa Masnaga

Fotorealismo

Sidney Franklin