With a relational DB I am used to checking if my app works with the DB by querying for the data that should have been inserted etc. With ES this does not seem so straightforward. Cause it is so elastic, you know..
Most places tell you to use scroll and scan search for this. Because otherwise when you request 10 docs, all the nodes need to retrieve those 10 and these all need to be sorted and the final 10 picked. For the next 11-20, this needs to be repeated but first going through first 10 to get to the 11.
With “scan” mode ES just returns the docs in the order they are stored. So no need to store, sort, whatever. Of course I have no idea how multiple nodes (shards) then figure out what to look up if they dont all store all docs in the exact same order. But whatever.
Finally, the “scroll” part means that the nodes dont discard the query results immediately but you can request the next 10 and next 10 and so on and the nodes know to continue where the query results were left at.
So the answers on stackoverflow etc seem to propose doing it like this:
curl -XGET ‘localhost:9200/INDEXNAME/_search?scroll=1m&search_type=scan’ -d ‘{“query”: {“match_all” : {}}}’
The “match_all” query should end up matching all docs. The “scroll=1m” means the query results are kept for 1 minute, which in turn means you have 1 minute to request the next batch (of 10 or whatever). Once you request for those 10 the timer resets so you have another 1m to get the next one before results are discarded.
But.. And there always is a but, right? This returns no docs. Just empty list for “hits”. Reading a bit further, you get from this search not the docs but a Base64 encoded identifier for your “scroll” window or whatever you call it. Then you are supposed to repeat queries like this:
GET /_search/scroll?scroll=1m
c2Nhbjs1OzExODpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExOTpRNV9aY1VyUVM4U0
NMd2pjWlJ3YWlBOzExNjpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExNzpRNV9aY1Vy
UVM4U0NMd2pjWlJ3YWlBOzEyMDpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzE7dG90YW
xfaGl0czoxOw==
Where the long set of random chars is the Base64 encoded scroll identifier. Good luck storing that when you manually do curl queries on command line or write the queries into your browser (especially a GET request with a body..). Well, you can put the ID in the url, body or query parameter (_scroll_id I guess). I am sure plenty of real hackers know how to put the curl return value (the scroll id) into an environment variable and then stick it there or something like this. I am lazy, I just want to copy paste it. BTW if you can do this, please let me know so I can copy paste your solution.. 🙂
Anyway, this is what I end up doing for now.
Do a direct query to get any type of data (the “q=*:*” part):
curl -XGET ‘http://localhost:9200/INDEXNAME/_search?pretty=true&q=*:*’
But notice this only returns the first 10 results. Might be fine to check if your code works or not. But to get more (50 here):
curl -XGET ‘http://localhost:9200/INDEXNAME/_search?pretty=true&q=*:*&size=50’
This can be a bit too much if you have many types. So another way is to reduce to more specific values such as specific type (_type:TYPE):
curl -XGET ‘http://localhost:9200/INDEXNAME/_search?pretty=true&q=_type:TYPE&size=50’
Good enough for little quick checks. And I can copy paste it to my terminal window..