Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to get URL Status (returns an URLItem) #92

Merged
merged 11 commits into from
Sep 3, 2024

Conversation

klockla
Copy link
Contributor

@klockla klockla commented Aug 2, 2024

Add a new API method to retrieve information about an URL

 /** Get status of a particular URL 
     This does not take into account URL scheduling.
     Used to check current status of an URL within the frontier
 **/
 rpc GetURLStatus(URLStatusRequest) returns (URLItem) {}

Implemented only for MemoryFrontier and RocksDb
(may fullfill partially #57 )

Unfortunately the internal storage doesn't make a distinction between Discovered and Known URLs which have to be refetched (or I have missed the point)

So all scheduled items will be returned as a KnownURLItem (with a refetch date equal to 0 for completed items)
If the URL is not in URLFrontier, the method will return io.grpc.Status.NOT_FOUND.asRuntimeException()

Signed-off-by: Laurent Klock Laurent.Klock@arhs-cube.com

@klockla klockla marked this pull request as ready for review August 2, 2024 14:58
@klockla klockla marked this pull request as draft August 2, 2024 15:01
@klockla klockla marked this pull request as ready for review August 2, 2024 15:07
@klockla klockla marked this pull request as draft August 6, 2024 12:07
Implemented only for MemoryFrontier and RocksDb

Unfortunately the internal storage doesn't make a distinction
between Discovered and Known URLs which have to be refetched

So all scheduled items will be returned as ill always return KwownURLItem or Status.NOT_FOUND runtime exception

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
@klockla klockla marked this pull request as ready for review August 8, 2024 14:16
Copy link
Collaborator

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor issues and questions. Great to have additional tests!

(To be done in separate PR)

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
@jnioche
Copy link
Collaborator

jnioche commented Aug 29, 2024

Thanks @klockla
Looks good at this stage but I think it needs an addition to the client so that we can query the new endpoint and display the status of a URL.

Copy link
Collaborator

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment in the conversation re-client side

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
@klockla
Copy link
Contributor Author

klockla commented Sep 2, 2024

see comment in the conversation re-client side

Added the method in client.

API/urlfrontier.proto Show resolved Hide resolved
private String crawl;

@Option(
names = {"-k", "--key"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment about key being generated by default on the server side. Should be optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@jnioche
Copy link
Collaborator

jnioche commented Sep 2, 2024

thanks a lot @klockla - I gave it a try and it seems to work fine
let me know what you think of my comments and suggestions above

Added missing license header

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
@jnioche
Copy link
Collaborator

jnioche commented Sep 3, 2024

Tested, works great! Thanks @klockla, this is a great contribution to the project

@jnioche jnioche merged commit 08f09c3 into crawler-commons:master Sep 3, 2024
2 checks passed
@jnioche jnioche added this to the 2.3 milestone Sep 3, 2024
@jnioche jnioche added enhancement New feature or request API Client labels Sep 3, 2024
@jnioche jnioche mentioned this pull request Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Client enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants