Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL discovery in CSV files where values are not wrapped in quotes #68

Closed
cicdguy opened this issue Nov 30, 2023 · 5 comments
Closed

URL discovery in CSV files where values are not wrapped in quotes #68

cicdguy opened this issue Nov 30, 2023 · 5 comments

Comments

@cicdguy
Copy link

cicdguy commented Nov 30, 2023

This is a reference to the issue from lycheeverse/lychee#1299, and it was suggested that I post here for feedback.


Hello,

I'm using lychee 0.13.0, which in turn is using v0.10.0 of linkify (see here) and running it against this file:
https://github.com/pharmaverse/admiraldiscovery/blob/06e6e55b884ef91de9ae457606ed66defc9dba14/data-raw/admiral-lookup-book.csv

Like so:

lychee **/*.csv

And I get the following result:

⠚ 1/47 ETA 80s ░░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_analysis_ratio.html,Template | Failed
⠚ 2/47 ETA 39s ░░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_dt.html,Template | Failed: Ne
⠚ 3/47 ETA 25s █░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_qtc.html,Template | Failed: Network
⠚ 4/47 ETA 19s █░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr.html,Template | Failed: Networ
⠚ 5/47 ETA 15s ██░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_base.html,Template | Failed: Network
⠚ 6/47 ETA 12s ██░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtemfl.html,Template | Failed: Netwo
⠚ 7/47 ETA 10s ██░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged_lookup.html,Template | Failed
⠚ 8/47 ETA 9s ███░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr_dir.html,Template | Failed: Net
⠚ 9/47 ETA 8s ███░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_map.html,Template | Failed: Network
⠚ 10/47 ETA 7s ████░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_rr.html,Template | Failed: Network
⠚ 11/47 ETA 6s ████░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_dt.html,Template | Failed: Ne
⠚ 12/47 ETA 6s █████░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/restrict_derivation.html,Template | Failed: Netw
⠚ 13/47 ETA 5s █████░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_shift.html,Template | Failed: Network
⠚ 14/47 ETA 4s █████░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dy.html,Template | Failed: Network e
⠚ 15/47 ETA 4s ██████░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_basetype_records.html,Template | Failed:
⠚ 16/47 ETA 4s ██████░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_anrind.html,Template | Failed: Networ
⠒ 16/47 ETA 4s ██████░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_anrind.html,Template | Failed: Networ
⠒ 17/47 ETA 1s ███████░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtdurd.html,Template | Failed: Netwo
⠒ 18/47 ETA 1s ███████░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_joined.html,Template | Failed: Netwo
⠒ 19/47 ETA 1s ████████░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_ontrtfl.html,Template | Failed: Netwo
⠒ 20/47 ETA 0s ████████░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_chg.html,Template | Failed: Network e
⠒ 21/47 ETA 0s ████████░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_extreme_records.html,Template | Failed: N
⠒ 22/47 ETA 0s █████████░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_merged_exist_flag.html,Template | Fai
⠒ 23/47 ETA 0s █████████░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dt.html,Template | Failed: Network e
⠒ 24/47 ETA 0s ██████████░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_tm.html,Template | Failed: Ne
⠒ 25/47 ETA 0s ██████████░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bsa.html,Template | Failed: Network
⠒ 26/47 ETA 0s ███████████░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_duration.html,Template | Failed: Net
⠒ 27/47 ETA 0s ███████████░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm.html,Template | Failed: Network
⠒ 32/47 ETA 0s █████████████░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged.html,Template | Failed: Netwo
⠒ 33/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_flag.html,Template | Failed:
⠒ 34/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_summary_records.html,Template | Failed: N
⠂ 35/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network
⠂ 35/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network
⠒ 35/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network
  47/47 ETA 0s ████████████████████ Finished extracting links                                                                               Issues found in 1 input. Find details below.

[data-raw/admiral-lookup-book.csv]:
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_query.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_analysis_ratio.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_rr.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtemfl.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bsa.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr_dir.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_basetype_records.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_ontrtfl.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_chg.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_anrind.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_wbc_abs.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_summary_records.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_map.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_obs_number.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_pchg.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged_lookup.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_joined.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_base.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dt.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_shift.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/restrict_derivation.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_tm.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_dt.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_qtc.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_flag.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtdurd.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_extreme_records.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_duration.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dy.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_merged_exist_flag.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_dt.html,Template | Failed: Network error: Not Found

🔍 47 Total ✅ 12 OK 🚫 35 Errors (HTTP:35)

When I modify the file by adding quotes around the URLs in the CSV, I get the correct expected result.

❯ lychee **/*.csv
  47/47 ETA 0s ████████████████████ Finished extracting links           
  🔍 47 Total ✅ 47 OK 🚫 0 Errors

Although commas are allowed/safe characters in URLs, will it be possible for linkify to detect CSV files and extract URLs from it without having to wrap the URL strings in quotes?

@mre
Copy link
Contributor

mre commented Jan 29, 2024

@robinst, what are your thoughts on this? Is this out of scope for linkify?

@robinst
Copy link
Owner

robinst commented Jan 31, 2024

Hmm. I wouldn't know how to distinguish this from a plain text case, linkify doesn't even know the file extension.

Does lychee have support for detecting file types via extension? That would help in this case.

@mre
Copy link
Contributor

mre commented Jan 31, 2024

Yes, it does. Maybe it could be passed as a parameter to linkify, although I could see why one would not want to do that.
Tricky one. Not sure where to draw the line between the tools.

@robinst
Copy link
Owner

robinst commented Mar 3, 2024

I think in this case it would be nice if lychee could detect csv, use a parser library to parse it and then feed individual cell values to linkify.

@mre
Copy link
Contributor

mre commented Mar 3, 2024

That's a great idea and I think that's a solid way forward. Thanks for the insight! I'll update the original issue accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants